Suppose for business reasons we need to segment a dataset into buckets of different sizes to perform an A/B test.
In this example, we generate two test splits of 20% each and a 60% validation set. The validation set is where we would apply whichever logic won in the A/B test
We’re making use of two functions.
NTILE(5) is a windowed function which allows us to split our data set into 5 groups.
NEWID() is used to randomize the sampling on each transaction.
Then in the main body of the query we use a case statement to reduce the 5 groups to our final splits.
WITH cteSplit AS ( SELECT Email , NTILE(5) OVER (ORDER BY NEWID()) AS RN FROM music.account ) SELECT Email , CASE WHEN RN = 1 THEN 'Split A' WHEN RN = 2 THEN 'Split B' ELSE 'Validation' END AS Split INTO #split FROM cteSplit;
NEWID() produces a different result at each transaction!
We can check the percentage distribution like so:
SELECT Split , COUNT(*)*100.0/SUM(COUNT(*)) OVER() AS Pcnt FROM #split GROUP BY Split;