Business “reasons”
Suppose for business reasons we need to segment a dataset into buckets of different sizes to perform an A/B test.
In this example, we generate two test splits of 20% each and a 60% validation set. The validation set is where we would apply whichever logic won in the A/B test
SQL
We’re making use of two functions.
NTILE(5)
is a windowed function which allows us to split our data set into 5 groups.
NEWID()
is used to randomize the sampling on each transaction.
Then in the main body of the query we use a case statement to reduce the 5 groups to our final splits.
WITH cteSplit AS (
SELECT
Email
, NTILE(5) OVER (ORDER BY NEWID()) AS RN
FROM music.account
)
SELECT
Email
, CASE
WHEN RN = 1 THEN 'Split A'
WHEN RN = 2 THEN 'Split B'
ELSE 'Validation'
END AS Split
INTO #split
FROM cteSplit;
Warming: NEWID()
produces a different result at each transaction!
Percentage Distribution
We can check the percentage distribution like so:
SELECT
Split
, COUNT(*)*100.0/SUM(COUNT(*)) OVER() AS Pcnt
FROM #split
GROUP BY Split;