Suppose for business reasons we need to segment a dataset into buckets of different sizes to perform an A/B test.

In this example, we generate two test splits of 20% each and a 60% validation set. The validation set is where we would apply whichever logic won in the A/B test

### SQL

We’re making use of two functions.

NTILE(5) is a windowed function which allows us to split our data set into 5 groups.

NEWID() is used to randomize the sampling on each transaction.

Then in the main body of the query we use a case statement to reduce the 5 groups to our final splits.

WITH cteSplit AS (
SELECT
Email
,	NTILE(5) OVER (ORDER BY NEWID()) AS RN
FROM music.account
)
SELECT
Email
,	CASE
WHEN RN = 1 THEN 'Split A'
WHEN RN = 2 THEN 'Split B'
ELSE 'Validation'
END AS Split
INTO #split
FROM cteSplit;


Warming: NEWID() produces a different result at each transaction!

### Percentage Distribution

We can check the percentage distribution like so:

SELECT
Split
,	COUNT(*)*100.0/SUM(COUNT(*)) OVER() AS Pcnt
FROM #split
GROUP BY Split;