How Many Visitors Do You Need for A/B Testing?
“How long should I run this A/B test?” is one of the most common CRO questions—and getting it wrong can completely invalidate your results.
Run too short, and you’ll make decisions based on noise. Run too long, and you waste time and opportunity cost. This guide explains how to calculate the right sample size for valid, actionable results.
Why Sample Size Matters
The Problem With Small Samples
Imagine flipping a coin 10 times and getting 7 heads. You might conclude the coin favors heads. But flip it 1,000 times, and you’ll get close to 50/50.
A/B testing works the same way. Small samples have high variance—random chance can easily look like a real pattern.
Real-World Consequences
Underpowered test scenario:
- You test a new headline
- After 500 visitors per variation, B shows 15% higher conversion
- You ship B, excited about the win
- Over the next month, conversion rate returns to baseline
- The “improvement” was just random fluctuation
You wasted development time, potentially hurt conversions during the test, and learned nothing reliable.
The Four Factors That Determine Sample Size
1. Baseline Conversion Rate
Your current conversion rate before the test.
Impact: Lower baseline rates need larger samples. Going from 1% to 1.1% is harder to detect than going from 10% to 11%.
Example:
- 2% baseline, detecting 10% lift: ~30,000 visitors per variation
- 10% baseline, detecting 10% lift: ~6,000 visitors per variation
2. Minimum Detectable Effect (MDE)
The smallest improvement you want to reliably detect.
Impact: Smaller effects need larger samples. Detecting a 5% lift requires 4x the sample of detecting a 10% lift.
How to choose MDE:
- What improvement would be meaningful to your business?
- What’s realistic based on your change?
- What can you afford to detect given your traffic?
Common MDEs:
- 5% relative improvement: Very sensitive, requires large samples
- 10% relative improvement: Balanced approach
- 20% relative improvement: Faster tests, may miss smaller wins
3. Statistical Significance Level
How confident you need to be that results aren’t due to chance.
Industry standard: 95% (p < 0.05)
What it means: If there’s truly no difference, there’s only a 5% chance you’d incorrectly conclude there is one (false positive).
Trade-offs:
- 90% confidence: Faster tests, more false positives
- 95% confidence: Standard balance
- 99% confidence: Slower tests, fewer false positives
4. Statistical Power
The probability of detecting a real effect if it exists.
Industry standard: 80%
What it means: If B truly is 10% better, you have an 80% chance of detecting it (and 20% chance of missing it—false negative).
Trade-offs:
- 70% power: Faster tests, miss more real winners
- 80% power: Standard balance
- 90% power: Slower tests, catch more real effects
Sample Size Calculations
The Formula
The actual formula is complex:
n = (2 × (Zα + Zβ)² × p × (1-p)) / δ²
Where:
- Zα = Z-score for significance level
- Zβ = Z-score for power
- p = baseline conversion rate
- δ = minimum detectable effect (absolute)
Use a Calculator Instead
Several free tools do this math for you:
Recommended calculators:
- Evan Miller’s A/B Test Calculator (https://www.evanmiller.org/ab-testing/sample-size.html)
- Optimizely Sample Size Calculator
- VWO Sample Size Calculator
- CXL Sample Size Calculator
Sample Size Table
Reference table for common scenarios (95% confidence, 80% power, 50/50 split):
| Baseline Rate | 10% Relative Lift | 15% Relative Lift | 20% Relative Lift |
|---|---|---|---|
| 1% | 190,000 | 85,000 | 48,000 |
| 2% | 95,000 | 42,000 | 24,000 |
| 3% | 63,000 | 28,000 | 16,000 |
| 5% | 38,000 | 17,000 | 9,500 |
| 10% | 19,000 | 8,400 | 4,800 |
| 20% | 9,500 | 4,200 | 2,400 |
Per variation. Double for total test traffic.
Converting Sample Size to Duration
The Calculation
Test duration = Sample size needed ÷ Daily eligible traffic
Example:
- Sample needed: 20,000 per variation (40,000 total)
- Daily traffic to tested page: 2,000 visitors
- Duration: 40,000 ÷ 2,000 = 20 days
Account for Weekly Patterns
If your traffic varies significantly by day of week:
- Always run for complete weeks
- Round up to nearest full week
- 20 days → 3 weeks (21 days)
Minimum Duration Guidelines
Regardless of sample size math:
- Minimum 7 days: Captures weekly patterns
- Minimum 14 days preferred: More stable estimate
- Avoid holiday periods: Unless testing holiday-specific changes
What If You Don’t Have Enough Traffic?
Many sites can’t reach adequate sample sizes in reasonable timeframes. Options:
Option 1: Test Bigger Changes
A 50% improvement needs far smaller sample than a 10% improvement.
| Baseline | Detect 10% lift | Detect 50% lift |
|---|---|---|
| 3% | 63,000/var | 2,800/var |
Test bolder hypotheses: complete redesigns, major copy changes, different offers.
Option 2: Accept Lower Confidence
Moving from 95% to 90% confidence reduces sample needs by roughly 25%.
When acceptable:
- Low-risk changes
- Easily reversible
- You’ll monitor post-implementation
When not acceptable:
- Major business decisions
- Permanent changes
- High-stakes pages
Option 3: Test Higher-Funnel Metrics
Metrics closer to the top of funnel have more volume.
Instead of testing for purchase (2% rate), test for add-to-cart (15% rate)—you’ll reach significance much faster.
Caveat: Validate that improving the proxy metric actually improves the final goal.
Option 4: Focus on High-Volume Pages
Test where traffic concentrates. Your homepage or main landing pages likely have 10x the traffic of deep product pages.
Option 5: Use Qualitative Methods
When A/B testing isn’t feasible:
- User testing (5-10 participants reveal major issues)
- Session recordings (patterns visible in dozens of recordings)
- Surveys (directional feedback)
- Before/after analysis (less rigorous but still valuable)
Option 6: Multi-Armed Bandit
Some testing tools offer “bandit” algorithms that:
- Automatically allocate more traffic to winning variations
- Reach conclusions faster
- Trade statistical rigor for practical efficiency
Useful for low-traffic situations, but understand the trade-offs.
Common Sample Size Mistakes
Mistake 1: Stopping When Significant
“B is winning at 95% significance after 3 days—ship it!”
Problem: This is called “peeking.” If you check repeatedly and stop when significant, your actual false positive rate can exceed 30%.
Solution: Calculate sample size in advance. Stop at that point regardless of interim results. Or use sequential testing methods designed for early stopping.
Mistake 2: Ignoring Baseline Rate
Using a generic “10,000 visitors per variation” rule.
Problem: Sample needs vary dramatically by conversion rate. A 1% baseline needs 10x the sample of a 10% baseline.
Solution: Always calculate based on your actual baseline.
Mistake 3: Testing Too Small an Effect
“I want to detect a 2% relative improvement.”
Problem: Detecting tiny effects requires enormous samples. A 2% lift from 3% baseline needs ~1.5 million visitors per variation.
Solution: Be realistic about minimum detectable effect. Can you actually achieve a 2% lift? Is it worth detecting?
Mistake 4: Splitting Traffic Too Many Ways
Testing 4 variations means 25% traffic each.
Problem: Each variation needs the full sample size. You’ve just quadrupled your test duration.
Solution: Test fewer variations. A/B (two versions) is much more practical than A/B/C/D/E.
Mistake 5: Not Accounting for Traffic Fluctuations
“We get 5,000 visitors per day, so we’ll have 70,000 in two weeks.”
Problem: Traffic varies. Weekends, holidays, marketing campaigns, seasonality all affect volume.
Solution: Use conservative estimates. Plan for variability.
Pre-Test Checklist
Before launching any A/B test:
- Baseline conversion rate documented
- Minimum detectable effect chosen (and justified)
- Sample size calculated
- Duration estimated (including buffer for traffic variation)
- Test will run for at least one full week
- You can commit to running the full duration
When to Deviate From Standard Power
Lower Power (70%) Might Be Acceptable When:
- Running many tests (more tests > perfect confidence per test)
- Changes are easily reversible
- You’re trying to learn quickly
- The change is low-risk
Higher Power (90%) Is Worth It When:
- Major business decisions
- Expensive implementation
- Can’t easily reverse
- Leadership needs high confidence
Revenue-Based Sample Size
Instead of conversion rate, you might optimize for revenue per visitor.
The difference: Revenue is continuous (vs. binary conversion), which affects the statistics.
General principle: Revenue-based tests typically need larger samples because revenue has higher variance than conversion.
Recommendation: If possible, use conversion rate as primary metric. Track revenue as secondary. This usually gives faster, clearer results.
Summary: The Sample Size Reality Check
- Calculate before testing: Know your sample size needs before launching
- Be realistic about MDE: Can you achieve that improvement? Can you afford to detect it?
- Plan for full duration: Commit to running the complete test
- Don’t peek and stop: Early stopping invalidates results
- Low traffic? Adapt: Use alternative methods rather than underpowered tests
The math isn’t optional. It’s the difference between making decisions based on evidence versus making decisions based on noise that looks like evidence.
Ready to Improve Your Conversions?
Get a comprehensive CRO audit with actionable insights you can implement right away.
Ready to optimize your conversions?
Get personalized, data-driven recommendations for your website.
Request Your Audit — $2,500