AI Chatbot A/B Testing Guide: Run Experiments That Win
Most chatbot teams test wrong — before/after, no holdout, peeking at p-values. Here's how to run experiments that actually convince your CFO, with the 7 tests that consistently win across 600+ deployments.
The non-negotiables
1) Visitor-level holdout (not page-level). 2) One variable at a time. 3) Pre-compute sample size. 4) Don't peek. 5) Pre-register the primary metric.
Sample Size Math
For a binary outcome (converted vs not) at baseline conversion p, to detect relative lift Δ at 80% power and 95% confidence:
n_per_arm ≈ 16 × p × (1 − p) / (p × Δ)²- • Baseline 5%, detect +10% relative → ~31K visitors/arm
- • Baseline 5%, detect +30% relative → ~3.4K visitors/arm
- • Baseline 20%, detect +10% relative → ~6.4K visitors/arm
The 7 Experiments That Win
1. Greeting copy
+12% engagementSpecific outperforms generic. "Stuck on pricing?" beats "How can I help?"
2. Opening question
+18% qualified leadsLead with the highest-signal qualifier (use case, role, or company size).
3. Trigger delay
+9% open rateTry 15s, 30s, 60s, exit-intent. Optimum varies by page intent.
4. Model (GPT-5 vs Claude 5)
+5–8% CSATRoute per query if possible — see our model routing guide.
5. Refusal style
+0.4 CSAT"I'm not sure, let me get a teammate" beats stiff legalistic refusals.
6. Handoff threshold
+11% resolutionHand off earlier when sentiment turns negative. Two failed turns is enough.
7. Proactive vs reactive
+22% lead conv.Proactive wins on pricing/comparison pages; reactive wins on docs.
Common Pitfalls
- • Peeking. Checking p-values mid-test inflates false positives 5–10x.
- • Multiple metrics. If you test 10 metrics at 95%, you'll find a "winner" by chance. Pre-register one.
- • Session bucketing. Bucket by visitor cookie, not session — otherwise users flip arms.
- • Holiday confounds. Avoid running across holidays or competitor launches.
- • Tiny absolute lift. A 5% lift on 1% baseline isn't real money. Pick experiments by absolute impact.
Frequently Asked Questions
Best way to test?
Visitor-level 50/50 holdout, one variable, pre-registered metric.
Sample size?
~31K/arm to detect +10% on 5% baseline. Smaller traffic → test bigger swings.
Built-in experiments
EzyConn ships with native holdout, sample-size calculators, and one-click model A/B routing. Free to start.
Start FreeLast updated . Pair with chatbot CRO guide. View more guides.