AI Chatbot A/B Testing Guide: Run Experiments That Win

Most chatbot teams test wrong — before/after, no holdout, peeking at p-values. Here's how to run experiments that actually convince your CFO, with the 7 tests that consistently win across 600+ deployments.

12 min readUpdated Experimentation
Start Testing Free

The non-negotiables

1) Visitor-level holdout (not page-level). 2) One variable at a time. 3) Pre-compute sample size. 4) Don't peek. 5) Pre-register the primary metric.

Sample Size Math

For a binary outcome (converted vs not) at baseline conversion p, to detect relative lift Δ at 80% power and 95% confidence:

n_per_arm ≈ 16 × p × (1 − p) / (p × Δ)²
  • • Baseline 5%, detect +10% relative → ~31K visitors/arm
  • • Baseline 5%, detect +30% relative → ~3.4K visitors/arm
  • • Baseline 20%, detect +10% relative → ~6.4K visitors/arm

The 7 Experiments That Win

1. Greeting copy

+12% engagement

Specific outperforms generic. "Stuck on pricing?" beats "How can I help?"

2. Opening question

+18% qualified leads

Lead with the highest-signal qualifier (use case, role, or company size).

3. Trigger delay

+9% open rate

Try 15s, 30s, 60s, exit-intent. Optimum varies by page intent.

4. Model (GPT-5 vs Claude 5)

+5–8% CSAT

Route per query if possible — see our model routing guide.

5. Refusal style

+0.4 CSAT

"I'm not sure, let me get a teammate" beats stiff legalistic refusals.

6. Handoff threshold

+11% resolution

Hand off earlier when sentiment turns negative. Two failed turns is enough.

7. Proactive vs reactive

+22% lead conv.

Proactive wins on pricing/comparison pages; reactive wins on docs.

Common Pitfalls

  • Peeking. Checking p-values mid-test inflates false positives 5–10x.
  • Multiple metrics. If you test 10 metrics at 95%, you'll find a "winner" by chance. Pre-register one.
  • Session bucketing. Bucket by visitor cookie, not session — otherwise users flip arms.
  • Holiday confounds. Avoid running across holidays or competitor launches.
  • Tiny absolute lift. A 5% lift on 1% baseline isn't real money. Pick experiments by absolute impact.

Frequently Asked Questions

Best way to test?

Visitor-level 50/50 holdout, one variable, pre-registered metric.

Sample size?

~31K/arm to detect +10% on 5% baseline. Smaller traffic → test bigger swings.

Built-in experiments

EzyConn ships with native holdout, sample-size calculators, and one-click model A/B routing. Free to start.

Start Free

Last updated . Pair with chatbot CRO guide. View more guides.

Related resources