How do I A/B test an AI chatbot?

Use a 50/50 visitor-level holdout where treatment sees the chatbot and control doesn’t — and isolate one variable at a time (greeting copy, opening question, model, trigger timing). Run until you hit the pre-computed sample size; don’t peek.

What sample size do I need?

To detect a 10% relative lift on a 5% baseline conversion at 80% power and 95% confidence, you need ~31,000 visitors per arm. To detect a 30% relative lift, ~3,400 per arm. Smaller-traffic sites should test bigger swings — not smaller ones.

AI Chatbot A/B Testing Guide: Run Experiments That Win

Name: EzyConn
Brand: EzyConn

Most chatbot teams test wrong — before/after, no holdout, peeking at p-values. Here's how to run experiments that actually convince your CFO, with the 7 tests that consistently win across 600+ deployments.

12 min readUpdated May 12, 2026Experimentation

Start Testing Free

The non-negotiables

1) Visitor-level holdout (not page-level). 2) One variable at a time. 3) Pre-compute sample size. 4) Don't peek. 5) Pre-register the primary metric.

Sample Size Math

For a binary outcome (converted vs not) at baseline conversion p, to detect relative lift Δ at 80% power and 95% confidence:

n_per_arm ≈ 16 × p × (1 − p) / (p × Δ)²

• Baseline 5%, detect +10% relative → ~31K visitors/arm
• Baseline 5%, detect +30% relative → ~3.4K visitors/arm
• Baseline 20%, detect +10% relative → ~6.4K visitors/arm

The 7 Experiments That Win

1. Greeting copy

+12% engagement

Specific outperforms generic. "Stuck on pricing?" beats "How can I help?"

2. Opening question

+18% qualified leads

Lead with the highest-signal qualifier (use case, role, or company size).

3. Trigger delay

+9% open rate

Try 15s, 30s, 60s, exit-intent. Optimum varies by page intent.

4. Model (GPT-5 vs Claude 5)

+5–8% CSAT

Route per query if possible — see our model routing guide.

5. Refusal style

+0.4 CSAT

"I'm not sure, let me get a teammate" beats stiff legalistic refusals.

6. Handoff threshold

+11% resolution

Hand off earlier when sentiment turns negative. Two failed turns is enough.

7. Proactive vs reactive

+22% lead conv.

Proactive wins on pricing/comparison pages; reactive wins on docs.

Common Pitfalls

• Peeking. Checking p-values mid-test inflates false positives 5–10x.
• Multiple metrics. If you test 10 metrics at 95%, you'll find a "winner" by chance. Pre-register one.
• Session bucketing. Bucket by visitor cookie, not session — otherwise users flip arms.
• Holiday confounds. Avoid running across holidays or competitor launches.
• Tiny absolute lift. A 5% lift on 1% baseline isn't real money. Pick experiments by absolute impact.

Frequently Asked Questions

Best way to test?

Visitor-level 50/50 holdout, one variable, pre-registered metric.

Sample size?

~31K/arm to detect +10% on 5% baseline. Smaller traffic → test bigger swings.

Built-in experiments

EzyConn ships with native holdout, sample-size calculators, and one-click model A/B routing. Free to start.

Start Free

Last updated May 12, 2026. Pair with chatbot CRO guide. View more guides.