Stats Engine: How Statistical Decisions Work

advanced

The problem: when is an experiment actually done?

The most dangerous moment in experimentation is when someone looks at early results, sees a 15% lift with 90% confidence, and declares a winner. Traditional fixed-horizon statistics require you to decide your sample size before starting and only check results once, at the end. In practice, nobody does this. Teams peek at results daily, stop experiments early when results look good, and unknowingly inflate their false positive rate.

Stats Engine is Optimizely’s answer to this real-world behavior. It produces results that are valid whenever you check them — no peeking penalty, no inflated error rates, no need to pre-calculate a sample size.

Sequential testing

Traditional statistical tests assume you collect all your data first and analyze it once. Sequential testing flips this: it analyzes data continuously as it arrives and adjusts confidence intervals so they remain valid at every point in time.

Stats Engine uses a sequential testing framework where confidence intervals widen to account for the number of times the data has been evaluated. Early in the experiment, intervals are wider (more uncertainty). As data accumulates, they narrow. At no point are they misleadingly tight.

This means you can:

Check results daily without inflating your false positive rate
Stop an experiment as soon as you see a statistically significant result
Let an experiment run longer if you want more precision

You do not need to commit to a sample size upfront, though setting a minimum runtime (typically one to two business cycles) is still recommended to capture natural traffic variation.

Always-valid p-values

A traditional p-value answers: “If there is no real difference, what is the probability I would see data this extreme?” But that answer is only valid if you committed to a single analysis point. Check the p-value five times during the experiment, and the actual false positive rate can exceed 20% instead of the intended 5%.

Stats Engine computes always-valid p-values that account for continuous monitoring. The reported significance level means what it says, regardless of how many times you have looked at the results.

False discovery rate control

When an experiment has multiple metrics (primary conversion, secondary revenue, engagement, bounce rate), each metric is a separate statistical test. Running many tests increases the chance that at least one shows a false positive — pure statistical noise masquerading as a real effect.

Stats Engine applies false discovery rate (FDR) control across all metrics in an experiment. Instead of controlling the error rate per metric, it controls the expected proportion of false discoveries among all metrics flagged as significant. This is less conservative than the Bonferroni correction (which controls family-wise error rate) but more protective than no correction at all.

In practice, this means:

If Stats Engine flags three metrics as significant, you can be confident that most of them represent real effects
You do not need to restrict yourself to a single primary metric to avoid statistical noise
Secondary metrics are genuinely informative, not just directional

Bayesian and frequentist perspectives

Stats Engine’s core methodology is frequentist sequential testing. However, Optimizely also provides Bayesian interpretations of results:

Output	Frequentist framing	Bayesian framing
Improvement	Point estimate of the treatment effect	Posterior mean of the treatment effect
Confidence interval	Range where the true effect lies (95% coverage)	Credible interval from the posterior distribution
Statistical significance	p-value below threshold (always-valid)	Probability that variation beats the baseline

Both views are computed from the same underlying data. The Bayesian view can be more intuitive for stakeholders (“there is a 96% chance this variation is better”) while the frequentist view provides the formal guarantees that protect against false positives.

Minimum detectable effect

The minimum detectable effect (MDE) is the smallest improvement you care about detecting. Setting an MDE does not change when Stats Engine declares significance — it affects how long you should plan to run the experiment to have adequate statistical power.

A smaller MDE (e.g., 1% lift) requires more data to detect reliably. A larger MDE (e.g., 10% lift) can be detected with less data. Think of MDE as a planning tool: “If the real improvement is at least X%, I want to be confident my experiment will detect it.”

Stats Engine displays estimated time-to-significance based on your current traffic rate and the MDE you configure.

CUPED: variance reduction

CUPED (Controlled-experiment Using Pre-Experiment Data) is a variance reduction technique that uses each visitor’s pre-experiment behavior to reduce noise in experiment results.

Here is the intuition: if a visitor who typically converts at 5% converts during the experiment, that tells you less than a visitor who typically converts at 0.5% converting. CUPED adjusts for each visitor’s baseline behavior, reducing the variance in your metric estimates.

The practical effect: experiments reach statistical significance faster — often 20-40% faster — without any change to your experiment design or traffic volume.

CUPED activates automatically when Optimizely has enough pre-experiment data for visitors in the experiment. You do not need to configure it.

Decision guidance

Question	Guidance
Can I check results early?	Yes. Stats Engine’s always-valid intervals mean every check is legitimate.
Do I need to set a sample size?	No, but set a minimum runtime of one to two business cycles to capture weekly patterns.
Should I worry about multiple metrics?	No. FDR control handles the multiple comparisons problem automatically.
When should I stop?	When the confidence interval excludes zero and you have run for at least one business cycle.
Is the Bayesian probability trustworthy?	Yes. It is calibrated against the frequentist framework and will not mislead you.

1. Your stakeholder checks experiment results every day and sees 92% confidence on day 3 of a two-week experiment. They want to stop the test and ship the variation. What should you tell them?

2. Your experiment tracks 6 metrics: primary conversion, revenue, bounce rate, time on page, scroll depth, and CTR. Stats Engine flags 3 of them as statistically significant. A colleague worries that some might be false positives because of the multiple comparisons. How does Stats Engine handle this?

3. An experiment is taking longer than expected to reach significance. A data scientist suggests using CUPED to accelerate results. What do you need to do to enable it?

Stats Engine: How Statistical Decisions Work

Feedback Review

Feedback Dashboard

Question Bank

The problem: when is an experiment actually done?

Sequential testing

Always-valid p-values

False discovery rate control

Bayesian and frequentist perspectives

Minimum detectable effect

CUPED: variance reduction

Decision guidance

Welcome to Knowledge Center

Which products do you work with?

What is your experience level with Optimizely?

Stats Engine: How Statistical Decisions Work

Feedback Review

Feedback Dashboard

Question Bank

The problem: when is an experiment actually done?

Sequential testing

Always-valid p-values

False discovery rate control

Bayesian and frequentist perspectives

Minimum detectable effect

CUPED: variance reduction

Decision guidance