Skip to content

Stats Engine: How Statistical Decisions Work

advanced

The problem: when is an experiment actually done?

Section titled “The problem: when is an experiment actually done?”

The most dangerous moment in experimentation is when someone looks at early results, sees a 15% lift with 90% confidence, and declares a winner. Traditional fixed-horizon statistics require you to decide your sample size before starting and only check results once, at the end. In practice, nobody does this. Teams peek at results daily, stop experiments early when results look good, and unknowingly inflate their false positive rate.

Stats Engine is Optimizely’s answer to this real-world behavior. It produces results that are valid whenever you check them — no peeking penalty, no inflated error rates, no need to pre-calculate a sample size.

Traditional statistical tests assume you collect all your data first and analyze it once. Sequential testing flips this: it analyzes data continuously as it arrives and adjusts confidence intervals so they remain valid at every point in time.

Stats Engine uses a sequential testing framework where confidence intervals widen to account for the number of times the data has been evaluated. Early in the experiment, intervals are wider (more uncertainty). As data accumulates, they narrow. At no point are they misleadingly tight.

This means you can:

  • Check results daily without inflating your false positive rate
  • Stop an experiment as soon as you see a statistically significant result
  • Let an experiment run longer if you want more precision

You do not need to commit to a sample size upfront, though setting a minimum runtime (typically one to two business cycles) is still recommended to capture natural traffic variation.

A traditional p-value answers: “If there is no real difference, what is the probability I would see data this extreme?” But that answer is only valid if you committed to a single analysis point. Check the p-value five times during the experiment, and the actual false positive rate can exceed 20% instead of the intended 5%.

Stats Engine computes always-valid p-values that account for continuous monitoring. The reported significance level means what it says, regardless of how many times you have looked at the results.

When an experiment has multiple metrics (primary conversion, secondary revenue, engagement, bounce rate), each metric is a separate statistical test. Running many tests increases the chance that at least one shows a false positive — pure statistical noise masquerading as a real effect.

Stats Engine applies false discovery rate (FDR) control across all metrics in an experiment. Instead of controlling the error rate per metric, it controls the expected proportion of false discoveries among all metrics flagged as significant. This is less conservative than the Bonferroni correction (which controls family-wise error rate) but more protective than no correction at all.

In practice, this means:

  • If Stats Engine flags three metrics as significant, you can be confident that most of them represent real effects
  • You do not need to restrict yourself to a single primary metric to avoid statistical noise
  • Secondary metrics are genuinely informative, not just directional

Stats Engine’s core methodology is frequentist sequential testing. However, Optimizely also provides Bayesian interpretations of results:

OutputFrequentist framingBayesian framing
ImprovementPoint estimate of the treatment effectPosterior mean of the treatment effect
Confidence intervalRange where the true effect lies (95% coverage)Credible interval from the posterior distribution
Statistical significancep-value below threshold (always-valid)Probability that variation beats the baseline

Both views are computed from the same underlying data. The Bayesian view can be more intuitive for stakeholders (“there is a 96% chance this variation is better”) while the frequentist view provides the formal guarantees that protect against false positives.

The minimum detectable effect (MDE) is the smallest improvement you care about detecting. Setting an MDE does not change when Stats Engine declares significance — it affects how long you should plan to run the experiment to have adequate statistical power.

A smaller MDE (e.g., 1% lift) requires more data to detect reliably. A larger MDE (e.g., 10% lift) can be detected with less data. Think of MDE as a planning tool: “If the real improvement is at least X%, I want to be confident my experiment will detect it.”

Stats Engine displays estimated time-to-significance based on your current traffic rate and the MDE you configure.

CUPED (Controlled-experiment Using Pre-Experiment Data) is a variance reduction technique that uses each visitor’s pre-experiment behavior to reduce noise in experiment results.

Here is the intuition: if a visitor who typically converts at 5% converts during the experiment, that tells you less than a visitor who typically converts at 0.5% converting. CUPED adjusts for each visitor’s baseline behavior, reducing the variance in your metric estimates.

The practical effect: experiments reach statistical significance faster — often 20-40% faster — without any change to your experiment design or traffic volume.

CUPED activates automatically when Optimizely has enough pre-experiment data for visitors in the experiment. You do not need to configure it.

QuestionGuidance
Can I check results early?Yes. Stats Engine’s always-valid intervals mean every check is legitimate.
Do I need to set a sample size?No, but set a minimum runtime of one to two business cycles to capture weekly patterns.
Should I worry about multiple metrics?No. FDR control handles the multiple comparisons problem automatically.
When should I stop?When the confidence interval excludes zero and you have run for at least one business cycle.
Is the Bayesian probability trustworthy?Yes. It is calibrated against the frequentist framework and will not mislead you.