Stats Engine: How Statistical Decisions Work
The problem: when is an experiment actually done?
Section titled “The problem: when is an experiment actually done?”The most dangerous moment in experimentation is when someone looks at early results, sees a 15% lift with 90% confidence, and declares a winner. Traditional fixed-horizon statistics require you to decide your sample size before starting and only check results once, at the end. In practice, nobody does this. Teams peek at results daily, stop experiments early when results look good, and unknowingly inflate their false positive rate.
Stats Engine is Optimizely’s answer to this real-world behavior. It produces results that are valid whenever you check them — no peeking penalty, no inflated error rates, no need to pre-calculate a sample size.
Sequential testing
Section titled “Sequential testing”Traditional statistical tests assume you collect all your data first and analyze it once. Sequential testing flips this: it analyzes data continuously as it arrives and adjusts confidence intervals so they remain valid at every point in time.
Stats Engine uses a sequential testing framework where confidence intervals widen to account for the number of times the data has been evaluated. Early in the experiment, intervals are wider (more uncertainty). As data accumulates, they narrow. At no point are they misleadingly tight.
This means you can:
- Check results daily without inflating your false positive rate
- Stop an experiment as soon as you see a statistically significant result
- Let an experiment run longer if you want more precision
You do not need to commit to a sample size upfront, though setting a minimum runtime (typically one to two business cycles) is still recommended to capture natural traffic variation.
Always-valid p-values
Section titled “Always-valid p-values”A traditional p-value answers: “If there is no real difference, what is the probability I would see data this extreme?” But that answer is only valid if you committed to a single analysis point. Check the p-value five times during the experiment, and the actual false positive rate can exceed 20% instead of the intended 5%.
Stats Engine computes always-valid p-values that account for continuous monitoring. The reported significance level means what it says, regardless of how many times you have looked at the results.
False discovery rate control
Section titled “False discovery rate control”When an experiment has multiple metrics (primary conversion, secondary revenue, engagement, bounce rate), each metric is a separate statistical test. Running many tests increases the chance that at least one shows a false positive — pure statistical noise masquerading as a real effect.
Stats Engine applies false discovery rate (FDR) control across all metrics in an experiment. Instead of controlling the error rate per metric, it controls the expected proportion of false discoveries among all metrics flagged as significant. This is less conservative than the Bonferroni correction (which controls family-wise error rate) but more protective than no correction at all.
In practice, this means:
- If Stats Engine flags three metrics as significant, you can be confident that most of them represent real effects
- You do not need to restrict yourself to a single primary metric to avoid statistical noise
- Secondary metrics are genuinely informative, not just directional
Bayesian and frequentist perspectives
Section titled “Bayesian and frequentist perspectives”Stats Engine’s core methodology is frequentist sequential testing. However, Optimizely also provides Bayesian interpretations of results:
| Output | Frequentist framing | Bayesian framing |
|---|---|---|
| Improvement | Point estimate of the treatment effect | Posterior mean of the treatment effect |
| Confidence interval | Range where the true effect lies (95% coverage) | Credible interval from the posterior distribution |
| Statistical significance | p-value below threshold (always-valid) | Probability that variation beats the baseline |
Both views are computed from the same underlying data. The Bayesian view can be more intuitive for stakeholders (“there is a 96% chance this variation is better”) while the frequentist view provides the formal guarantees that protect against false positives.
Minimum detectable effect
Section titled “Minimum detectable effect”The minimum detectable effect (MDE) is the smallest improvement you care about detecting. Setting an MDE does not change when Stats Engine declares significance — it affects how long you should plan to run the experiment to have adequate statistical power.
A smaller MDE (e.g., 1% lift) requires more data to detect reliably. A larger MDE (e.g., 10% lift) can be detected with less data. Think of MDE as a planning tool: “If the real improvement is at least X%, I want to be confident my experiment will detect it.”
Stats Engine displays estimated time-to-significance based on your current traffic rate and the MDE you configure.
CUPED: variance reduction
Section titled “CUPED: variance reduction”CUPED (Controlled-experiment Using Pre-Experiment Data) is a variance reduction technique that uses each visitor’s pre-experiment behavior to reduce noise in experiment results.
Here is the intuition: if a visitor who typically converts at 5% converts during the experiment, that tells you less than a visitor who typically converts at 0.5% converting. CUPED adjusts for each visitor’s baseline behavior, reducing the variance in your metric estimates.
The practical effect: experiments reach statistical significance faster — often 20-40% faster — without any change to your experiment design or traffic volume.
CUPED activates automatically when Optimizely has enough pre-experiment data for visitors in the experiment. You do not need to configure it.
Decision guidance
Section titled “Decision guidance”| Question | Guidance |
|---|---|
| Can I check results early? | Yes. Stats Engine’s always-valid intervals mean every check is legitimate. |
| Do I need to set a sample size? | No, but set a minimum runtime of one to two business cycles to capture weekly patterns. |
| Should I worry about multiple metrics? | No. FDR control handles the multiple comparisons problem automatically. |
| When should I stop? | When the confidence interval excludes zero and you have run for at least one business cycle. |
| Is the Bayesian probability trustworthy? | Yes. It is calibrated against the frequentist framework and will not mislead you. |
1. Your stakeholder checks experiment results every day and sees 92% confidence on day 3 of a two-week experiment. They want to stop the test and ship the variation. What should you tell them?
Stats Engine uses sequential testing that produces always-valid p-values. Unlike traditional statistics, checking results daily does not inflate the false positive rate. The 92% confidence is real. However, running for at least one full business cycle is recommended to capture natural traffic variation.
Stats Engine uses sequential testing that produces always-valid p-values. Unlike traditional statistics, checking results daily does not inflate the false positive rate. The 92% confidence is real. However, running for at least one full business cycle is recommended to capture natural traffic variation.
Review this topic →2. Your experiment tracks 6 metrics: primary conversion, revenue, bounce rate, time on page, scroll depth, and CTR. Stats Engine flags 3 of them as statistically significant. A colleague worries that some might be false positives because of the multiple comparisons. How does Stats Engine handle this?
Stats Engine applies false discovery rate (FDR) control across all metrics. This is less conservative than Bonferroni but more protective than no correction. When it flags 3 metrics as significant, you can be confident that most represent real effects.
Stats Engine applies false discovery rate (FDR) control across all metrics. This is less conservative than Bonferroni but more protective than no correction. When it flags 3 metrics as significant, you can be confident that most represent real effects.
Review this topic →3. An experiment is taking longer than expected to reach significance. A data scientist suggests using CUPED to accelerate results. What do you need to do to enable it?
CUPED activates automatically when Optimizely has enough pre-experiment data for visitors. It reduces variance by adjusting for each visitor's baseline behavior, often reaching significance 20-40% faster without any configuration changes.
CUPED activates automatically when Optimizely has enough pre-experiment data for visitors. It reduces variance by adjusting for each visitor's baseline behavior, often reaching significance 20-40% faster without any configuration changes.
Review this topic →