Don\’t Calculate Post-hoc Power Using Observed Estimate Of Effect Size

What is Post-Hoc Power?

Post-hoc power, also known as observed power or retrospective power, is the statistical power of a test calculated after the study has been completed, using the effect size actually observed in the data. While it might seem like a useful tool to understand why a result was non-significant, leading experts and statisticians agree: you don’t calculate post-hoc power using observed estimate of effect size.

Statistical power is the probability of rejecting a null hypothesis when it is actually false. Ideally, this calculation is done a priori (before the study) to determine the necessary sample size. Calculating it after the study using the observed effect size creates a circular logic that offers no new information beyond the p-value itself.

Why You Don’t Calculate Post-Hoc Power Using Observed Estimate of Effect Size

The primary reason you don’t calculate post-hoc power using observed estimate of effect size is known as the “Power Approach Paradox.” This paradox reveals that observed power is a 1:1 function of the p-value. If your p-value is high, your observed power will inevitably be low. If your p-value is low (significant), your observed power will inevitably be high.

The Redundancy Problem

When you calculate power based on the observed effect, you are essentially asking: “If the true effect were exactly what I just found, what would the probability of finding a significant result be?” Since your result was non-significant, the observed effect is by definition small relative to the noise, and thus the “power” will always look “inadequate.” This adds zero diagnostic value to your interpretation of the data.

Variable Explanations and Formula

The mathematical relationship for a simple Z-test comparison is expressed as:

Power = Φ( (d * √n) – Z_1-α/2 )

Variable	Meaning	Typical Range
d	Observed Cohen’s d (Effect Size)	0.0 to 1.5+
n	Sample Size (per group)	10 to 1000+
α (Alpha)	Significance Threshold	0.01 to 0.10
Φ	Standard Normal Cumulative Distribution	0 to 1

Practical Examples (Real-World Use Cases)

Case 1: The “Underpowered” Trap

A researcher conducts a clinical trial with 30 patients per group. The p-value comes back as 0.12 (not significant at the 0.05 level). The researcher then calculates the post-hoc power based on the observed effect size of d=0.35. They find the power is only 35% and conclude: “The study was underpowered; that’s why we didn’t find significance.”

Interpretation: This is a mistake. The low power is simply a restatement of the p-value being 0.12. It does not prove that a larger study would have found a significant effect. You don’t calculate post-hoc power using observed estimate of effect size here because it provides a false sense of security that the null hypothesis is still probably false.

Case 2: The Significant Success

A marketer runs an A/B test with 500 visitors per variation. They find a significant result (p=0.01) with an effect size of d=0.15. They calculate observed power and find it to be 88%. They report that the study was “highly powered.”

Interpretation: Again, the power calculation is redundant. The significance (p=0.01) already tells us the effect was large enough relative to the sample size to reject the null. The 88% figure is just another way of looking at the same data.

How to Use This Post-Hoc Power Calculator

Enter Sample Size: Input the number of participants in a single group (assumes two equal groups).
Enter Observed Effect Size: Use the Cohen’s d value reported in your analysis.
Select Alpha: Usually 0.05, representing the 5% risk of a Type I error.
Analyze the Results: Observe how the Power % and P-value move in inverse lockstep.
Review the Chart: See where your study falls on the power curve compared to other potential effect sizes.

Key Factors That Affect Post-Hoc Results

Sample Size (n): Larger samples increase the Z-score for the same effect size, thus lowering the p-value and raising observed power.
Observed Effect Size: This is the most volatile factor. In small samples, the observed effect size is a very poor estimate of the true population effect.
Alpha Level: A more lenient alpha (0.10) will naturally show higher “power” because the bar for significance is lower.
Variance: High noise in data shrinks the observed Cohen’s d, directly impacting the post-hoc calculation.
Sampling Error: Because of random chance, your observed effect size might be much larger or smaller than reality. Post-hoc power ignores this uncertainty.
The Redundancy Law: As long as the sample size and alpha are fixed, the power is determined entirely by the p-value.

Frequently Asked Questions (FAQ)

Why is post-hoc power considered biased?

It is biased because it assumes the observed effect size is the true population effect size. In reality, observed effects in non-significant studies are often underestimated due to sampling error.

Should I ever report post-hoc power in a paper?

Most modern statisticians and journals discourage it. Instead, you should report confidence intervals for the effect size to show the range of plausible values.

What is the alternative to post-hoc power?

The best alternative is to conduct a prospective power analysis for future studies or use Confidence Intervals to evaluate the precision of your estimates.

Is post-hoc power the same as observed power?

Yes, the terms are interchangeable. Both refer to calculations performed after data collection using the results of that data.

Can post-hoc power be high even if the result is non-significant?

No. By mathematical definition, if p > alpha, then the post-hoc power will always be lower than the power threshold intended for that alpha (usually < 50% for p > 0.05).

Why do some software packages still calculate it?

Legacy reasons and historical practice. Just because a software outputs a value doesn’t mean that value is scientifically valid for interpretation.

Does sample size planning solve this?

Yes, a priori power analysis helps you design a study with enough sensitivity to detect a *meaningful* effect, rather than just reacting to whatever effect you happened to observe.

Is there a specific effect size I should use for power?

You should use the “Minimum Detectable Effect” (MDE) that is clinically or practically significant, not the effect you observed in a single small trial.

Related Tools and Internal Resources

Statistical Power Analysis Guide – A deep dive into a priori planning.
A/B Testing Mistakes – Common pitfalls in digital experimentation.
P-hacking prevention – How to ensure your p-values are honest and robust.
Cohen’s d Calculator – Standardizing your effect sizes for better reporting.
Sample Size Calculator – Determine how many users you need before you start.
Confidence Interval Tool – A better way to report uncertainty than post-hoc power.