Precise null vs. underpowered

A nonsignificant result is ambiguous. It could mean the effect is truly close to zero — a precise null. Or it could mean your study didn’t have enough power to detect any effect of interest — underpowered. Standard significance testing doesn’t tell you which one you have, because both produce the same “fail to reject $H_0$” verdict.

This page is about how to actually distinguish them, and which standard tools the literature uses to do so. The short version: don’t read “no effect” from a nonsignificant result alone. Pre-specify a smallest effect size of interest (SESOI), then ask whether your confidence interval rules out everything outside that band. If it does, you have a precise null. If it doesn’t, you’re underpowered relative to that question.

The conceptual problem

The classical Neyman-Pearson framework is asymmetric: a small p-value is evidence that the null is wrong, but a large p-value is not evidence that the null is right. It’s the absence of evidence either way. So a p-value of 0.32 leaves you stranded — you can’t conclude there’s an effect, but you also can’t conclude there isn’t.

The asymmetry comes from how the null is set up. The standard null is a point: $H_0: \beta = 0$. Failure to reject it could happen because $\beta$ is exactly zero, or because $\beta$ is small but nonzero, or because $\beta$ is large but your sample is too small to detect it. The data don’t distinguish these cases unless you bring extra information to the analysis — namely, what counts as a substantively meaningful effect.

The eyeball test

Before any formal procedure, you can almost always tell the two apart just by looking at the confidence interval. Three rules:

1. Look at the CI, not the p-value. A p-value of $0.4$ with CI $[-0.01, 0.02]$ is a precise null. A p-value of $0.4$ with CI $[-0.50, 0.50]$ is underpowered. Same p-value, opposite stories. The p-value hides the precision; the CI bounds expose it.

2. Ask one question of the CI. “If the truth were at the worst edge of this interval, would I still care?” If no — even at the edge, the effect is too small to be substantively interesting — you have a precise null. If yes — the interval still allows for an effect you’d consider meaningful — you’re underpowered. The “would I care” answer is substantive, not statistical.

3. Compare CI width to the natural scale of the outcome. A blood-pressure study with a CI of $(-0.3,\, +0.5)$ mm Hg is narrow on the clinical scale (blood pressure has to move 5–10 mm Hg to matter) → precise null. A CI of $(-4,\, +5)$ mm Hg is wide on that scale → underpowered. Same logic applies to any outcome: compare interval width to “how big is big” in that variable’s natural units.

These three checks get you the right answer in most applied situations without ever running TOST or computing a Bayes factor. The formal methods below are for when you need to report a specific claim — “we can rule out effects larger than $\delta$ with 95% confidence” — rather than just read the picture.

TOST: turn the question around

Equivalence testing — also called the two one-sided tests (TOST) procedure — flips the null. Instead of asking “is the effect zero?”, it asks “is the effect inside a band that’s small enough to be practically zero?”

You pre-specify equivalence bounds $\pm\delta$ representing the smallest effect size of substantive interest (SESOI). Then you run two one-sided tests:

\[ H_{0,L}: \beta \leq -\delta \qquad \text{vs.} \qquad H_{1,L}: \beta > -\delta \] \[ H_{0,U}: \beta \geq +\delta \qquad \text{vs.} \qquad H_{1,U}: \beta < +\delta \]

If you reject both at the $\alpha$ level (so the 100(1−2α)% CI lies entirely inside $(-\delta, +\delta)$), you’ve established equivalence: the effect is too small to matter. That’s a positive claim — “the data support a null effect” — not just a failure to reject.

If you can’t reject both, you’re either truly null but your sample isn’t big enough, or there really is an effect inside the equivalence band. Either way, inconclusive.

The accessible reference is Lakens (2017) Equivalence Tests: A Practical Primer for t Tests, Correlations, and Meta-Analyses in Social Psychological and Personality Science. He maintains the TOSTER R package.

The CI-against-SESOI shortcut

TOST has a clean visual equivalent. Draw the 95% CI for your effect, then draw the equivalence bounds $\pm\delta$ as a second band. There are exactly three pictures:

Picture	Verdict
CI is entirely inside $(-\delta, +\delta)$	Precise null. Equivalent to the TOST rejecting both equivalence-null hypotheses at $\alpha$. The effect, even at the edge of the CI, is too small to matter.
CI excludes 0 and stretches outside $(-\delta, +\delta)$	Detected effect. Standard significance, plus the magnitude is substantive.
CI includes 0 and extends past $\pm\delta$ in at least one direction	Underpowered (or inconclusive). Effects of policy-relevant size are still inside your interval. The data don’t pin the truth down.

The headline: the precision of your study is the width of your CI, and “precise” is defined relative to a pre-specified SESOI. Whether you have a precise null isn’t a property of $\hat\beta$ alone; it’s a joint property of $\hat\beta$, $\text{SE}(\hat\beta)$, and $\delta$.

The simulation below shows how this works. Drag the standard error and the SESOI and watch the same point estimate cycle between verdicts.

#| standalone: true
#| viewerHeight: 580

library(shiny)

ui <- fluidPage(
  titlePanel("Same effect estimate, different verdict"),

  sidebarLayout(
    sidebarPanel(
      width = 4,
      sliderInput("beta", "Observed effect β̂:",
                  min = -1, max = 1, value = 0.05, step = 0.01),
      sliderInput("se", "Standard error SE(β̂):",
                  min = 0.01, max = 0.6, value = 0.05, step = 0.01),
      sliderInput("sesoi", "Smallest effect size of interest δ (SESOI):",
                  min = 0.05, max = 1, value = 0.30, step = 0.05),
      hr(),
      htmlOutput("readout")
    ),
    mainPanel(
      width = 8,
      plotOutput("ci_plot", height = "480px")
    )
  )
)

server <- function(input, output, session) {

  verdict <- reactive({
    b   <- input$beta
    se  <- input$se
    d   <- input$sesoi
    ci_lo <- b - 1.96 * se
    ci_hi <- b + 1.96 * se

    excludes_zero      <- (ci_lo > 0) || (ci_hi < 0)
    ci_inside_band     <- (ci_lo > -d) && (ci_hi < d)
    extends_past_sesoi <- (ci_lo < -d) || (ci_hi > d)

    if (ci_inside_band) {
      v <- "PRECISE NULL"
      col <- "#27ae60"
    } else if (excludes_zero) {
      v <- "DETECTED EFFECT"
      col <- "#185FA5"
    } else {
      v <- "UNDERPOWERED / INCONCLUSIVE"
      col <- "#c0392b"
    }
    list(v = v, col = col, ci_lo = ci_lo, ci_hi = ci_hi,
         p = 2 * pnorm(-abs(b / se)),
         p_tost_lower = pnorm((b + d) / se, lower.tail = FALSE),
         p_tost_upper = pnorm((b - d) / se))
  })

  output$ci_plot <- renderPlot({
    r <- verdict()
    b <- input$beta; d <- input$sesoi

    xlim <- c(min(-1.2, r$ci_lo - 0.1, -d - 0.1),
              max( 1.2, r$ci_hi + 0.1,  d + 0.1))

    par(mar = c(4, 4, 1.5, 1))
    plot(NA, xlim = xlim, ylim = c(0, 1.4),
         xlab = "Effect size", ylab = "", yaxt = "n", main = "")

    # SESOI band
    rect(-d, 0.1, d, 1.0, col = adjustcolor("#27ae60", 0.10), border = NA)
    abline(v = c(-d, d), col = "#27ae60", lwd = 2, lty = 2)
    text(-d, 1.1, expression(-delta), col = "#27ae60", cex = 1.1)
    text( d, 1.1, expression(+delta), col = "#27ae60", cex = 1.1)
    text(0, 1.25, "SESOI band (effect 'too small to matter')",
         col = "#27ae60", cex = 0.95)

    # null
    abline(v = 0, col = "gray60", lwd = 1, lty = 3)

    # CI bar
    segments(r$ci_lo, 0.55, r$ci_hi, 0.55, col = r$col, lwd = 5)
    points(b, 0.55, pch = 19, col = r$col, cex = 1.7)
    text(b, 0.7, sprintf("β̂ = %.2f", b), col = r$col, cex = 1, pos = 3)
    text(r$ci_lo, 0.55, sprintf("%.2f", r$ci_lo), col = r$col, cex = 0.9, pos = 1)
    text(r$ci_hi, 0.55, sprintf("%.2f", r$ci_hi), col = r$col, cex = 0.9, pos = 1)
    text(b, 0.42, "95% CI", col = r$col, cex = 0.9)
  })

  output$readout <- renderUI({
    r <- verdict()
    HTML(sprintf(paste(
      "<div style='margin-top:10px;font-size:13px;line-height:1.7;'>",
      "<b>95%% CI:</b> [%.3f, %.3f]<br>",
      "<b>SESOI band:</b> [−%.2f, +%.2f]<br><br>",
      "<b>Standard p-value</b> (H₀: β = 0): %.3f<br>",
      "<b>TOST p-values</b> (equivalence): %.3f (lower), %.3f (upper)<br><br>",
      "<div style='padding:8px;background:%s;color:white;font-weight:bold;border-radius:4px;text-align:center;'>%s</div>",
      "</div>"
    ),
    r$ci_lo, r$ci_hi,
    input$sesoi, input$sesoi,
    r$p, r$p_tost_lower, r$p_tost_upper,
    r$col, r$v))
  })
}

shinyApp(ui, server)

Things to try.

Start with β̂ = 0.05, SE = 0.05, δ = 0.30. The CI is roughly $(-0.05, 0.15)$, entirely inside $(-0.30, 0.30)$. Precise null. You have evidence the effect is too small to matter — even though the standard p-value is unimpressive.
Now raise SE to 0.20 with everything else fixed. CI becomes roughly $(-0.34, 0.44)$ — extends past $\delta$ in both directions. Same point estimate, but now underpowered.
Move β̂ to 0.40 and keep SE at 0.05. CI is roughly $(0.30, 0.50)$ — excludes zero AND extends past $\delta$. Detected effect.
Tighten δ to 0.10 with β̂ = 0.05, SE = 0.05. The CI no longer fits inside the SESOI band. Underpowered for a stricter equivalence margin. Same data; the verdict changed because the substantive question got more demanding.

The Bayesian alternative

In a Bayesian framework, you don’t need a separate testing procedure for the null — the posterior already quantifies what you believe.

Posterior probability inside the SESOI: compute $P(|\beta| < \delta \mid \text{data})$. If it’s, say, > 95%, you’ve established a precise null in the Bayesian sense.
Bayes factor: $BF_{01}$ compares the null model ($\beta$ near 0) to an alternative ($\beta$ distributed under some prior). $BF_{01} > 3$ is conventionally “moderate evidence for the null”; $BF_{01} > 10$ is “strong evidence for the null.” $BF \approx 1$ is the clean statement “the data are inconclusive.”

Worked example: a small evaluation, large SE

A workforce training program is evaluated by RCT. Monthly wage difference between treatment and control: $\hat\beta = \$50$ with $SE = \$200$. The standard t-statistic is $0.25$; the p-value is $0.80$. The 95% CI is roughly $[-\$342,\, +\$442]$.

A p-value of $0.80$ feels like “no effect.” But the CI is wide. Is this a precise null, or is the study just uninformative? Let’s see what a Bayes factor says.

Pick a weakly informative prior under $H_1$: $\beta \sim \mathcal{N}(0, \tau^2)$ with $\tau = \$500$ (you think effects up to a few hundred dollars per month are plausible). Under $H_0$, $\beta = 0$. With Gaussian likelihood, there’s a closed-form Bayes factor:

\[ BF_{01} \;=\; \sqrt{1 + \frac{\tau^2}{SE^2}}\;\cdot\; \exp\!\left[-\,\frac{1}{2}\,t^2\cdot\frac{\tau^2}{\tau^2 + SE^2}\right], \]

where $t = \hat\beta/SE$. Plugging in $\tau = 500$, $SE = 200$, $t = 0.25$:

$\tau^2/SE^2 = 250000/40000 = 6.25$, so the first factor is $\sqrt{7.25} \approx 2.69$.
$\tau^2/(\tau^2 + SE^2) = 250000/290000 \approx 0.862$, so the exponent is $-\tfrac{1}{2}(0.0625)(0.862) \approx -0.027$, giving $\exp(-0.027) \approx 0.973$.
$BF_{01} \approx 2.69 \cdot 0.973 \approx 2.6$.

So the data give you about 2.6 to 1 odds in favor of the null — on the Jeffreys scale that’s “anecdotal,” not even moderate evidence. Inconclusive, leaning slightly toward the null. Not a precise null, not a detected effect — underpowered, the same conclusion the eyeball test gives.

Now suppose you ran a larger study and got $\hat\beta = \$50$ with $SE = \$30$ (CI roughly $[-\$9, \$109]$). Same prior, same point estimate. Now $t = 1.67$, $\tau^2/SE^2 \approx 278$, and $BF_{01} \approx 16.7 \cdot 0.20 \approx 3.3$. Moderate evidence for the null. Same headline number, much tighter CI → the Bayes factor flips from “inconclusive” to “the null is winning.” That flip is exactly what the eyeball test was tracking, just quantified.

Why this matters

Bayes factors handle the inconclusive case better than frequentist equivalence testing: a frequentist failure to reject doesn’t quantify how uninformative the data are, but $BF \approx 1$ does. See Model Comparison for the mechanics and Bayesian Estimation for the posterior framework.

The catch: Lindley’s paradox

Bayes factors depend on the prior you assign to the parameter under $H_1$, and that dependence is sharp. A diffuse prior makes $H_1$ predict the data poorly on average, which biases $BF_{10}$ toward the null. Push this hard enough and you get a situation where a frequentist test rejects the null while a Bayes factor with a vague prior favors it — Lindley’s paradox (Lindley 1957).

The toy example. Suppose you measure the proportion of male births in a sample of $n = 98{,}000$ newborns against the null hypothesis of $50\%$. The observed proportion is $50.36\%$. With that sample size, the z-statistic is about $2.27$ — a p-value of roughly $0.02$. The frequentist verdict: reject the null.

Now compute a Bayes factor against a flat prior on $H_1$ — “the true proportion could be anywhere from 0 to 1 with equal plausibility.” Under this prior, $H_1$ spreads its predictive mass across all possible proportions, most of which are far from $50.36\%$. The Bayes factor turns out to favor the null by a substantial margin. The two procedures disagree on the same data.

Why? The null is a point — “exactly 50%” — so it predicts the observed $50.36\%$ with a definite probability density. The alternative is a distribution. With a wide prior the alternative spreads its mass over a huge range, most of which is far from $50.36\%$. Averaged over the whole range, the alternative model predicts the observed value poorly. So the data look “close enough to 50%” relative to a model that allowed anything. The null wins by being specific.

The takeaway has two parts:

Bayes factors and p-values answer different questions. A p-value asks “if the null were true, how surprising is this result?” A Bayes factor asks “which of two specific models predicts this data better?” When sample size is large, even tiny deviations from the null become very surprising under $H_0$ — so p-values shrink. But the deviation is also small relative to the range the prior allowed, so the Bayes factor favors the null. Different questions, different answers, both legitimate.
Don’t use diffuse priors with Bayes factors. A standard practice is to set the prior under $H_1$ to match what you actually believed about plausible effect sizes before seeing the data — informed by the literature, by physical limits, by past studies. Reporting a Bayes factor without disclosing the prior is incomplete.

What not to do: post-hoc power

A surprisingly common move is to take a nonsignificant result, compute the observed (post-hoc) power using $|\hat\beta|$ as the effect size, and report it as evidence for or against the null. This is mathematically useless. For a fixed significance level, observed power is a one-to-one function of the observed p-value: a small p-value gives high observed power, a large p-value gives low observed power, and so the “observed power” carries no information beyond what the p-value already told you.

Hoenig and Heisey (2001), “The Abuse of Power: The Pervasive Fallacy of Power Calculations for Data Analysis” (The American Statistician), is the canonical takedown. Their point: power calculations belong in study design, with a pre-specified effect size. Computing power after the fact using the observed effect adds nothing.

So if a referee or coauthor asks for “observed power” to interpret a null result, redirect to one of the three legitimate methods above instead.

How to report

A clean nonsignificant result writeup includes four things:

Pre-specified SESOI — what magnitude of effect would have mattered, decided before looking at the data, with a citation or substantive justification.
Point estimate and confidence interval — the actual data summary.
Verdict against the SESOI — either equivalence-test result (TOST p-values) or the visual CI-vs-SESOI comparison, with the conclusion stated explicitly: “we can rule out effects larger than $\delta$” vs. “effects up to $\delta$ remain consistent with our data.”
What it would have taken to be definitive — e.g., “to detect an effect of $\delta$ with 80% power would require $n = N^*$, which is $k\times$ our actual sample.” This puts the underpowered case in context.

That language turns a nonsignificant result from “we don’t know” into one of two informative statements: “the effect, if any, is small” or “we can’t tell from this study.”

Did you know?

The standard test framework treats null and alternative asymmetrically by design. Fisher (1935) was explicit: a small p-value rejects the null, but a large one is “absence of evidence, not evidence of absence.” Equivalence testing was developed in the 1970s and 1980s for the bioequivalence question — generic drug X is “as good as” name-brand drug Y — and only later imported into the social sciences.
The “absence of evidence is not evidence of absence” line is sometimes attributed to Carl Sagan, but the phrasing dates at least to the 1990s in clinical-trial methodology, where it was the warning to FDA reviewers about underpowered equivalence trials.
The 2016 ASA statement on p-values (Wasserstein & Lazar) explicitly calls out “a p-value does not measure the probability that the null hypothesis is true.” Equivalence testing and Bayes factors are the standard fixes for the question p-values can’t answer.