May 7, 20269 min read

Your 95% Prediction Interval Probably Isn't 95%

And the four methods that fix it — without retraining your forecast model.

ForecastingSupply ChainPharmaInventoryRisk Management

In One Paragraph

The forecast says 100 ± 30 with 95% confidence. You set the reorder point against that band. Six months later your stockouts and your overage both look about twice what the model predicted. This is miscalibration — the gap between the coverage your intervals claim and the coverage they actually deliver — and on a 100,000-SKU catalogue, every percentage point of miscoverage is roughly $1–4M of inventory. There are four post-hoc calibration methods that fix this without retraining your forecast model. We benchmarked all four on the full Walmart M5 dataset (30,490 SKU-stores). Here's which one to pick and why.

A Point Forecast Is an Opinion. A Prediction Interval Is a Decision.

Every downstream business choice — where to set reorder points, what service level to commit to, when to escalate to expedited freight, how much working capital to tie up in inventory — is sized off the interval, not the point estimate.

The cost of miscalibration is asymmetric and concrete:

  • Under-coverage — the band is too narrow, real demand exceeds the upper bound more often than promised — causes stockouts. On a fast-moving SKU you lose the margin on the missed sale; on a critical input you lose the entire downstream production batch. Service-level commitments break, contractual penalties trigger, buyers stop trusting the forecast, and they start adding their own private buffers on top of yours.
  • Over-coverage — the band is too wide, you set reorder points for a worst case that's far worse than reality — causes inventory bloat. Capital sits in warehouses, working-capital ratios deteriorate, and on perishable or short-life-cycle goods you end up writing it off entirely.

A retail planner running a 95% service level wants intervals that actually cover 95% of the realisations — not 99.97%, not 87%. Each percentage point of miscoverage on a 100k-SKU catalogue is roughly $1–4M of inventory, depending on category margin and capital cost. It's not a rounding error.

The whole job of forecasting in production is producing intervals that mean what they say. Point accuracy matters; interval calibration is what determines whether anyone can act on the forecast safely.

Why Most Prediction Intervals Miss the Mark

Forecasting models almost always make mathematical assumptions to produce intervals — Gaussian residuals, constant variance, stationary patterns — and real-world demand data almost always violates them. A typical ARIMA's "95% interval" routinely covers 87% on weekly SKU sales, or 99.7% on slow-movers where every band is far too wide. The model isn't broken; it's working as designed against an assumption that doesn't hold.

Conformal prediction is the family of post-hoc calibrators that fixes this with a single distribution-free guarantee: feed in any forecast model and a recent calibration window, get out an interval that delivers nominal coverage in expectation — no matter what assumptions the underlying model violated.

The Four Methods, in Plain Language

Split conformal

Hold out part of your training set as a calibration window. Measure how far each historical point fell from the forecast. Take the 95th-percentile worst miss, and use that distance as the half-width of every future interval. Distribution-free, dead simple, never under-covers in expectation. The price is that it produces one global half-width — every horizon, every regime gets the same band, regardless of how much harder some moments are to forecast than others.

CQR — Conformalised Quantile Regression

Same idea, smarter scoring. Instead of measuring absolute miss distance, measure how badly the band missed when the actual fell outside. Negative when the actual fell inside the band, positive by how much it overshot. The 95th percentile of those scores adjusts both bounds simultaneously. When the base learner already produces a sensibly-shaped band, CQR fine-tunes it; when it doesn't, CQR pulls it back. Tighter than split conformal whenever forecast difficulty varies across the data — which is most real-world forecasting.

EnbPI — Ensemble Bootstrap Prediction Intervals

Train K bootstrap-resampled forecasters, then for each calibration point use the leave-one-out ensemble prediction. Score against actuals, take the quantile. No retraining loop, no calibration holdout — the bagging structure is the calibration. Adapts to drift via a sliding residual window without any new model fits. The catch: it needs a bootstrap ensemble to leverage; if you have a single point forecaster, this method has nothing to grip onto.

ACI — Adaptive Conformal Inference

Online. Maintain a working miscoverage rate and update it after every new observation: if the most recent actual fell outside the interval, widen the band slightly; if it fell inside, narrow it. Under arbitrary distribution drift — promotions, seasonal turn-on, post-launch ramp — ACI provably converges to the nominal coverage rate without any distributional assumption. The catch: the convergence is online — the benefit only shows up if you're actually streaming new observations through the model in production.

The Results

We ran the benchmark on the full M5 retail dataset (30,490 SKU-store series) with a 9-method ensemble and a 95% target coverage. The conformal calibration method was the only thing we varied. Lower MIS is better — it's the standard metric for prediction-interval quality, combining width and miss penalty into a single number.

MethodEmpirical coverage
(target: 95%)
Interval score
(MIS — lower is better)
Split (industry default)99.99%49.51
CQR97.4%12.42
EnbPI95.6%10.38
ACI95.6%10.38

The directional story is clean.

Split conformal is unusable here. It lands at 99.99% empirical coverage with an interval-score of 49.5 — nearly 4× wider than CQR for almost no extra calibration safety. On retail data with heavy-tailed residuals, this is the worst of both worlds: enormous capital tied up in inventory against a pessimism the data didn't warrant.

CQR delivers strict ≥-target coverage with respectable width. 97.4% empirical coverage, much tighter intervals. The asymmetric scoring is naturally robust to heavy tails — it ignores points inside the band entirely and only scores how badly the band missed when it did. This is the safe production default.

EnbPI and ACI win at the ensemble level. 95.6% empirical coverage — the closest of any method to the 95% target — with the tightest interval score (10.38). On a single-method standalone benchmark these methods showed under-coverage; the ensemble is what gets EnbPI and ACI to nominal coverage at production scale.

Which One to Pick

  • For a static-batch ensemble forecaster on real demand data, use CQR or EnbPI. CQR is the safer default — it slightly over-covers (97.4% vs 95% target), trading a small amount of inventory cost for a strong "never under-cover" guarantee. EnbPI lands closer to nominal at a tighter width and is the right pick when calibration accuracy matters more than safety margin.
  • Reach for ACI when you're streaming and care about long-run coverage under drift — promotion regimes, seasonal turn-on, post-launch ramp. The online update is what makes ACI work; without a streaming pipeline, it's static split conformal under a different name.
  • Split conformal is the safe fallback when you have neither a quantile-producing base learner, a bootstrap ensemble, nor a stream — but on heavy-tailed data (anything intermittent) it'll over-cover heavily. Either of CQR / EnbPI is the better choice whenever the base model can produce a quantile band.

Why This Matters in Production

A 95% interval that delivers 99.99% coverage isn't "extra safety" — it's a band that's wider than necessary, paid for in inventory, paid for in capital tied up against demand that won't materialise. The split-conformal interval score of 49.5 versus EnbPI's 10.38 isn't a 4× academic ratio — it's roughly 4× the inventory to deliver the same service level, sized against essentially the same demand realisation. That's the difference between a working-capital ratio executives can defend and one they can't.

A 95% interval that delivers 91% coverage isn't "close enough" — it's a band that misses the actual 4% of the time more than it's allowed to, which on a million-unit catalogue is forty-thousand stockouts a year more than your model promised. Service-level commitments break in slow motion. Buyers stop trusting the forecast and add their own buffer on top of yours.

The math behind these methods is from 2019–2021 — recent enough that most production stacks haven't caught up, established enough that there's no implementation risk. The infrastructure to apply them on your actual catalogue is now in anofox-forecast. End-to-end on the full 30,490-leaf M5 with conformal calibration enabled: 40 minutes cold-cache on a single machine; seconds on a warm one. The same approach scales to the catalogue sizes we see in real engagements (50k–500k+ active SKUs).

If your prediction intervals don't cover at the rate they claim, your reorder points are calibrated to fiction. If they cover at triple the rate they claim, your working capital is paying for it.

Audit Your Intervals

The simplest diagnostic: take the last six months of forecasts your team produced, count how often the actual fell inside the 95% band you reported, and compare to 95%. If the answer is materially different, your reorder points are sized wrong — usually too high.

A typical engagement looks like: empirical coverage audit on your existing pipeline (1 week), a backtest with the four conformal methods so you can quantify the lift before committing (2–3 weeks), then production rollout with the right method for your data shape (3–6 weeks).

Book a 30-min Call

— Simon

Need help implementing this?

Dr. Simon Müller builds production forecasting systems for manufacturing and pharma companies. If your team is dealing with the challenges described in this article, let's talk.

Newsletter

Get new articles in your inbox

Practical articles on supply chain forecasting, statistical modeling, and high-performance Rust systems. No spam, no marketing — just new posts when they're published. Unsubscribe anytime.

Powered by Buttondown. Double opt-in confirmation required.

Wir verwenden Cookies für Analytics (Google Analytics), um diese Website zu verbessern. Ohne Ihre Zustimmung werden keine Daten erhoben.