When the Forecast Pyramid Lies — and How MinT Stops It at 30,000 Leaves
Why your category total never matches the sum of your SKU forecasts — and what to do about it.
In One Paragraph
Your category total never matches the sum of your SKU forecasts. Your buyers add safety stock to absorb the gap. Your planners spend Mondays reconciling spreadsheets by hand. Your CFO has stopped trusting the demand dashboard because the numbers don't tie out. There is a 40-year-old statistical method that fixes this — it makes every level of your forecast pyramid add up by construction, while improving accuracy at the levels executives report on by 6–9%. Until now it was considered too computationally expensive for real catalogues. We made it run in 25 minutes on a laptop, on the full Walmart M5 dataset.
The Monday Morning Problem
If your company runs a multi-level demand plan, you've seen this:
- The CFO's dashboard shows a total revenue forecast.
- The category plan shows a different number.
- The store-level rollup shows a third.
- The SKU-store buyers' orders, summed up, show a fourth.
All four were generated by different teams, with different models, on different cadences. None of them agree. Every Monday, someone has to stitch them back together — usually by overriding the SKU buyers, or padding the category plan with an "unallocated" bucket nobody owns.
The variance between levels doesn't disappear. It gets absorbed somewhere. Three places, in our experience:
- Excess safety stock. Buyers can't reconcile the numbers, so they buffer. We've seen safety stock sit at 2–3× the statistical optimum in companies that don't reconcile properly.
- Manual reconciliation labour. Planners and analysts spend 4–8 hours a week aligning category plans to SKU rollups in Excel. At 30+ planners across a global organisation, that's a full FTE worth of work going to spreadsheet plumbing.
- Eroded executive trust. The CFO learns not to trust the demand dashboard because the numbers don't match the budget. So they ask for a manual update. Which takes another two days. Which is wrong by Friday.
Why the Standard Fixes Don't Work
There are three textbook approaches to making a forecast pyramid add up. They all have the same flaw: they pick one level to be right, and force every other level to live with that choice.
- Bottom-up. Forecast every SKU-store, sum upwards. The leaf forecasts are good, but the noise from 30,000 individual SKUs amplifies upward — your category total is worse than if you'd forecast the category directly. The boardroom pays for the buyer's noise.
- Top-down. Forecast the total, divide it by historical proportions. The total is good. But every leaf is just a fraction of one shared forecast — useless for products with promotions, stockouts, or new store openings. The buyer pays for the boardroom's blind spot.
- Middle-out. Pick a middle level and split the difference. Better than top-down at the leaves, better than bottom-up at the top, worse than both at the level you didn't pick. Someone always loses.
If you've ever asked "why doesn't our SKU plan match our category plan" and been told "different models, different bias, that's just how it is" — this is what they were describing.
A Method That Doesn't Pick a Loser
MinT (Minimum-Trace reconciliation) is a statistical technique that fixes this. The intuition is simple: forecast every level independently with whichever model works best at that level, then compute the unique adjustment that (a) makes everything add up across the pyramid and (b) does the least possible damage to forecast accuracy in the process.
In plain language: the top-level forecast informs the leaves, and the leaf forecasts inform the top, weighted by how reliable each one is. Levels that are noisy (the leaves) borrow stability from levels that aren't (the top). Levels that are blind to local detail (the top) borrow specificity from levels that aren't (the leaves). Every level keeps something. Nothing has to be sacrificed.
The method has been mathematically established since the 1980s. The reason your planning system probably doesn't use it is that the conventional implementation breaks down past a few thousand leaves — exactly where most real catalogues start.
The Technical Achievement: 30,000 Leaves in 25 Minutes
We rewrote the MinT solver from scratch using algorithms that exploit the tree structure of the hierarchy itself. The full reconciliation pass on the Walmart M5 dataset — 30,490 SKU-store combinations across 12 levels of aggregation — now runs in seconds per planning cycle.
The end-to-end pipeline (base forecasting + meta-learner blending + MinT reconciliation across 12 levels) completes in 25 minutes on a single machine. The same approach handles the catalogue sizes we see in real retail and pharmaceutical engagements — 50,000 to 500,000+ active leaves, monthly or weekly cycles, on infrastructure most companies already own.
The Results
We benchmarked four reconciliation methods on the full M5 retail dataset, holding the underlying forecasts constant. The only thing that changed was the reconciliation step. Lower numbers are better.
| Level | Bottom-up (industry default) | MinT | Improvement |
|---|---|---|---|
| Total revenue (CFO view) | 0.752 | 0.684 | −9.0% |
| By region | 0.761 | 0.700 | −8.0% |
| By store | 0.795 | 0.738 | −7.2% |
| By category | 0.761 | 0.698 | −8.3% |
| By department | 0.823 | 0.759 | −7.7% |
| Region × category | 0.776 | 0.722 | −6.9% |
| Store × department | 0.841 | 0.798 | −5.1% |
| Item-level (buyer view) | 0.841 | 0.837 | ≈ 0% |
| SKU × store (operational view) | 0.843 | 0.842 | ≈ 0% |
The pattern is clear: significant improvement at every level executives and category managers report on, with no degradation at the SKU-store level your buyers act on. The leaf-level forecasts stay as accurate as the industry default; the rollups your management team uses get materially tighter.
What This Means in Practice
The numbers above translate to operational outcomes that decision-makers care about:
- Your CFO's dashboard reconciles to your category plan reconciles to your buyers' orders. Without anyone overriding anything. Without an "unallocated" bucket. The numbers add up because they were computed to add up.
- Monday-morning reconciliation work goes away. The planners who spend hours weekly aligning category and SKU plans get those hours back for actual analysis.
- Safety stock buffers can come down. The variance the buffer was absorbing came from reconciliation error, not real demand uncertainty. Real customers want what real demand says they want — not the spread between four conflicting forecasts.
- Executive trust returns. When the demand dashboard, the budget, and the operational plan tell the same story, planning meetings become decisions instead of debates about whose number is right.
If This Sounds Like Your Pyramid
Most companies have lived with reconciliation gaps for so long that they've stopped noticing the cost. The first sign it's worth investigating: ask your supply-chain lead "do our category totals match the sum of our SKU forecasts on Monday morning?" — and watch how long the pause is before the answer.
A typical engagement looks like: hierarchy mapping (1 week), backtest against your existing pipeline (2–3 weeks) so you can quantify the lift before committing, then production rollout (3–6 weeks). The implementation is in anofox-forecast and scales to catalogue sizes from 10,000 to 500,000+ leaves.
— Simon
Need help implementing this?
Dr. Simon Müller builds production forecasting systems for manufacturing and pharma companies. If your team is dealing with the challenges described in this article, let's talk.
Get new articles in your inbox
Practical articles on supply chain forecasting, statistical modeling, and high-performance Rust systems. No spam, no marketing — just new posts when they're published. Unsubscribe anytime.
Powered by Buttondown. Double opt-in confirmation required.