March 30, 202610 min read

Wielding Occam's Razor

Why Simpler Forecasting Models Beat Complex Ones — and Save Millions

ForecastingSupply ChainRetailCost Optimization

Summary for Decision Makers

The Finding: A research team from the University of Bath, University of Virginia, and the National Technical University of Athens tested whether large, complex model pools actually produce better forecasts than small, carefully chosen ones. They don't. A reduced pool of 8 exponential smoothing models outperforms the full pool of 19 — both in accuracy and in uncertainty estimation.

Why It Matters: Large retailers forecast demand for millions of SKU-store combinations every week. Every unnecessary model burns compute time and money. For a Walmart-scale operation (1 billion series), switching from a full to a reduced model pool saves $2,600 per week — over $135,000 annually on weekly cycles, and up to $1.5M on daily cycles.

The Surprising Result: The reduced model pool doesn't just save money — it actually produces better forecasts. Complex models (especially those with multiplicative trends) generate explosive, unrealistic predictions that degrade overall accuracy. Removing them improves everything.

The Bottom Line: If your forecasting system runs 19+ exponential smoothing variants or ARIMA models with high orders, you're likely spending more and getting less. Simplify your model pool. The math is clear: fewer models, better results, lower costs.

The Research

The paper "Wielding Occam's Razor: Fast and Frugal Retail Forecasting" by Fotios Petropoulos, Yael Grushka-Cockayne, Enno Siemsen, and Evangelos Spiliotis (2023) asks a deceptively simple question:

"Can we maintain the accuracy of statistical forecasts while reducing the cost required to generate them?"

The answer, tested across nearly 50,000 time series from the M, M3, and M4 competitions plus 30,000 Walmart SKU-store combinations, is an emphatic yes.

Why More Models ≠ Better Forecasts

The conventional wisdom in demand forecasting is "more models, better selection." If you run 19 exponential smoothing variants and let AICc pick the winner, you should get the best possible fit. The paper demonstrates this logic is flawed for three reasons:

  1. Multiplicative trend models produce explosive forecasts.

    Four of the 19 ETS variants use multiplicative trends. These are frequently selected by information criteria because they fit in-sample data well — but they generate unrealistic, exponentially growing predictions at longer horizons. These "ticking time bombs" drag down the entire pool's out-of-sample accuracy.

  2. Overfitting masquerades as better fit.

    When expanding from 8 to 19 models, ~20% of series previously fitted with simple level-only models get reassigned to trended models. These reassignments look better in-sample but perform worse out-of-sample. The simpler model was right all along.

  3. Computational cost scales non-linearly.

    The 11 additional models don't just add 60% more compute time — they add 247% more cost, primarily because multiplicative error models require simulation for prediction intervals rather than closed-form solutions.

The Evidence: Exponential Smoothing

The authors tested five ETS pools of increasing size across nearly 50,000 series:

PoolModelsMASEMSISCost (sec/series)
Reduced80.9428.2010.365
Default140.9478.3780.793
All Models190.9538.5641.267

The reduced pool wins on every metric: best accuracy (lowest MASE), best uncertainty estimation (lowest MSIS), and lowest computational cost — while using less than half the models.

The Evidence: ARIMA

For ARIMA, the authors varied the maximum order K (controlling model complexity). The sweet spot is K=4:

  • K=4: Best accuracy (MASE 0.931) at 4.28 sec/series
  • K=8: Worst accuracy despite being 300x slower than K=3
  • Going from K=1 to K=4 improves accuracy by 3.2% — but at a 15,756% cost increase

The diminishing returns are dramatic. Beyond K=4, you're paying exponentially more for worse results.

Machine Learning Is Not the Answer Either

The paper also benchmarked LightGBM against the statistical methods:

  • LightGBM achieved marginally better accuracy (0.921 MASE) on competition data
  • But required 7.724 seconds per series — 21x slower than reduced ETS
  • On Walmart retail data, reduced ETS actually beat LightGBM in accuracy (0.714 vs 0.717 RMSSE) while being 41% cheaper

The takeaway: for large-scale retail forecasting, well-chosen statistical models outperform ML on both accuracy and cost.

The Real-World Impact

For a Walmart-scale retailer forecasting 1 billion SKU-store combinations:

Weekly Cycles

$135,000 — $210,000

annual savings

Daily Cycles

$949,000 — $1,478,000

annual savings

Environmental impact (extrapolated to 40 billion series with daily forecasts):

  • Annual CO₂ reduction: 108,286 tonnes
  • Equivalent to 89,000 vehicles' yearly emissions
  • Equivalent to 3.2 million trees' annual CO₂ absorption

Practical Recommendations

The authors provide clear, actionable guidance:

  1. Exclude multiplicative trend ETS models. They generate explosive forecasts and drag down accuracy. The four variants with multiplicative trends should never be in your production pool.
  2. Restrict ARIMA maximum orders to K≤4. Higher orders increase complexity exponentially with diminishing (or negative) returns. Seasonal parameters (P, Q) are the main cost driver.
  3. Prioritise damped trends. When trend components are needed, damped specifications consistently outperform their undamped counterparts.
  4. Don't default to ML for scale. At retail scale, reduced statistical models match or beat LightGBM in accuracy at a fraction of the cost.

Our Benchmark: Rust Confirms the Findings

We replicated the paper's methodology with our own Rust-native implementation (full benchmark), running the complete M5 dataset — 30,490 item-store combinations with 1,941 days of daily Walmart sales data and a 28-day forecast horizon.

Accuracy: Virtually Identical

MetricComplete (19 models)Reduced (8 models)Difference
Avg RMSE1.43321.4331-0.007%
Median RMSE0.92570.92570.0000
Avg MAPE63.72%63.70%-0.02pp
Success Rate100%100%

The difference is statistically negligible. The reduced pool is not a compromise — it is the same result with half the work.

Speed: 2x Faster

MetricCompleteReducedSpeedup
Wall-clock Time370.8 s184.9 s2.01x
CPU Time / Series250.7 ms121.4 ms2.06x
Throughput82 series/s165 series/s2.01x

165 series per second with the reduced pool — processing all 30,490 Walmart series in under 3 minutes. The full pool takes over 6 minutes for identical accuracy.

What the Models Choose

The model selection distribution reveals why the reduced pool works. Over 90% of series select just two models:

  • ETS(A,N,A) — seasonal, no trend: 70.1% of series (reduced pool)
  • ETS(A,N,N) — simple exponential smoothing: 23.8%
  • The remaining 6.1% split between damped trend variants

The 11 models removed from the complete pool were almost never the right choice — and when they were selected (e.g., undamped trends), they hurt out-of-sample accuracy. The reduced pool's design uses damped trends exclusively and balances additive/multiplicative error types with 2-model coverage per demand profile.

Full benchmark results and methodology on GitHub

How This Connects to Our Work

The principles from this paper are directly embedded in our tools. The anofox-forecast DuckDB extension implements 32 forecasting models — but the key insight is that you don't run all of them. Intelligent model selection, informed by the statistical properties identified through anofox-statistics (demand classification, distribution fitting), ensures each SKU gets the right model — not every model.

fast_forecast.sql
-- Fast & frugal: forecast 10,000 products
-- with automatically selected models
SELECT * FROM ts_forecast_by(
  'sales', item_id, date, quantity,
  'AutoETS', 12, '1M',
  MAP{'seasonal_period': '12'}
);

Try our live forecasting demo to see this in action — running entirely in your browser via WebAssembly.

Citation

Petropoulos, F., Grushka-Cockayne, Y., Siemsen, E., & Spiliotis, E. (2023). Wielding Occam's Razor: Fast and Frugal Retail Forecasting. arXiv preprint arXiv:2102.13209v4. doi:10.48550/arXiv.2102.13209

Need help implementing this?

Dr. Simon Müller builds production forecasting systems for manufacturing and pharma companies. If your team is dealing with the challenges described in this article, let's talk.

Wir verwenden Cookies für Analytics (Google Analytics), um diese Website zu verbessern. Ohne Ihre Zustimmung werden keine Daten erhoben.