Methodology

The Three Metrics That Actually Tell You If Your Demand Forecast Is Working

8 min read
Abstract visualization of forecast accuracy metrics — MAPE and bias charts

Every demand planning team we talk to tracks MAPE. It's the default, the metric that ships with every forecasting tool, the number that ends up in the quarterly review deck. And it's not wrong, exactly — it does tell you something about average forecast error. The problem is what it doesn't tell you: whether that error is going to cost you a stockout, a write-down, or a missed promotional window.

We built Automcore specifically because mid-market inventory teams need signals that map to purchasing decisions, not just forecast validation reports. After a year of working through how to surface the right information to planners, we've converged on three metrics that together tell a coherent story about forecast quality. MAPE is one component of one of them. By itself, it's not enough.

Why MAPE Flatters Bad Forecasts

MAPE — Mean Absolute Percentage Error — averages error magnitude across all your SKUs and periods. The structural problem is that it treats a 20% overforecast identically to a 20% underforecast. From a purchasing standpoint, those two errors have completely different consequences and completely different costs.

A 20% overforecast means you bought inventory you don't need. You're carrying capital, warehouse space, and insurance on product that may end up in a clearance bin or obsolescence write-down. A 20% underforecast means you ran out. You missed sales, frustrated customers, and potentially handed a competitor a trial opportunity.

The second problem: MAPE inflates on low-volume SKUs. If you forecast 5 units and sell 4, that's a 20% error. If you forecast 500 units and sell 400, that's also a 20% error — but the second mistake is 100x larger by cost and operational impact. Average MAPE across all SKUs treats them identically.

The third problem is what statisticians call the "zero demand issue." MAPE is undefined when actual demand is zero. Many planning systems silently exclude these periods from the average, which systematically inflates how good your forecast looks during low-demand windows like post-holiday or product end-of-life.

Metric 1: Bias-Weighted Error by SKU Tier

The first metric we track is forecast bias — the signed version of error — segmented by your ABC inventory classification. Bias tells you whether your model consistently over- or underforecasts, and which inventory tier is bearing that error.

The formula is simple: sum of (Forecast minus Actual) divided by sum of Actual, expressed as a percentage, calculated separately for A-tier, B-tier, and C-tier SKUs. A positive bias means you're consistently overforecasting; negative means you're underforecasting.

What matters operationally: you can tolerate different bias levels in different tiers. A-tier SKUs — your top 20% by revenue contribution — should be held to very tight bias tolerances, typically within ±5%. The cost of being wrong on A-tier items is highest. C-tier SKUs, which might represent 50% of your SKU count but only 5% of revenue, can carry higher error without significant financial consequence.

When we first built this view for one of our early users — a housewares distributor with about 2,400 active SKUs — they found their A-tier bias was running at +18%. They'd been consistently overordering their best-selling products for months, which explained a dead stock accumulation problem that had been attributed to "weak Q4 demand." The forecast wasn't wrong about total volume; it was directionally wrong in a way that MAPE masked.

Metric 2: Forecast Value Add (FVA)

FVA answers a question that almost no one thinks to ask: is your sophisticated forecasting model actually better than doing nothing?

The benchmark comparison is a naive model — typically last period's actuals or a simple moving average. FVA is the difference in error between your model and the naive baseline. Positive FVA means your model beats the naive baseline. Negative FVA — which is more common than you'd think — means your model is actively making things worse.

This matters because many forecasting systems add complexity without adding accuracy. Elaborate seasonal adjustment algorithms, promotion lift factors, and trend smoothing all sound like improvements. But if they're adding noise rather than signal, you'd be better off with a 4-week moving average. FVA forces that question into the open.

We're not saying sophisticated models are bad. We're saying that without FVA measurement, you can't tell if the complexity is earning its keep. A model that adds 8-12% FVA against a naive baseline is genuinely helping your team. A model with negative FVA is a liability dressed up in dashboard aesthetics.

Track FVA at the aggregate level and also by product category. In our experience, models often add value on certain product types (stable, high-velocity SKUs with clear seasonality) and subtract value on others (new products, irregular promotional items, items with lumpy demand). Category-level FVA tells you where to trust your model and where to put a human in the loop.

Metric 3: Service Level Deviation Against Cycle Service Target

This is the metric that closes the loop between forecast quality and inventory outcomes. Your safety stock calculation has a target cycle service level baked in — typically 95% or 98% for A-tier items. Service Level Deviation measures whether you're actually achieving that target in practice.

The reason this belongs in forecast accuracy measurement is that stockouts and overstock both trace back to forecast error. If your service level is running 4-6 percentage points below target, you have a forecast problem (or a safety stock parameter problem, which is downstream of forecast error). If your fill rate is consistently above target but your inventory turns are poor, you're over-provisioning — also a forecast signal.

What makes this metric actionable is pairing it with which SKUs are driving the deviation. Service level deviation at the aggregate level tells you there's a problem. Breaking it down by supplier lead time bucket, by category, and by location tells you where the forecast model needs tuning or where you need revised safety stock parameters.

One pattern we see repeatedly: service level deviation spikes in the 6-8 weeks following a supplier lead time change. The forecast model is using historical lead time parameters that no longer apply. The fix is lead time re-parameterization, not forecasting improvement — but you'd never know that without decomposing the service level deviation data.

How These Three Metrics Work Together

Bias-weighted error tells you if your model is directionally wrong. FVA tells you if your model is adding value relative to doing nothing. Service level deviation closes the loop to inventory outcomes.

Run all three in the same weekly review and patterns emerge quickly. High bias on A-tier SKUs + low FVA in a category + service level deviation on that supplier's items is a clear diagnostic signal: the model isn't learning the right patterns for those SKUs, probably because the training data has a structural problem (lead time changes, new supplier, promotional periods in the history that aren't flagged).

That diagnostic specificity is what MAPE alone can't give you. MAPE might be holding steady at 22% while those three metrics flash warning signs on a SKU cluster that represents 40% of your margin. The average obscures the distribution.

Practical Implementation Notes

If you're building this measurement framework on top of an existing planning system, a few things to watch:

First, decide upfront how you handle new SKUs and promotional periods in your error calculations. New SKUs shouldn't be included in FVA calculations until they have at least 8-12 weeks of history. Promotional periods where you've overridden the model forecast should be excluded from model accuracy metrics (you're measuring human judgment, not model performance, in those windows).

Second, set separate targets for each metric by inventory tier. Don't try to manage to a single combined score — the business decisions downstream of A-tier errors versus C-tier errors are too different to compress into one number.

Third, review FVA by category quarterly, not just annually. Model performance drifts. A product category that showed +12% FVA in Q1 might slide to negative territory in Q3 if demand patterns have shifted and the model hasn't been retrained. Quarterly FVA reviews catch this before it costs you a full season of suboptimal orders.

What we track inside Automcore is essentially a live version of these three metrics, surfaced alongside the forecast itself so planners can see accuracy context while they're making replenishment decisions — not after the fact in a separate reporting tool. The goal is to make "is this forecast reliable right now?" answerable in the moment, not at the monthly review.

Ready to put this into practice?