Every major supply chain software vendor has relabeled their forecasting module as "AI-enabled" or "machine learning-powered" over the past three years. The actual methodology change in most cases was marginal: gradient boosting instead of ARIMA, a few more features in the input vector, a slightly more sophisticated seasonal decomposition. The marketing changed more than the math.
Demand planners who've sat through vendor pitches know this. They've seen "AI forecast" outputs that look almost identical to what their previous system produced. They've been told that accuracy would improve 20–30% and found the actual improvement was 3–5% on aggregate metrics and zero or negative on the SKUs that actually matter. Their skepticism is earned.
This piece is not a defense of AI in demand planning broadly. It's an attempt to draw a line between the types of tools that deserve skepticism and the types that don't — and to explain what the meaningful distinction actually is.
The Three Flavors of "AI" in Demand Planning
Not all demand planning AI is the same. There are roughly three categories of what vendors are selling under that label:
Better statistical models with ML feature engineering. This is the most common category. The vendor has replaced a simple ARIMA or Holt-Winters model with a gradient boosting or random forest model, added more calendar and promotional features, and calls it machine learning. The accuracy improvement is real but modest — typically 5–15% MAD reduction on stable SKUs with good historical data. The primary benefit is that the model finds feature interactions automatically that a statistician would have had to discover manually. The limitation: it's still entirely backward-looking. It cannot anticipate demand changes that have no historical precedent in the training data.
Black-box "AI" with no explainability. This is the category that deserves the most skepticism. The vendor has a complex model — often a deep learning architecture — that produces forecast outputs without any mechanism for the demand planner to understand why a particular SKU is being forecast at a particular level. When the forecast is wrong, the planner has no way to diagnose why or to apply judgment to override it intelligently. Accuracy on benchmark datasets may be impressive; accuracy on the specific SKU mix and demand patterns of a real business may be much less impressive.
External signal integration with explainable adjustments. This is the category that's actually new and worth evaluating seriously. These tools go beyond historical sales data to incorporate leading external signals — weather, freight, social trends, commodity prices — and they surface the signal-to-adjustment logic explicitly. The demand planner can see: "this SKU's forecast is increased 18% for the week of March 15 because temperature forecasts show a 14°F below-average cold snap across this distribution region, and this SKU category has a 0.73 correlation with cold weather demand in the historical data." That's a statement a planner can evaluate, challenge, or accept.
The third category is worth adopting. The first is modestly useful. The second deserves exactly the skepticism it gets.
Why Explainability Matters More Than Accuracy
Demand planners spend their careers building judgment about their specific product mix, customer base, and regional demand patterns. That judgment is valuable and cannot be replaced by a model. The question is not "does the AI know better than the planner?" — it's "does the AI give the planner better inputs for their judgment?"
A black-box forecast output asks the planner to abdicate their judgment. When the forecast says to order 4,200 units and the planner's read of the situation says 3,600, they have two options: override the AI (and be blamed if it was right) or follow the AI (and be unable to explain why if it was wrong). Neither option is good. The planning team's accountability structure gets distorted by a tool they can't understand or audit.
An explainable forecast with external signal attribution changes the interaction. The planner sees that the 4,200 unit recommendation is driven by a weather adjustment applied to a stable base of 3,400. They can evaluate the weather adjustment on its merits. They might know that this specific SKU doesn't actually respond much to cold snaps because the customer base is mostly indoor restaurants. They can override the weather adjustment while accepting the base forecast. The tool amplified their judgment rather than replacing it.
This is the test we'd apply to any demand planning tool: can a demand planner with 10 years of category experience look at a specific forecast output and understand the reasoning well enough to agree or disagree intelligently? If the answer is no, the tool is probably creating accountability problems that outweigh its accuracy benefits.
The Accuracy Measurement Problem
Vendor accuracy claims deserve careful scrutiny for a specific reason: aggregate accuracy metrics hide the distribution of errors in ways that matter for inventory decision-making.
A vendor might accurately claim "15% MAD reduction across the SKU portfolio." That aggregate improvement could be driven entirely by better accuracy on stable, high-volume SKUs that were already well-forecast — SKUs where the safety stock implications of a 15% MAD reduction are minor because the absolute error was small to begin with.
The SKUs that drive inventory cost and stockout risk are typically the volatile, lower-volume SKUs where demand is driven by external events — weather, trends, promotions — and where traditional statistical models have the highest error rates. If the accuracy improvement doesn't apply to those SKUs, the inventory benefit of the tool is much smaller than the aggregate headline suggests.
When evaluating any demand planning tool, the accuracy metrics worth examining are: MAD improvement on weather-sensitive SKUs, on social-trend-sensitive SKUs, on SKUs with high demand variability (coefficient of variation above 0.5), and during external demand event periods specifically. If the vendor can't produce those segmented accuracy numbers, the aggregate metric is not a useful guide to whether the tool will reduce your actual inventory costs.
What Good Adoption Looks Like in Practice
The demand planning teams we've seen adopt external signal tools successfully share a few common patterns.
They start with a limited SKU set where the signal-to-demand relationship is clearest and most measurable. Weather-sensitive categories — seasonal beverages, hot/cold food items, weather-related household goods — are often the right starting point because the causal mechanism is well-understood and the feedback loop is fast enough to validate within weeks. Starting with a subset of SKUs lets the team build confidence in the tool before applying it to the full portfolio.
They run the tool in parallel with their existing forecast for 4–8 weeks before using its output for actual replenishment decisions. Parallel running lets the demand planner observe where the signal-augmented forecast diverges from their existing forecast, understand the reasons for divergence, and form an opinion on which tool was right. Skipping parallel running is the fastest path to poor adoption — planners who never got to form their own judgment about the tool's reliability are the first to abandon it when it makes a high-profile miss.
They define clear boundaries for when the tool's output should be trusted versus when planner override should be expected. A sudden competitor product launch or a major promotional event may produce demand patterns outside the range of the model's training data. The planner should know in advance that those situations are outside the tool's reliable range, rather than discovering it after a missed forecast.
The Right Frame for Evaluation
The skepticism demand planners bring to AI tools is productive when it's specific. "AI forecasting doesn't work" is a blanket dismissal that prevents adoption of tools that genuinely improve outcomes. "This tool can't explain why it's making a particular recommendation, which means I can't audit it or apply my judgment to override it intelligently" is specific and actionable skepticism that leads to better vendor selection.
The questions worth asking any demand planning vendor: Can you show me, for a specific SKU and a specific time period, exactly what drove the forecast output? What external signals are you reading, and at what resolution? What's your accuracy breakdown on high-variability SKUs versus stable SKUs? What happens to your model's output during an unusual demand event with no historical precedent?
Vendors who can answer those questions clearly are building tools worth evaluating. Vendors who deflect to aggregate accuracy metrics and phrases like "our proprietary algorithm" are selling a black box — and the skepticism it deserves is entirely justified.