The Supplytrx platform reads weather forecasts, port congestion data, freight capacity signals, social trend velocity, and commodity price movements — then translates those inputs into demand adjustment signals that feed our users' replenishment systems. That description makes it sound simpler than it is. This post is about the architecture decisions we made and, more honestly, the ones we got wrong first.
We're a small team. The engineers working on the signal ingestion layer can be counted on one hand. That constraint shapes every architectural decision: we can't afford systems that require full-time maintenance overhead, we need to be able to debug production issues at 11pm without specialized expertise, and we need the flexibility to add new signal sources without rearchitecting the whole pipeline.
The Core Problem: Signal Sources Are Not Uniform
The first thing we learned — quickly — is that different signal sources have completely different delivery characteristics. Weather forecast data from NOAA's NWS API delivers structured JSON at predictable intervals. Port dwell time data scraped from Marine Exchange and port authority sources comes in inconsistent formats, with some ports updating hourly and others daily, some providing structured data and others providing HTML tables we have to parse. Social trend data comes from multiple APIs with different rate limits, different authentication mechanisms, and wildly different data schemas.
Our initial instinct was to build a generic ingestion layer that could handle all of these through a common interface. That was the wrong instinct. The diversity in source characteristics — latency, schema, reliability, refresh rate — is fundamental, not incidental. A weather API that updates every hour has different pipeline requirements than a port authority data source that posts a daily summary at an unpredictable time.
We ended up with source-specific connectors that share a common output interface. Each connector handles the idiosyncrasies of its source and emits normalized events to a Kafka topic. Downstream consumers don't know or care which source produced an event — they consume from a topic and see a consistent schema.
Kafka as the Backbone: What It Solved and What It Didn't
We chose Apache Kafka as the backbone of the pipeline early, and that decision has held up. The reasons were straightforward: we need durable message storage (signal events need to be replayable for backfilling and debugging), we need multiple consumers to read the same signal stream independently, and we need to decouple ingestion latency from processing latency.
The topic structure we landed on after several iterations looks like this:
signals.raw.weather
signals.raw.freight
signals.raw.social
signals.raw.commodity
signals.processed.demand-adjustments
signals.processed.disruption-events
Raw topics receive events from source connectors. Processed topics receive output from the transformation layer that converts raw signals into structured demand signals. Consumers downstream of the processed topics — our demand adjustment engine, our alert system, our user-facing API — only read from processed topics.
What Kafka didn't solve: schema evolution. When a weather data provider changes their API response format, every consumer downstream of that source breaks unless the connector handles the migration. We wasted two weeks early on dealing with a breaking API change that we could have mitigated with a proper schema registry. We now use Confluent Schema Registry with Avro schemas for all processed topics, which gives us backward compatibility guarantees and makes schema evolution manageable.
The Transformation Layer: From Raw Signals to Demand Adjustments
Raw signals are not demand signals. A NOAA forecast saying "temperature 12°F below seasonal average across Atlanta metro for days 8–14" is a raw signal. "Increase demand planning adjustment for hot beverage SKUs in Southeast DC network by 18–24% for replenishment week starting [date]" is a demand signal. The transformation layer is where that translation happens, and it's the hardest part of the system to build.
We use Apache Flink for the stream processing in this layer. The decision was largely pragmatic — Flink handles stateful stream processing well, which we need because some transformations require joining signals from multiple sources and maintaining state across time windows. A weather event's demand impact depends on the base season, the specific product category, and whether competing signals are present.
The transformation logic itself is model-based. We have a set of category-specific adjustment models that take normalized signal inputs and produce demand adjustment factors. These models were the slowest part of the system to develop because they required historical validation — we needed to confirm that our weather-to-demand model actually predicted demand movement before shipping it. We ran three months of backtests against historical POS data before we were confident enough to ship any weather-based adjustment logic to production users.
One thing we got wrong: we initially tried to build a single unified transformation model that could handle all signal types together. The interactions between signal types — weather and social trends both moving simultaneously, for example — are real, but they're also rare enough that trying to model them jointly made everything harder to validate and debug. We now have separate models per signal type that run in parallel, with a combining step that handles interaction cases explicitly. Simpler, more debuggable, and easier to explain to users.
Latency Targets and Where We Actually Land
For demand sensing to be useful, signal processing latency needs to be measured in minutes or hours, not days. The latency target we designed for is: raw signal event available within 15 minutes of source update → processed demand signal available within 45 minutes of raw event → user-facing signal available within 60 minutes end-to-end.
In practice, we hit that target roughly 85–90% of the time. The exceptions are mostly source-side: a port authority data source that's down for maintenance, a social API rate limit we've hit, a weather API that's running slow. On the processing side, we're reliably under 30 minutes for the transformation layer under normal load.
The honest caveat about latency: for most demand planning decisions, 60-minute latency is overkill. A demand planner running a weekly replenishment cycle doesn't need signals updated every hour. What they need is signals that are current when they sit down to make replenishment decisions — which in practice means daily refresh is sufficient for most use cases, and hourly is a quality-of-life improvement rather than a fundamental requirement.
We built for hourly because the architecture that supports hourly also supports daily trivially, and because some disruption detection use cases — a factory fire, a sudden port closure — genuinely benefit from faster signal propagation. But we're not claiming that hourly refresh is what makes demand sensing valuable. The signal quality and the connection to replenishment logic are what matter.
What We'd Do Differently
The two decisions I'd change with hindsight: the schema registry choice and the social data sourcing strategy.
On schema registry: we should have implemented it before we shipped any production data flows, not after a painful breaking change. The cost of adding it retroactively — migrating existing consumers, updating all connector code — was significantly higher than the cost of building it in from the start would have been. This is the classic "we'll add it later" trap.
On social data: we started by pulling from three different social APIs directly. The maintenance overhead of managing three separate authentication mechanisms, rate limit strategies, and schema variations was disproportionate to the signal value. We've since consolidated to a single data partner that aggregates social trend signals across platforms and delivers them in a normalized format. The signal quality is slightly lower than what we could get from direct access, but the operational overhead reduction was worth it at our current scale.
The pipeline is not finished — no production data system ever is. But it's stable enough that we're spending more time improving signal quality and expanding category coverage than maintaining infrastructure. That's roughly where we want to be at this stage of building.