Skip to main content

Rebuilding a Failing Data Pipeline — Live, With Zero Downtime

Neil Simpson
production-systemsai-engineering

Client

Series A SaaS startup (anonymised)

Platform

TypeScript / PostgreSQL / Redis / AWS

Industry

B2B SaaS

85% → 99.99% reliabilityZero downtime migration47ms → 12ms p95 latency15% data loss eliminated

The Client

A Series A SaaS startup with 400 business customers and a team of 12 engineers. Their platform tracks customer engagement events — page views, feature usage, API calls — and pipes them into analytics dashboards that their customers rely on for business decisions.

The platform was processing roughly 50 million events per day across all tenants.

The Problem

Their event pipeline was losing data. Not dramatically — not enough to trigger alerts — but enough that customers were noticing discrepancies. Dashboard totals didn't match API exports. Historical reports showed gaps. A few customers had started asking questions.

The engineering team knew something was wrong but couldn't quantify it. There was no observability on the pipeline itself — only on the end-state data. When they investigated, they found the problem was worse than expected: approximately 15% of events were being dropped under peak load.

The pipeline had been built early, optimised for speed-to-market, and never revisited. It was a series of Lambda functions connected by SQS queues with no dead letter queues, no retry logic, no idempotency guarantees, and no way to replay failed events. When a Lambda timed out or SQS throttled, events vanished silently.

They couldn't afford downtime. Their enterprise customers had SLAs. Ripping out the old pipeline and replacing it wasn't an option — not with 50 million events per day flowing through it.

What We Built

Week 1: Instrumentation

Before changing anything, we needed to see what was actually happening. We instrumented every stage of the existing pipeline:

  • Ingestion counter at the API gateway — how many events enter the system
  • Stage counters at each Lambda — how many events arrive, how many proceed, how many fail
  • Terminal counter at the database write — how many events reach their destination

Within 24 hours we had a dashboard showing the full funnel. The 15% drop rate was confirmed. More importantly, we could see where events were dying: 60% of losses happened at a single stage where a Lambda was doing synchronous database lookups that occasionally timed out under load.

Week 2–3: Shadow Pipeline

We built the new pipeline alongside the old one. Every event entering the system was duplicated — one copy went through the existing path, the other went through the new path. The new pipeline wrote to a separate set of tables.

The new architecture:

Ingestion. Events land in a PostgreSQL-backed queue with write-ahead logging. Nothing is acknowledged to the client until the event is durably stored. This is the "no event left behind" guarantee — if we acknowledge receipt, it's on disk.

Processing. A pool of workers pulls events from the queue in batches. Each batch is processed with idempotency keys — if a worker crashes and the batch is retried, no duplicates are created. The lookups that caused the old pipeline to choke are now cached in Redis with a 30-second TTL, eliminating the synchronous database dependency.

Delivery. Processed events are written to the analytics tables in batches. Failed writes go to a dead letter queue with automatic retry (exponential backoff, 5 attempts, then alert). Nothing is silently dropped.

Replay. Every event is retained in the ingestion queue for 72 hours. If we discover a processing bug, we can replay any window of events through the pipeline. This didn't exist before — historical data loss was permanent.

Running both pipelines in parallel for two weeks gave us hard numbers. The old pipeline: 85.2% delivery rate under peak load. The new pipeline: 99.99% delivery rate under the same load, with the missing 0.01% recoverable from the dead letter queue.

Week 4: Cutover

The cutover was anticlimactic by design. We changed one configuration flag to route live traffic to the new pipeline and kept the old pipeline running in shadow mode (receiving duplicates but not writing to production tables).

We monitored for 48 hours. Zero anomalies. Then we shut down the old pipeline.

The only customer-visible change was that their dashboards became more accurate. Several customers noticed their numbers went up slightly — they were now seeing 100% of their events instead of 85%.

Week 5: Backfill

The shadow pipeline had captured two weeks of events processed by both systems. We compared the outputs and identified 2.3 million events that the old pipeline had dropped during the shadow period. These were replayed through the new pipeline and backfilled into the production analytics tables.

Customers with enterprise plans received a one-line changelog entry: "Improved event processing accuracy." No one needed to know how bad it had been.

The Result

MetricBeforeAfter
Event delivery rate85.2%99.99%
p95 processing latency47ms12ms
Recovery from failuresNot possible72-hour replay window
ObservabilityEnd-state onlyFull pipeline funnel
Monthly dropped events~225MUnder 5,000 (all recoverable)

The infrastructure cost increased by 15% — the durable queue and Redis cache aren't free. But the cost of losing 15% of customer data, once quantified in churn risk and SLA exposure, made the investment trivial.

What Made This Work

Shadow pipeline pattern. Building alongside the live system instead of replacing it eliminated the biggest risk — downtime during migration. It also gave us hard comparison data, not estimates.

Instrumentation first. We didn't start fixing until we could measure. The dashboard we built in week one became a permanent part of their infrastructure and has caught two unrelated issues since.

Idempotency everywhere. Every stage of the pipeline can be safely retried. This single design decision eliminated an entire class of failure modes — partial processing, duplicate events, and the cascade failures that happen when retry logic creates duplicates that trigger downstream errors.

Boring technology. PostgreSQL queues, Redis caching, batch processing with dead letter queues. Nothing exotic. The old pipeline failed because it relied on complex orchestration (Lambda + SQS + DynamoDB) with no error handling. The new pipeline succeeds because every component is simple, observable, and independently recoverable.

Services used in this engagement

Need similar results?

Every engagement starts with understanding your problem. We'll tell you honestly whether we're the right fit.