Work · DevTools · 10 weeks · 2024

Telemetry ingest pipeline, 400k events/sec sustained

A developer-facing observability product had accumulated forty-three Node services in its ingest path over six years of feature work. Unbounded in-memory queues meant that a single bursty tenant could exhaust heap on shared services and produce correlated outages across customers.

The situation

The team knew the architecture was the problem. Three previous proposals to consolidate had stalled because each one required rewriting the entire pipeline at once — a risk nobody would sign off on. We were brought in specifically to find a path that did not require a flag day.

The product's SLA was a 95th-percentile end-to-end ingest latency of 3 seconds. Actual p95 was 11 seconds during bursts, and the on-call rotation had been paging 14 times a week, with a median time-to-resolution of 47 minutes. The paging was what forced the conversation; the architecture was what caused the paging.

What we did

Assessment produced a call-graph map, built from four weeks of OpenTelemetry trace data, that revealed most of the forty-three services could be collapsed into three: an edge receiver terminating TLS and doing per-tenant rate accounting, a normalizer performing schema validation and unit conversion, and a sink writer batching into ClickHouse and Kafka.

We wrote each in Go, deployed them alongside the existing pipeline, and used header-based routing (a x-ingest-version: v2 header injected at the load balancer for selected tenants) to shift traffic over one tenant at a time. Backpressure was propagated end-to-end using bounded channels sized against measured downstream capacity, with explicit shed policies rather than implicit OOM — sink-writer full returns 429 to the edge, edge full returns 503 to the client, both policies documented in the runbook and tested in the CI suite.

The replay mechanism — arguably the most valuable part of the work — lets operators rewind Kafka offsets and re-ingest a specified window of data into a shadow partition without touching the live path. This turned incident recovery from an all-hands event into a command on-call can run with confidence. The command is ingestctl replay --from=<offset> --to=<offset> --shadow, and we spent a full week of the engagement making it unambiguous which offsets it was going to touch.

Outcome

  • Sustained ingest throughput of 400,000 events/sec, benchmarked on a reduced cluster at roughly a third of the prior footprint
  • Infrastructure spend on the ingest path reduced by approximately 62%
  • Correlated tenant outages eliminated across two full quarters post-cutover (zero multi-tenant pages)
  • On-call pages related to ingest dropped from 14/week to fewer than 2/week
  • The replay tool has been invoked in production six times in the first six months, each time resolving an incident in under fifteen minutes

Stack

Go 1.21 · Kafka · ClickHouse · Kubernetes · OpenTelemetry · Envoy

← Back to all work