Work · Fintech · 14 weeks · 2024

Ledger throughput rewrite for a B2B payments platform

The client's double-entry ledger, written in Ruby over eight years, had become the single gating resource for the entire product. Under payroll-day load it would cap at roughly 340 postings per second and exhibit pathological tail latency as row-level locks piled up on high-traffic operating accounts.

The situation

A small internal team had already attempted two rewrites. The first was abandoned because the replacement diverged from the original's accounting semantics in subtle ways that only surfaced during month-end reconciliation — ordering of same-microsecond postings, rounding on fractional-cent splits, and whether refunds to a closed account produced a zombie balance or an error. The second attempt was technically sound but architecturally isolated — it never earned the trust to handle live traffic, because nobody could articulate what "equivalent behaviour" meant precisely enough to test it.

What we did

Weeks one and two were assessment. We produced a thirty-four-page review identifying three real bottlenecks (row-level locks on the operating-account row, a synchronous journal-write on the posting hot path, and an N+1 inside the idempotency check) and two bottlenecks the team thought they had but didn't (their connection pool was fine; their JSON serializer was not).

Weeks three through five were spent building a differential tester before the replacement engine wrote its first production line. The harness replayed the payday-2024-03-15 corpus — a six-hour window containing 4.1 million postings — against both the legacy Ruby implementation and any candidate replacement, and failed loudly on any deviation in final balances, journal ordering within an account, or observable error codes. Nothing shipped until the tester ran clean on a full year of production data.

The replacement engine was written in Go with per-account serial queues, a Raft-replicated journal (etcd/raft) for durability, and strict FIFO ordering preserved at the account level. Cross-account postings — roughly 18% of volume — were resolved through a two-phase commit coordinator that we were able to keep under 200 lines of reviewable code because the invariants were narrow: an account participates in at most one multi-account posting at a time, coordinator state is journaled before any account is contacted, and timeouts always resolve to rollback.

Cutover was tenant-by-tenant, largest last, over three weeks. Each tenant's postings were shadow-executed on the new engine for 48 hours before the new engine became authoritative, with an automated diff alert firing on any balance discrepancy above $0.00. The alert fired twice, both times due to a tenant-specific rounding rule the team had forgotten existed.

Outcome

  • Sustained throughput rose from ~340 to over 11,000 postings per second on the same underlying database hardware
  • p99 end-to-end posting latency dropped from 2.1s to 38ms
  • Month-end reconciliation, previously a four-hour window during which postings were blocked, now completes in under twenty minutes and runs online
  • The differential tester is now part of the client's CI and blocks deploys on any semantic drift; it has caught three candidate regressions in the nine months since handoff
  • Two engineers on the client's team are now comfortable on-call for the posting engine; neither was before the engagement

Stack

Go 1.22 · PostgreSQL 15 · NATS JetStream · etcd/raft · OpenTelemetry · pgx v5

← Back to all work