February 2025 · Halvorsen
Bounded channels, or: why your pipeline OOMs at 3 a.m.
An unbounded channel is a queue whose backpressure is your process's RSS limit. This works, for a while, because most workloads sit comfortably below the limit and you never learn otherwise. Then a bursty tenant appears, or a downstream dependency slows by 40ms, and the queue grows until the kernel decides the question for you.
Our rule: every channel has a capacity, chosen by measuring the slowest downstream stage and sizing the buffer at roughly two times its steady-state throughput. A shed policy — drop, block, or redirect — is written alongside the make(chan, N) call, in a comment that names the policy owner. If you cannot answer "what happens when this is full?" the channel is not production-ready.
November 2024 · Halvorsen
On channels, and when not to reach for them
New Go developers tend to model everything as a channel because the language makes it feel canonical. In practice the channel-per-problem instinct produces systems that are harder to reason about than the mutex-guarded struct they replaced. A useful heuristic: reach for a channel when ownership of a value must transfer between goroutines. Reach for a mutex when ownership stays put and you are coordinating access. Most coordination bugs we fix are the first case pretending to be the second, or vice versa.
September 2024 · Pritchard
The "one more service" fallacy
When a team is under delivery pressure and an existing service is painful to modify, the path of least resistance is to stand up a new service beside it. Six months later the new service has grown its own dependencies and the old one is still there. We arrive, usually, on month fourteen. The lesson we repeat: the cost of a new service is not measured the day you deploy it. It is measured the third time someone is paged for it at 2 a.m.
July 2024 · Addo-Mensah
Benchmarks that lie, and how to spot them
A benchmark is a scientific claim. It should be reproducible, it should state its conditions, and it should report variance. If your performance improvement is "30% faster" with no mention of percentile, workload, hardware, or run count, it is not a benchmark; it is a press release. We require every performance claim in our final reports to include p50, p95, p99, sample size, standard deviation, and the exact commit hash of both the old and new implementation. This discipline is boring and it catches real mistakes before they are handed to a VP.
A specific failure mode we see often: the benchmark ran on an idle laptop, the production system shares cores with a noisy neighbour, and the "3×" improvement evaporates under contention. Isolated-core measurements belong in a paper. Production claims belong on production-shaped hardware.
April 2024 · Pritchard
Strangling a Rails monolith without drama
The strangler-fig pattern works. It works slowly, it works unevenly, and it works only if the team has the patience to run two systems in parallel for longer than is politically comfortable. Our rule of thumb: if leadership is not willing to run the old and new paths side by side for at least a full quarter, the rewrite will fail — not because the code is wrong, but because the cutover will be rushed. We have declined engagements on these grounds, twice.
January 2024 · Addo-Mensah
Why we write fewer generics than you'd expect
Generics arrived in Go 1.18 and opened a door that most production codebases should step through carefully. A generic function is a contract with every future caller. In consulting we tend to write concrete types first, duplicate twice, and extract a generic only on the third occurrence — and even then only if the type parameter is doing real work beyond saving a few keystrokes. Our handoff code tends to read as slightly verbose and very unsurprising. That is deliberate.
October 2023 · Addo-Mensah
Connection pools are not free and neither is SELECT *
We have audited seven production Go services in the last two years where the PostgreSQL pool was sized at the library default of two connections per CPU, and the driver was silently serializing under load. If your pgx pool is saturated for more than 200ms at a time, your p99 is a function of your pool size, not your query plan. Measure pool_wait_seconds. Log it. Alert on it.
And SELECT * through a Go ORM is not lazy — it materializes every column you did not need, through three layers of reflection, on every call. Name your columns. Your driver's allocator will thank you, and so will your p99.