Work · Media · 8 weeks · 2023

Deterministic job scheduler for a video encoding farm

A video infrastructure company was missing SLAs on premium encoding jobs during peak hours. Their scheduler was a priority queue with a handful of ad-hoc overrides and no model of deadlines, which meant premium jobs could wait behind long-running bulk work.

The situation

The team knew the solution involved deadline awareness but had twice deferred the rewrite because the existing scheduler touched too many parts of the platform. Every attempt to modify it had produced regressions nobody could reproduce outside production — the classic "works on my laptop, breaks at 9 p.m. Pacific on Fridays" pattern that indicates a workload the test suite does not represent.

What we did

Before writing any scheduler code we built a simulator: a Go harness that could replay a year of historical job submissions against any candidate scheduler and measure deadline miss rate, preemption count, encoder utilization, and a per-tier fairness metric. This reduced the argument from "which scheduler feels right" to "which scheduler wins on the replay corpus." The simulator became the client's most-used engineering artifact from the engagement.

The production scheduler we shipped is a deadline-aware variant of EDF (Earliest Deadline First) with bounded preemption and a tiebreaker that preserves locality for jobs already partially encoded on a given node. Preemption is only permitted when the preempting job's deadline is within 1.5× its expected runtime and the preempted job has less than 30% remaining. These constants were tuned against the simulator rather than guessed at. The core fits in under 900 lines of Go, including comments, because we resisted the temptation to generalize beyond the workload.

Cutover was done by job tier: free-tier traffic moved first for two weeks while we watched the miss-rate curve, then standard, then premium. At no point did both schedulers run on the same job — a premium job either went through the new path or did not, with the selection made at submission time and recorded for replay.

Outcome

Premium-tier SLA miss rate fell from 4.2% to under 0.1%
Encoder utilization rose from 63% to 81% at peak
The simulator is now run against every scheduler change in CI, and has blocked two candidate changes in the months since handoff
Incident count related to scheduler behavior went to zero over the six months following handoff
Per-tier fairness metric (ratio of p95 wait times across tiers) improved from 8.4 to 1.9

Stack

Go 1.20 · Redis Streams · FFmpeg orchestration · Prometheus · Grafana

← Back to all work