Evidence-gated rollout
Tie behaviour checks, telemetry, and rollback paths into one promotion decision. Model promotions, adapter swaps, GPU failover — anything with ugly tail-risk gets proof-of-safety before it touches production.
AI can write the implementation. Production still needs proof. I build the evidence systems that close the gap between 'it compiles' and 'it's safe to ship' — behavioural verification, context-aware rollout, and rollback discipline for model routing, eval updates, and GPU-heavy deploys.
In AI-native entertainment, a model regression is a product outage. p99 latency is UX. Eval coverage is product quality. Rollback discipline is how you keep shipping daily.
I write TypeScript and Python. I also write the proof that what I shipped won't page someone at 3am.
Uncheck things. Watch the verdict change. This is how I think about shipping risk.
Built the eval systems that catch model behaviour failures before 25M users do.
Shipped AI where bad deploys trigger regulatory incidents, not retros. Audit trails, rollback, real validation.
Compressed GPU-heavy pipelines from weeks to hours. Heavy compute is my native environment.
Published on how models break. Adversarial rigour without governance theatre.
Tie behaviour checks, telemetry, and rollback paths into one promotion decision. Model promotions, adapter swaps, GPU failover — anything with ugly tail-risk gets proof-of-safety before it touches production.
Every high-risk operation — backfills, provider migrations, capacity shifts — gets a documented runbook before anyone touches production. Cheap failure is fast iteration.
Pull logs, metrics, analytics, and incident context into the verification layer. The repo looking fine and the system being fine are different problems.
Pick your scariest pending change. I'll turn it into a boring, evidence-backed rollout in the trial.
send me the scary one