0. context

Shipping one ugly coreflow change

This page follows one concrete production change from design to incident response: promoting a new route policy for a live multimodal session.

At coreflow, model behaviour is product behaviour. Latency is UX. Eval coverage is product quality. The code is the easy bit. The hard part is context, verification, rollout, and rollback under real production pressure.

This is how I make risky AI changes boring to ship.

1. the design

Live route design

Design the request path for a real-time session that can fail over without losing observability or rollback.

route graph connect the components required for a safe route-policy promotion

route status unsafe

2. the gate

Promotion gate

Tune the new route policy until quality improves without wrecking latency or cost.

policy controls adjust

promotion gate not met

3. the ship

Evidence-gated rollout

Promote the route policy behind shadow traffic, canary guards, and live business metrics.

rollout trace step through

rollout status waiting

This is the job: tie behaviour checks, telemetry, and rollback paths into one promotion decision.

4. the recover

GPU failover / provider incident

P99 blows out during peak demand. Stabilise the system without turning a latency event into a product outage.

investigation terminal

mitigation actions

Bring me a route change, model promotion, adapter swap, stateful backfill, or GPU failover scenario. I will turn it into a boring rollout in the trial.