gareth roberts / coreflow / founding engineer
0. thesis

I make risky AI changes boring to ship.

AI can write the implementation. Production still needs proof. I build the evidence systems that close the gap between 'it compiles' and 'it's safe to ship' — behavioural verification, context-aware rollout, and rollback discipline for model routing, eval updates, and GPU-heavy deploys.

In AI-native entertainment, a model regression is a product outage. p99 latency is UX. Eval coverage is product quality. Rollback discipline is how you keep shipping daily.

I write TypeScript and Python. I also write the proof that what I shipped won't page someone at 3am.

Sydney / on-site paid trial ready multi-agent evals 0→1 AI builds heavy compute
1. rollout console

Uncheck things. Watch the verdict change. This is how I think about shipping risk.

configure change route_policy
scenario
available context
validation gates
Model route-policy promotion reversible behavioural change
risk class
R1
context coverage
83%
verification score
78%
autonomy boundary
guarded
recommended rollout
    watch these signals
    why I am useful here
      event log
      
                    
      evidence bundle
      json
      
                    
      2. proof / selected systems
      Culture Amp / multi-agent coaching at enterprise scale
      20+ languages · 6,500 enterprise clients · 25M users
      multi-agent / behavioural eval / deterministic state
      • Architected production multi-agent LLM systems for autonomous multilingual AI coaching across 25M users.
      • Built translation validation using multi-agent debate patterns and psychometric rigour — accuracy, fairness, and behavioural reliability all had to be measured, not assumed.
      • Engineered deterministic state guarantees and full audit trails for bias-detection and fairness pipelines.

      Built the eval systems that catch model behaviour failures before 25M users do.

      NEOS / production AI from first hire in a regulated insurer
      built the AI function from zero · 4.4× throughput
      model risk / APRA audit / end-to-end ownership
      • Recruited the team, set technical direction, and shipped production AI in a regulated insurance environment.
      • Designed a multi-agent RAG underwriting system across structured and unstructured inputs — 4.4× throughput improvement.
      • Implemented model validation, monitoring, and audit procedures aligned to APRA-grade operational expectations.

      Shipped AI where bad deploys trigger regulatory incidents, not retros. Audit trails, rollback, real validation.

      Source Localisation / terabyte-scale ML under time pressure
      computer vision · geospatial ML · Tier-1 mining
      deep learning / GPU pipelines / terabyte ingest
      • Built production computer vision and geospatial ML pipelines for Tier-1 mining clients.
      • Collapsed exploration target identification from 6+ weeks to under 24 hours using deep learning.
      • Designed data systems handling terabyte-scale datasets — feature extraction, dimensionality reduction, predictive modelling.

      Compressed GPU-heavy pipelines from weeks to hours. Heavy compute is my native environment.

      Research / model misbehaviour under adversarial conditions
      insideLLMs · adversarial debate · prompt injection
      adversarial testing / cross-model eval / prompt injection
      • Published on LLM misbehaviour, prompt injection, and cross-model behavioural comparison.
      • Bring research-grade experimental design into product systems without the process overhead.
      • Comfortable operating where model behaviour, reliability, and organisational risk collide.

      Published on how models break. Adversarial rigour without governance theatre.

      3. first 30 days / what I would own
      wedge 01

      Evidence-gated rollout

      Tie behaviour checks, telemetry, and rollback paths into one promotion decision. Model promotions, adapter swaps, GPU failover — anything with ugly tail-risk gets proof-of-safety before it touches production.

      wedge 02

      Scary-change playbooks

      Every high-risk operation — backfills, provider migrations, capacity shifts — gets a documented runbook before anyone touches production. Cheap failure is fast iteration.

      wedge 03

      Repo-to-reality loop

      Pull logs, metrics, analytics, and incident context into the verification layer. The repo looking fine and the system being fine are different problems.

      Pick your scariest pending change. I'll turn it into a boring, evidence-backed rollout in the trial.

      send me the scary one