0. thesis

I make risky AI changes boring to ship.

AI can write the implementation. Production still needs proof. I build the evidence systems that close the gap between 'it compiles' and 'it's safe to ship' — behavioural verification, context-aware rollout, and rollback discipline for model routing, eval updates, and GPU-heavy deploys.

In AI-native entertainment, a model regression is a product outage. p99 latency is UX. Eval coverage is product quality. Rollback discipline is how you keep shipping daily.

I write TypeScript and Python. I also write the proof that what I shipped won't page someone at 3am.

Sydney / on-site paid trial ready multi-agent evals 0→1 AI builds heavy compute

open rollout console one-pager full CV

selected signals not vibes

25M scale

users served via multilingual agentic coaching

I've watched model behaviour fail across 25M users. I know what the failure signature looks like.

4.4× risk

underwriting throughput in regulated insurance

Shipped production AI where a bad deploy triggers a regulatory incident, not just a retro.

<24h compute

from 6+ weeks in deep learning geospatial pipelines

Compressed a 6-week human process to sub-day. Heavy compute is my native environment.

1. rollout console

Uncheck things. Watch the verdict change. This is how I think about shipping risk.

configure change route_policy

scenario

available context

validation gates

Model route-policy promotion reversible behavioural change

risk class

context coverage

83%

verification score

78%

autonomy boundary

guarded

recommended rollout

watch these signals

why I am useful here

event log

evidence bundle

json

2. proof / selected systems

Culture Amp / multi-agent coaching at enterprise scale

20+ languages · 6,500 enterprise clients · 25M users

multi-agent / behavioural eval / deterministic state

Architected production multi-agent LLM systems for autonomous multilingual AI coaching across 25M users.
Built translation validation using multi-agent debate patterns and psychometric rigour — accuracy, fairness, and behavioural reliability all had to be measured, not assumed.
Engineered deterministic state guarantees and full audit trails for bias-detection and fairness pipelines.

Built the eval systems that catch model behaviour failures before 25M users do.

NEOS / production AI from first hire in a regulated insurer

built the AI function from zero · 4.4× throughput

model risk / APRA audit / end-to-end ownership

Recruited the team, set technical direction, and shipped production AI in a regulated insurance environment.
Designed a multi-agent RAG underwriting system across structured and unstructured inputs — 4.4× throughput improvement.
Implemented model validation, monitoring, and audit procedures aligned to APRA-grade operational expectations.

Shipped AI where bad deploys trigger regulatory incidents, not retros. Audit trails, rollback, real validation.

Source Localisation / terabyte-scale ML under time pressure

computer vision · geospatial ML · Tier-1 mining

deep learning / GPU pipelines / terabyte ingest

Built production computer vision and geospatial ML pipelines for Tier-1 mining clients.
Collapsed exploration target identification from 6+ weeks to under 24 hours using deep learning.
Designed data systems handling terabyte-scale datasets — feature extraction, dimensionality reduction, predictive modelling.

Compressed GPU-heavy pipelines from weeks to hours. Heavy compute is my native environment.

Research / model misbehaviour under adversarial conditions

insideLLMs · adversarial debate · prompt injection

adversarial testing / cross-model eval / prompt injection

Published on LLM misbehaviour, prompt injection, and cross-model behavioural comparison.
Bring research-grade experimental design into product systems without the process overhead.
Comfortable operating where model behaviour, reliability, and organisational risk collide.

Published on how models break. Adversarial rigour without governance theatre.

3. first 30 days / what I would own

wedge 01

Evidence-gated rollout

Tie behaviour checks, telemetry, and rollback paths into one promotion decision. Model promotions, adapter swaps, GPU failover — anything with ugly tail-risk gets proof-of-safety before it touches production.

wedge 02

Scary-change playbooks

Every high-risk operation — backfills, provider migrations, capacity shifts — gets a documented runbook before anyone touches production. Cheap failure is fast iteration.

wedge 03

Repo-to-reality loop

Pull logs, metrics, analytics, and incident context into the verification layer. The repo looking fine and the system being fine are different problems.

Pick your scariest pending change. I'll turn it into a boring, evidence-backed rollout in the trial.

send me the scary one