How I actually test a new frontier model

Every lab ships a chart showing their new model winning. Then you use it and the chart evaporates. So when a frontier model lands on lvl30 with a score, here’s exactly what that number went through first.

The gauntlet

I run the same five tasks against every model, because the only fair comparison is a fixed one. None of these are clever gotchas — they’re the work I’d actually hand a model on a normal Tuesday.

Refactor a gnarly real file. Not a toy. A 600-line module from one of my own repos with bad names and hidden coupling. Does it preserve behaviour?
Long-context recall. Drop in a 40-page spec and ask for the three constraints that contradict each other. Tests attention, not vibes.
Tool-use under ambiguity. Give it a fuzzy task and a few tools. Does it ask before doing something destructive, or charge ahead?
“Explain it to me at two levels.” Same concept for a beginner and for a peer. Catches models that only have one register.
Say no. A request with a subtly wrong premise. The best models push back. The worst confidently build on the bad assumption.

How the score breaks down

0–4: something is fundamentally off — refuses normal work, or hallucinates with confidence.
5–7: useful, but I’m babysitting it. Good for drafts, not for shipping.
8: I trust it with real tasks and only spot-check. This is most good models.
9–10: it changes my workflow. Rare, and I hold the bar high.

A 7 isn’t an insult. Most models are genuinely a 7 — capable, occasionally brilliant, still needing a human in the loop. The interesting question is where each one breaks.

What I don’t care about

Leaderboard deltas of a point or two. Context windows I’ll never fill. Demos that only work on the demo. If it can’t survive my five tasks, the marketing chart is irrelevant.

When a specific model review goes up, you’ll see this rubric applied with the receipts. No vibes-only scores.