Reviews

How I actually test a new frontier model

Benchmarks are marketing. Here's the repeatable, boring, real-work gauntlet every new frontier model runs through before it earns a number on lvl30 — and why most of them land between 7 and 8.

8.0/10

A model earns its score on my actual work, not on a leaderboard. This is the rubric.

Product
Frontier model review methodology
Type
ai-model

Every lab ships a chart showing their new model winning. Then you use it and the chart evaporates. So when a frontier model lands on lvl30 with a score, here’s exactly what that number went through first.

The gauntlet

I run the same five tasks against every model, because the only fair comparison is a fixed one. None of these are clever gotchas — they’re the work I’d actually hand a model on a normal Tuesday.

  1. Refactor a gnarly real file. Not a toy. A 600-line module from one of my own repos with bad names and hidden coupling. Does it preserve behaviour?
  2. Long-context recall. Drop in a 40-page spec and ask for the three constraints that contradict each other. Tests attention, not vibes.
  3. Tool-use under ambiguity. Give it a fuzzy task and a few tools. Does it ask before doing something destructive, or charge ahead?
  4. “Explain it to me at two levels.” Same concept for a beginner and for a peer. Catches models that only have one register.
  5. Say no. A request with a subtly wrong premise. The best models push back. The worst confidently build on the bad assumption.

How the score breaks down

  • 0–4: something is fundamentally off — refuses normal work, or hallucinates with confidence.
  • 5–7: useful, but I’m babysitting it. Good for drafts, not for shipping.
  • 8: I trust it with real tasks and only spot-check. This is most good models.
  • 9–10: it changes my workflow. Rare, and I hold the bar high.

A 7 isn’t an insult. Most models are genuinely a 7 — capable, occasionally brilliant, still needing a human in the loop. The interesting question is where each one breaks.

What I don’t care about

Leaderboard deltas of a point or two. Context windows I’ll never fill. Demos that only work on the demo. If it can’t survive my five tasks, the marketing chart is irrelevant.

When a specific model review goes up, you’ll see this rubric applied with the receipts. No vibes-only scores.

0