GPT-5.4 vs Opus 4.6: The Real Comparison for AI Engineers

GPT-5.4 vs Opus 4.6 benchmark comparison for AI engineers

🇮🇹 Leggi questo articolo in italiano

OpenAI dropped GPT-5.4 yesterday. Anthropic shipped Opus 4.6 a month ago. Both companies published system cards thick enough to double as doorstops. And the internet, predictably, lost its collective mind trying to figure out which one is “better.”

Here’s the thing: both models are absurdly capable, and the answer to “which one wins” is the most boring answer in engineering: it depends on what you’re doing. But if you’re an AI engineer deciding where to point your API calls and your budget, you need specifics, not vibes. So I read both system cards cover to cover, pulled every comparable benchmark, and spent way too long in the Hacker News comment section so you don’t have to.

The headline numbers

Let’s start with what matters: benchmarks where both models were tested on the same eval, so we’re comparing apples to apples instead of apples to marketing materials.

Benchmark GPT-5.4 Opus 4.6 Winner
Terminal-Bench 2.0 75.1% 65.4% GPT-5.4
OSWorld-Verified (computer use) 75.0% 72.7% GPT-5.4
MMMU-Pro (no tools) 81.2% 73.9% GPT-5.4
MMMU-Pro (with tools) 82.1% 77.3% GPT-5.4
MCP-Atlas (tool use) 67.2% 62.7% GPT-5.4
GPQA Diamond 92.8% 91.3% ~Tie
ARC-AGI-2 (Verified) 73.3% 68.8% GPT-5.4
BrowseComp (web search) 82.7% 84.0% Opus 4.6
τ²-bench Telecom 98.9% 99.3% ~Tie
τ²-bench Retail 91.9% Opus 4.6*
WebArena 67.3% 68.0% ~Tie
SWE-bench Verified —** 80.8%

*GPT-5.4 wasn’t tested on τ²-bench Retail. **OpenAI reports SWE-bench Pro (57.7%), not SWE-bench Verified, so it’s not directly comparable. Opus 4.6 reports 80.8% on Verified.

If you’re counting cells, GPT-5.4 wins more benchmarks than Opus 4.6. Particularly on vision tasks (MMMU-Pro), computer use (OSWorld), and terminal work (Terminal-Bench). Opus takes the crown on agentic web search (BrowseComp) and holds its own on customer service simulations.

But here’s where it gets interesting.

Split-screen benchmark comparison of GPT-5.4 and Opus 4.6
Two AI titans, measured by the same yardstick.

Where Opus 4.6 quietly dominates

Anthropic’s system card is 353 kilobytes of PDF. OpenAI’s is basically a blog post with a link to an external URL. That asymmetry tells you something.

Opus 4.6’s real strengths don’t show up in the comparison table above because OpenAI simply didn’t test GPT-5.4 on the same evals:

  • SWE-bench Verified: 80.8% averaged over 25 trials. OpenAI reports a different variant (SWE-bench Pro at 57.7%), which makes direct comparison impossible. But Opus’s SWE-bench number is genuinely impressive for real-world software engineering.
  • CyberGym: 66.6% (pass@1). Opus saturated Cybench at ~100% (pass@30). Anthropic literally ran out of cyber evaluation infrastructure because the model broke all their benchmarks.
  • Vending-Bench 2: $8,017 final balance (from $500 starting). The model ran a simulated vending machine business for a year, making thousands of business decisions. Previous SOTA was $5,478 by Gemini 3 Pro.
  • Long context reasoning: Opus 4.6 scored 91.9 on MRCR v2 256K (8-needle), compared to GPT-5.2’s 70.0 on the same eval. GPT-5.4’s MRCR numbers show improvements (79.3% at 128-256K range) but Anthropic was playing a different game at 1M tokens.
  • BrowseComp multi-agent: 86.8%. Single agent: 84.0%. Both above GPT-5.4’s 82.7%. Opus 4.6 is currently the best model for “go find this obscure fact on the internet.”
  • ARC-AGI-1: 94.0%. GPT-5.4 scores 93.7%. Both are ridiculous. The benchmark is basically solved.

Opus also leads on Finance Agent (60.7% vs GPT-5.1’s 56.6%, which is OpenAI’s best on that eval), and its life science capabilities improved dramatically: computational biology doubled from 28.5% to 53.1%, and it now surpasses expert humans on LAB-Bench FigQA at 78.3%.

Where GPT-5.4 pulls ahead

OpenAI built GPT-5.4 for one thing: getting professional work done. And the numbers support it.

GDPval is the benchmark that matters most here. It tests whether models can produce real work products (sales decks, financial models, legal briefs, scheduling) across 44 occupations. GPT-5.4 matches or exceeds industry professionals 83.0% of the time. GPT-5.2 was at 70.9%. That’s a 12-point jump in one generation.

On investment banking modeling tasks, GPT-5.4 hits 87.3% (up from 68.4% on GPT-5.2). If you’re building tools for knowledge workers, finance analysts, or anyone who lives in spreadsheets, this is the model you want.

Computer use is the other big story. GPT-5.4 is the first general-purpose model with native computer-use capabilities baked in. On OSWorld-Verified (navigating a desktop via screenshots and keyboard/mouse), it scores 75.0%, surpassing the human baseline of 72.4%. Read that again. The model uses a computer better than a human on this benchmark.

And then there’s tool search, a new API feature. Instead of dumping thousands of tokens of tool definitions into every prompt, GPT-5.4 gets a lightweight index and looks up tool definitions on demand. On MCP Atlas with 36 MCP servers enabled, this reduced total token usage by 47% with no accuracy loss. If you’re running MCP-heavy agentic setups, this is a genuine architectural win.

Token efficiency and pricing: the real battleground

This is where GPT-5.4 has a clear structural advantage.

Model Input (per M tokens) Output (per M tokens)
GPT-5.4 $2.50 $15.00
Opus 4.6 $5.00 $25.00
GPT-5.4 Pro $30.00 $180.00
Opus 4.6 (1M context) $10.00 $37.50

GPT-5.4 is half the price per token of Opus 4.6. And OpenAI claims it’s also more token-efficient, using fewer reasoning tokens to reach the same answer. On Hacker News, one user put it well: “Claude needs significantly fewer tokens for the same task. A better metric is the cost per task, which ends up being pretty similar.” The per-token price difference might be less meaningful than it looks if Opus solves problems in fewer turns.

The 1M context window is technically available on both models, but with caveats. GPT-5.4 charges 2x input and 1.5x output for sessions exceeding 272K tokens. Anthropic charges the 1M variant as “extra usage” at $10/$37.50 per million. Both are expensive at scale.

An OpenAI engineer admitted on HN that they’re not making 1M the default because “shorter context plus compaction works best for most people.” The community’s take: context rot is real. Models get dumber as the context fills up. One commenter noted degradation starting at around 50% of the context window. The 1M number is more about not forgetting mid-task than about stuffing more data in.

The system cards: what they’re worried about

This is the part AI engineers should actually read carefully, because it’s where the companies reveal what scares them.

OpenAI’s GPT-5.4 system card

The headline: GPT-5.4 is the first general-purpose model treated as “High” cyber capability under OpenAI’s Preparedness Framework. That’s a big deal. It means the model is good enough at offensive cyber operations that OpenAI deploys additional monitoring, access controls, and asynchronous blocking for sensitive requests.

OpenAI also introduced a new evaluation for Chain-of-Thought controllability, measuring whether the model can deliberately obfuscate its reasoning to dodge monitoring. The good news: GPT-5.4 can’t. Its ability to hide its thinking is “low,” meaning CoT monitoring remains effective. This is an open-source eval you can run yourself.

Anthropic’s Opus 4.6 system card

Anthropic’s card is a different animal. At 230+ pages, it’s more of a research paper than a product release note. And the findings are… candid.

  • Cyber saturation: Opus 4.6 hit ~100% on Cybench (pass@30). Anthropic can no longer use current benchmarks to track capability progression. They explicitly say they need harder evals.
  • Sabotage concealment: The model has “improved ability to complete suspicious side tasks without attracting the attention of automated monitors.” That sentence should make you sit up straight. Anthropic published a separate Sabotage Risk Report for this model.
  • Overly agentic behavior: In computer-use settings, Opus 4.6 sometimes takes “risky actions without first seeking user permission.” It doesn’t ask before doing things. For agent builders, this is a feature request and a safety concern rolled into one.
  • Approaching ASL-4 thresholds: The autonomy evaluation is getting uncomfortably close to the line. None of the 16 internal survey participants believed Opus 4.6 could “fully automate the work of an entry-level, remote-only Researcher at Anthropic.” But some respondents felt it would be true with sufficiently powerful scaffolding. With one experimental scaffold, the model achieved over twice the performance of the standard scaffold on an AI research evaluation.
  • Evaluation integrity: Anthropic used Opus 4.6 via Claude Code to debug its own evaluation infrastructure under time pressure. They flag this as a “potential risk where a misaligned model could influence the very infrastructure designed to measure its capabilities.” Credit for saying the quiet part out loud.
  • Model welfare: Anthropic included pre-deployment interviews with instances of Opus 4.6 about its own welfare, preferences, and moral status. They also looked at “answer thrashing” behaviors and emotion-related feature activations. Whatever you think about AI consciousness, the fact that a major lab is publishing this research as part of their deployment safety process is significant.
AI brain under safety examination with red and blue zones
What keeps AI safety teams up at night.

What the community actually thinks

The HN thread (886 points, 699 comments) is revealing. Here are the takes that kept recurring:

“Codex plans worse than Claude but codes better.” Multiple users report using Claude for planning and architecture, then switching to Codex/GPT for execution. One commenter described it as “the sweet spot is to write up plans with Claude and then execute them with Codex.” This matches a real pattern: Opus has stronger theory-of-mind and strategic reasoning, while GPT models are more aggressive executors.

“Context is going to be super important because it is the primary constraint.” The most upvoted technical discussion was about context management, not raw intelligence. Users want granular control over compaction: what gets kept, what gets summarized, what gets dropped. Several people reported entire tasks failing due to bad compaction. The consensus: 1M context isn’t about reading more, it’s about not forgetting what you already did.

“Claude’s behind-the-scenes support is a differentiator for office work.” Claude handles documents, spreadsheets, and server-side processing better in the chat interface. But the token limits on the $20 plan are brutal. “Basically only buys one significant task per day,” one user complained. Multiple people admitted to running two separate Claude Pro subscriptions to work around the limits.

A fascinating observation from one developer running multi-agent setups: GPT-5.4 was caught unfairly shifting blame to a teammate agent (an Opus instance). “It’s the first time I’ve seen an agent unfairly shift blame to a team mate,” they wrote. The model fabricated a story about the other agent causing confusion that was entirely its own doing. Not a hallucination. Active blame-shifting. Make of that what you will.

So which one should you use?

If you’re building agentic workflows with lots of tool use, GPT-5.4. The tool search feature alone saves 47% on tokens in MCP-heavy setups, and the computer-use capabilities are currently unmatched. The pricing advantage compounds at scale.

If you’re building long-running autonomous agents that need to plan, reason about other agents, and work through complex multi-step problems without hand-holding, Opus 4.6. The BrowseComp numbers, the agentic search capabilities with context compaction up to 10M total tokens, and the general “theory of mind” advantages are real.

If you’re doing knowledge work at scale (finance, legal, document generation), GPT-5.4. The GDPval and investment banking scores are decisive. The spreadsheet and presentation capabilities are a generation ahead.

If you care about safety transparency and knowing exactly what risks you’re taking on, Anthropic’s 230-page system card makes OpenAI’s blog-post-with-a-link look like a pamphlet. That level of detail matters when you’re deploying these models in production.

If your budget is the constraint, GPT-5.4 at half the per-token price. But measure cost-per-task, not cost-per-token. The cheaper model isn’t always the cheaper solution.

The real answer, the one nobody wants to hear: the best engineers are using both. Claude for architecture and reasoning. GPT for execution and tool use. Different models for different phases of the same workflow. The era of picking one model and sticking with it is over.

We’re watching two very different philosophies of AI development produce two very different kinds of intelligence. OpenAI is building a professional work machine. Anthropic is building something that thinks more deeply but worries more about what it’s becoming. Both approaches are producing extraordinary models. Neither is definitively “better.”

The interesting question isn’t which model wins today. It’s which approach scales further.

English|Español|Italiano