On March 5, 2026, OpenAI released a number that should make every knowledge worker pause and reconsider their relationship with software.
75%.
That's GPT-5.4's success rate on OSWorld-Verified, a benchmark that measures how well an AI can navigate a desktop computer using screenshots, keyboard input, and mouse clicks. The kind of work you do every day: opening applications, filling forms, moving files, managing spreadsheets.
The reported human performance on the same test? 72.4%.
For the first time, a general-purpose AI model operates a computer more reliably than the average person.
What GPT-5.4 Actually Is
OpenAI is calling it "our most capable and efficient frontier model for professional work," and for once, the marketing copy might be underselling it.
GPT-5.4 ships in three configurations. The standard version handles general tasks. GPT-5.4 Thinking adds extended chain-of-thought reasoning for complex problems. GPT-5.4 Pro targets the heaviest workloads and is reserved for ChatGPT Pro ($200/month) and Enterprise users.
But the configurations matter less than the capabilities. Here is what actually changed:
Native computer use. This is the headline feature. GPT-5.4 can control a computer directly through the API and Codex. It reads screenshots, issues mouse and keyboard commands, writes code to drive applications through libraries like Playwright, and completes multi-step workflows across different programs. This isn't a plugin or an add-on. It's built into the model's architecture.
1-million-token context window. More than double the 400,000-token limit of GPT-5.3. For developers working with large codebases or organizations processing massive document sets, this means keeping the full context in-window instead of relying on retrieval workarounds. One caveat: OpenAI charges double the standard rate per million tokens once input exceeds 272,000 tokens.
Tool Search. A structural rethinking of how the API handles tool calling. Previously, every request included full definitions for all available tools, a practice that burned through tokens as tool ecosystems grew. Now the model retrieves tool definitions on demand. In testing across 250 tasks using 36 MCP servers, this cut total token usage by 47% while maintaining the same accuracy.
Why the OSWorld Number Matters More Than You Think
Let's put the 75% vs 72.4% comparison in context.
OSWorld-Verified isn't a trivia test. It measures real desktop tasks: opening applications, configuring settings, moving between programs, filling out forms, managing files. The kind of mundane, repetitive computer work that fills most office hours.
GPT-5.2 scored 47.3% on the same benchmark just months ago. That's a 27.7 percentage point jump in a single model generation. And it didn't just catch up to humans. It passed them.
This matters because computer use was supposed to be the hard problem. Language? AI solved that years ago. Code generation? Getting there. But the physical act of navigating a graphical interface, reading screen elements, deciding what to click, and recovering from errors, that was considered deeply difficult.
Anthropic's Claude Computer Use, which launched in October 2024, was the first serious attempt. It worked, but it was clearly a beta: slow, sometimes confused by complex interfaces, and prone to the kind of mistakes that made you hesitant to let it run unsupervised.
GPT-5.4 suggests the gap between "impressive demo" and "reliable tool" is closing faster than anyone predicted.
The Spreadsheet Play Nobody is Talking About
Buried beneath the computer use headlines is something equally significant for the enterprise market: ChatGPT for Excel and Google Sheets.
OpenAI launched a suite of integrations that embed GPT-5.4 directly into spreadsheet cells. Not as a sidebar assistant. Inside the actual spreadsheet, building formulas, running analysis, and updating financial models in the structures teams already use.
They also announced partnerships with FactSet, MSCI, Third Bridge, and Moody's to feed market data, company fundamentals, and research directly into ChatGPT workflows. And they introduced reusable "Skills" for recurring finance work: earnings previews, comparable company analysis, DCF modeling, and investment memo drafting.
The internal benchmark claim is hard to ignore. On an investment banking evaluation, model performance jumped from 43.7% with GPT-5 to 88% with GPT-5.4 Thinking.
This is a direct shot at Anthropic, which rolled out Claude for Finance with similar Excel integrations just weeks earlier. The AI companies are not competing on raw intelligence anymore. They are competing on who can most effectively replace the Excel jockey at a consulting firm.
What This Means for AI Agents
If you've been following the AI agent space, GPT-5.4 represents a shift from proof-of-concept to production infrastructure.
Consider what "native computer use" means for agent builders. Until now, giving an AI agent the ability to interact with software required layering additional tools, custom integrations, and brittle screen-reading pipelines on top of a language model. Each layer added complexity and potential failure points.
With GPT-5.4, computer interaction is a first-class capability. An agent built on this model can browse the web, fill out a CRM, update a spreadsheet, and draft a report without needing separate tools for each step. The model handles the translation from intent to mouse click internally.
The 1-million-token context window compounds this advantage. An agent handling a complex workflow, say, researching a market opportunity across multiple sources and producing a formatted report, no longer needs to chunk its work into separate, context-limited steps. It can hold the entire task in memory.
And the Tool Search system addresses what has been a growing pain point in agentic AI: tool sprawl. As agents connect to more services and APIs, the overhead of managing tool definitions was eating into performance and cost. Reducing token usage by 47% on tool-heavy tasks isn't just an efficiency win. It makes certain agent architectures financially viable for the first time.
The Honest Limitations
Let's be clear about what GPT-5.4 does not solve.
First, the APEX-Agents benchmark context. Mercor's professional task evaluation, which GPT-5.4 now leads, was designed to test sustained professional work across law, finance, and consulting. When the benchmark launched in January, the best models completed fewer than 25% of tasks on the first attempt. With eight tries, the ceiling was around 40%. Mercor's own CEO described the state of play: "Right now it's fair to say it's like an intern that gets it right a quarter of the time."
GPT-5.4 tops that leaderboard, but topping a leaderboard where no model performs reliably is different from being production-ready for high-stakes professional work.
Second, the hallucination improvements are relative, not absolute. Claims are 33% less likely to contain errors compared to GPT-5.2, and responses are 18% less likely to have any errors at all. Those are meaningful gains. But "less likely to be wrong" is not "reliable enough to trust without verification." For workflows where accuracy matters, human oversight is still essential.
Third, cost. The 1-million-token context window sounds impressive until you hit the 272,000-token threshold where pricing doubles. Google's Gemini 3.1 Pro offers a 2-million-token context at a lower base price. For organizations processing large document sets, the math matters.
And fourth, the competitive picture is messier than OpenAI's launch post suggests. Anthropic's Claude continues to lead on certain reasoning and safety benchmarks. Google's Gemini models offer competitive performance at lower price points. The "best model" depends entirely on your specific use case, and that's been true for over a year now.
What Happens Next
The speed of improvement is the real story. GPT-5.2 to GPT-5.4 in the span of months, with a 27.7 percentage point jump on desktop navigation. If that trajectory holds, the conversation shifts from "can AI agents do this?" to "how fast do we need to adapt?"
For developers, the immediate action is clear: test GPT-5.4's computer use capabilities against your actual workflows, not benchmark tasks. The gap between benchmark performance and real-world reliability is where most agent projects fail.
For businesses, the spreadsheet integrations deserve serious evaluation. If GPT-5.4 Thinking really does perform at 88% on investment banking tasks, the ROI calculation for AI agent adoption just changed significantly.
For everyone else, the OSWorld number is worth sitting with. An AI that operates your computer better than you do isn't a party trick. It's the foundation of a new category of software, one where you describe what you want done and an agent handles the clicking, typing, and navigating.
We crossed a line this week. The question is no longer whether AI agents can use computers effectively. It's what we do with that fact.