On March 11, 2026, OpenAI released GPT-5.4 with a feature worth highlighting: native capability to control a computer. No third-party plugin, no fragile overlay. The model can see your screen, click, type, and navigate between applications. And on the OSWorld benchmark, it scores 75% successful tasks, where an expert human tops out at 72.4%.
Before we get carried away, let’s get the basics straight. A benchmark is a lab test. 75% also means 25% failure. The useful question isn’t “is it perfect?” but “is it better than what I have today for automating repetitive tasks?” Often, the answer is yes.
What Has Changed With GPT-5.4
GPT-5.2 already handled text, code, and images. But to operate a graphical interface, you had to rely on external tools, manually script interactions, and maintain fragile connectors. GPT-5.4 brings Computer Use directly into the model.
The difference with Codex is clear. Codex generates code. GPT-5.4 executes actions in a real environment: open an application, fill out a CERFA form, extract data from a protected Excel spreadsheet, click “Confirm” in an online banking portal.
It’s the difference between a chef who writes a recipe and a chef who cooks the dish.
GPT-5.4 marks the shift from AI that advises to AI that takes action. It’s a leap in kind, not just degree.
The other major technical innovation: Tool Search. The model dynamically picks the tools it needs for each step, instead of loading the entire environment for every request.
The result: 47% tokens saved on complex workflows. For an automation that runs 10,000 times a month, that’s meaningful on your bill.
How the 4-Step Action Cycle Works
The process follows a simple, repeating loop. GPT-5.4 takes a screenshot of the current interface state. It analyzes what it sees: position of elements, texts, available buttons.
It generates an action (clicks at specific coordinates, types, scrolls). It captures the new state and starts over.
This perceive-decide-act-observe loop runs until the task is complete or a fallback is triggered.
For developers deploying this: Docker is still the recommended infrastructure to isolate the execution environment. Playwright is used as the browser automation layer. The OpenAI API orchestrates it all.
The 1 million token context makes a real difference here: the model can keep the entirety of a long workflow in memory without losing track.
For example, extracting data from 50 different vendors—memorizing the navigation patterns for each site—remains consistent from start to finish.
5 Use Cases That Work
1. Multi-source Price Extraction
A buyer typically spends 2.5 hours a week visiting suppliers’ portals to update a price comparison spreadsheet.
GPT-5.4 does the same job in 8 minutes: browsing each site, identifying price columns, copying them into a structured spreadsheet. The gains are evident from week one.
2. Health Insurance Reimbursement Claim
A typical French scenario: an HR service needs to submit reimbursement requests on Ameli for employees. The portal has no API. GPT-5.4 navigates the site, fills out the forms using provided data, submits them, and captures the confirmation.
With confirmation policies (human validation before sending), the loop remains under control.
3. Accelerated Financial Analysis
An analyst used to spend 30 minutes aggregating data from 4 sources (internal Excel, Bloomberg terminal, a banking portal, an annual report PDF) before writing their summary.
With the GPT-5.4 workflow, it now takes just 2 minutes. The analyst reviews, corrects if needed, and signs off. They keep responsibility; the AI just handles the data gathering.
4. Customer Support Debugging with Screenshot Analysis
A support agent receives a picture of an error screen. GPT-5.4 reads the image, identifies the error code, searches the internal knowledge base, finds the resolution procedure, and opens a ticket in Zendesk with the relevant context pre-filled.
The agent verifies and sends. What used to take 12 minutes now takes just 90 seconds.
5. GDPR Compliance on Client Portals
In the banking and insurance sectors in France, compliance teams manually check that legal notices are properly displayed on hundreds of partner portal pages.
GPT-5.4 navigates, captures, compares against a compliance checklist, and generates an anomaly report.
Work that took several days is reduced to just a few hours of automated processing.
GPT-5.4 vs Established RPA (UiPath, Make, Zapier)
An honest comparison shows GPT-5.4 and classic RPA tools tackle different problems, but the overlap is growing.
UiPath or Automation Anywhere on a structured, stable process is unbeatable for reliability.
These tools use precise selectors: if the interface doesn’t change, they’re flawless.
The problem: deployment for a custom process takes 2 to 4 weeks, they break with every UI update, and they’re blind to unscripted interfaces.
GPT-5.4 adapts visually. If a button changes color or moves, the model finds it again because it “reads” the interface like a human.
Initial deployment takes 2 to 5 days. The tradeoff: the failure rate hovers around 25% on complex interfaces.
| Criterion | GPT-5.4 Computer Use | UiPath/Classic RPA | Make/Zapier |
|---|---|---|---|
| Deployment time | 2–5 days | 2–4 weeks | 1–2 days (if API available) |
| Resistance to UI changes | Good (visual) | Low (selectors) | N/A (API) |
| Works without API | Yes | Yes | No |
| Success rate | 75% (OSWorld) | 95%+ (if stable) | 99%+ (if API stable) |
| Setup cost | Low | High | Low to medium |
The OpenAI Operator vs Anthropic Computer Use comparison covers more aspects of this rivalry if you want to dive deeper into agent architecture choices.

GPT-5.4 vs Claude Sonnet 4.6
Claude is not GPT-5.4’s only competitor, but it’s the most serious one for Computer Use. Claude Sonnet 4.6 scores 72.5% on OSWorld, 2.5 points lower. On SWE-Bench Verified (real-world code bug resolution), Claude Opus 4.6 scores 80.8% against 77.2% for GPT-5.4.
Practical takeaway: GPT-5.4 is stronger at operating GUIs. Claude retains an advantage in multi-file refactoring and multi-agent architectures.
GPT-5.4’s starting price ($0.50/million tokens input) remains competitive at scale, especially when combined with Tool Search, which reduces real consumption.
For teams evaluating both: test on your actual use case. Benchmarks show a direction, not a guarantee.
Security, GDPR and Fallback Handling
OpenAI classifies GPT-5.4 at “High” for its internal cybersecurity assessment. The Chain-of-Thought is auditable with a low rate of obfuscation, meaning each model decision can be reviewed step by step.
This is a strong argument for security teams.
For GDPR compliance: confirmation policies allow you to define actions that require human approval before execution.
Sending an email, submitting a form with personal data, making a payment: these actions can be paused until manual approval is given.
The workflow remains automated except at control points you define.
An AI agent that can be stopped at any time is infinitely more deployable than one that rushes ahead. Confirmation policies are GPT-5.4’s real security feature.
About failures: a 25% failure rate is unacceptable without a fallback strategy. Three patterns work in production:
- Retry with extended reasoning (xhigh parameter): the model tries again with deeper analysis. This solves about half of simple failures.
- Human escalation via webhook: when the AI gets stuck (CAPTCHA, unrecognized interface, repeated error), it notifies an operator to take over.
- Screenshot logs: every step is archived so you can audit and debug after execution.
Sustained failures: image CAPTCHAs, interfaces with very small or blurry text, multi-monitor setups (not natively supported). These limits are documented, not hidden.
Security and Isolation: Why Docker Is Essential
Giving an AI control of a browser without isolation is opening up an attack surface.
A Docker container with a headless browser (Chrome or Firefox via Playwright) contains the damage if the model performs an unintended action: no access to the host file system, no unwanted session persistence, container destroyed after each run if needed.
For businesses under strict regulatory obligations (banking, health, insurance), this isolation is usually non-negotiable. The implementation of OpenAI AI agents in enterprise covers recommended isolation architectures for these sectors.
About sensitive data: screenshots sent to the OpenAI API travel through OpenAI servers.
For data subject to professional secrecy or data localization requirements, you need either a data processing agreement with OpenAI or an on-premises solution with a local model. GPT-5.4 does not yet offer a deployable local version.
Run a Pilot in 4 Steps
Step 1: Choose the right process. Look for tasks with more than 5 hours of copy-paste per week, little variation in the data handled, and zero added value from manual work.
Exporting data between portals, updating tracking spreadsheets, repetitive compliance checks: good candidates.
Step 2: Set up the isolated environment. Docker container + Playwright + OpenAI API key. Two days for a developer familiar with the tools.
Add screenshot logs from the start: you’ll need them for debugging.
Step 3: Define confirmation policies. Before any deployment, list the irreversible actions and set up human validation points.
Form submission, file uploads, actions on real accounts: all these must be routed through a human in the pilot phase.
Step 4: Measure over 2 weeks. Success rate by task type, time saved, actual API cost (helped by Tool Search), number of human fallbacks triggered.
If the ROI is visible after two weeks of pilot, it’s easy to make the case for expansion.

Honest Limitations
GPT-5.4 fails 1 in 4 tasks under benchmark conditions. In production, on less standardized interfaces than in the lab, this number can go up. CAPTCHAs block the model.
Poorly scanned PDF files are problematic. Workflows requiring multi-factor authentication every step are hard to automate without human help.
The model has no persistent memory between sessions unless you build infrastructure for that. If your process needs to resume where it left off yesterday, you’ll need to manage state manually.
Usage-based billing can deliver surprises on lengthy workflows. Test with Tool Search enabled and measure real consumption before scaling up. For massive automations, the 47% token savings make a significant difference.
OpenClaw, OpenAI’s internal framework for multi-agent workflows, integrates with GPT-5.4 for architectures where several agents coordinate.
It’s promising but still in early access, with limited documentation.
To understand the different levels of AI autonomy, check out our article Why you shouldn’t let AI decide for you.
Roadmap: What’s Coming
OpenAI has announced for coming months: native integration into the ChatGPT desktop app (no need to use the API for simple use cases), multi-monitor support, one-click templates for the most common workflows, and enhanced approval flows for enterprises.
Multi-monitor support is expected Q3 2026. It’s a real blocker for finance professionals and developers working on several screens at once.
The direction is clear: autonomous AI agents in 2025 were mostly demos.
In 2026, these become production tools with measurable success rates and predictable costs.
If you spend more than 5 hours a week copy-pasting between interfaces, a two-week pilot is definitely worth it.
The next article will show how to set up your Docker + Playwright + GPT-5.4 API environment step by step, with pitfalls to avoid right from the start.
This isn’t perfect technology. But on the right use cases, it delivers measurable results as early as the second week.
That’s enough reason to start taking a closer look.
FAQ
Is GPT-5.4 Computer Use available to all developers?
Yes, the API has been open since launch on March 11, 2026. Access is through the standard OpenAI API with an API key. No special access is needed, but Computer Use pricing applies based on tokens consumed and actions generated.
What’s the difference between OSWorld and a real-world test?
OSWorld is a standardized benchmark that tests predefined desktop tasks in a controlled environment. In production, interfaces vary, data is unpredictable, and systems have variable load times. The 75% score is a useful reference, not a performance guarantee for your specific use case.
Can GPT-5.4 access applications that require authentication?
Technically yes, provided the credentials are given in a secure environment. Secret management (passwords, tokens) must use a secret manager (AWS Secrets Manager, HashiCorp Vault) and never be put directly in the prompt. Authentication sessions should be handled at the Docker container level.
How to handle CAPTCHAs that block automation?
GPT-5.4 cannot resolve image or audio CAPTCHAs. Options are: third-party CAPTCHA-solving services (2captcha, Anti-Captcha), automatic human escalation when a CAPTCHA is detected, or negotiation with the site provider for direct API access to bypass the web interface.
What is the real cost of running a GPT-5.4 workflow in production?
Base rate is $0.50/million input tokens. With Tool Search enabled, actual consumption drops by 47%. A 10-minute workflow that generates 50,000 tokens costs about $0.025. For 1,000 runs a month, that’s $25 in API cost—compare that to the human time saved.
Does GPT-5.4 work with desktop applications (not just browsers)?
Yes, that’s a strength of native Computer Use. Excel, Word, proprietary enterprise software without an API—the model interacts with anything visible onscreen. Performance varies with interface complexity. Apps with lots of tiny text or dense layouts are more challenging.
How do I configure confirmation policies to stay GDPR compliant?
Confirmation policies are set at the API call level. You define a list of actions that trigger a pause and wait for human validation (via webhook payload into your system). For GDPR, actions that handle personal data, send communications, or submit forms with sensitive data should be included in this list.
Can GPT-5.4 be used to automate processes in regulated industries (banking, health)?
With proper precautions: Docker isolation, confirmation policies for sensitive actions, complete logs for audit, and a signed data processing agreement with OpenAI. For data subject to strict medical or banking secrecy, check with your DPO whether data can transit through the OpenAI API, or if an on-premises solution is required.
What’s the difference between GPT-5.4 Computer Use and OpenAI Operator?
OpenAI Operator is the consumer interface (available via ChatGPT) that leverages GPT-5.4’s Computer Use abilities. The Computer Use API is the developer version, offering more control over environment, fallbacks, and configuration. For production workflows, the API is the right choice. OpenAI Operator is covered in this article if you want to explore the no-code option.
When can we expect a 90%+ success rate on OSWorld?
OpenAI doesn’t give a precise date. The jump from 47.3% (GPT-5.2) to 75% (GPT-5.4) in under a year is significant. Expected improvements on multi-monitor setup and complex interface handling should boost the score. Exceeding 85% by the end of 2026 is plausible, but the last few percentage points are always the hardest to gain.
Related Articles
Reddit blocks AI scraping: what it means for LLMs and open source
On March 25, 2026, Reddit sent shockwaves through the AI community: the platform is shutting its doors to automated scrapers, requiring biometric verification for suspicious accounts, and removing 100,000 bot…
Claude Mythos: what the Capybara leak reveals about Anthropic’s next model
On March 26, 2026, two cybersecurity researchers stumbled across something Anthropic never meant to show: roughly 3,000 internal assets exposed publicly on the company’s blog, including draft posts revealing the…