Cursor Composer 2: beats Opus 4.6 at 10x lower cost

On March 19, 2026, Anysphere published what looked like a quiet blog post: three benchmarks, a pricing table, and an announcement that repositioned the entire AI coding market.

Cursor, the VS Code-based IDE with over one million daily users, is no longer content to resell models from Anthropic and OpenAI.

With Composer 2, the startup valued at $29.3 billion has launched its own AI model, trained entirely in-house on data exclusively tied to code.

The signal is strategic: Cursor moves from reseller to manufacturer, and the numbers deserve serious analysis before declaring victory or defeat.

Key takeaways:

Composer 2 beats Claude Opus 4.6 on 2 out of 3 benchmarks, but GPT-5.4 still leads on Terminal-Bench 2.0 (75.1 vs 61.7)
Pricing changes everything: $0.50/M input tokens vs $5.00 for Opus 4.6, making it 10x cheaper
Self-summarization compresses 5,000 tokens into 1,000 with 50% fewer errors, integrated directly into the RL loop
Claude Code is still 5.5x more token-efficient than Cursor on identical tasks: Composer 2’s price advantage shrinks in real-world use
Three generations in 5 months (Oct. 2025, Feb. 2026, Mar. 2026): Anysphere’s velocity is as much an industry signal as a technical one

Cursor Composer 2: genesis and architecture

To understand what Composer 2 represents, a quick timeline is in order.

In October 2025, Cursor launched Composer 1 alongside Cursor 2.0: the startup’s first proprietary model, built on a Mixture-of-Experts (MoE) architecture with reinforcement learning, 4x faster than comparable models, but less precise than frontier models.

In February 2026, Composer 1.5 improved benchmark scores (44.2 on CursorBench, 47.9 on Terminal-Bench 2.0), but still trailed Opus 4.6 by 10 points on Terminal-Bench according to The New Stack.

On March 19, 2026, Composer 2 introduced two training phases that were new territory for Anysphere.

Phase 1: continued pretraining.

Cursor took a base model (which it has not disclosed) and retrained it intensively on a massive code corpus, before any fine-tuning phase.

This is the same playbook as DeepSeek Coder or CodeLlama: saturate a solid general model with code data to create a specialized foundation, more receptive to the reinforcement signal.

Phase 2: RL on long-horizon tasks.

Cursor then applied reinforcement learning specifically to coding tasks requiring hundreds of consecutive actions.

The result is a model with a 200,000-token context window, a claimed generation speed of 200+ tokens per second, and a deliberate specialization: Composer 2 doesn’t write poems or fill out your tax returns.

Co-founder Aman Sanger, to Bloomberg: “It won’t help you do your taxes.”

“The focus on code is the competitive advantage, not a limitation.”

To explore Cursor’s features before diving into the benchmarks, our complete Cursor guide for developers covers all the IDE’s features in depth.

Benchmarks decoded

Cursor publishes three benchmarks for Composer 2.

Before reading the numbers, a methodological note is in order.

Evaluation harnesses differ across models: Anthropic models are evaluated using the Claude Code harness, OpenAI models with the Simple Codex harness, and Cursor uses the Harbor framework (the officially designated harness for Terminal-Bench 2.0).

These differences are not trivial: Anthropic reports that Claude Opus 4.6 reaches 65.4% on Terminal-Bench 2.0 in its own optimized configuration, compared to 58.0% in Cursor’s measurements.

For a deeper look at reading AI benchmarks critically, our article on how to read AI benchmarks without being misled covers the key analysis tools.

Terminal-Bench 2.0

Terminal-Bench 2.0 is maintained by the Laude Institute: it tests an agent’s ability to complete command-line tasks autonomously, without step-by-step human assistance.

Cursor ran 5 iterations per model/agent pair and reported the average.

Model	Terminal-Bench 2.0
GPT-5.4 Thinking	75.1
Composer 2	61.7
Claude Opus 4.6	58.0
Composer 1.5	47.9

Composer 2 beats Claude Opus 4.6 by 3.7 points on this external benchmark, but GPT-5.4 remains ahead with a significant 13-point lead.

CursorBench internal

CursorBench is Cursor’s proprietary benchmark: tasks are based on real requests from Anysphere engineers, with an average of 352 lines of code across 8 files per exercise.

Model	CursorBench
GPT-5.4 Thinking (High)	63.9
Composer 2	61.3
Claude Opus 4.6	58.2
Composer 1.5	44.2

Composer 2 surpasses Claude Opus 4.6 (61.3 vs 58.2) for the first time on this internal test, while staying slightly behind GPT-5.4.

Keep things in perspective: this is Cursor’s own benchmark, built around tasks specific to its own editor.

SWE-bench Multilingual

SWE-bench Multilingual is the most neutral of the three: it tests the resolution of real GitHub tickets across multiple programming languages.

Model	SWE-bench Multilingual
Claude Opus 4.6	77.8
Composer 2	73.7
Composer 1.5	65.9
Composer 1	56.9

This is the only benchmark of the three where Claude Opus 4.6 keeps the advantage: 77.8% vs 73.7% for Composer 2.

The improvement over Composer 1.5 remains remarkable (+7.8 points), but the honest summary is this: Composer 2 leads on 2 out of 3 benchmarks, and falls behind on the most neutral one.

Small luminous cursor floating freely facing a massive gold cube in chains: a metaphor for Cursor Composer 2's economic advantage vs Claude Opus 4.6

The real advantage: pricing

Benchmarks make headlines, but it’s the pricing table that should capture development teams’ attention.

Composer 2 targets the price-to-performance sweet spot, and at this price level, the comparison shifts radically.

Model	Input (per 1M tokens)	Output (per 1M tokens)
Composer 2 Standard	$0.50	$2.50
Composer 2 Fast (default)	$1.50	$7.50
Claude Opus 4.6	$5.00	$25.00
GPT-5.4 (short context)	$2.50	$15.00
GPT-5.4 (long context)	$5.00	$22.50

The Standard tier is 10x cheaper than Opus 4.6 on both input and output.

Even the Fast variant (which becomes the default mode in Cursor and delivers the same intelligence at higher speed) remains 3x cheaper than Anthropic and 2x cheaper than OpenAI in short-context mode.

The math becomes striking at scale: a team consuming 10 million tokens per day pays roughly $5 with Composer 2 Standard, compared to $50 with Claude Opus 4.6.

This is the Apple M-chip vs Intel analogy applied to the world of models: by controlling the entire stack, from training to inference, Cursor can offer comparable performance at a structurally lower cost.

Self-summarization: the technical innovation

Composer 2’s real innovation, less visible than the benchmarks, is what Cursor calls compaction-in-the-loop RL.

The problem it solves is a classic one: complex coding tasks generate hundreds of actions and tens of thousands of context tokens, and when the window fills up, models “forget” and make mistakes that invalidate all previous work.

Standard solutions (sliding the window, compressing via separate prompts) invariably lose critical information.

Cursor’s solution: integrating compression directly into the reinforcement learning loop.

Composer 2 learns, during training itself, which information is worth keeping.

Compression is integrated into the reinforcement learning cycle, not added as an afterthought.

In practice, when Composer 2 reaches a token threshold, it inserts a summary of its current context: it compresses 5,000+ tokens into roughly 1,000 tokens, with a selection learned through reinforcement.

Results measured by Cursor on high-difficulty tasks:

Traditional prompt-based summaries require an average of 5,000+ output tokens
Composer 2’s self-summarization produces summaries of ~1,000 tokens on average (5x more token-efficient)
Compression errors are reduced by 50%

A documented example in Cursor’s research notes: a task with 170 interaction turns, compressing 100,000+ tokens down to 1,000 tokens over the course of the session, with a correct final result.

Cascade of code and data compressing into a compact luminous crystal: an illustration of Cursor Composer 2's self-summarization

Composer 2 vs Claude Code: two philosophies

The most relevant comparison for a developer: Cursor with Composer 2 vs Claude Code, far more than Composer 2 vs Opus 4.6 in isolation.

These two tools converged in 2026 on many features (background agents, CLI, agentic workflows), but their philosophy remains structurally different.

Criterion	Cursor + Composer 2	Claude Code
Main interface	Full IDE (VS Code fork)	CLI + VS Code extensions
Model access	Multi-model (Claude, GPT, Gemini, Composer)	Anthropic models only
Effective context	200K announced, 70 to 120K in practice	200K reliable (1M in beta)
Tab completions	Yes, dedicated model	No
Token efficiency	Market standard	5.5x fewer tokens for identical tasks

The 5.5x figure is documented: Claude Code completes a reference task with 33,000 tokens without errors, while Cursor (with GPT-5) consumes 188,000 with errors (The New Stack, builder.io).

Composer 2’s raw price advantage ($0.50 vs $5.00) must be weighed against this higher token consumption in real use.

On the context window, multiple threads on the Cursor forum report an effective limit of 70,000 to 120,000 tokens, despite the 200K announcement, due to internal performance truncations.

For developers getting started with Claude Code, our Claude Code commands guide for beginners is the best starting point.

Each tool’s philosophy reflects two distinct visions of AI coding:

Cursor: the IDE enhanced by AI, with visual feedback, tab completions, multi-model access, and now its own specialized model
Claude Code: the autonomous agent that delegates complex tasks, with reliable context and maximum efficiency

The pattern emerging among the most productive developers: Claude Code for large refactors and architecture, Cursor for fast editing and UI/frontend work.

Cursor Composer 2’s impact on the AI coding market

Composer 2 is as much an industry signal as a product.

For the past two years, Cursor, GitHub Copilot, and Windsurf were all “resellers with a nice interface”: the real models came from Anthropic, OpenAI, or Google.

With Composer 2, Cursor becomes the first AI coding platform to produce its own frontier-class model while continuing to offer third-party models in parallel.

The business logic is direct: Cursor subscriptions operate at negative margins, subsidized by enterprise contracts.

If Cursor serves the majority of its requests on its own model, its cost structure collapses and its dependency on competing providers disappears.

Because Anthropic has Claude Code, OpenAI has standalone Codex, Google has Gemini CLI: every major lab is building its own coding tool.

Cursor couldn’t remain indefinitely dependent on models supplied by its own competitors.

Composer 2 is as much a survival response as a technical ambition.

A current valuation of $29.3 billion (November 2025), over $2 billion ARR, and 50,000 business customers (Stripe, Figma) give Anysphere the resources to accelerate this strategy.

Three generations of Composer in five months demonstrates an R&D velocity few labs could sustain.

Anysphere researchers have already begun publicly discussing Composer 3.

Our verdict on Composer 2

The answer depends on your usage profile.

Composer 2 on Cursor is clearly worth testing if:

You’re already a Cursor Pro subscriber and want to stretch your request budget
Your daily tasks are well-defined: clear refactors, features with precise specifications, multi-file changes on a familiar codebase
Generation speed matters in your workflow: 200+ tokens/s changes the feel of repeated iteration
You work on a large monorepo where self-summarization gives Composer 2 a structural advantage on long tasks

Claude Opus 4.6 (or Claude Code) remains superior if:

Your tasks require deep reasoning on complex architectures or ambiguous requirements
You need reliable 200K-token context across a large codebase
Token efficiency at runtime matters to you (5.5x savings measured on complex tasks)
You’re resolving real GitHub tickets: Opus 4.6 still leads on SWE-bench Multilingual (77.8% vs 73.7%)

Test Composer 2 on an existing project: pick a refactoring task you know well, measure the result and the actual token cost.

And to discover AI coding from scratch, our Claude Code guide for beginners remains the best entry point into the world of coding agents.

The real lesson from Composer 2: the AI coding market is entering a phase of vertical specialization where the strongest platforms build their own foundations.

FAQ

Does Composer 2 actually beat Claude Opus 4.6?

On 2 out of 3 benchmarks published by Cursor, yes.

Composer 2 surpasses Opus 4.6 on CursorBench (61.3 vs 58.2) and Terminal-Bench 2.0 (61.7 vs 58.0).

On SWE-bench Multilingual (the most neutral), Opus 4.6 keeps the advantage (77.8 vs 73.7).

Differences in evaluation harnesses make direct comparisons partially indirect.

How much does Composer 2 cost?

The Standard tier is $0.50/M input tokens and $2.50/M output tokens.

The Fast tier (default in Cursor) is $1.50/$7.50.

Claude Opus 4.6 is $5.00/$25.00, making it 10x more expensive than Standard.

Can Composer 2 be used outside the Cursor IDE?

Yes, via the Cursor API with the identifiers composer-2 (Standard) and composer-2-fast.

For Cursor Pro subscribers, Composer 2 usage draws from a dedicated quota, separate from credits for third-party models.

What base model is Composer 2 built on?

Cursor does not disclose the base model used for continued pretraining.

The startup only confirms: a first pretraining phase on code, followed by a RL phase on long-horizon tasks.

What is Cursor’s self-summarization?

Cursor calls it compaction-in-the-loop RL: when Composer 2 approaches the limit of its context window, it automatically compresses its history (5,000+ tokens down to ~1,000).

This capability is learned during reinforcement training, not applied as post-processing.

Result: 5x more token-efficient than traditional methods and 50% fewer compression errors.

How fast is Composer 2?

Cursor reports more than 200 tokens per second (measured on real traffic from March 18, 2026, normalized for token size differences across providers).

The Fast variant is the default mode in the IDE.

Should you choose Cursor or Claude Code in 2026?

Cursor excels at fast editing, tab completions, multi-model flexibility, and visual workflows.

Claude Code is superior for large autonomous refactors, reliable 200K-token context, and token efficiency (5.5x fewer tokens for identical tasks).

Many developers use both in parallel depending on the nature of the task.

Is GPT-5.4 better than Composer 2?

On Terminal-Bench 2.0, GPT-5.4 Thinking leads with 75.1% vs 61.7% for Composer 2.

On CursorBench, GPT-5.4 High is slightly ahead (63.9 vs 61.3).

GPT-5.4 is priced at $2.50 to $5.00/M input tokens, compared to $0.50 for Composer 2 Standard.

Can Composer 2 handle large enterprise projects?

For well-defined coding tasks on large codebases, self-summarization gives Composer 2 a specific advantage during long sessions.

For complex architectures with ambiguous requirements, Opus 4.6 maintains superior reasoning depth, according to the consensus of developers who have tested both.

Is Cursor still reliable with Composer 2?

The Cursor community reported reliability issues in March 2026, including a confirmed code reversion bug.

Some developers report monthly costs of $40 to $50 with intensive use and an effective context limit of 70 to 120K tokens despite the 200K announcement.

Cursor built its own AI brain: why Composer 2 changes everything for developers