Claude Opus 4.6 crushes benchmarks with 1M-token beta window
Claude Opus 4.6, the newest upgrade to Anthropics flagship model, lands with notable gains in coding, reasoning, and long-context understanding. The headline feature is a 1-million-token context window in beta, which opens the door to tasks that previously required stitching together dozens of prompts or external retrieval workarounds.
Anthropic frames this release as a step forward for real-world knowledge work. The company says Opus 4.6 plans more carefully, sustains focus for longer, and is more reliable across large codebases and complex documents. On top of that, it posts strong results across a range of industry benchmarks that test agentic coding, deep reasoning, web browsing prowess, and long-context retrieval.
Here is what changed under the hood, how it performs, and where it fits into practical workflows.
What is new in Claude Opus 4.6
The 4.6 update is designed to address a common pain point in advanced AI work: models drift or drop context during long sessions. According to Anthropic, Opus 4.6 handles extended tasks more reliably, plans steps more intentionally, and resists losing the thread. This behavior is especially relevant in large codebases and multi-document analysis where coherence matters as much as raw accuracy.
For software teams, Opus 4.6 also improves code review and debugging. The company notes better self-checking, including detecting and correcting its own mistakes. That translates into more useful draft reviews, fewer missed edge cases, and more consistent follow-through during iterative development.
Beyond engineering, Anthropic positions Opus 4.6 for everyday professional work. The model can run financial analysis, conduct research, and create or edit documents, spreadsheets, and presentations. Inside Anthropics Cowork environment, Claude can manage multiple tasks autonomously, which points to more agentic workflows where the model coordinates steps with minimal hand-holding.
Why the 1M-token window matters
The beta 1-million-token context window is the most visible change. A longer window helps the model keep source material and prior steps in view, which reduces the need to chunk and reassemble inputs. It also lowers the risk of inconsistencies that creep in when a task is split across many separate prompts.
Examples of where this could help in practice include:
- Large codebase comprehension. Keeping numerous files in context during refactors or cross-cutting changes.
- Due diligence and research. Parsing lengthy reports, filings, or literature reviews without heavy pre-trimming.
- Enterprise knowledge work. Summarizing years of meeting notes, support tickets, or system logs while preserving nuance.
- Regulatory analysis. Cross-referencing long policy documents and amendments in one session.
That said, a big window only helps if the model can use it effectively. Anthropic reports reduced context rot, the tendency to lose track of earlier information as prompts grow. The long-context results below suggest Opus 4.6 can retrieve buried facts and maintain coherence across very large inputs more reliably than before.
Performance across benchmarks
Anthropic says Opus 4.6 delivers state-of-the-art results on several public and internal benchmarks that stress different capabilities. Highlights include:
- Terminal-Bench 2.0. Highest score on a benchmark focused on agentic coding, which tests a models ability to plan and execute coding tasks over multiple steps.
- Humanitys Last Exam. Leading performance on a complex reasoning benchmark spanning multiple disciplines, signaling improved general problem solving.
- GDPval-AA. A test for economically valuable knowledge work in finance and legal. Opus 4.6 reportedly outperforms OpenAIs GPT-5.2 by about 144 Elo points and improves on Claude Opus 4.5 by 190 points.
- BrowseComp. Top results on a web browsing benchmark that evaluates a models ability to find hard-to-locate information online.
On long-context retrieval, Anthropic points to stronger performance on large document sets and less degradation across hundreds of thousands of tokens. The company reports:
- On the 8-needle 1M variant of MRCR v2, which hides key facts in massive volumes of text, Opus 4.6 scored 76%. This compares to 18.5% for Sonnet 4.5 on the same test.
Beyond these headline scores, Opus 4.6 shows gains in software engineering, multilingual coding, cybersecurity, long-term coherence, root cause analysis, and life sciences knowledge. These areas reflect day-to-day use cases where consistent reasoning and precise recall are essential.
In long-context testing, Anthropic says Opus 4.6 maintains accuracy across hundreds of thousands of tokens with less context rot, improving retrieval of relevant information from very large document sets.
Safety and alignment
Performance jumps often raise questions about safety and alignment. Anthropic says the gains in Opus 4.6 did not come at the expense of guardrails. On its automated behavioral audit, the model showed low rates of problematic behaviors such as deception, sycophancy, or encouraging harmful misuse.
The company states that Opus 4.6 is as aligned as Claude Opus 4.5 and has the lowest rate of unnecessary refusals among recent Claude models. That matters for productivity, since models that refuse too often force users to rewrite prompts or find manual workarounds.
For this release, Anthropic expanded safety evaluations with new tests focused on user well-being, refusal behavior, and hidden harmful actions. The team also applied new interpretability methods to better understand how the model arrives at decisions. Given Opus 4.6s stronger cybersecurity skills, Anthropic added six new cybersecurity probes to monitor potential misuse. At the same time, the company emphasizes defensive use cases, like identifying and fixing vulnerabilities in open-source software.
Developer and product updates
Alongside the model upgrade, Anthropic rolled out several platform updates that affect cost, speed, and workflow design. The goal is to help teams control effort and budget while scaling into longer tasks.
Adaptive thinking levels
Developers can now select from four effort levels that signal when the model should lean into deeper reasoning:
- Low
- Medium
- High (default)
- Max
This setting lets teams balance intelligence, speed, and cost per request. Lighter queries can run faster and cheaper, while complex problems can allocate more compute for stronger analysis. It is a practical way to match output quality to the stakes of the task.
Context compaction in beta
Opus 4.6 introduces context compaction in beta, which automatically summarizes older parts of a conversation as it grows. This helps maintain continuity in longer-running sessions without exceeding limits or manually pruning content.
In effect, compaction acts like a rolling memory that preserves essential information while trimming repetitive or stale details. For projects that span days or weeks, that can reduce friction and prevent accidental drift.
Output and context limits
Anthropic lists the following I/O limits and options:
- Up to 128,000 output tokens
- 1-million-token context window in beta, with premium pricing for prompts above 200,000 tokens
- US-only inference option at 1.1x token pricing for workloads that require U.S.-based processing
Pricing for standard usage remains $5 per million input tokens and $25 per million output tokens, with higher rates applied for extended context tiers. Opus 4.6 is available through claude.ai, Anthropics API, and major cloud platforms.
How the gains translate to real work
Benchmarks are useful signals, but the real test is whether teams can close knowledge gaps, ship more reliable software, and accelerate analysis. Several workflows stand to benefit immediately from Opus 4.6.
Software engineering and data work
Better planning, self-checks, and debugging support are valuable in large codebases. With a bigger window, the model can keep more files, comments, and PR history in view, which helps maintain thread continuity during reviews and refactors.
There is also value in improving multilingual coding, cybersecurity, and root cause analysis. These areas often involve tracing subtle interactions across services or languages. A model that resists context rot is less likely to miss a dependency or lose a variable across steps.
Finance and legal analysis
Results on GDPval-AA suggest Opus 4.6 is more competitive on economically valuable tasks in finance and legal domains. The ability to handle long filings, multi-year datasets, or complex contracts in a single workflow can reduce the overhead of manual curation.
For analysts and counsel, that can mean faster first passes, better cross-references, and clearer summaries that preserve nuance from the source material.
Research and knowledge management
Teams that manage large knowledge stores often face a choice between retrieval pipelines and brittle prompt chains. A longer window, plus compaction, helps keep critical context in one place while the model moves from exploration to synthesis to final outputs.
In practice, that could look like literature reviews, long-form writing, or internal investigations where dozens of sources must be integrated without losing fidelity. Inside Anthropics Cowork environment, autonomous task handling can orchestrate these steps with minimal guidance.
Costs, trade-offs, and limits
The expanded window is a powerful capability, but it is not free. Large prompts carry higher costs and may introduce latency while the system reads and reasons over long inputs. Teams should match the window size to the task rather than defaulting to the maximum.
Anthropics adaptive thinking and context compaction features are practical levers. Use lower effort levels for routine actions, then ramp to max for critical reasoning. Let compaction prune trailing context in long sessions instead of manually curating every turn.
It is also worth remembering that 1M tokens is in beta. While results are promising, real workloads will reveal corner cases, especially with noisy source data or mixed formats. As always with benchmarks, consider how evaluation conditions compare to your environment before drawing conclusions about production impact.
How it stacks up
Opus 4.6 is presented as a state-of-the-art model across several respected tests. On GDPval-AA, Anthropic reports an advantage of about 144 Elo points over GPT-5.2, and a 190 point improvement over Opus 4.5. It also leads on Terminal-Bench 2.0 and BrowseComp, and shows strong long-context retrieval in MRCR v2.
Performance on Humanitys Last Exam signals breadth in reasoning across disciplines. Combined with better long-context handling, that breadth is likely to matter for complex, open-ended tasks where the right answer depends on connecting distant pieces of information.
What to watch next
The most interesting frontier is how teams push the 1M-token window from lab demos to daily workflows. Expect experimentation with:
- Massive code and doc audits where the model keeps a projects full surface area in view.
- Longitudinal analyses that track changes across time, from telemetry to policy updates.
- Agentic orchestration inside environments like Cowork, where the model coordinates steps across tasks.
On the safety side, stronger cybersecurity skills call for careful monitoring. The added probes and extended audits are a good start. Continued investment in interpretability will help teams understand failure modes and correct them early.
Bottom line
Claude Opus 4.6 pairs benchmark-leading performance with a 1M-token beta context window and thoughtful platform updates. For practical work, the gains in planning, long-context reliability, and self-checking are the real story. They reduce friction in large-scale coding, analysis, and research where context continuity and steady reasoning matter most.
The result is not a flashy parlor trick. It is a steadier tool that can hold more of your problem in mind, refuse less when it should not, and still respect safety boundaries. If your workload has been constrained by context length or context rot, Opus 4.6 is worth a close look.
Key takeaways
- 1M-token context window (beta) enables long, coherent sessions across large codebases and document sets, with reduced context rot.
- Crushes benchmarks, including top results on Terminal-Bench 2.0, Humanitys Last Exam, GDPval-AA, BrowseComp, and strong MRCR v2 long-context retrieval.
- Stronger at real-world tasks, from code review and debugging to finance, legal, research, and document workflows.
- Safety and alignment maintained, with expanded audits, new interpretability work, and six cybersecurity probes to monitor misuse.
- Developer controls include adaptive thinking levels, context compaction in beta, up to 128,000 output tokens, US-only inference option, and base pricing of $5 per million input tokens and $25 per million output tokens, with higher rates for extended context tiers.

Written by
Tharun P Karun
Full-Stack Engineer & AI Enthusiast. Writing tutorials, reviews, and lessons learned.