AI

Claude's newest model is a step forward and two steps back, and it's infuriating

At a glance:

  • Anthropic's Opus 4.7 improves instruction following but reduces creative problem-solving capabilities
  • The model shows 12/14 benchmark improvements but requires 35% more tokens
  • Users report increased hallucinations despite higher honesty scores

The Model's New Capabilities

Anthropic's Opus 4.7 represents a significant leap in software engineering and agentic safety, with the company claiming 12 out of 14 reported benchmark improvements over Opus 4.6. The model demonstrates enhanced precision in following user instructions, with Anthropic noting that 4.7 now takes directions literally rather than interpreting them loosely. This update is particularly notable for developers, as the model now executes specific code modifications without unnecessary refactoring. For instance, when asked to change a single line of code, Opus 4.7 executes that change without altering surrounding code, a marked improvement over previous versions that often added unsolicited suggestions.

However, this increased precision comes with notable trade-offs. The model's strict adherence to instructions has led to reduced capacity for independent problem-solving. In testing, Opus 4.7 frequently failed to perform necessary web searches when context required it, such as when asked about OpenClaw - a 1997 platformer game. While Opus 4.6 would request clarification, 4.7 provided an incorrect answer without attempting to verify information. This behavior aligns with Anthropic's own admission that the model's honesty score (91.7% on MASK benchmark) reflects technical accuracy rather than real-world knowledge verification.

The Trade-Offs of Precision

The most striking criticism of Opus 4.7 is its apparent 'laziness' in task execution. Users report the model often performs the bare minimum required by instructions, avoiding deeper analysis unless explicitly prompted. This was evident in a Reddit user's experience where Opus 4.7 incorrectly denied the existence of a Claude for Excel add-in, despite the tool being available. When confronted, the model admitted it should have searched the web but repeated the same error in subsequent queries, listing seven instances of self-correction in a single conversation.

The model's token consumption has also increased significantly, with Anthropic reporting a 1.0x to 1.35x increase in token usage compared to Opus 4.6. This means the same prompt could cost up to 35% more, compounded by the need for more back-and-forth interactions as users guide the model through complex tasks. The updated tokenizer, while improving text mapping, has created a more resource-intensive experience for users.

User Experience and Reception

Despite its technical advancements, Opus 4.7 has sparked mixed reactions from the developer community. Many users appreciate the improved instruction following, particularly for code modification tasks where precision is critical. However, the model's reduced capacity for creative problem-solving has frustrated users who rely on AI for exploratory tasks. One developer noted that Opus 4.7's strict adherence to instructions made it less effective for brainstorming sessions, where previous models would offer multiple solution paths.

The model's increased token costs have also raised concerns about accessibility. For users with limited API budgets, the 35% increase in token usage could significantly impact their workflow. This is particularly problematic for small businesses and independent developers who may struggle with the higher costs of using the model for routine tasks.

Broader Implications for AI Development

Opus 4.7's release highlights a growing tension in AI development between precision and flexibility. While the model's improvements in instruction following and safety are technically impressive, they come at the cost of reduced adaptability. This trade-off raises important questions about the future direction of AI models - should they prioritize strict compliance with user instructions, or should they maintain the ability to think creatively and independently?

The controversy surrounding Opus 4.7 also underscores the challenges of benchmarking AI models. While Anthropic's MASK honesty score suggests technical improvements, real-world performance metrics tell a different story. The model's increased hallucinations and reduced problem-solving capacity demonstrate that benchmark scores don't always translate to practical usability.

What's Next for Anthropic

Anthropic has not yet announced plans to address the criticisms of Opus 4.7, but the company's history of rapid iteration suggests updates are likely. Potential improvements could include better web search integration, more flexible instruction handling, and optimized token usage. However, any changes would need to balance the model's strengths in precision with the need for creative problem-solving capabilities.

The Opus 4.7 controversy may also influence how other AI companies approach model development. As the industry continues to push for more capable models, the debate over precision versus flexibility will likely intensify. For now, Opus 4.7 serves as a cautionary tale about the complexities of AI development and the challenges of balancing technical capabilities with user needs.

Editorial SiliconFeed is an automated feed: facts are checked against sources; copy is normalized and lightly edited for readers.

FAQ

What are the main improvements in Opus 4.7?
Opus 4.7 shows significant improvements in software engineering tasks and agentic safety, with 12 out of 14 benchmark improvements over Opus 4.6. The model now follows instructions more literally, executing specific code changes without unnecessary refactoring. It also demonstrates better honesty in technical responses, scoring 91.7% on the MASK benchmark compared to Opus 4.6's 90.3%.
What are the main criticisms of Opus 4.7?
Despite its technical advancements, Opus 4.7 has been criticized for reduced creative problem-solving capabilities and increased token costs. The model often fails to perform necessary web searches when context requires it, leading to factual inaccuracies. It also consumes up to 35% more tokens than its predecessor, making it more expensive to use for routine tasks.
How does Opus 4.7 compare to previous models?
Opus 4.7 represents a significant upgrade in instruction following and safety, but at the cost of reduced flexibility. While it performs better on technical benchmarks, it struggles with real-world tasks that require independent research. The model's increased token usage and tendency to hallucinate when not given explicit instructions represent notable trade-offs compared to previous versions.

More in the feed

Prepared by the editorial stack from public data and external sources.

Original article