AI

I asked Claude, ChatGPT, and Gemini to fix the same bug, and only one understood it

At a glance:

  • Anthropic's Claude Sonnet 4.6 achieved a perfect score by identifying all three intentional bugs in a Pygame project.
  • OpenAI's ChatGPT 5.5 successfully diagnosed two out of three errors but failed to detect inverted wall collision logic.
  • Google's Gemini 3.1 failed to identify any of the planted bugs, opting instead to rewrite the entire movement system.

Frontier language models from OpenAI, Google, and Anthropic are increasingly being integrated into professional coding workflows. While many developers use these tools to generate boilerplate code or suggest new functions, the true test of an AI's utility lies in its ability to perform forensic debugging. Identifying a specific, logical error within an existing codebase requires a level of precision and context retention that goes beyond simple pattern matching.

To test this capability, a controlled experiment was conducted using a deliberately sabotaged Pygame project titled "Captain Hat." The goal was to see how three leading models—Claude Sonnet 4.6, ChatGPT 5.5, and Gemini 3.1—would respond to a zero-shot prompt: "There's a problem with the movement in this platformer game. Can you find and fix it, and detail the issue?" No hints, context, or fine-tuning were provided, forcing the models to rely entirely on their inherent reasoning abilities.

The three sabotaged bugs

The experiment utilized three specific logical errors designed to disrupt the physics and movement of the platformer game. These bugs were not syntax errors, which are easily caught by standard linters, but rather logical flaws that would cause unpredictable behavior during gameplay.

  • Conditional Gravity (Line 812): A ternary operator was inserted to set gravity to zero whenever the player moved to the right, causing the character to drift upward without accumulating downward force.
  • Swapped Platform Momentum (Lines 809-810): The coordinate axes for platform momentum were corrupted, meaning the script updated the X-position using the vertical change (dy) and the Y-position using the horizontal change (dx).
  • Inverted Wall Collisions (Lines 818-819): The logic for horizontal wall collisions was flipped, causing the player to phase through obstacles and appear on the opposite side of the barrier rather than being clamped against it.

ChatGPT 5.5 delivers a partial fix

OpenAI's ChatGPT 5.5 performed respectably, successfully identifying two of the three planted bugs. It correctly diagnosed the swapped axis issue regarding platform-riding mechanics, noting that the jittery movement was caused by vertical and horizontal deltas being applied to the wrong axes. It also successfully traced the inconsistent jumping behavior back to the conditional gravity bug on line 812.

However, the model failed to address the most disruptive error: the inverted wall collision logic. Because it did not recognize the cause of the phasing behavior, the resulting fix was incomplete. While ChatGPT remains a powerful tool for general assistance, this test highlighted a gap in its ability to catch subtle, non-obvious logical inversions in complex physics scripts.

Gemini 3.1 fails the technical test

In a surprising result, Google's Gemini 3.1 failed to identify any of the specific bugs. Rather than performing the requested task of finding and fixing the existing errors, the model took a highly divergent approach. It flagged the entire movement system as being poorly designed and attempted to rebuild the control scheme from scratch.

Gemini's response included the implementation of new acceleration curves, friction multipliers, and velocity capping. While these additions might improve a game's "feel," they completely ignored the sabotaged lines of code. This behavior places the model in the "confidently incorrect" category, as it answered a question that was never asked—reimagining the architecture instead of repairing the specific logic provided.

Claude Sonnet 4.6 takes the crown

Anthropic's Claude Sonnet 4.6 was the only model to achieve a perfect score, identifying and fixing all three bugs with high precision. It demonstrated a superior ability to parse the specific logic of the codebase without drifting into unsolicited redesigns or generalized advice. This performance aligns with Claude's growing reputation as the preferred model for professional coding workflows.

Claude correctly pinpointed the conditional gravity problem on line 812, the swapped platform axes on lines 809 and 810, and—most importantly—the sneaky wall collision logic on lines 818-819. By focusing strictly on the task at hand, Claude proved to be the most reliable tool for precision debugging under zero-shot conditions, making it a standout choice for developers looking to solve specific, high-stakes technical issues.

Editorial SiliconFeed is an automated feed: facts are checked against sources; copy is normalized and lightly edited for readers.

FAQ

Which AI model performed best in the debugging test?
Claude Sonnet 4.6 by Anthropic was the only model to achieve a perfect score. It successfully identified and fixed all three intentional logical bugs in the Pygame project without providing unsolicited redesigns.
What were the specific bugs used to test the models?
The test involved three logical errors: a ternary operator on line 812 that zeroed out gravity during rightward movement, swapped coordinate axes on lines 809-810 for platform momentum, and inverted wall collision logic on lines 818-819.
How did Gemini 3.1 respond to the debugging prompt?
Gemini 3.1 failed to detect any of the three bugs. Instead of fixing the existing code, it attempted to overhaul the entire movement system by introducing new mechanics like acceleration curves and friction multipliers, effectively ignoring the specific errors.

More in the feed

Prepared by the editorial stack from public data and external sources.

Original article