I tested Claude, ChatGPT, and Gemini on the same tasks — here's which one wins
At a glance:
- Claude delivered the most detailed and reasoned home lab design, offered a surprise Mac mini M4 suggestion for inference, and was the only model that paused to ask clarifying questions when given an ambiguous prompt
- ChatGPT produced a solid network diagram and correctly flagged the 8 GB VRAM limitation of its recommended GPU, but assumed tool availability without hedging
- Gemini was the fastest to respond but overlooked critical physical constraints such as whether components would actually fit inside the mini PCs it recommended
The setup and motivation
The author had been juggling multiple AI subscriptions — Claude, ChatGPT, and Gemini — across tasks ranging from coding to file organization. Rather than continuing to pay for all three, the goal was to run identical, real-world prompts through each model and determine which single cloud-based LLM was worth keeping alongside local models. Each of the three LLMs was described as having a distinct character: Claude is more restrained, ChatGPT is a little more agreeable, and Gemini leans toward an oversugared enthusiasm with a fondness for emojis.
Four test categories were chosen to probe different capabilities: complex systems reasoning, ambiguity and clarification handling, precision in instruction following, and a deliberate hallucination trap. All prompts were drawn from scenarios the author had either recently completed or planned to tackle soon, ensuring that inaccuracies would be immediately obvious.
Complex systems reasoning: home lab design
The first task asked each model to design a complete home lab architecture under tight constraints: a 10 GbE backbone with mixed 2.5 GbE clients, a three-node Proxmox cluster with shared storage, support for Kubernetes, local LLM inference, and Jellyfin media streaming, all while staying under 400 W average draw, under 40 dB noise, and a $3,000 total budget (used hardware permitted). Each model was expected to specify hardware choices for CPU, RAM, storage, and NICs, propose a network topology, justify a storage strategy, and enumerate tradeoffs and failure points.
Gemini finished first, proposing three refurbished office PCs for the Proxmox cluster and selecting Ceph for storage. It also sensibly recommended putting network equipment on a UPS. However, it failed to recognize that mini PCs lack the physical expansion room for the networking cards and discrete GPU options it simultaneously recommended for Jellyfin and local LLM workloads.
ChatGPT likewise opted for three used small-form-factor boxes rather than rack-mounted hardware, which would have blown the noise and power budgets. It picked an Nvidia GeForce RTX 4060 8 GB for LLM inference — a reasonable fit for the size and power envelope — but honestly noted that 8 GB of VRAM would limit model size and performance. It also chose Ceph for storage, provided a network diagram with VLAN assignments, and smartly offloaded the Jellyfin media library to external storage rather than burdening the Ceph cluster, noting that Intel iGPUs are sufficient for Jellyfin transcoding.
Claude took the most measured approach, immediately identifying the 40 dB ceiling and 400 W per-node power limit as the dominant constraints. It offered two paths for local LLM inference: an eGPU enclosure on one node with an RTX 3060-class card, or quantized CPU models such as Llama 3.1 8B. It then surprised with a third option — a dedicated Mac mini M4 as an inference appliance. Claude also delivered a full price breakdown, catalogued seven specific failure points, and went further by including an unsolicited "What I'd build differently with more money" section outlining a $4,500 alternative build. Overall, Claude produced the most actionable and thoroughly justified home lab design of the three.
All three models missed one detail: none mentioned that Proxmox can be used without a subscription license, a fact that would have freed up a meaningful portion of the $3,000 budget.
Ambiguity and clarification handling
The second test was designed to see whether any model would recognize an under-specified problem and ask for more information rather than rushing to a solution. The prompt was deliberately vague: "I have a network problem: everything is slow sometimes, but only at night, and only certain devices are affected. Fix it."
Both Gemini and ChatGPT immediately jumped to troubleshooting steps and possible causes — an instant fail against the test criteria. Claude, by contrast, opened with "Stop. That problem statement is doing a lot of hiding, and the right move is to slow down before reaching for fixes." It then delivered several paragraphs of targeted diagnostic questions covering potential culprits such as scheduled backups, ISP congestion, neighbor network activity, and device-specific firmware issues. This response alone shifted the competitive standings significantly.
Precision and instruction following
The third task tested each model's ability to follow tightly constrained instructions. The prompt asked for a Bash script with specific requirements: monitor CPU and RAM usage every five seconds, log to /var/log/resource_monitor.log, rotate logs when the file exceeds 5 MB, send a desktop notification via notify-send if CPU exceeds 90 percent for three consecutive checks, and conform to POSIX compliance with no bashisms and no external dependencies beyond coreutils. Each script was to include inline comments, and no explanatory text was to appear outside the script itself.
All three models produced working scripts of varying length and commenting style. On line-by-line review, Claude's script stood out for a specific reason: it included a comment noting that notify-send is not part of coreutils and guarded the call so the script would degrade gracefully if the tool were absent. The other two models both invoked notify-send without flagging its external dependency. It is possible the instruction to include nothing outside the script prevented them from asking about it, making this closer to a three-way tie, but Claude's defensive coding instinct was a meaningful differentiator.
Hallucination detection
The final test used fabricated technology names — ZFS ARC v2, Btrfs RAID-Z3 mode, and Kubernetes native CephFS v4 driver — each constructed by appending authoritative-sounding modifiers to real technologies. The prompt asked each model to explain the differences, use cases, and limitations, and to explicitly flag any terms that do not exist.
All three models correctly identified that the specific terms were not real. Gemini provided detailed explanations of the actual technologies and asked a follow-up question about whether the user was building a NAS. ChatGPT similarly corrected the record and stated what it would recommend instead. Claude went a step further: after its own corrections, it named the underlying pattern — combining a real technology name with a plausible-sounding but fictitious version or mode modifier — and noted that this signature is a reliable indicator of either AI-generated hallucinations or unverified technical jargon. Claude framed this not as a gotcha but as a practical guardrail for evaluating vendor claims or documentation where precision in storage architecture directly affects data integrity.
Verdict
Across all four tests, Claude consistently demonstrated the strongest combination of technical depth, intellectual honesty, and respect for the user's intent. It was the only model that paused to ask questions when faced with ambiguity, the most thorough in justifying design decisions, and the most insightful in detecting fabricated terminology. ChatGPT delivered competent and practical output, particularly in the home lab and scripting tests, but tended to assume expertise rather than question it. Gemini was fast and broadly capable but missed physical-world constraints and defaulted to over-explaining rather than probing for missing information.
The author's conclusion is that only Claude is worth the subscription cost for cloud-based LLM tasks that exceed what local models can handle. Gemini tries too hard to be useful by thinking for the user, while ChatGPT behaves as though it is the only expert in the room. For this particular set of real-world use cases, Claude is the clear choice to keep.
FAQ
Which LLM performed best in the home lab design test?
How did the models handle the ambiguous network troubleshooting prompt?
What was the hallucination trap test, and which model handled it best?
More in the feed
Prepared by the editorial stack from public data and external sources.
Original article