I tested Claude, ChatGPT, and Gemini on the same tasks — here's which one wins

SiliconFeed EditorialMay 9, 2026

ai comparison llm benchmark claude vs chatgpt vs gemini home lab design ai assistant review prompt engineering

Sections and tags — in the Topics menu Search the feed

At a glance:

Claude delivered the most detailed and reasoned home lab design, offered a surprise Mac mini M4 suggestion for inference, and was the only model that paused to ask clarifying questions when given an ambiguous prompt
ChatGPT produced a solid network diagram and correctly flagged the 8 GB VRAM limitation of its recommended GPU, but assumed tool availability without hedging
Gemini was the fastest to respond but overlooked critical physical constraints such as whether components would actually fit inside the mini PCs it recommended

The setup and motivation

The author had been juggling multiple AI subscriptions — Claude, ChatGPT, and Gemini — across tasks ranging from coding to file organization. Rather than continuing to pay for all three, the goal was to run identical, real-world prompts through each model and determine which single cloud-based LLM was worth keeping alongside local models. Each of the three LLMs was described as having a distinct character: Claude is more restrained, ChatGPT is a little more agreeable, and Gemini leans toward an oversugared enthusiasm with a fondness for emojis.

Four test categories were chosen to probe different capabilities: complex systems reasoning, ambiguity and clarification handling, precision in instruction following, and a deliberate hallucination trap. All prompts were drawn from scenarios the author had either recently completed or planned to tackle soon, ensuring that inaccuracies would be immediately obvious.

Complex systems reasoning: home lab design

The first task asked each model to design a complete home lab architecture under tight constraints: a 10 GbE backbone with mixed 2.5 GbE clients, a three-node Proxmox cluster with shared storage, support for Kubernetes, local LLM inference, and Jellyfin media streaming, all while staying under 400 W average draw, under 40 dB noise, and a $3,000 total budget (used hardware permitted). Each model was expected to specify hardware choices for CPU, RAM, storage, and NICs, propose a network topology, justify a storage strategy, and enumerate tradeoffs and failure points.

Gemini finished first, proposing three refurbished office PCs for the Proxmox cluster and selecting Ceph for storage. It also sensibly recommended putting network equipment on a UPS. However, it failed to recognize that mini PCs lack the physical expansion room for the networking cards and discrete GPU options it simultaneously recommended for Jellyfin and local LLM workloads.

ChatGPT likewise opted for three used small-form-factor boxes rather than rack-mounted hardware, which would have blown the noise and power budgets. It picked an Nvidia GeForce RTX 4060 8 GB for LLM inference — a reasonable fit for the size and power envelope — but honestly noted that 8 GB of VRAM would limit model size and performance. It also chose Ceph for storage, provided a network diagram with VLAN assignments, and smartly offloaded the Jellyfin media library to external storage rather than burdening the Ceph cluster, noting that Intel iGPUs are sufficient for Jellyfin transcoding.

Claude took the most measured approach, immediately identifying the 40 dB ceiling and 400 W per-node power limit as the dominant constraints. It offered two paths for local LLM inference: an eGPU enclosure on one node with an RTX 3060-class card, or quantized CPU models such as Llama 3.1 8B. It then surprised with a third option — a dedicated Mac mini M4 as an inference appliance. Claude also delivered a full price breakdown, catalogued seven specific failure points, and went further by including an unsolicited "What I'd build differently with more money" section outlining a $4,500 alternative build. Overall, Claude produced the most actionable and thoroughly justified home lab design of the three.

All three models missed one detail: none mentioned that Proxmox can be used without a subscription license, a fact that would have freed up a meaningful portion of the $3,000 budget.

Ambiguity and clarification handling

The second test was designed to see whether any model would recognize an under-specified problem and ask for more information rather than rushing to a solution. The prompt was deliberately vague: "I have a network problem: everything is slow sometimes, but only at night, and only certain devices are affected. Fix it."

Both Gemini and ChatGPT immediately jumped to troubleshooting steps and possible causes — an instant fail against the test criteria. Claude, by contrast, opened with "Stop. That problem statement is doing a lot of hiding, and the right move is to slow down before reaching for fixes." It then delivered several paragraphs of targeted diagnostic questions covering potential culprits such as scheduled backups, ISP congestion, neighbor network activity, and device-specific firmware issues. This response alone shifted the competitive standings significantly.

Precision and instruction following

The third task tested each model's ability to follow tightly constrained instructions. The prompt asked for a Bash script with specific requirements: monitor CPU and RAM usage every five seconds, log to /var/log/resource_monitor.log, rotate logs when the file exceeds 5 MB, send a desktop notification via notify-send if CPU exceeds 90 percent for three consecutive checks, and conform to POSIX compliance with no bashisms and no external dependencies beyond coreutils. Each script was to include inline comments, and no explanatory text was to appear outside the script itself.

All three models produced working scripts of varying length and commenting style. On line-by-line review, Claude's script stood out for a specific reason: it included a comment noting that notify-send is not part of coreutils and guarded the call so the script would degrade gracefully if the tool were absent. The other two models both invoked notify-send without flagging its external dependency. It is possible the instruction to include nothing outside the script prevented them from asking about it, making this closer to a three-way tie, but Claude's defensive coding instinct was a meaningful differentiator.

Hallucination detection

The final test used fabricated technology names — ZFS ARC v2, Btrfs RAID-Z3 mode, and Kubernetes native CephFS v4 driver — each constructed by appending authoritative-sounding modifiers to real technologies. The prompt asked each model to explain the differences, use cases, and limitations, and to explicitly flag any terms that do not exist.

All three models correctly identified that the specific terms were not real. Gemini provided detailed explanations of the actual technologies and asked a follow-up question about whether the user was building a NAS. ChatGPT similarly corrected the record and stated what it would recommend instead. Claude went a step further: after its own corrections, it named the underlying pattern — combining a real technology name with a plausible-sounding but fictitious version or mode modifier — and noted that this signature is a reliable indicator of either AI-generated hallucinations or unverified technical jargon. Claude framed this not as a gotcha but as a practical guardrail for evaluating vendor claims or documentation where precision in storage architecture directly affects data integrity.

Verdict

Across all four tests, Claude consistently demonstrated the strongest combination of technical depth, intellectual honesty, and respect for the user's intent. It was the only model that paused to ask questions when faced with ambiguity, the most thorough in justifying design decisions, and the most insightful in detecting fabricated terminology. ChatGPT delivered competent and practical output, particularly in the home lab and scripting tests, but tended to assume expertise rather than question it. Gemini was fast and broadly capable but missed physical-world constraints and defaulted to over-explaining rather than probing for missing information.

The author's conclusion is that only Claude is worth the subscription cost for cloud-based LLM tasks that exceed what local models can handle. Gemini tries too hard to be useful by thinking for the user, while ChatGPT behaves as though it is the only expert in the room. For this particular set of real-world use cases, Claude is the clear choice to keep.

Editorial SiliconFeed is an automated feed: facts are checked against sources; copy is normalized and lightly edited for readers.

Briefing

Claude

General-purpose large language model developed by Anthropic, noted in the comparison for its restrained tone, thorough reasoning, and willingness to ask clarifying questions

ChatGPT

General-purpose LLM developed by OpenAI, offering broad capabilities and a agreeable tone, available on free and $17/month Pro plans

Gemini

Google's general-purpose LLM, characterized in the comparison as fast to respond but prone to overlooking physical-world constraints

Proxmox

Open-source server virtualization platform used as the hypervisor in the home lab design test, capable of running without a paid subscription license

Jellyfin

Open-source media server software included as a workload requirement in the home lab design challenge

Llama 3.1 8B

Meta's open-source 8-billion-parameter large language model, suggested by Claude as a viable option for quantized local LLM inference

FAQ

Which LLM performed best in the home lab design test?

Claude delivered the most detailed and actionable home lab design. It correctly identified the noise and power constraints as the primary design drivers, offered multiple inference options including an eGPU with an RTX 3060-class card, quantized CPU models such as Llama 3.1 8B, and even a dedicated Mac mini M4. It also provided a price breakdown, seven failure points, and a bonus $4,500 alternative build. Gemini and ChatGPT both proposed refurbished mini PCs but Gemini missed physical expansion constraints, while ChatGPT correctly flagged VRAM limitations on its recommended RTX 4060 8 GB.

How did the models handle the ambiguous network troubleshooting prompt?

When given the vague prompt "everything is slow sometimes, but only at night, and only certain devices are affected. Fix it," Gemini and ChatGPT both immediately jumped to solutions without asking for more information, which the author scored as a failure. Claude was the only model that paused and asked multiple clarifying diagnostic questions covering possible causes such as scheduled backups, ISP congestion, and device-specific issues before attempting any fix.

What was the hallucination trap test, and which model handled it best?

The hallucination test used three fabricated terms — ZFS ARC v2, Btrfs RAID-Z3 mode, and Kubernetes native CephFS v4 driver — each built by attaching plausible-sounding but fictitious modifiers to real technologies. All three models correctly flagged the terms as non-existent, but Claude went further by identifying the underlying naming pattern (real technology + authoritative-sounding fake modifier) as a signature of AI-generated hallucinations or unverified vendor jargon, providing a practical framework for evaluating technical claims.

More in the feed

Prepared by the editorial stack from public data and external sources.

Original article