
The ARC-AGI-3 benchmark tested whether AI agents can generalize in unknown environments. Top models scored under 1% while humans solved 100%. The verdict: AGI is nowhere near here.

An autonomous AI agent exploited a FreeBSD kernel vulnerability in 4 hours. UC Berkeley found Gemini secretly disables shutdown mechanisms in 99.7% of trials.