AI

Ollama's new MLX engine doubles local LLM speed on MacBook Air M5

At a glance:\n- Ollama's MLX engine doubles inference speed on a MacBook Air M5 with 16GB RAM.\n- The engine improves memory usage and adds NVIDIA NVFP4 quantization, cutting quality loss in half for models like Gemma 4 12B.\n- Agent workflows gain a snapshot system that reduces redundant context processing for coding assistants such as Claude Code, OpenClaw, and Aider.\n\n## Performance boost on Apple Silicon\nOllama's new MLX engine transforms how local LLMs run on Apple Silicon Macs. Previously, running models on a MacBook Air M5 with 16GB RAM often slowed down the entire system due to heavy memory and compute demands. After upgrading to the MLX engine, the author observed that inference became almost twice as fast, making the notebook feel far more responsive during everyday tasks.\n\nThe engine leverages Apple's unified memory architecture, allowing the CPU and GPU to share the same pool without costly data copies. By combining multiple GPU operations into larger Metal kernels via MLX's just‑in‑time compiler and improving GPU‑backed sampling, unnecessary memory movement is cut and token generation speeds up. Ollama claims the updated engine delivers roughly 20% higher output speed than the previous Q4_K_M implementation, a figure that matches the author's daily experience.\n\n## Quality improvements with NVFP4 quantization\nBeyond raw speed, the MLX engine now supports NVIDIA's model‑optimized NVFP4 quantization format. In Ollama's own tests with Gemma 4 12B, NVFP4 reduces quality loss by about half compared to the widely used Q4_K_M while keeping memory usage similar. The benchmark shows lower perplexity, indicating the model behaves closer to its original BF16 version.\n\nFor users of smaller models on memory‑constrained Macs, this means stronger outputs without needing more RAM. The author notes that generated code follows instructions more consistently, follow‑up prompts need fewer corrections, and conversations stay coherent longer, reducing the time spent rewriting prompts. These quality gains complement the speed improvements, making local AI more practical for daily development work.\n\n## Agent workflow enhancements via snapshot system\nModern coding assistants constantly resend large contexts—system prompts, tool definitions, conversation history, and loaded files—on each tool call. Traditional prefix caching only helps when each request directly follows the previous one, but agents often branch into sub‑agents, retry failed steps, or drop reasoning tokens, forcing the model to rebuild the same context repeatedly. This inefficiency adds latency and wastes compute resources during interactive sessions.\n\nOllama addresses this with a new snapshot system that stores reusable model states at key points in a conversation. Separate agent sessions can resume from these saved states instead of reconstructing everything from scratch, and thinking models benefit because snapshots preserve useful states before reasoning tokens disappear from the visible history. The result is faster tool execution and smoother multi‑agent workflows.\n\n- Claude Code\n- OpenClaw\n- Aider\n\nOllama is a lot better now. The new update improves everything you use local LLMs for, whether it's chatting with a model or using it as a coding assistant. My own local workflows feel much quicker because repeated tool calls no longer spend as much time rebuilding context. Faster response times, combined with better output quality, make the new MLX engine one of the most worthwhile upgrades I have made to my local AI setup.

Editorial SiliconFeed is an automated feed: facts are checked against sources; copy is normalized and lightly edited for readers.

FAQ

What is Ollama's new MLX engine and how does it improve performance on Mac?
Ollama's MLX engine is an updated inference backend that leans heavily on Apple's unified memory and Metal frameworks. It reduces unnecessary data movement between CPU and GPU by combining GPU operations into larger kernels via MLX's just‑in‑time compiler. On a MacBook Air M5 with 16GB RAM, the author observed inference becoming almost twice as fast, with Ollama claiming roughly 20% higher output speed compared to the previous Q4_K_M implementation.
How does the MLX engine's support for NVIDIA's NVFP4 quantization affect model quality and memory usage?
The MLX engine now supports NVIDIA's NVFP4 format, which lowers the memory footprint of models while preserving quality far better than older methods. In Ollama's tests with Gemma 4 12B, NVFP4 cut quality loss by about half relative to the widely used Q4_K_M while keeping memory usage similar. Benchmarks showed lower perplexity, indicating outputs closer to the original BF16 model, and the author noted more consistent code generation and fewer corrections needed.
In what ways does the updated MLX engine benefit coding assistants and agent workflows?
Modern AI agents repeatedly resend large contexts like system prompts and conversation history, which traditional prefix caching cannot handle efficiently when agents branch or retry. Ollama's new snapshot system stores reusable model states at key points in a conversation, allowing agents to resume from those states instead of rebuilding everything. This speeds up tool calls for assistants such as Claude Code, OpenClaw, and Aider, and also helps thinking models by preserving useful states before reasoning tokens disappear.

More in the feed

Prepared by the editorial stack from public data and external sources.

Original article