AI

Local LLM gets dumber over time on RTX 5090 due to context window and memory issues

At a glance:

  • Running Qwen 3.6 27B at 256K context on an RTX 5090 causes performance degradation, not model deterioration
  • The hybrid architecture reduces KV cache requirements, but still exceeds 32GB VRAM when including system overhead
  • Memory spilling to system RAM over PCIe creates slowdowns that make the model appear less capable

The experiment that revealed the problem

While local large language models have reached a point where they're useful for most coding tasks, the author discovered an unexpected issue when running Qwen 3.6 27B at Q4_K_M quantization inside LM Studio on an Nvidia RTX 5090. Over extended conversations, the model appeared to become less coherent, with answers drifting, token generation slowing, and overall performance declining even during idle periods.

This wasn't an isolated incident tied to a specific model or server setup. Testing with smaller models and alternative setups like vLLM produced the same degradation patterns. Initial suspicion fell on context length calculations, but the root cause proved to be more nuanced than simple napkin math would suggest.

The KV cache miscalculation

The conventional wisdom about VRAM usage doesn't apply to Qwen 3.6 27B, which uses a hybrid transformer architecture. While standard transformers would require approximately 64GB for a 256K context window at fp16 precision, Qwen 3.6's architecture only applies full attention to 16 of its 64 layers. This dramatically reduces the KV cache requirement to roughly 16GB instead of the expected 64GB.

However, this doesn't solve the memory problem entirely. The model weights consume 16.8GB, and Windows 11 system overhead, browser processes, vision encoder, and CUDA buffering all compete for the remaining VRAM. The total memory footprint exceeds the RTX 5090's 32GB capacity, forcing the GPU driver to silently offload data to system RAM over the PCIe bus.

Context length KV cache (fp16) + 16.8GB weights Fits in 32GB?
262K (the setting) ~16 GB ~33 GB + overhead No — just over → spills
128K ~8 GB ~25 GB Yes
64K ~4 GB ~21 GB Yes, easily
32K ~2 GB ~19 GB Yes, lots of room

Why the model isn't actually getting worse

The fundamental misunderstanding here is attributing performance degradation to the model itself. The model weights remain frozen during inference—there's no continual learning or bad habit acquisition happening. What's deteriorating is the computational environment surrounding the model.

Context window management becomes critical as conversations extend. Each interaction feeds history back into the model, increasing token count and creating longer sequences for the transformer to process. This hits a fundamental limitation of transformer architectures: they're measurably worse at recalling information from the middle of long context windows, tending to focus on the beginning and end of conversations.

Qwen 3.6 compounds this issue as a reasoning model with hidden think traces that consume context rapidly. The 256K context window, which initially seems advantageous, becomes a liability as it fills with conversation history that the model struggles to effectively utilize.

The silent performance killer: Windows GPU drivers

Using an Nvidia GPU on Windows introduces a specific problem: the driver doesn't fail when VRAM becomes full. Instead of throwing an error, it silently offloads the excess memory to system RAM via PCIe. This creates a bottleneck that slows the entire system, making even an RTX 5090 perform like an underpowered machine.

This behavior differs significantly from what developers might expect based on server-grade GPU behavior, where memory pressure typically results in explicit failures rather than silent degradation.

The simple fix that works

The solution to this apparent intelligence degradation is surprisingly straightforward: restart the conversation or reload the model entirely. Opening a fresh chat, restarting LM Studio, or manually reloading the model clears the KV cache and restores performance.

This reveals that the issue isn't cumulative model degradation but rather cache saturation. As VRAM headroom disappears over time or across multiple sessions, the system increasingly relies on slower system memory, accelerating the performance decline.

While rebooting might seem like a primitive solution for AI systems, treating local LLMs as computational tools rather than magical intelligences leads to more practical usage patterns. Regular cache management becomes as important as any prompt engineering technique.

What this means for local LLM users

The degradation in local LLM performance over time stems from a combination of context window management, memory pressure, and system-level bottlenecks rather than any inherent model limitations. Users should expect that longer conversations will eventually hit performance walls, and that restarting sessions can restore optimal performance.

For those with unified memory architectures like Apple's M-series chips, these issues may be less pronounced since the unified memory design handles memory pressure more gracefully than discrete GPU setups with PCIe offloading.

The key takeaway is that local LLMs are tools with physical constraints, not infinitely scalable intelligences. Understanding these limitations—particularly around memory management and context handling—enables more effective usage patterns. Just as traditional software benefits from periodic restarts, local LLMs perform better when treated as finite computational resources rather than magical black boxes.

Editorial SiliconFeed is an automated feed: facts are checked against sources; copy is normalized and lightly edited for readers.

FAQ

Why does my local LLM seem to get dumber over time?
The model itself doesn't change—weights remain frozen during inference. Performance degradation comes from context window saturation, KV cache filling up, and memory spilling to system RAM over PCIe. As the conversation history grows, the transformer becomes less effective at recalling middle-context information, and when VRAM exceeds capacity, Windows silently offloads to slower system memory.
Does this happen with all local LLMs or just specific models?
This affects most local LLMs to varying degrees. The author tested Qwen 3.6 27B but saw similar patterns with smaller models and different setups like vLLM. The hybrid architecture of Qwen 3.6 actually helps reduce KV cache requirements compared to standard transformers, but the total memory footprint still exceeds 32GB when including system overhead.
What's the best way to prevent this performance degradation?
Start fresh conversations regularly, restart LM Studio, or manually reload the model to clear the KV cache. Setting context windows to 64K or 32K can also help stay within VRAM limits—the table in the article shows 64K uses ~4GB for KV cache plus 16.8GB for weights, fitting comfortably in 32GB. Users with Apple's unified memory systems may experience fewer issues.

More in the feed

Prepared by the editorial stack from public data and external sources.

Original article