AI

How I Built a Free Local LLM Pipeline on a 10-Year-Old GTX 1080 with llama.cpp

At a glance:

  • A writer repurposed a decade-old GTX 1080 and a Ryzen 5 1600 system running Proxmox to build a fully local LLM inference pipeline using llama.cpp with Vulkan acceleration — no cloud API costs involved.
  • By switching from Ollama to llama.cpp and running Mixture of Experts models such as Gemma-4-26B-A4B, the setup achieved 15 tokens per second after tuning container memory from 8 GB to 24 GB, all on Pascal-era hardware.
  • The entire stack is Linux-based and self-hosted, keeping private files on the local network and integrating with FOSS tools including Open WebUI, Paperless-GPT, Blinko, Karakeep, and Claude Code.

Why Ollama Wasn't Enough

Ollama is, by most accounts, a fantastic entry point for anyone curious about running large language models locally. It abstracts away driver headaches, model management, and server configuration into a single CLI-friendly tool, and it remains a rock-solid starting point for newcomers to the self-hosted AI ecosystem. For the author — a PC hardware and gaming writer who had already been running simple Proxmox workloads on a Ryzen 5 1600 paired with 32 GB of DDR4 memory — Ollama was the original plan.

However, Ollama has real limitations once you move past casual experimentation. It lacks many of the fine-grained settings that power users need for serious LLM work, it can be slow to add support for newly released models, and its performance ceiling on older hardware is noticeable. Those shortcomings pushed the author toward llama.cpp, an open-source inference engine that offers far more granular control over GPU offloading, thread allocation, and model quantization. The trade-off is a significantly steeper setup process, but for someone willing to get their hands dirty in the terminal, the payoff is substantial.

Preparing the Hardware: GPU Passthrough to an LXC Container

Rather than running llama.cpp directly on bare metal, the author chose to spin up an LXC container inside Proxmox and pass the GTX 1080 through to it. Virtual machines were ruled out because the additional abstraction layers would introduce unnecessary bottlenecks for an already constrained GPU. The LXC container (ID 100) was configured with ample storage and system resources, and GPU passthrough was achieved by editing /etc/pve/lxc/100.conf via the Proxmox shell and pasting the following parameters:

lxc.cgroup2.devices.allow: c 195:* rwm
lxc.cgroup2.devices.allow: c 235:* rwm
lxc.cgroup2.devices.allow: c 237:* rwm
lxc.mount.entry: /dev/nvidia0 dev/nvidia0 none bind,optional,create=file
lxc.mount.entry: /dev/nvidiactl dev/nvidiactl none bind,optional,create=file
lxc.mount.entry: /dev/nvidia-uvm dev/nvidia-uvm none bind,optional,create=file
lxc.mount.entry: /dev/nvidia-uvm-tools dev/nvidia-uvm-tools none bind,optional,create=file
lxc.mount.entry: /dev/nvidia-modeset dev/nvidia-modeset none bind,optional,create=file

A note for anyone following the same path: the device IDs (195, 235, 237) are specific to this particular GPU. Running ls -l /dev/nvidia* on your own system will reveal the correct major and minor numbers for your card.

Installing Legacy NVIDIA Drivers for a Deprecated GPU

Nvidia officially dropped driver support for the entire Pascal lineup back in December 2025, which meant the author had to install a slightly older driver manually. The same set of commands had already been used on the Proxmox host, so they were simply repeated inside the freshly configured LXC:

wget https://us.download.nvidia.com/XFree86/Linux-x86_64/580.119.02/NVIDIA-Linux-x86_64-580.119.02.run
chmod +x NVIDIA-Linux-x86_64-580.119.02.run
./NVIDIA-Linux-x86_64-580.119.02.run

Inside the LXC container, there is one critical difference: the --no-kernel-modules flag must be appended to the final command. Without it, the installer attempts to compile kernel modules that an LXC container cannot support, and the process fails partway through.

The CUDA Dead End and the Pivot to Vulkan

With the NVIDIA drivers in place, the next challenge was compiling llama.cpp in a way that could actually detect and use the GPU. The author initially attempted the CUDA build path, since CUDA is generally considered the gold standard for GPU-accelerated inference on Nvidia hardware. Unfortunately, installing the CUDA toolkit on this aging setup turned into a nightmare of incompatible packages. Even after painstaking manual configuration, llama.cpp stubbornly refused to recognize the CUDA installation. After hours of troubleshooting, the author reverted the LXC to an earlier snapshot and decided to pivot entirely.

The Vulkan backend of llama.cpp is a less-traveled path, but it proved dramatically easier to set up. The following command installed all the necessary Vulkan libraries and build tools:

apt install glslc glslang-tools libvulkan1 vulkan-tools libvulkan-dev spirv-tools build-essential git cmake curl

Next, the Vulkan ICD (Installable Client Driver) needed to be configured so that the runtime could find the Nvidia driver. The author created the file /usr/share/vulkan/icd.d/nvidia_icd.json with the following contents:

{"file_format_version" : "1.0.0","ICD": {"library_path": "libGLX_nvidia.so.0","api_version" : "1.3"}}

With Vulkan and the prerequisites in place, compiling llama.cpp was straightforward:

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
cmake -B build -DGGML_VULKAN=ON
cmake --build build --config Release -j$(nproc)

The entire build took roughly four to five minutes on the Ryzen 5 1600 — a modest but workable compile time.

Running Mixture of Experts Models on Weak Hardware

The real motivation behind this project was not just to run any local LLM, but to run large ones. The author had recently encountered Mixture of Experts (MoE) architectures and found them to be a game-changer for resource-constrained hardware. Unlike dense models that activate every parameter layer for each token, MoE models route different inputs to specialized sub-networks called "experts." This means the GPU only needs to hold the attention mechanisms and the most frequently used experts in VRAM, while less-active experts can be offloaded to system RAM. The result is access to a much larger knowledge base without the catastrophic token-generation slowdowns that plague dense models on limited VRAM.

For the first experiment, the author chose Gemma-4-26B-A4B — a 26-billion-parameter model with a 4-bit activation per token, selected partly for its strong reputation and partly as a change of pace from Qwen3.6-35B-A3B, which was already on the testing roadmap. The model was launched with the following server command:

./llama-server -m "/root/models/gemma-4-26B-A4B-it-Q4_K_M.gguf" -c 65536 -ngl 999 --n-cpu-moe 40 -t 6 -b 2048 -ub 2048 --no-mmap --host 0.0.0.0 --port 8082

The --n-cpu-moe 40 flag was the critical piece — it tells llama.cpp to keep 40 experts on the CPU (system RAM) rather than trying to load them onto the GPU, which is what makes running a model this size feasible on a GTX 1080 with only 8 GB of VRAM.

A Painful Bottleneck — and the Fix

Within seconds of launching, the llama.cpp server was active and responding through its built-in web UI. Initial results, however, were underwhelming: token generation sat at just 2.5 to 3 tokens per second, far below expectations. After further troubleshooting, the root cause was identified — the LXC container had been allocated only 8 GB of system memory. With the GPU's VRAM already saturated and system RAM maxed out, the model was forced to page to disk storage, causing a severe performance penalty.

Increasing the LXC container's RAM allocation to 24 GB and restarting the server transformed the experience. Gemma-4-26B-A4B jumped to approximately 15 tokens per second — a roughly fivefold improvement. For context, this is a significant step up from the DeepSeek R1 7B model the author had previously run through Ollama on the same machine, and it was achieved on a GPU that is ten years old as of 2026.

Plugging Into the FOSS Ecosystem

A local LLM server is only as useful as the tools that connect to it. Over the following hours, the author wired the llama.cpp server into a curated stack of free and open-source applications:

  • Open WebUI — a ChatGPT-style web interface for interacting with local models
  • Paperless-GPT — an AI-powered document management system
  • Blinko — a knowledge management and note-taking tool
  • Karakeep — a lightweight journaling and memory app
  • VS Code — for development workflows augmented by local AI assistance
  • Claude Code — an agentic coding tool that can be pointed at local inference endpoints

This combination effectively replicates much of what cloud-based AI platforms offer, with the crucial difference that no prompts, documents, or personal data ever leave the local network.

The Cost of Running It All

Beyond the initial experiment, the ongoing operational cost of this setup is negligible. The GTX 1080 was purchased years ago and has been repurposed rather than bought new. The underlying Ryzen 5 1600 system was already running 24/7 for Proxmox workloads, so there is no additional hardware expense. Energy consumption is the only recurring cost, and it is minimal: the author notes that LLM tasks cause the GPU to spike in short bursts rather than sustain heavy loads, and most inference jobs complete within seconds. The system idles for the vast majority of its uptime, and the author has already optimized the CPU scaling governor and other power-related settings to reduce baseline wattage.

What Comes Next

The author plans to test Qwen3.6-35B-A3B over the coming days, expecting it to work with a few parameter adjustments. That model is even larger than Gemma-4-26B-A4B, so further tuning of the CPU expert offloading and RAM allocation will likely be necessary. The broader takeaway is encouraging: with the right combination of open-source tools, a Vulkan-compatible inference engine, and Mixture of Experts architectures, genuinely capable local LLM inference is within reach of anyone with a modest GPU and a willingness to tinker.

As cloud API prices continue to climb and privacy concerns around commercial AI platforms grow, self-hosted pipelines like this one represent a practical — and nearly free — alternative.

Editorial SiliconFeed is an automated feed: facts are checked against sources; copy is normalized and lightly edited for readers.

FAQ

What GPU and system specs does this local LLM setup require?
The setup runs on a GTX 1080 (Pascal-era, 8 GB VRAM) paired with a Ryzen 5 1600 CPU and 32 GB of DDR4 system memory. The operating system is Linux under Proxmox, with llama.cpp running inside an LXC container that was allocated 24 GB of RAM for optimal performance. Even with decade-old hardware, the pipeline achieves around 15 tokens per second on Gemma-4-26B-A4B.
Why use llama.cpp instead of Ollama for local LLM inference?
Ollama is beginner-friendly but lacks fine-grained configuration options, lags in adding support for newer models, and underperforms on constrained hardware. llama.cpp offers deeper customization — including Vulkan GPU acceleration, explicit Mixture of Experts CPU offloading via flags like --n-cpu-moe, and precise control over context length and thread allocation — making it better suited for running larger models on older GPUs.
What are Mixture of Experts models and why do they matter for older GPUs?
Mixture of Experts (MoE) models activate only a subset of their parameters for each input token, routing different tasks to specialized sub-networks called experts. This allows less-frequently used experts to reside in system RAM rather than GPU VRAM, enabling large models like Gemma-4-26B-A4B (26 billion parameters) to run on a GPU with only 8 GB of VRAM while still maintaining usable token generation speeds.

More in the feed

Prepared by the editorial stack from public data and external sources.

Original article