Running a Local Coding Agent with Qwen3-Coder-Next

Step 1: Building llama.cpp with CUDA Support
Step 2: First Run (and Disappointing Results)
Step 3: The Parameter Sweep
Step 4: Plugging It into Claude Code
- The Prefix Caching Problem
Step 5: Trying the Pi Coding Agent
The Final Setup
Honest Assessment
Resources

Open-source coding models have gotten seriously good. Qwen3-Coder-Next is an 80B parameter Mixture-of-Experts model that only activates 3B parameters per token, and according to Qwen’s own benchmarks, it scores competitively with Claude Sonnet 4.5 on SWE-Bench Verified. That’s a model you can run on consumer hardware matching (or at least approaching) one of the best proprietary coding models available. I wanted to see how far I could push it on my desktop — an RTX 3060 12GB, Ryzen 9 7950X, and 128GB of DDR5 RAM.

Here’s what I did, what worked, what didn’t, and what I learned along the way.

Step 1: Building llama.cpp with CUDA Support

The first step was getting llama.cpp built locally with CUDA support. llama.cpp is the go-to inference engine for running quantized models on consumer hardware, and it has native support for serving an Anthropic-compatible API — which matters a lot when you want to plug it into coding agent frontends.

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j$(nproc)

Nothing too complicated, but you do need the CUDA toolkit installed. Once built, you get llama-server which can serve models over HTTP with both OpenAI and Anthropic-compatible API endpoints.

Step 2: First Run (and Disappointing Results)

I grabbed the Q4_K_M quantization of Qwen3-Coder-Next (~48GB on disk) and fired up the server with some basic settings. The initial results were… fine. Around 77 tokens/second for prompt processing and 23 tokens/second for generation. Usable, but I had a feeling the hardware could do better.

The key variable I needed to tune was -ncmoe — this controls how many of the model’s 48 MoE layers have their experts processed on CPU versus GPU. With only 12GB of VRAM and a 48GB model, most of the expert computation has to happen on CPU, but the question was: exactly how much can we push onto the GPU before running out of memory?

Step 3: The Parameter Sweep

Rather than guess, I wrote a series of benchmark scripts to systematically sweep through different configurations. I tested:

ncmoe values from 40 to 48 (how many MoE layers stay on CPU)
Batch and ubatch sizes (how the prompt gets chunked for processing)
Thread counts (8, 16, 32)
KV cache quantization (f16 vs q8_0)
Context sizes from 32K up to 256K

Some highlights from the results:

Config	Prompt Processing	Token Generation	Notes
ncmoe=41 (best perf)	323 t/s	25.2 t/s	+7% pp, +16% tg vs baseline
ncmoe=48 (all CPU MoE)	302 t/s	21.8 t/s	Safest, frees most VRAM
ncmoe=48, ub=4096, pp8192	631 t/s	24.7 t/s	Sweet spot for long prompts

The most interesting finding was about VRAM management. When you need to free up VRAM (say, for a large context window), you have two options: shrink the ubatch size or push more MoE layers to CPU. Pushing layers to CPU is dramatically better — bumping ncmoe from 42 to 48 freed nearly 6GB of VRAM with only a 3% performance hit, while cutting ubatch from 2048 to 512 freed less than 1GB and cost 45% of prompt processing speed.

I also confirmed that 128K context works on the RTX 3060 (using q8_0 KV cache quantization), and 160K is the practical maximum. 256K just OOMs.

Step 4: Plugging It into Claude Code

Here’s where it gets interesting. llama-server has native Anthropic API support, so you can point Claude Code directly at it:

claude-local() {
    ANTHROPIC_BASE_URL="http://localhost:8080" \
    ANTHROPIC_API_KEY="sk-no-key-required" \
    CLAUDE_CODE_ATTRIBUTION_HEADER=0 \
    claude --model "Qwen3-Coder-Next" "$@"
}

But notice that CLAUDE_CODE_ATTRIBUTION_HEADER=0 — that was a critical discovery.

The Prefix Caching Problem

When running with a single slot (-np 1), llama-server keeps the KV/recurrent state from the previous request. If the next request starts with the same token sequence, it only needs to process the new tokens at the end. This is prefix caching, and for a coding agent conversation where the system prompt is 20K+ tokens, it’s the difference between a 40-second wait and a 200-millisecond one on every single turn.

The problem: Claude Code injects a dynamic billing/attribution header into the beginning of the system prompt on every request. This header includes a content hash that changes each turn, which means the tokenized prefix is different every time and the server can never reuse its cached state.

The fix is simple — CLAUDE_CODE_ATTRIBUTION_HEADER=0 disables the header when talking to a local server. This is only relevant for local use; the header is just analytics when talking to Anthropic’s actual API.

You can verify it’s working by watching the server logs:

# Cache hit — only new tokens to process:
prompt eval time = 193.48 ms / 14 tokens

# Cache miss — full re-processing:
prompt eval time = 43204.96 ms / 20637 tokens

Step 5: Trying the Pi Coding Agent

While Claude Code works, it has a heavy system prompt (~20K+ tokens) and relies on tool-calling patterns that aren’t always a natural fit for Qwen3-Coder-Next. Community users have reported the model getting stuck in loops with Claude Code’s XML-based tool format.

So I tried the Pi coding agent — a lighter-weight terminal-based coding agent. Pi’s system prompt is much smaller (~1.3K tokens), and prefix caching works out of the box with no configuration needed. Setup was straightforward:

npm install -g @mariozechner/pi-coding-agent

Configure a local model in ~/.pi/agent/models.json, add a bash alias, and you’re off:

alias pi-local='pi --provider llama-local --model Qwen3-Coder-Next'

Pi felt noticeably snappier than Claude Code on the local model — partly because of the smaller system prompt, and partly because it avoids the tool-calling translation layer that can trip up non-Claude models.

The Final Setup

After all the tuning, here’s what I landed on:

llama-server \
    -hf Qwen/Qwen3-Coder-Next-GGUF:Q4_K_M \
    -ngl 99 -ncmoe 48 -fa on -t 8 \
    -c 131072 -ctk q8_0 -ctv q8_0 \
    -b 4096 -ub 4096 -np 1 \
    --jinja --host 0.0.0.0 --port 8080 \
    --temp 0

I set it up as a systemd user service with bash aliases for easy management (llama-start, llama-stop, llama-logs, etc.) so it’s on-demand — no resources wasted when I’m not using it.

Honest Assessment

A caveat worth mentioning: despite the strong benchmark numbers, community experience (including my own) suggests that Qwen3-Coder-Next in practice feels more like Claude Haiku than Sonnet for complex multi-step reasoning tasks. It’s great for quick prototyping, simpler coding tasks, and situations where you want local/private inference. But if you’re expecting Claude Sonnet-level performance on hard refactors, temper your expectations.

That said, the fact that you can run a model with competitive benchmark scores on a consumer desktop — for free, with no API costs, and with full privacy — is remarkable. And the inference speed (~25 tokens/second for generation, 460-630 tokens/second for prompt processing) is genuinely usable for real work.

Resources

Qwen3-Coder-Next on Hugging Face — model card and official benchmarks
llama.cpp — the inference engine
Pi coding agent — lightweight terminal coding agent
Tutorial: Offline Agentic Coding with llama-server — helpful llama.cpp community guide
Unsloth: Local LLMs with Claude Code — guide for connecting local models to Claude Code

Adam Lewis