I'm Adam Lewis


👋 Hi, I'm a PhD Data Scientist currently at OpenTeams (formerly at Quansight) where I perform a wide range of data science tasks to meet our clients' needs. I've created algorithms enabling targeted advertising based on 50+ TB of geospatial data. I've created training material demonstrating Dask, DaskML, and distributed hyperparameter tuning on out of memory datasets. I've also created an optical character recognition pipeline reducing manual entry of business data by 70%.

🎓 Before I joined Quansight, I got my PhD at The University of Texas at Austin where I analyzed terabytes of multimodal imaging data taken from an industrial 3D printing process using Python and Java to detect in-situ defects via aggregation, registration, and visualization.

🌟 Outside of data science I enjoy disc golf 🥏, hiking 🥾, tennis 🎾, traveling ✈️, and I'm always up for a good board game. Check out my social media for more info!

📝 Recent Posts

Running a Local Coding Agent with Qwen3-Coder-Next

February 23, 2026

Running a Local Coding Agent with Qwen3-Coder-Next

Open-source coding models have gotten seriously good. Qwen3-Coder-Next is an 80B parameter Mixture-of-Experts model that only activates 3B parameters per token, and according to Qwen’s own benchmarks, it scores competitively with Claude Sonnet 4.5 on SWE-Bench Verified. That’s a model you can run on consumer hardware matching (or at least approaching) one of the best proprietary coding models available. I wanted to see how far I could push it on my desktop — an RTX 3060 12GB, Ryzen 9 7950X, and 128GB of DDR5 RAM.

Here’s what I did, what worked, what didn’t, and what I learned along the way.

Step 1: Building llama.cpp with CUDA Support

The first step was getting llama.cpp built locally with CUDA support. llama.cpp is the go-to inference engine for running quantized models on consumer hardware, and it has native support for serving an Anthropic-compatible API — which matters a lot when you want to plug it into coding agent frontends.

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j$(nproc)

Nothing too complicated, but you do need the CUDA toolkit installed. Once built, you get llama-server which can serve models over HTTP with both OpenAI and Anthropic-compatible API endpoints.

Step 2: First Run (and Disappointing Results)

I grabbed the Q4_K_M quantization of Qwen3-Coder-Next (~48GB on disk) and fired up the server with some basic settings. The initial results were… fine. Around 77 tokens/second for prompt processing and 23 tokens/second for generation. Usable, but I had a feeling the hardware could do better.

The key variable I needed to tune was -ncmoe — this controls how many of the model’s 48 MoE layers have their experts processed on CPU versus GPU. With only 12GB of VRAM and a 48GB model, most of the expert computation has to happen on CPU, but the question was: exactly how much can we push onto the GPU before running out of memory?

Step 3: The Parameter Sweep

Rather than guess, I wrote a series of benchmark scripts to systematically sweep through different configurations. I tested:

  • ncmoe values from 40 to 48 (how many MoE layers stay on CPU)
  • Batch and ubatch sizes (how the prompt gets chunked for processing)
  • Thread counts (8, 16, 32)
  • KV cache quantization (f16 vs q8_0)
  • Context sizes from 32K up to 256K

Some highlights from the results:

Config Prompt Processing Token Generation Notes
ncmoe=41 (best perf) 323 t/s 25.2 t/s +7% pp, +16% tg vs baseline
ncmoe=48 (all CPU MoE) 302 t/s 21.8 t/s Safest, frees most VRAM
ncmoe=48, ub=4096, pp8192 631 t/s 24.7 t/s Sweet spot for long prompts

The most interesting finding was about VRAM management. When you need to free up VRAM (say, for a large context window), you have two options: shrink the ubatch size or push more MoE layers to CPU. Pushing layers to CPU is dramatically better — bumping ncmoe from 42 to 48 freed nearly 6GB of VRAM with only a 3% performance hit, while cutting ubatch from 2048 to 512 freed less than 1GB and cost 45% of prompt processing speed.

I also confirmed that 128K context works on the RTX 3060 (using q8_0 KV cache quantization), and 160K is the practical maximum. 256K just OOMs.

Step 4: Plugging It into Claude Code

Here’s where it gets interesting. llama-server has native Anthropic API support, so you can point Claude Code directly at it:

claude-local() {
    ANTHROPIC_BASE_URL="http://localhost:8080" \
    ANTHROPIC_API_KEY="sk-no-key-required" \
    CLAUDE_CODE_ATTRIBUTION_HEADER=0 \
    claude --model "Qwen3-Coder-Next" "$@"
}

But notice that CLAUDE_CODE_ATTRIBUTION_HEADER=0 — that was a critical discovery.

The Prefix Caching Problem

When running with a single slot (-np 1), llama-server keeps the KV/recurrent state from the previous request. If the next request starts with the same token sequence, it only needs to process the new tokens at the end. This is prefix caching, and for a coding agent conversation where the system prompt is 20K+ tokens, it’s the difference between a 40-second wait and a 200-millisecond one on every single turn.

The problem: Claude Code injects a dynamic billing/attribution header into the beginning of the system prompt on every request. This header includes a content hash that changes each turn, which means the tokenized prefix is different every time and the server can never reuse its cached state.

The fix is simple — CLAUDE_CODE_ATTRIBUTION_HEADER=0 disables the header when talking to a local server. This is only relevant for local use; the header is just analytics when talking to Anthropic’s actual API.

You can verify it’s working by watching the server logs:

# Cache hit — only new tokens to process:
prompt eval time = 193.48 ms / 14 tokens

# Cache miss — full re-processing:
prompt eval time = 43204.96 ms / 20637 tokens

Step 5: Trying the Pi Coding Agent

While Claude Code works, it has a heavy system prompt (~20K+ tokens) and relies on tool-calling patterns that aren’t always a natural fit for Qwen3-Coder-Next. Community users have reported the model getting stuck in loops with Claude Code’s XML-based tool format.

So I tried the Pi coding agent — a lighter-weight terminal-based coding agent. Pi’s system prompt is much smaller (~1.3K tokens), and prefix caching works out of the box with no configuration needed. Setup was straightforward:

npm install -g @mariozechner/pi-coding-agent

Configure a local model in ~/.pi/agent/models.json, add a bash alias, and you’re off:

alias pi-local='pi --provider llama-local --model Qwen3-Coder-Next'

Pi felt noticeably snappier than Claude Code on the local model — partly because of the smaller system prompt, and partly because it avoids the tool-calling translation layer that can trip up non-Claude models.

The Final Setup

After all the tuning, here’s what I landed on:

llama-server \
    -hf Qwen/Qwen3-Coder-Next-GGUF:Q4_K_M \
    -ngl 99 -ncmoe 48 -fa on -t 8 \
    -c 131072 -ctk q8_0 -ctv q8_0 \
    -b 4096 -ub 4096 -np 1 \
    --jinja --host 0.0.0.0 --port 8080 \
    --temp 0

I set it up as a systemd user service with bash aliases for easy management (llama-start, llama-stop, llama-logs, etc.) so it’s on-demand — no resources wasted when I’m not using it.

Honest Assessment

A caveat worth mentioning: despite the strong benchmark numbers, community experience (including my own) suggests that Qwen3-Coder-Next in practice feels more like Claude Haiku than Sonnet for complex multi-step reasoning tasks. It’s great for quick prototyping, simpler coding tasks, and situations where you want local/private inference. But if you’re expecting Claude Sonnet-level performance on hard refactors, temper your expectations.

That said, the fact that you can run a model with competitive benchmark scores on a consumer desktop — for free, with no API costs, and with full privacy — is remarkable. And the inference speed (~25 tokens/second for generation, 460-630 tokens/second for prompt processing) is genuinely usable for real work.

Resources


My Take on DeepLearning.AI's Post-Training of LLMs Course

July 29, 2025

So I just finished DeepLearning.AI’s Post-Training of LLMs course, and honestly? It was pretty much exactly what I needed—a straightforward intro to how you actually fine-tune these big language models after they’ve done their initial training.

What the Course Covers

They break it down into three main ways to do this stuff:

Supervised Fine-Tuning (SFT) is basically when you want to make big changes to how your model behaves. Want to turn a regular foundation model into something that actually follows instructions? Or maybe teach it to use tools? That’s SFT territory. The big takeaway here is that quality beats quantity every time—1,000 really good, diverse examples will crush a million mediocre ones.

Direct Preference Optimization (DPO) is kind of like showing the model examples of “do this, not that.” You give it both good and bad responses so it learns what you actually want. This works great for smaller adjustments like making it safer, better at multilingual stuff, or just following instructions better. Pro tip: start with a model that can already answer questions, then use DPO to polish it up.

Online Reinforcement Learning is where things get really interesting (and complicated). The model generates responses in real-time, gets scored by humans or other models, and then updates itself based on that feedback. Think about how ChatGPT was trained with PPO, or what DeepSeek does with GRPO.

What I Actually Liked About It

The best part? They actually tell you when to use each method instead of just throwing theory at you. You get real advice on how to curate your data, what mistakes to avoid (like when DPO gets obsessed with surface-level patterns), and how much memory each approach is going to eat up.

Plus, they handle all the setup through their Jupyter notebook thing, which is honestly a relief when you just want to learn the concepts without spending half your time fighting with dependencies.

The Not-So-Great Parts

Okay, real talk—some of the hands-on stuff felt a bit like when your older sibling lets you “play” video games but gives you the controller that’s not actually plugged in. 😄 You’re going through the motions, but you’re not really in control. Still, it gives you a decent foundation if you want to actually implement this stuff yourself later.

Also, this definitely isn’t for people who are new to LLMs. You should already get the basics of how language models work before jumping into the fine-tuning world.

For me, this course was pretty much perfect for what I needed—an intro to post-training methods without having to slog through dense academic papers. It’s short, well-organized, and gives you enough understanding to figure out which rabbit holes are actually worth exploring.

Resources


Building Polished Bespoke Solutions Fast with Vibe Coding

July 05, 2025

Building Polished Bespoke Solutions Fast with Vibe Coding

Sometimes the best solutions come from the most personal problems. My wife, an amateur photographer, had a classic modern problem: duplicate photos scattered across multiple Google Takeout extractions from different email accounts. She needed help organizing thousands of photos and removing duplicates without losing precious memories or paying for unnecessary cloud storage.

This is exactly the kind of problem where AI-assisted coding shines. With a newborn at home and precious little free time, I spent about two hours building a proper Python package that not only solved the immediate problem but created something maintainable and extensible. I could have probably written a quick and dirty script in similar time, but not with the amount of polish I had time for with vibe coding.

This has unit tests, a README, allowing me to come back to this in the future if needed. I’d create things like this in the past, but after a certain amount of time had gone by without using it, it would be easier to start over than pick up on what I had done before.

What It Does

The organize-photos tool is relatively simple. It tackles two main challenges:

  1. Smart Organization: Automatically sorts JPEG images into a clean YYYY/MM/DD folder structure using EXIF metadata
  2. Duplicate Detection: Uses SHA256 hashing to identify identical files and generates a CSV report for review before deletion

The tool handles edge cases gracefully - logging errors without crashing, managing filename conflicts, and giving you control over whether to copy or move files. You can see the code here if you’d like - https://github.com/Adam-D-Lewis/organize-photos.

Why I Love Vibe Coding

Vibe coding - that flow state where AI helps you rapidly prototype and refine solutions - allowed me to create something much better than a throwaway script, even with the time constraints of new parenthood. The key benefit isn’t speed (I could have hacked something together just as fast), but that this maximizes the value of those precious few hours of coding time. This approach gave me:

  • Proper structure: A real Python package with pyproject.toml, proper imports, and CLI interface
  • Quality foundations: Tests, error handling, and clean separation of concerns
  • Future-proof: Dependencies properly captured, code that’s readable and extensible
  • Confidence: I can modify this later without fear of breaking everything

The Result

Two hours of focused development produced a tool that’s both immediately useful and built to last. My wife got her photos organized and duplicates identified safely. More importantly, I have a solid foundation that I could expand on in the future - maybe adding support for other image formats, more sophisticated duplicate detection, or integration with cloud storage.

The real win isn’t just solving today’s problem quickly - it’s building solutions that respect your future self.