I'm Adam Lewis


👋 Hi, I'm a PhD Data Scientist currently at OpenTeams (formerly at Quansight) where I perform a wide range of data science tasks to meet our clients' needs. I've created algorithms enabling targeted advertising based on 50+ TB of geospatial data. I've created training material demonstrating Dask, DaskML, and distributed hyperparameter tuning on out of memory datasets. I've also created an optical character recognition pipeline reducing manual entry of business data by 70%.

🎓 Before I joined Quansight, I got my PhD at The University of Texas at Austin where I analyzed terabytes of multimodal imaging data taken from an industrial 3D printing process using Python and Java to detect in-situ defects via aggregation, registration, and visualization.

🌟 Outside of data science I enjoy disc golf 🥏, hiking 🥾, tennis 🎾, traveling ✈️, and I'm always up for a good board game. Check out my social media for more info!

📝 Recent Posts

What I Learned Making a Local LLM Do Real Work

April 09, 2026

What I Learned Making a Local LLM Do Real Work

In my previous post, I described building an AI agent for Harvest time tracking using Pydantic AI — driven partly by security concerns with the skill-based approach. The agent worked perfectly with Claude. Then I tried running it locally.

It added a time entry on Tuesday instead of Monday. When I asked it to fix that, it added the Monday entry but forgot to delete the Tuesday one. This wasn’t a damning verdict on local models — but it was a useful lesson about where to invest your time when building agents.

Going Local

Why run locally? Partly practical — no API costs, no data leaving my machine. But honestly, partly just for fun. The dream of doing everything locally without worrying about privacy or third-party dependencies is appealing, even if it’s not strictly necessary for a time-tracking tool.

I’d already been experimenting with running models locally for coding tasks, which I wrote about previously. Since then, llama.cpp has added built-in router mode — you run a single server process that auto-loads and unloads models on demand based on the model field in your API request. With --models-max 1 it evicts the current model when you request a different one, which works well for my setup with an RTX 3060 12GB where only one large model fits in VRAM at a time.

My first test model was Qwen3-Coder-Next — 80B total parameters but only 3B active per token (it’s a Mixture-of-Experts model with 512 experts, 10 selected per token). I was using the Q4_K_M quantization, about 46GB on disk. It’s built for coding tasks, not calendar math and time-tracking. So it wasn’t exactly a fair fight from the start, and the quantization may have further degraded its reasoning on this kind of task.

Where Things Went Wrong

I want to be careful here — this isn’t a story about local models being bad. Local models can be very good, and they’re getting better fast. GLM-5.1, for example, is a 754B parameter model with reportedly near-Opus 4.6 capability under an MIT license. The fact that you can self-host something at that level at all is incredible, even if you need serious hardware to run it.

This was more of an exercise in seeing if I could make the agent robust to a poorly-performing model. The agent worked fine with Claude Sonnet 4.6 — I wanted to see what would break when I threw a much weaker model at it. Here’s what I ran into:

Date math: I asked it to log time for “next Monday.” It picked Tuesday. When dates crossed month boundaries, it got worse.

Day-of-week hallucination: Given the ISO date “2026-04-07,” the model confidently identified it as Monday. It’s Tuesday.

Silent substitution: I typed a project name with a slight typo. Instead of asking for clarification, the model quietly logged my time to a completely different real project.

These aren’t exotic edge cases — dates and project names are the entire job of a time-tracking agent. Could it have been the model, the quantization, something in llama.cpp, or just that 3B active parameters isn’t enough for this? Hard to say. Probably some combination.

Pushing Logic Out of the LLM

Rather than guessing at fixes, I started by building evals — 22 test cases covering tool selection, date parsing, shortcut resolution, hallucination detection, and project validation. Evals are valuable regardless of model quality. Model life cycles are short; you’re going to be swapping models regularly, and evals let you validate each swap quickly and catch regressions.

With the evals showing me exactly where the model was failing, I systematically moved deterministic work out of the LLM and into Python:

Date parsing: Instead of asking the model to calculate “next Friday,” the tool code parses relative dates in Python. The system prompt includes a three-week calendar table — last week through next week — so the model just reads dates off the table instead of doing arithmetic.

Project validation: At startup, the agent builds an index of real Harvest projects. Every tool call validates the project name against this index before hitting the API. Typos get fuzzy-matched with suggestions (“did you mean Deep Learning?”). The model is explicitly told: never substitute a project the user didn’t ask for.

Shortcut resolution and hour rounding: Lookup tables and rounding logic in Python, not left to the model.

The pattern behind all of these: the LLM handles intent — understanding what the user wants. Code handles precision — getting the details exactly right. Anything deterministic belongs in code, not in the prompt.

After these changes, Qwen3-Coder-Next passed 100% of my eval cases. But in real-world usage, it still had rough edges not covered by my (admittedly quick) test suite. A 100% pass rate means your test suite isn’t comprehensive enough yet — not that you’re production-ready.

A Better Model Changes Everything

I tried a few other local options. Gemma 4 E4B (roughly 4B parameters, fits entirely in VRAM) was fast but just wasn’t reliable enough for agent tasks.

Then I loaded Gemma 4 26B-A4B — a Mixture-of-Experts model with about 4B active parameters out of 26B total. It fits on my 12GB GPU by offloading the MoE expert layers to CPU. And it just… worked. The rough edges I’d been fighting with Qwen largely disappeared. Not a frontier model, not even a huge model — but a better fit for this task, still running entirely on my hardware.

I’d spent a lot of time engineering around Qwen’s weaknesses. A better model — still local, still on the same GPU — solved most of those problems without extra effort.

To be fair, the engineering work wasn’t wasted. Moving deterministic logic to code made the agent better for all models, including Claude. That kind of improvement is worth making regardless. But the hours spent debugging model-specific failures? Those were mostly absorbed by a better model.

Practical Takeaways

These are suggestions from my experience, not hard rules — your mileage will vary.

Default to the best model you can use, then scale back with measurement. Don’t start by trying to make a weak model work. Get the product right first, validate the concept, then optimize if needed. In my case, going from a coding-specialized 3B-active model to a general-purpose 4B-active model made a dramatic difference — and both ran locally on the same hardware.

Invest in evals early. Even if you’re using a frontier model, evals give you a regression safety net for when you swap models, update prompts, or change tool implementations. They’re not just for debugging weak models.

Think carefully before fine-tuning a smaller model. I’ve done fine-tuning work on a separate project — training a small model for GitHub issue classification — and the maintenance burden is real. Adding a new label means regenerating your training dataset and retraining. Any requirement change means re-doing that work. For most use cases, I suspect the maintenance cost exceeds the inference savings. It might make sense at industrial scale — a company running chatbots for hundreds of clients, where a stable fine-tuned model is amortized across huge volume. But for most teams building agents, a better base model plus good engineering is probably the more practical path.

Before fine-tuning, consider the alternatives: better prompts, moving more logic to code (as I did here), structured output constraints, or honestly just waiting — small models are getting better fast. By the time you finish a fine-tuning pipeline, the next generation of base models might have closed the gap.

The Bottom Line

The best agent architecture is one that doesn’t depend on the model being brilliant. Push precision into code, let the LLM handle intent, invest in evals, and start with the best model available to you. You can always optimize later — and by the time you need to, there might be a better small model anyway.


From Skill to Agent: When a Text File Isn't Enough

April 08, 2026

From Skill to Agent: When a Text File Isn't Enough

A coworker of mine built a Go CLI for the Harvest time-tracking API. It’s a solid tool, and I wanted to make it even easier to use from Claude Code. So I wrote a skill — essentially a markdown file with instructions, examples, and patterns — and in about an hour I had a working integration. Claude could log time, view entries, edit hours, and delete entries. It just worked.

What surprised me was how much it could do with so little. The skill handled first-time onboarding — prompting new users to install the CLI, verifying their credentials, pulling their recent time entries to learn their billing patterns, and creating a preferences file mapping shorthand names to Harvest projects. It also walked them through setting up their API token. All of this from a text file describing the flow in natural language.

I sent it to my coworkers and they dropped it in their skills directory and it just worked for them too.

Software in a Text File

A Claude Code skill is a structured text file that tells an LLM what tools exist, how to call them, and what patterns to follow. There’s no compilation, no packaging, no dependency management. You write a markdown file describing the interface, and the LLM figures out the rest. Anyone with Claude Code can install the skill and use it immediately.

I keep seeing this pattern show up in different forms. Andrej Karpathy recently shared his LLM Knowledge Base concept — an “idea file” that you paste into an LLM agent, and it builds you a personal wiki. The gist got over 2,100 stars in under 12 hours. OpenClaw has built an entire ecosystem around this — over 13,000 community skills, essentially text files that extend what an agent can do. Same underlying pattern: text as software.

There’s a meaningful difference in accessibility, though. Karpathy’s idea file is a starting point — you’re expected to spend an hour or two customizing it, adapting it to your needs. It’s still a developer tool. A skill is closer to an app. You install it and it works. My Harvest skill didn’t require the user to understand the Harvest API or write any code. They just talk to Claude and their time gets logged.

It’s not perfect, there are certainly flaws. But it’s remarkable how fast you can get something that looks and feels like working software. What used to take days of development now takes an hour with a markdown file, if you have a good model backing it. I’m not saying this is the future of software distribution, but I think it’s already starting to take hold in some niches. I think it’s part of why ecosystems like OpenClaw have grown so fast — the barrier to creating and sharing useful integrations has dropped dramatically.

That said, this approach has real limitations the biggest of which I’d say is security.

The Security Problem

When you use a skill with Claude Code, the LLM operates in your full environment. My Harvest API token was sitting in an environment variable. Nothing in the skill architecture stops the LLM from reading it. We’re trusting the LLM to remember and follow instructions — and that workflow has limits.

If you need a reminder of how that can go wrong, look at what happened with OpenClaw and Meta’s AI Safety Director in February 2026. She connected OpenClaw to her work email with a clear instruction: “don’t do anything without my approval.” When the context window filled up and the agent compacted its memory, that safety constraint got dropped from the summary. The agent then deleted over 200 emails, ignoring her repeated commands to stop. The instruction was there — the agent just forgot it.

Anthropic’s own Claude 4 System Card documents that Opus “seems more willing than prior models to take initiative on its own in agentic contexts.” The Opus 4.6 Risk Report goes further, flagging “overeager agentic behavior” including “aggressively acquiring authentication tokens” in coding and GUI settings.

Here’s a concrete scenario: the agent calls the Harvest CLI and gets an authentication error. A capable, initiative-taking model might decide to debug by reading your .env file or checking your shell configuration to verify the token. Now your secret is part of the conversation context, sent to Anthropic’s servers. The model wasn’t being malicious — it was being helpful. But the result is the same: your credential has left your machine.

You can (and I did) write “never read credentials” in the skill instructions. But that’s a suggestion to the model, not a guardrail. There’s no enforcement mechanism — and as the OpenClaw incident showed, even if the agent perfectly follows all instructions, explicit instructions can get lost.

What You Can Do About It

A more robust approach is separating the agent from the credentials via something like a credential-injecting proxy. The agent never sees the secret — a network proxy intercepts outgoing HTTP requests and attaches the authorization header before forwarding or by following NVIDIA’s guidance on sandboxing agentic workflows covers this pattern well, including credential brokers and short-lived tokens.

I think skills and idea files are a thought-provoking new pattern, and I’m curious to see how that evolves — especially as sandboxing and proxy approaches mature. But when you’re handling real credentials for real services, a bit of architecture goes a long way.

Moving to a proper Agent

Something like the sandbox described above is a better design for anything sensitive, and I want to play around with it in the future. But for now, I took a simpler approach: I built a proper agent using Pydantic AI with a very limited toolset. The agent can only call specific Harvest operations — no file reading, no bash commands, no reading environment variables, and no arbitrary environment access. Credentials flow through environment variables to the Harvest CLI subprocess, but the agent code never reads or exposes them. It’s not a full sandbox, but good enough for now. Read more in this post - What I Learned Making a Local LLM Do Real Work.


Running a Local Coding Agent with Qwen3-Coder-Next

February 23, 2026

Running a Local Coding Agent with Qwen3-Coder-Next

Open-source coding models have gotten seriously good. Qwen3-Coder-Next is an 80B parameter Mixture-of-Experts model that only activates 3B parameters per token, and according to Qwen’s own benchmarks, it scores competitively with Claude Sonnet 4.5 on SWE-Bench Verified. That’s a model you can run on consumer hardware matching (or at least approaching) one of the best proprietary coding models available. I wanted to see how far I could push it on my desktop — an RTX 3060 12GB, Ryzen 9 7950X, and 128GB of DDR5 RAM.

Here’s what I did, what worked, what didn’t, and what I learned along the way.

Step 1: Building llama.cpp with CUDA Support

The first step was getting llama.cpp built locally with CUDA support. llama.cpp is the go-to inference engine for running quantized models on consumer hardware, and it has native support for serving an Anthropic-compatible API — which matters a lot when you want to plug it into coding agent frontends.

git clone https://github.com/ggml-org/llama.cpp
cd llama.cpp
cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release -j$(nproc)

Nothing too complicated, but you do need the CUDA toolkit installed. Once built, you get llama-server which can serve models over HTTP with both OpenAI and Anthropic-compatible API endpoints.

Step 2: First Run (and Disappointing Results)

I grabbed the Q4_K_M quantization of Qwen3-Coder-Next (~48GB on disk) and fired up the server with some basic settings. The initial results were… fine. Around 77 tokens/second for prompt processing and 23 tokens/second for generation. Usable, but I had a feeling the hardware could do better.

The key variable I needed to tune was -ncmoe — this controls how many of the model’s 48 MoE layers have their experts processed on CPU versus GPU. With only 12GB of VRAM and a 48GB model, most of the expert computation has to happen on CPU, but the question was: exactly how much can we push onto the GPU before running out of memory?

Step 3: The Parameter Sweep

Rather than guess, I wrote a series of benchmark scripts to systematically sweep through different configurations. I tested:

  • ncmoe values from 40 to 48 (how many MoE layers stay on CPU)
  • Batch and ubatch sizes (how the prompt gets chunked for processing)
  • Thread counts (8, 16, 32)
  • KV cache quantization (f16 vs q8_0)
  • Context sizes from 32K up to 256K

Some highlights from the results:

Config Prompt Processing Token Generation Notes
ncmoe=41 (best perf) 323 t/s 25.2 t/s +7% pp, +16% tg vs baseline
ncmoe=48 (all CPU MoE) 302 t/s 21.8 t/s Safest, frees most VRAM
ncmoe=48, ub=4096, pp8192 631 t/s 24.7 t/s Sweet spot for long prompts

The most interesting finding was about VRAM management. When you need to free up VRAM (say, for a large context window), you have two options: shrink the ubatch size or push more MoE layers to CPU. Pushing layers to CPU is dramatically better — bumping ncmoe from 42 to 48 freed nearly 6GB of VRAM with only a 3% performance hit, while cutting ubatch from 2048 to 512 freed less than 1GB and cost 45% of prompt processing speed.

I also confirmed that 128K context works on the RTX 3060 (using q8_0 KV cache quantization), and 160K is the practical maximum. 256K just OOMs.

Step 4: Plugging It into Claude Code

Here’s where it gets interesting. llama-server has native Anthropic API support, so you can point Claude Code directly at it:

claude-local() {
    ANTHROPIC_BASE_URL="http://localhost:8080" \
    ANTHROPIC_API_KEY="sk-no-key-required" \
    CLAUDE_CODE_ATTRIBUTION_HEADER=0 \
    claude --model "Qwen3-Coder-Next" "$@"
}

But notice that CLAUDE_CODE_ATTRIBUTION_HEADER=0 — that was a critical discovery.

The Prefix Caching Problem

When running with a single slot (-np 1), llama-server keeps the KV/recurrent state from the previous request. If the next request starts with the same token sequence, it only needs to process the new tokens at the end. This is prefix caching, and for a coding agent conversation where the system prompt is 20K+ tokens, it’s the difference between a 40-second wait and a 200-millisecond one on every single turn.

The problem: Claude Code injects a dynamic billing/attribution header into the beginning of the system prompt on every request. This header includes a content hash that changes each turn, which means the tokenized prefix is different every time and the server can never reuse its cached state.

The fix is simple — CLAUDE_CODE_ATTRIBUTION_HEADER=0 disables the header when talking to a local server. This is only relevant for local use; the header is just analytics when talking to Anthropic’s actual API.

You can verify it’s working by watching the server logs:

# Cache hit — only new tokens to process:
prompt eval time = 193.48 ms / 14 tokens

# Cache miss — full re-processing:
prompt eval time = 43204.96 ms / 20637 tokens

Step 5: Trying the Pi Coding Agent

While Claude Code works, it has a heavy system prompt (~20K+ tokens) and relies on tool-calling patterns that aren’t always a natural fit for Qwen3-Coder-Next. Community users have reported the model getting stuck in loops with Claude Code’s XML-based tool format.

So I tried the Pi coding agent — a lighter-weight terminal-based coding agent. Pi’s system prompt is much smaller (~1.3K tokens), and prefix caching works out of the box with no configuration needed. Setup was straightforward:

npm install -g @mariozechner/pi-coding-agent

Configure a local model in ~/.pi/agent/models.json, add a bash alias, and you’re off:

alias pi-local='pi --provider llama-local --model Qwen3-Coder-Next'

Pi felt noticeably snappier than Claude Code on the local model — partly because of the smaller system prompt, and partly because it avoids the tool-calling translation layer that can trip up non-Claude models.

The Final Setup

After all the tuning, here’s what I landed on:

llama-server \
    -hf Qwen/Qwen3-Coder-Next-GGUF:Q4_K_M \
    -ngl 99 -ncmoe 48 -fa on -t 8 \
    -c 131072 -ctk q8_0 -ctv q8_0 \
    -b 4096 -ub 4096 -np 1 \
    --jinja --host 0.0.0.0 --port 8080 \
    --temp 0

I set it up as a systemd user service with bash aliases for easy management (llama-start, llama-stop, llama-logs, etc.) so it’s on-demand — no resources wasted when I’m not using it.

Honest Assessment

A caveat worth mentioning: despite the strong benchmark numbers, community experience (including my own) suggests that Qwen3-Coder-Next in practice feels more like Claude Haiku than Sonnet for complex multi-step reasoning tasks. It’s great for quick prototyping, simpler coding tasks, and situations where you want local/private inference. But if you’re expecting Claude Sonnet-level performance on hard refactors, temper your expectations.

That said, the fact that you can run a model with competitive benchmark scores on a consumer desktop — for free, with no API costs, and with full privacy — is remarkable. And the inference speed (~25 tokens/second for generation, 460-630 tokens/second for prompt processing) is genuinely usable for real work.

Resources