👋 Hi, I'm a PhD Data Scientist currently at OpenTeams (formerly at Quansight) where I perform a wide range of data science tasks to meet our clients' needs. I've created algorithms enabling targeted advertising based on 50+ TB of geospatial data. I've created training material demonstrating Dask, DaskML, and distributed hyperparameter tuning on out of memory datasets. I've also created an optical character recognition pipeline reducing manual entry of business data by 70%.

🎓 Before I joined Quansight, I got my PhD at The University of Texas at Austin where I analyzed terabytes of multimodal imaging data taken from an industrial 3D printing process using Python and Java to detect in-situ defects via aggregation, registration, and visualization.

🌟 Outside of data science I enjoy disc golf 🥏, hiking 🥾, tennis 🎾, traveling ✈️, and I'm always up for a good board game. Check out my social media for more info!

📝 Recent Posts

skyllm: Cheap On-Demand Cloud LLMs for When Local Isn't Enough

April 27, 2026

I’ve written previously about running LLMs locally on my RTX 3060. Local is great when the model fits, but my 12GB ceiling rules out a lot of interesting models. The one that finally pushed me to do something about it was Qwen 3.6 27B — a dense model posting near-frontier benchmark numbers, and too big to run on my 12GB card (you’d want a 3090 or bigger). The obvious answer is renting a cloud GPU, but a 24/7 instance is wasteful when I only want to poke at a model for an hour at a time. So I built skyllm.

What it does

One skyllm up command spins up a 24GB+ NVIDIA GPU on RunPod, starts vLLM (for safetensors/AWQ/GPTQ) or llama.cpp (for GGUFs) with the model you picked, and exposes it through a Cloudflare Tunnel at a hostname you control. Clients point at https://llm.yourdomain.com/v1 forever — the GPU comes and goes, the URL stays. skyllm down tears it all back down. You pay cents for the hour.

Because the endpoint is OpenAI-compatible, anything that speaks that protocol just works — Open WebUI, Cherry Studio, your own scripts, whatever.

Why I built it instead of using a managed API

Managed APIs (Together, Fireworks, Groq) are honestly fine for most low-volume hobbyist use, and I’d recommend them first if you don’t care about the rest. They’re easier — one signup, no infrastructure. But there are a few things skyllm gives you that they don’t:

You choose who sees your data. A managed API is take-it-or-leave-it on whoever runs it, and policies on logging or training-on-prompts can be fuzzy. RunPod is a raw GPU rental — they’re not in the inference-data business. To be clear, only RunPod is wired up today, but SkyPilot itself supports AWS, GCP, Lambda, Vast, and more, so adding another provider should be fairly straightforward if needed. There’s also a documented migration path off Cloudflare Tunnel to your own VPS via FRP, if you want to cut Cloudflare out of the plaintext path entirely.
Reproducibility. I pick the exact model, exact quantization, exact engine flags. No silent provider-side swaps, no system prompts injected under me.
No vendor lock-in. Same SkyPilot YAMLs work across providers — flip one line if RunPod gets expensive or you want to move.

To be honest about the privacy tradeoff: in the default Cloudflare-Tunnel setup, your prompts pass through both Cloudflare and RunPod, which is more hands than a managed API, not fewer. The advantage isn’t fewer hands by default — it’s that you get to choose which hands, and there’s a clear path to fewer if you want to do the work.

Surprise-bill paranoia

The thing I was most nervous about building this was leaving a GPU running by accident overnight. So there are five layers of guard:

Idle auto-shutdown — watches vLLM’s token-generation metric, exits after 15 idle minutes.
Wall-clock cap — shutdown -h +240 runs at launch (1 hour on the 80GB tier since H100s are several × the cost).
SkyPilot autostop — terminates the cluster after 30 idle minutes regardless.
Monthly budget check — cron-able script that runs sky down if the month’s spend crosses a threshold.
RunPod’s own monthly spend limit — the real backstop. The other four protect against my mistakes; this one protects against bugs in the other four.

Belt, suspenders, and a third belt. Probably overkill — but the cost of one wedged H100 overnight is enough that I sleep better with all five.

The tradeoff with Cloudflare Tunnel

The v1 setup terminates TLS at Cloudflare’s edge, which means CF technically has plaintext access to every request. For a hobbyist LLM endpoint that’s a fine threat model. If your prompts are sensitive enough that you don’t want CF reading them, the README documents a migration path to FRP on a $5/mo VPS — same hostname, same API key, clients change nothing.

Who this is for

One person, one model at a time, occasional use. If you need multi-user serving with request queuing, the managed APIs above are a better fit. If you want to mess around with a 27B dense model or an 80B MoE for an hour without a $300/mo bill, this is the shape of the thing.

Repo’s at github.com/adam-d-lewis/skyllm. MIT licensed.

What I Learned Making a Local LLM Do Real Work

April 09, 2026

In my previous post, I described building an AI agent for Harvest time tracking using Pydantic AI — driven partly by security concerns with the skill-based approach. The agent worked perfectly with Claude. Then I tried running it locally.

It added a time entry on Tuesday instead of Monday. When I asked it to fix that, it added the Monday entry but forgot to delete the Tuesday one. This wasn’t a damning verdict on local models — but it was a useful lesson about where to invest your time when building agents.

Going Local

Why run locally? Partly practical — no API costs, no data leaving my machine. But honestly, partly just for fun. The dream of doing everything locally without worrying about privacy or third-party dependencies is appealing, even if it’s not strictly necessary for a time-tracking tool.

I’d already been experimenting with running models locally for coding tasks, which I wrote about previously. Since then, llama.cpp has added built-in router mode — you run a single server process that auto-loads and unloads models on demand based on the model field in your API request. With --models-max 1 it evicts the current model when you request a different one, which works well for my setup with an RTX 3060 12GB where only one large model fits in VRAM at a time.

My first test model was Qwen3-Coder-Next — 80B total parameters but only 3B active per token (it’s a Mixture-of-Experts model with 512 experts, 10 selected per token). I was using the Q4_K_M quantization, about 46GB on disk. It’s built for coding tasks, not calendar math and time-tracking. So it wasn’t exactly a fair fight from the start, and the quantization may have further degraded its reasoning on this kind of task.

Where Things Went Wrong

I want to be careful here — this isn’t a story about local models being bad. Local models can be very good, and they’re getting better fast. GLM-5.1, for example, is a 754B parameter model with reportedly near-Opus 4.6 capability under an MIT license. The fact that you can self-host something at that level at all is incredible, even if you need serious hardware to run it.

This was more of an exercise in seeing if I could make the agent robust to a poorly-performing model. The agent worked fine with Claude Sonnet 4.6 — I wanted to see what would break when I threw a much weaker model at it. Here’s what I ran into:

Date math: I asked it to log time for “next Monday.” It picked Tuesday. When dates crossed month boundaries, it got worse.

Day-of-week hallucination: Given the ISO date “2026-04-07,” the model confidently identified it as Monday. It’s Tuesday.

Silent substitution: I typed a project name with a slight typo. Instead of asking for clarification, the model quietly logged my time to a completely different real project.

These aren’t exotic edge cases — dates and project names are the entire job of a time-tracking agent. Could it have been the model, the quantization, something in llama.cpp, or just that 3B active parameters isn’t enough for this? Hard to say. Probably some combination.

Pushing Logic Out of the LLM

Rather than guessing at fixes, I started by building evals — 22 test cases covering tool selection, date parsing, shortcut resolution, hallucination detection, and project validation. Evals are valuable regardless of model quality. Model life cycles are short; you’re going to be swapping models regularly, and evals let you validate each swap quickly and catch regressions.

With the evals showing me exactly where the model was failing, I systematically moved deterministic work out of the LLM and into Python:

Date parsing: Instead of asking the model to calculate “next Friday,” the tool code parses relative dates in Python. The system prompt includes a three-week calendar table — last week through next week — so the model just reads dates off the table instead of doing arithmetic.

Project validation: At startup, the agent builds an index of real Harvest projects. Every tool call validates the project name against this index before hitting the API. Typos get fuzzy-matched with suggestions (“did you mean Deep Learning?”). The model is explicitly told: never substitute a project the user didn’t ask for.

Shortcut resolution and hour rounding: Lookup tables and rounding logic in Python, not left to the model.

The pattern behind all of these: the LLM handles intent — understanding what the user wants. Code handles precision — getting the details exactly right. Anything deterministic belongs in code, not in the prompt.

After these changes, Qwen3-Coder-Next passed 100% of my eval cases. But in real-world usage, it still had rough edges not covered by my (admittedly quick) test suite. A 100% pass rate means your test suite isn’t comprehensive enough yet — not that you’re production-ready.

A Better Model Changes Everything

I tried a few other local options. Gemma 4 E4B (roughly 4B parameters, fits entirely in VRAM) was fast but just wasn’t reliable enough for agent tasks.

Then I loaded Gemma 4 26B-A4B — a Mixture-of-Experts model with about 4B active parameters out of 26B total. It fits on my 12GB GPU by offloading the MoE expert layers to CPU. And it just… worked. The rough edges I’d been fighting with Qwen largely disappeared. Not a frontier model, not even a huge model — but a better fit for this task, still running entirely on my hardware.

I’d spent a lot of time engineering around Qwen’s weaknesses. A better model — still local, still on the same GPU — solved most of those problems without extra effort.

To be fair, the engineering work wasn’t wasted. Moving deterministic logic to code made the agent better for all models, including Claude. That kind of improvement is worth making regardless. But the hours spent debugging model-specific failures? Those were mostly absorbed by a better model.

Practical Takeaways

These are suggestions from my experience, not hard rules — your mileage will vary.

Default to the best model you can use, then scale back with measurement. Don’t start by trying to make a weak model work. Get the product right first, validate the concept, then optimize if needed. In my case, going from a coding-specialized 3B-active model to a general-purpose 4B-active model made a dramatic difference — and both ran locally on the same hardware.

Invest in evals early. Even if you’re using a frontier model, evals give you a regression safety net for when you swap models, update prompts, or change tool implementations. They’re not just for debugging weak models.

Think carefully before fine-tuning a smaller model. I’ve done fine-tuning work on a separate project — training a small model for GitHub issue classification — and the maintenance burden is real. Adding a new label means regenerating your training dataset and retraining. Any requirement change means re-doing that work. For most use cases, I suspect the maintenance cost exceeds the inference savings. It might make sense at industrial scale — a company running chatbots for hundreds of clients, where a stable fine-tuned model is amortized across huge volume. But for most teams building agents, a better base model plus good engineering is probably the more practical path.

Before fine-tuning, consider the alternatives: better prompts, moving more logic to code (as I did here), structured output constraints, or honestly just waiting — small models are getting better fast. By the time you finish a fine-tuning pipeline, the next generation of base models might have closed the gap.

The Bottom Line

The best agent architecture is one that doesn’t depend on the model being brilliant. Push precision into code, let the LLM handle intent, invest in evals, and start with the best model available to you. You can always optimize later — and by the time you need to, there might be a better small model anyway.

From Skill to Agent: When a Text File Isn't Enough

April 08, 2026

A coworker of mine built a Go CLI for the Harvest time-tracking API. It’s a solid tool, and I wanted to make it even easier to use from Claude Code. So I wrote a skill — essentially a markdown file with instructions, examples, and patterns — and in about an hour I had a working integration. Claude could log time, view entries, edit hours, and delete entries. It just worked.

What surprised me was how much it could do with so little. The skill handled first-time onboarding — prompting new users to install the CLI, verifying their credentials, pulling their recent time entries to learn their billing patterns, and creating a preferences file mapping shorthand names to Harvest projects. It also walked them through setting up their API token. All of this from a text file describing the flow in natural language.

I sent it to my coworkers and they dropped it in their skills directory and it just worked for them too.

Software in a Text File

A Claude Code skill is a structured text file that tells an LLM what tools exist, how to call them, and what patterns to follow. There’s no compilation, no packaging, no dependency management. You write a markdown file describing the interface, and the LLM figures out the rest. Anyone with Claude Code can install the skill and use it immediately.

I keep seeing this pattern show up in different forms. Andrej Karpathy recently shared his LLM Knowledge Base concept — an “idea file” that you paste into an LLM agent, and it builds you a personal wiki. The gist got over 2,100 stars in under 12 hours. OpenClaw has built an entire ecosystem around this — over 13,000 community skills, essentially text files that extend what an agent can do. Same underlying pattern: text as software.

There’s a meaningful difference in accessibility, though. Karpathy’s idea file is a starting point — you’re expected to spend an hour or two customizing it, adapting it to your needs. It’s still a developer tool. A skill is closer to an app. You install it and it works. My Harvest skill didn’t require the user to understand the Harvest API or write any code. They just talk to Claude and their time gets logged.

It’s not perfect, there are certainly flaws. But it’s remarkable how fast you can get something that looks and feels like working software. What used to take days of development now takes an hour with a markdown file, if you have a good model backing it. I’m not saying this is the future of software distribution, but I think it’s already starting to take hold in some niches. I think it’s part of why ecosystems like OpenClaw have grown so fast — the barrier to creating and sharing useful integrations has dropped dramatically.

That said, this approach has real limitations the biggest of which I’d say is security.

The Security Problem

When you use a skill with Claude Code, the LLM operates in your full environment. My Harvest API token was sitting in an environment variable. Nothing in the skill architecture stops the LLM from reading it. We’re trusting the LLM to remember and follow instructions — and that workflow has limits.

If you need a reminder of how that can go wrong, look at what happened with OpenClaw and Meta’s AI Safety Director in February 2026. She connected OpenClaw to her work email with a clear instruction: “don’t do anything without my approval.” When the context window filled up and the agent compacted its memory, that safety constraint got dropped from the summary. The agent then deleted over 200 emails, ignoring her repeated commands to stop. The instruction was there — the agent just forgot it.

Anthropic’s own Claude 4 System Card documents that Opus “seems more willing than prior models to take initiative on its own in agentic contexts.” The Opus 4.6 Risk Report goes further, flagging “overeager agentic behavior” including “aggressively acquiring authentication tokens” in coding and GUI settings.

Here’s a concrete scenario: the agent calls the Harvest CLI and gets an authentication error. A capable, initiative-taking model might decide to debug by reading your .env file or checking your shell configuration to verify the token. Now your secret is part of the conversation context, sent to Anthropic’s servers. The model wasn’t being malicious — it was being helpful. But the result is the same: your credential has left your machine.

You can (and I did) write “never read credentials” in the skill instructions. But that’s a suggestion to the model, not a guardrail. There’s no enforcement mechanism — and as the OpenClaw incident showed, even if the agent perfectly follows all instructions, explicit instructions can get lost.

What You Can Do About It

A more robust approach is separating the agent from the credentials via something like a credential-injecting proxy. The agent never sees the secret — a network proxy intercepts outgoing HTTP requests and attaches the authorization header before forwarding or by following NVIDIA’s guidance on sandboxing agentic workflows covers this pattern well, including credential brokers and short-lived tokens.

I think skills and idea files are a thought-provoking new pattern, and I’m curious to see how that evolves — especially as sandboxing and proxy approaches mature. But when you’re handling real credentials for real services, a bit of architecture goes a long way.

Moving to a proper Agent

Something like the sandbox described above is a better design for anything sensitive, and I want to play around with it in the future. But for now, I took a simpler approach: I built a proper agent using Pydantic AI with a very limited toolset. The agent can only call specific Harvest operations — no file reading, no bash commands, no reading environment variables, and no arbitrary environment access. Credentials flow through environment variables to the Harvest CLI subprocess, but the agent code never reads or exposes them. It’s not a full sandbox, but good enough for now. Read more in this post - What I Learned Making a Local LLM Do Real Work.

📚 View All Posts

Adam Lewis

I'm Adam Lewis