- What it does
- Why I built it instead of using a managed API
- Surprise-bill paranoia
- The tradeoff with Cloudflare Tunnel
- Who this is for
I’ve written previously about running LLMs locally on my RTX 3060. Local is great when the model fits, but my 12GB ceiling rules out a lot of interesting models. The one that finally pushed me to do something about it was Qwen 3.6 27B — a dense model posting near-frontier benchmark numbers, and too big to run on my 12GB card (you’d want a 3090 or bigger). The obvious answer is renting a cloud GPU, but a 24/7 instance is wasteful when I only want to poke at a model for an hour at a time. So I built skyllm.
What it does
One skyllm up command spins up a 24GB+ NVIDIA GPU on RunPod, starts vLLM (for safetensors/AWQ/GPTQ) or llama.cpp (for GGUFs) with the model you picked, and exposes it through a Cloudflare Tunnel at a hostname you control. Clients point at https://llm.yourdomain.com/v1 forever — the GPU comes and goes, the URL stays. skyllm down tears it all back down. You pay cents for the hour.
Because the endpoint is OpenAI-compatible, anything that speaks that protocol just works — Open WebUI, Cherry Studio, your own scripts, whatever.
Why I built it instead of using a managed API
Managed APIs (Together, Fireworks, Groq) are honestly fine for most low-volume hobbyist use, and I’d recommend them first if you don’t care about the rest. They’re easier — one signup, no infrastructure. But there are a few things skyllm gives you that they don’t:
- You choose who sees your data. A managed API is take-it-or-leave-it on whoever runs it, and policies on logging or training-on-prompts can be fuzzy. RunPod is a raw GPU rental — they’re not in the inference-data business. To be clear, only RunPod is wired up today, but SkyPilot itself supports AWS, GCP, Lambda, Vast, and more, so adding another provider should be fairly straightforward if needed. There’s also a documented migration path off Cloudflare Tunnel to your own VPS via FRP, if you want to cut Cloudflare out of the plaintext path entirely.
- Reproducibility. I pick the exact model, exact quantization, exact engine flags. No silent provider-side swaps, no system prompts injected under me.
- No vendor lock-in. Same SkyPilot YAMLs work across providers — flip one line if RunPod gets expensive or you want to move.
To be honest about the privacy tradeoff: in the default Cloudflare-Tunnel setup, your prompts pass through both Cloudflare and RunPod, which is more hands than a managed API, not fewer. The advantage isn’t fewer hands by default — it’s that you get to choose which hands, and there’s a clear path to fewer if you want to do the work.
Surprise-bill paranoia
The thing I was most nervous about building this was leaving a GPU running by accident overnight. So there are five layers of guard:
- Idle auto-shutdown — watches vLLM’s token-generation metric, exits after 15 idle minutes.
- Wall-clock cap —
shutdown -h +240runs at launch (1 hour on the 80GB tier since H100s are several × the cost). - SkyPilot autostop — terminates the cluster after 30 idle minutes regardless.
- Monthly budget check — cron-able script that runs
sky downif the month’s spend crosses a threshold. - RunPod’s own monthly spend limit — the real backstop. The other four protect against my mistakes; this one protects against bugs in the other four.
Belt, suspenders, and a third belt. Probably overkill — but the cost of one wedged H100 overnight is enough that I sleep better with all five.
The tradeoff with Cloudflare Tunnel
The v1 setup terminates TLS at Cloudflare’s edge, which means CF technically has plaintext access to every request. For a hobbyist LLM endpoint that’s a fine threat model. If your prompts are sensitive enough that you don’t want CF reading them, the README documents a migration path to FRP on a $5/mo VPS — same hostname, same API key, clients change nothing.
Who this is for
One person, one model at a time, occasional use. If you need multi-user serving with request queuing, the managed APIs above are a better fit. If you want to mess around with a 27B dense model or an 80B MoE for an hour without a $300/mo bill, this is the shape of the thing.
Repo’s at github.com/adam-d-lewis/skyllm. MIT licensed.
