Self-Hosted LLM Guide: Skip the Hype and Get It Running
Meta: A practical self-hosted LLM guide covering when it actually makes sense, which tools to pick (Ollama, LM Studio, vLLM), honest hardware needs, and the 5 mistakes that make first attempts fail.
Self-Hosted LLM Guide: Skip the Hype and Get It Running
You downloaded Ollama. You ran ollama pull llama3. It started downloading — 4.7 gigabytes, fine, that's normal — and then it finished and you typed your first prompt. And it responded. And it was... fine. A little slow. The kind of response that a year ago would have blown your mind, but after months of GPT-4 and Claude, just feels underwhelming.
So now you're wondering: was this worth it? And the honest answer is: it depends. On your hardware, your use case, your privacy requirements, and whether you picked the right model and tool for what you're actually trying to do.
This is the guide I wish existed when I started. No breathless "run GPT-4 on your laptop!" energy. No 3-hour setup walkthroughs for tools you'll never use. Just: what self-hosting actually is, when it makes sense, which tool to reach for, and the five mistakes that make most first attempts feel like a waste of time.
Why Self-Host an LLM in the First Place?
Before anything else, be honest with yourself about why you're doing this. The three genuinely good reasons — and the one most people actually start with.
Privacy is the strongest reason. If you're working with medical records, legal documents, customer support transcripts, or anything that shouldn't leave your infrastructure, sending it to OpenAI's API isn't just a policy concern — it's a compliance problem. HIPAA, GDPR, SOC 2, client NDAs. A local model processes your data on your hardware, end of story. This is why healthcare systems, law firms, and defense contractors are serious about self-hosting. If this is you, skip ahead to the decision framework — this guide is definitely for you.
Cost at scale is the second legitimate reason. At low volumes, API pricing is fine. At millions of tokens per day, the economics flip. A 10-person dev team spending $800K/year on API calls might spend $300k self-hosting over three years, once hardware and ops are amortized. If your volume is high and predictable, do the math — you'll probably come out ahead.
Customization is the third reason, but it's narrower. Fine-tuning an open-source model on your codebase, your writing style, or your product's knowledge base can produce meaningfully better results for specific tasks than a general-purpose API model. This is real, but it's also the most work. Don't start here.
And then there's the honest reason most people try it: curiosity, wanting to reduce API dependency, or just not wanting to hand your data to a third party on principle. These are fine reasons. Just don't confuse them with the first three, because they'll lead you to spend more time than the reward justifies.
The Three Tools: Ollama, LM Studio, and vLLM
Here's the 90-second version so you can pick your tool and stop reading about tools.
| Ollama | LM Studio | vLLM | |
|---|---|---|---|
| Best for | Developers, fast iteration | Non-technical users, GUI exploration | Production servers, high throughput |
| Interface | CLI + REST API | Desktop GUI + local server | Python / Docker / API server |
| Setup complexity | Lowest | Low | High |
| Throughput | Moderate | Moderate | Highest |
| Multi-GPU | No | No | Yes |
| Speculative decoding | No | Limited | Yes |
| Can it serve an API? | Yes | Yes | Yes |
| Who made it | Ollama Inc. | LM Studio Inc. | UC Berkeley vLLM team |
Use Ollama if you want to get a model running in 5 minutes, iterate on prompts, or build an app on top of a local model. It's the default answer for "I just want to run something."
Use LM Studio if you prefer clicking to typing, want to hot-swap between models visually, or are exploring what's possible before committing to a setup. It has a beautiful GUI and handles quantization automatically.
Use vLLM if you're deploying a production API, running multiple GPUs, or need the highest possible throughput. It's what serious inference servers run on. The learning curve is steeper but the performance ceiling is significantly higher.
💡 Tip: Many teams in 2025 use Ollama for local development and prototyping, then migrate to vLLM when they move to production. These aren't mutually exclusive.
What You Actually Need: Hardware, Models, and Expectations
This is where most first attempts go wrong: expectations about hardware don't match reality.
The VRAM Ladder
Model size is measured in parameters — billions of parameters, so "7B" means seven billion. The more parameters, the more capable the model generally is, and the more VRAM you need to run it at usable speed.
| Model size | Minimum VRAM | Realistic use |
|---|---|---|
| 7B | 6–8 GB | Chat, coding help, summarization — usable on a decent gaming GPU |
| 13B | 10–12 GB | Noticeably better reasoning, good for document analysis |
| 33B | 20–24 GB | Strong reasoning, some GPT-3.5-level capability |
| 70B | 40+ GB | GPT-4 tier in many benchmarks, requires serious hardware |
| 405B | 200+ GB | Datacenter-grade; not something you run at home |
If you have an NVIDIA RTX 3080, 3090, 4070, or 4080 — you're in 7B–13B territory, which is genuinely useful. An RTX 4090 or A100 opens up 33B. Anything below 6GB of VRAM and you're CPU-bound, which means slow.
The Quality Ladder
The model size story is straightforward, but the model name story is where people get lost. Here's a rough current ranking for open-source models:
For coding: CodeLlama, Qwen 2.5 Coder, and Mistral derivatives consistently outperform general-purpose models of equivalent size on code tasks.
For general reasoning: Llama 3.3, Mistral Small/Large, Qwen 2.5, and Claude distilled models (distilled to smaller sizes) cover most use cases well.
For multilingual: Qwen 2.5 and Mistral variants have strong non-English performance.
The biggest practical insight: a well-prompted 13B model from 2025 beats a 70B model from 2023. Use current releases, not the model size alone, as your decision variable.
Your First Local LLM in 20 Minutes
Here's the fastest path to a working local model using Ollama. If you want a GUI instead, skip to the note at the end of this section.
Step 1: Install Ollama
macOS / Linux:
# One-line install
curl -fsSL https://ollama.com/install.sh | shWindows: Download the app from ollama.com/download and run the installer. No WSL required.
Step 2: Pull and Run a Model
# Pull a model (Llama 3.3 — strong all-around, ~20GB download)
ollama pull llama3.3
# Or start smaller — Mistral 7B is fast and capable
ollama pull mistral
# Run it interactively
ollama run llama3.3You'll see a prompt. Type your question. That's it — you have a local LLM running.
Step 3: Use It from an App or Script
Ollama runs a local REST API automatically. Any app that works with OpenAI's API can be pointed at your local instance:
# Example: chat completion via curl
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama3.3",
"messages": [{"role": "user", "content": "Explain quantization in one sentence"}]
}'💡 Tip: On first run, Ollama uses system RAM if you don't have a compatible GPU. It'll be slow but functional. Install NVIDIA's CUDA drivers if you have an NVIDIA GPU — Ollama detects and uses it automatically.
Prefer a GUI? Download LM Studio — it handles model downloading, hot-swapping, and has a built-in chat interface. Point it at the same models Ollama uses. Many people run both and pick whichever matches their mood.
Quantization Explained: What Q4_K_M Actually Means
You've seen filenames like llama3.3-Q4_K_M.gguf and wondered what the letters mean. Here's the honest version.
GGUF is a file format for sharing quantized LLM weights. It bundles the model weights and the quantization metadata together. If you're downloading a model for Ollama or LM Studio, you'll almost always download a .gguf file.
Quantization is the process of compressing the model weights from 16-bit floating point (FP16) to a smaller format. A 7B model at FP16 takes ~14GB. The same model at Q4 takes ~3.5GB — roughly 4x smaller, with surprisingly little quality loss.
The letter-number codes:
- Q8_0 — 8-bit quantization. Near-lossless, but still large. Use when you have plenty of VRAM and want maximum quality.
- Q5_K_M — 5-bit with improved calibration. Good middle ground — you lose a small amount of quality but save significant VRAM.
- Q4_K_M — 4-bit with improved calibration. The sweet spot for most people. Significant VRAM savings, quality loss is imperceptible for general use.
- Q3_K_M — 3-bit. Noticeable quality degradation on complex reasoning tasks. Fine for simpler use cases.
- Q2_K — 2-bit. Significant quality loss. Only use when hardware is severely constrained.
💡 Practical rule: If you have enough VRAM for Q8_0, use Q8_0. Otherwise, Q4_K_M is the recommended default — it's what most benchmark comparisons use as the baseline.
The 5 Mistakes That Make First Attempts Fail
After reading hundreds of "why does local keep failing me?" posts, five mistakes appear over and over.
1. Picking the Wrong Model for Your Hardware
Trying to run a 70B model on 8GB of VRAM will technically work — it'll load, it'll respond — but it'll be so slow as to be useless. A well-tuned 7B model on a fast GPU will outperform a struggling 70B model on insufficient VRAM every time.
The fix: Match your model size to your VRAM using the ladder above. Upgrade the model size once you know what your hardware handles comfortably.
2. No System Prompt Tuning
You downloaded the model, you type your questions, and the responses are generic and unhelpful. The default behavior of most models is heavily influenced by the system prompt. Without tuning it, you're getting the model's best guess at "what should I say" rather than "what should I say for your use case."
The fix: Start with an explicit system prompt. Ollama supports this in the CLI (/set system) and via API. Something as simple as "You are a senior software engineer reviewing pull requests. Focus on security, performance, and clarity." transforms output quality.
3. Ignoring Context Window Limits
Every model has a context window — the total amount of text it can "see" in a single conversation. Push past it and the model either errors or silently forgets the beginning of the conversation. Most 7B–13B models ship with 4K–8K context windows by default.
The fix: Know your context window. If you're working with long documents, explicitly check the model's supported context length before starting. Ollama lets you set context size in the Modelfile.
4. Treating Local Like GPT-4
Local models are genuinely impressive, but the best open-source models still lag GPT-4o and Claude on complex multi-step reasoning, code generation at scale, and nuanced instruction following. If you evaluate a local model by GPT-4 benchmarks, you'll be disappointed.
The fix: Evaluate local models on tasks where they excel — high-volume, repetitive, domain-specific tasks where a 90% quality model running locally beats a 98% quality model you have to pay per token for. The use case matters enormously.
5. Skipping Temperature and Sampling Settings
The default temperature (creativity/randomness setting) isn't always right. For factual Q&A, you want low temperature (deterministic, factual). For creative writing, you want higher temperature. Most people never touch this and get responses that are either too random or too stale.
The fix: Set temperature: 0.3 for factual tasks, 0.7–0.9 for creative tasks. Ollama supports this via the API. LM Studio has a slider in the GUI.
Is Self-Hosting Worth It? A Real Decision Framework
Here's the honest framework, skipping the hype on both sides.
The cost break-even
| Your OpenAI API spend | Recommendation |
|---|---|
| Under $500/month | Don't self-host yet. The ops overhead exceeds the savings. |
| $500–$5,000/month | Run the numbers for your specific volume and model sizes. Evaluate managed options (Groq, Replicate) first. |
| $5,000+/month | Strong case for self-hosting, especially with data privacy needs. Do the 3-year TCO analysis. |
These are rough numbers — your hardware costs, model mix, and team capacity all shift the answer. But the pattern is consistent: at low volumes, API wins. At high volumes, self-hosting wins.
The privacy check
Ask yourself: would I be comfortable if this data were stored on OpenAI's servers? If the answer is no — legal documents, medical records, PII, customer conversations — then privacy outweighs the cost argument and you should self-host regardless of volume.
The hybrid path
If self-hosting for production feels like too much commitment, there's a middle ground:
- Groq — Hardware-accelerated inference via API, with a free tier. Extremely fast, not cheap at scale, but no ops required.
- Replicate — Run open-source models via API without managing servers. Pay per prediction.
- Modal, Banana, Beam — Serverless GPU infrastructure; you bring your own model code.
These aren't self-hosting, but they're also not OpenAI. For teams that need control without ops burden, they're worth evaluating.
Where to Go Next
Once you have local inference working, most people want one of three things next:
RAG (Retrieval-Augmented Generation) — Connect your local model to your own documents. Instead of the model knowing only what it was trained on, you give it a vector database of your knowledge base and it retrieves relevant context at query time. Tools like AnythingLLM, RAGFlow, and MaxKB make this approachable.
Fine-tuning — Train a base model on your specific data to get domain-specialized behavior. This is the highest-effort option but produces the most dramatic quality improvements for narrow, repetitive tasks. Tools like Axolotl and LlamaFactory have made fine-tuning much more accessible.
Embedding models — Smaller models (typically 100M–700M parameters) optimized for converting text into vector embeddings. Used for RAG, semantic search, and similarity matching. Run them locally alongside your main model for a fully private AI stack.
Each of these is a full guide unto itself. The fact that you're here — having read through this one — puts you in a good position to evaluate which one is worth your time next.
The Bottom Line
Self-hosting an LLM isn't magic and it isn't for everyone. The people who get the most from it are those with real privacy requirements, high inference volumes, or specific customization needs. Everyone else is often better served by the API and a good prompt.
But if you've got a reason — and now you know what a real reason looks like — the tools have matured enough that it's genuinely accessible. Ollama for getting started. LM Studio for exploration. vLLM when you're serious. Q4_K_M as your default. Don't try to run a 70B model on hardware that can't handle it. And tune your system prompt before you decide the model is disappointing.
The technology is real. The use cases are real. The hype is mostly just hype.
Explore more: Ollama | LM Studio | vLLM | r/LocalLLaMA
Related: related post: RAG explained | related post: fine-tuning LLMs guide | related post: local AI privacy