Self-Hosted LLM Guide: Skip the Hype and Get It Running

Use casesTips

Self-Hosted LLM Guide: Skip the Hype and Get It Running

Meta: A practical self-hosted LLM guide covering when it actually makes sense, which tools to pick (Ollama, LM Studio, vLLM), honest hardware needs, and the 5 mistakes that make first attempts fail.

Self-Hosted LLM Guide: Skip the Hype and Get It Running

You downloaded Ollama. You ran ollama pull llama3. It started downloading — 4.7 gigabytes, fine, that's normal — and then it finished and you typed your first prompt. And it responded. And it was... fine. A little slow. The kind of response that a year ago would have blown your mind, but after months of GPT-4 and Claude, just feels underwhelming.

So now you're wondering: was this worth it? And the honest answer is: it depends. On your hardware, your use case, your privacy requirements, and whether you picked the right model and tool for what you're actually trying to do.

This is the guide I wish existed when I started. No breathless "run GPT-4 on your laptop!" energy. No 3-hour setup walkthroughs for tools you'll never use. Just: what self-hosting actually is, when it makes sense, which tool to reach for, and the five mistakes that make most first attempts feel like a waste of time.


Why Self-Host an LLM in the First Place?

Before anything else, be honest with yourself about why you're doing this. The three genuinely good reasons — and the one most people actually start with.

Privacy is the strongest reason. If you're working with medical records, legal documents, customer support transcripts, or anything that shouldn't leave your infrastructure, sending it to OpenAI's API isn't just a policy concern — it's a compliance problem. HIPAA, GDPR, SOC 2, client NDAs. A local model processes your data on your hardware, end of story. This is why healthcare systems, law firms, and defense contractors are serious about self-hosting. If this is you, skip ahead to the decision framework — this guide is definitely for you.

Cost at scale is the second legitimate reason. At low volumes, API pricing is fine. At millions of tokens per day, the economics flip. A 10-person dev team spending $800K/year on API calls might spend $300k self-hosting over three years, once hardware and ops are amortized. If your volume is high and predictable, do the math — you'll probably come out ahead.

Customization is the third reason, but it's narrower. Fine-tuning an open-source model on your codebase, your writing style, or your product's knowledge base can produce meaningfully better results for specific tasks than a general-purpose API model. This is real, but it's also the most work. Don't start here.

And then there's the honest reason most people try it: curiosity, wanting to reduce API dependency, or just not wanting to hand your data to a third party on principle. These are fine reasons. Just don't confuse them with the first three, because they'll lead you to spend more time than the reward justifies.


The Three Tools: Ollama, LM Studio, and vLLM

Here's the 90-second version so you can pick your tool and stop reading about tools.

OllamaLM StudiovLLM
Best forDevelopers, fast iterationNon-technical users, GUI explorationProduction servers, high throughput
InterfaceCLI + REST APIDesktop GUI + local serverPython / Docker / API server
Setup complexityLowestLowHigh
ThroughputModerateModerateHighest
Multi-GPUNoNoYes
Speculative decodingNoLimitedYes
Can it serve an API?YesYesYes
Who made itOllama Inc.LM Studio Inc.UC Berkeley vLLM team

Use Ollama if you want to get a model running in 5 minutes, iterate on prompts, or build an app on top of a local model. It's the default answer for "I just want to run something."

Use LM Studio if you prefer clicking to typing, want to hot-swap between models visually, or are exploring what's possible before committing to a setup. It has a beautiful GUI and handles quantization automatically.

Use vLLM if you're deploying a production API, running multiple GPUs, or need the highest possible throughput. It's what serious inference servers run on. The learning curve is steeper but the performance ceiling is significantly higher.

💡 Tip: Many teams in 2025 use Ollama for local development and prototyping, then migrate to vLLM when they move to production. These aren't mutually exclusive.


What You Actually Need: Hardware, Models, and Expectations

This is where most first attempts go wrong: expectations about hardware don't match reality.

The VRAM Ladder

Model size is measured in parameters — billions of parameters, so "7B" means seven billion. The more parameters, the more capable the model generally is, and the more VRAM you need to run it at usable speed.

Model sizeMinimum VRAMRealistic use
7B6–8 GBChat, coding help, summarization — usable on a decent gaming GPU
13B10–12 GBNoticeably better reasoning, good for document analysis
33B20–24 GBStrong reasoning, some GPT-3.5-level capability
70B40+ GBGPT-4 tier in many benchmarks, requires serious hardware
405B200+ GBDatacenter-grade; not something you run at home

If you have an NVIDIA RTX 3080, 3090, 4070, or 4080 — you're in 7B–13B territory, which is genuinely useful. An RTX 4090 or A100 opens up 33B. Anything below 6GB of VRAM and you're CPU-bound, which means slow.

The Quality Ladder

The model size story is straightforward, but the model name story is where people get lost. Here's a rough current ranking for open-source models:

For coding: CodeLlama, Qwen 2.5 Coder, and Mistral derivatives consistently outperform general-purpose models of equivalent size on code tasks.

For general reasoning: Llama 3.3, Mistral Small/Large, Qwen 2.5, and Claude distilled models (distilled to smaller sizes) cover most use cases well.

For multilingual: Qwen 2.5 and Mistral variants have strong non-English performance.

The biggest practical insight: a well-prompted 13B model from 2025 beats a 70B model from 2023. Use current releases, not the model size alone, as your decision variable.


Your First Local LLM in 20 Minutes

Here's the fastest path to a working local model using Ollama. If you want a GUI instead, skip to the note at the end of this section.

Step 1: Install Ollama

macOS / Linux:

# One-line install
curl -fsSL https://ollama.com/install.sh | sh

Windows: Download the app from ollama.com/download and run the installer. No WSL required.

Step 2: Pull and Run a Model

# Pull a model (Llama 3.3 — strong all-around, ~20GB download)
ollama pull llama3.3

# Or start smaller — Mistral 7B is fast and capable
ollama pull mistral

# Run it interactively
ollama run llama3.3

You'll see a prompt. Type your question. That's it — you have a local LLM running.

Step 3: Use It from an App or Script

Ollama runs a local REST API automatically. Any app that works with OpenAI's API can be pointed at your local instance:

# Example: chat completion via curl
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.3",
    "messages": [{"role": "user", "content": "Explain quantization in one sentence"}]
  }'

💡 Tip: On first run, Ollama uses system RAM if you don't have a compatible GPU. It'll be slow but functional. Install NVIDIA's CUDA drivers if you have an NVIDIA GPU — Ollama detects and uses it automatically.

Prefer a GUI? Download LM Studio — it handles model downloading, hot-swapping, and has a built-in chat interface. Point it at the same models Ollama uses. Many people run both and pick whichever matches their mood.


Quantization Explained: What Q4_K_M Actually Means

You've seen filenames like llama3.3-Q4_K_M.gguf and wondered what the letters mean. Here's the honest version.

GGUF is a file format for sharing quantized LLM weights. It bundles the model weights and the quantization metadata together. If you're downloading a model for Ollama or LM Studio, you'll almost always download a .gguf file.

Quantization is the process of compressing the model weights from 16-bit floating point (FP16) to a smaller format. A 7B model at FP16 takes ~14GB. The same model at Q4 takes ~3.5GB — roughly 4x smaller, with surprisingly little quality loss.

The letter-number codes:

  • Q8_0 — 8-bit quantization. Near-lossless, but still large. Use when you have plenty of VRAM and want maximum quality.
  • Q5_K_M — 5-bit with improved calibration. Good middle ground — you lose a small amount of quality but save significant VRAM.
  • Q4_K_M — 4-bit with improved calibration. The sweet spot for most people. Significant VRAM savings, quality loss is imperceptible for general use.
  • Q3_K_M — 3-bit. Noticeable quality degradation on complex reasoning tasks. Fine for simpler use cases.
  • Q2_K — 2-bit. Significant quality loss. Only use when hardware is severely constrained.

💡 Practical rule: If you have enough VRAM for Q8_0, use Q8_0. Otherwise, Q4_K_M is the recommended default — it's what most benchmark comparisons use as the baseline.


The 5 Mistakes That Make First Attempts Fail

After reading hundreds of "why does local keep failing me?" posts, five mistakes appear over and over.

1. Picking the Wrong Model for Your Hardware

Trying to run a 70B model on 8GB of VRAM will technically work — it'll load, it'll respond — but it'll be so slow as to be useless. A well-tuned 7B model on a fast GPU will outperform a struggling 70B model on insufficient VRAM every time.

The fix: Match your model size to your VRAM using the ladder above. Upgrade the model size once you know what your hardware handles comfortably.

2. No System Prompt Tuning

You downloaded the model, you type your questions, and the responses are generic and unhelpful. The default behavior of most models is heavily influenced by the system prompt. Without tuning it, you're getting the model's best guess at "what should I say" rather than "what should I say for your use case."

The fix: Start with an explicit system prompt. Ollama supports this in the CLI (/set system) and via API. Something as simple as "You are a senior software engineer reviewing pull requests. Focus on security, performance, and clarity." transforms output quality.

3. Ignoring Context Window Limits

Every model has a context window — the total amount of text it can "see" in a single conversation. Push past it and the model either errors or silently forgets the beginning of the conversation. Most 7B–13B models ship with 4K–8K context windows by default.

The fix: Know your context window. If you're working with long documents, explicitly check the model's supported context length before starting. Ollama lets you set context size in the Modelfile.

4. Treating Local Like GPT-4

Local models are genuinely impressive, but the best open-source models still lag GPT-4o and Claude on complex multi-step reasoning, code generation at scale, and nuanced instruction following. If you evaluate a local model by GPT-4 benchmarks, you'll be disappointed.

The fix: Evaluate local models on tasks where they excel — high-volume, repetitive, domain-specific tasks where a 90% quality model running locally beats a 98% quality model you have to pay per token for. The use case matters enormously.

5. Skipping Temperature and Sampling Settings

The default temperature (creativity/randomness setting) isn't always right. For factual Q&A, you want low temperature (deterministic, factual). For creative writing, you want higher temperature. Most people never touch this and get responses that are either too random or too stale.

The fix: Set temperature: 0.3 for factual tasks, 0.7–0.9 for creative tasks. Ollama supports this via the API. LM Studio has a slider in the GUI.


Is Self-Hosting Worth It? A Real Decision Framework

Here's the honest framework, skipping the hype on both sides.

The cost break-even

Your OpenAI API spendRecommendation
Under $500/monthDon't self-host yet. The ops overhead exceeds the savings.
$500–$5,000/monthRun the numbers for your specific volume and model sizes. Evaluate managed options (Groq, Replicate) first.
$5,000+/monthStrong case for self-hosting, especially with data privacy needs. Do the 3-year TCO analysis.

These are rough numbers — your hardware costs, model mix, and team capacity all shift the answer. But the pattern is consistent: at low volumes, API wins. At high volumes, self-hosting wins.

The privacy check

Ask yourself: would I be comfortable if this data were stored on OpenAI's servers? If the answer is no — legal documents, medical records, PII, customer conversations — then privacy outweighs the cost argument and you should self-host regardless of volume.

The hybrid path

If self-hosting for production feels like too much commitment, there's a middle ground:

  • Groq — Hardware-accelerated inference via API, with a free tier. Extremely fast, not cheap at scale, but no ops required.
  • Replicate — Run open-source models via API without managing servers. Pay per prediction.
  • Modal, Banana, Beam — Serverless GPU infrastructure; you bring your own model code.

These aren't self-hosting, but they're also not OpenAI. For teams that need control without ops burden, they're worth evaluating.


Where to Go Next

Once you have local inference working, most people want one of three things next:

RAG (Retrieval-Augmented Generation) — Connect your local model to your own documents. Instead of the model knowing only what it was trained on, you give it a vector database of your knowledge base and it retrieves relevant context at query time. Tools like AnythingLLM, RAGFlow, and MaxKB make this approachable.

Fine-tuning — Train a base model on your specific data to get domain-specialized behavior. This is the highest-effort option but produces the most dramatic quality improvements for narrow, repetitive tasks. Tools like Axolotl and LlamaFactory have made fine-tuning much more accessible.

Embedding models — Smaller models (typically 100M–700M parameters) optimized for converting text into vector embeddings. Used for RAG, semantic search, and similarity matching. Run them locally alongside your main model for a fully private AI stack.

Each of these is a full guide unto itself. The fact that you're here — having read through this one — puts you in a good position to evaluate which one is worth your time next.


The Bottom Line

Self-hosting an LLM isn't magic and it isn't for everyone. The people who get the most from it are those with real privacy requirements, high inference volumes, or specific customization needs. Everyone else is often better served by the API and a good prompt.

But if you've got a reason — and now you know what a real reason looks like — the tools have matured enough that it's genuinely accessible. Ollama for getting started. LM Studio for exploration. vLLM when you're serious. Q4_K_M as your default. Don't try to run a 70B model on hardware that can't handle it. And tune your system prompt before you decide the model is disappointing.

The technology is real. The use cases are real. The hype is mostly just hype.


Explore more: Ollama | LM Studio | vLLM | r/LocalLLaMA

Related: related post: RAG explained | related post: fine-tuning LLMs guide | related post: local AI privacy


Related Posts

Small logo of Artifilog Artifilog

Artifilog is a creative blog that explores the intersection of art, design, and technology. It serves as a hub for inspiration, featuring insights, tutorials, and resources to fuel creativity and innovation.

Categories