How to Run Open-Source LLMs Locally
You don't need a cloud account to build with large language models. With Ollama, LM Studio, or GPT4All, you can run open-source models on your own machine—privately, cheaply, and offline.
Why Run LLMs Locally?
Privacy & security — Your prompts and files never leave your device. Great for regulated data, proprietary code, or teams with strict compliance requirements.
Cost control — No metered API bills. After your hardware purchase, usage is effectively free.
Offline access — Keep working on planes, in datacenters with no internet, or during outages.
Customization & control — Choose the exact model, tweak behavior, and avoid third-party filters/rate limits.
Learning & speed — Try new models the day they drop and understand how they behave under the hood.
Hardware Requirements
| Model Size | CPU | RAM | GPU | VRAM | Storage | Examples |
|---|---|---|---|---|---|---|
| Small (1B–3B) | i5/Ryzen 5 | 16 GB | Optional | 4–8 GB | 10–30 GB | Phi-2, Gemma 2B, LLaMA 3.2-1B |
| Medium (7B–13B) | i7/Ryzen 7 | 32–64 GB | Recommended | 12–24 GB | 50–100 GB | LLaMA 3-8B, Mistral 7B, Gemma 7B |
| Large (30B–70B) | i9/Threadripper | 128 GB+ | Required | 48–80 GB | 150–300 GB | LLaMA 3-70B, Falcon 40B |
Good GPU picks (2025): RTX 4090 (24 GB) for 13B–30B, 4080 (16 GB) for quantized 13B, 3090 (24 GB) as a value option.
Apple Silicon: M1/M2/M3 Max or Ultra run GGUF models very well.
CPU-only: Works, but 10–50× slower—use 4-bit or 5-bit quantized models for acceptable speed.
The Three Ways to Run LLMs Locally
A) Ollama (fastest path for developers)
A CLI that feels like "Docker for LLMs." Pull a model, run it, call a local API.
Why you'd pick it:
- Quick setup, large model library, great streaming speed
- Built-in, OpenAI-compatible API on localhost:11434
- macOS, Windows, and Linux
Install:
# Linux
curl -fsSL https://ollama.com/install.sh | sh
# macOS/Windows: download the installer from https://ollama.com
Pull & run a model:
ollama pull llama3 # Meta LLaMA 3 (8B)
ollama pull mistral # Mistral 7B
ollama run llama3
Useful commands:
| Command | Purpose |
|---|---|
ollama list | Show downloaded models |
ollama pull <model> | Download a model |
ollama run <model> | Chat in the terminal |
ollama serve | Start the local API server |
Call the API (Python):
import requests
r = requests.post(
"http://localhost:11434/api/generate",
json={"model": "llama3", "prompt": "Why is the sky blue?", "stream": False},
)
print(r.json()["response"])
B) LM Studio (best desktop GUI)
A polished desktop app with a built-in chat UI and local API. Works with GGUF models (llama.cpp).
Why you'd pick it:
- Point-and-click downloads with hardware compatibility hints
- OpenAI-compatible local API
- Load multiple models and compare responses
Install & run:
- Download https://lmstudio.ai (Mac/Win/Linux)
- Open Discover → search for "LLaMA 3" or "Mistral 7B"
- Click Download (LM Studio picks a suitable quant)
- Select the model and start chatting
Use the local API (Python):
from openai import OpenAI
client = OpenAI(base_url="http://localhost:1234/v1", api_key="not-needed")
chat = client.chat.completions.create(
model="llama-3-8b-instruct",
messages=[{"role": "user", "content": "Explain MXFP4 quantization in one paragraph."}],
)
print(chat.choices[0].message.content)
C) GPT4All (most beginner-friendly)
A simple desktop chatbot with curated models and excellent CPU-only performance. Includes document chat.
Why you'd pick it:
- No terminal required
- Works well without a GPU
- Upload PDFs/Word/text and ask questions locally
Install & run:
- Download https://gpt4all.io
- First launch → choose a model (e.g., "LLaMA 3 Instruct" or "Mistral 7B")
- Start chatting
- For document chat, upload files and ask questions—everything stays offline
Quick Comparison
| Feature | Ollama | LM Studio | GPT4All |
|---|---|---|---|
| Setup | CLI (fast for devs) | GUI (easy) | GUI (easiest) |
| Interface | Terminal + API | Desktop UI + API | Desktop UI |
| Performance | Very fast | Fast | Good (great on CPU) |
| Model formats | Many (GGUF via llama.cpp backends) | GGUF | GGUF |
| API | Built-in | Built-in | Python bindings |
| Doc chat | Via libraries | — | Built-in |
| Best for | Developers | Visual users | Beginners/CPU |
Troubleshooting
Out-of-memory (OOM)
- Switch to a smaller model (llama3:8b)
- Use heavier quantization (Q8 → Q5 → Q4)
- Reduce generation length (max_new_tokens)
- On LM Studio, lower GPU layers (offload to CPU)
Slow Responses
- Check GPU with nvidia-smi (Linux/Windows)
- Close memory-hungry apps
- Prefer quantized builds (Q4/Q5)
- Ensure the GPU is actually being used (Ollama auto-detects CUDA/Metal)
"Model not found" (Ollama)
ollama list
ollama pull llama3 # verify exact spelling
Best Practices
Start small — 7B–8B models first; scale up when stable
Default to quantized — Q4_K_M is the sweet spot for most setups
Watch resources:
nvidia-smi # GPU
htop # CPU/RAM
df -h # Disk
Keep models tidy:
~/models/
├── llama3-8b-q4/
├── mistral-7b-q5/
└── gemma-2b-q4/
Update tools regularly:
brew upgrade ollama (Mac) or grab the latest installers; keep LM Studio / GPT4All current.
Resources
- Ollama — https://ollama.com • https://github.com/ollama/ollama
- LM Studio — https://lmstudio.ai
- GPT4All — https://gpt4all.io
- Meta LLaMA — https://llama.com • https://github.com/meta-llama/llama
Wrap-Up
Use Ollama if you're comfortable in a terminal and want an API. Use LM Studio if you prefer a desktop app with a slick UI. Use GPT4All if you're new to this (or CPU-only) and want simple document chat.
That's it—private, low-cost, and ready to build with.


