How to Run Open-Source LLMs Locally

You don’t need a cloud account to build with large language models. With Ollama, LM Studio, or GPT4All, you can run open-source models on your own machine—privately, cheaply, and offline.

Why Run LLMs Locally?

Privacy & security — Your prompts and files never leave your device. Great for regulated data, proprietary code, or teams with strict compliance requirements.

Cost control — No metered API bills. After your hardware purchase, usage is effectively free.

Offline access — Keep working on planes, in datacenters with no internet, or during outages.

Customization & control — Choose the exact model, tweak behavior, and avoid third-party filters/rate limits.

Learning & speed — Try new models the day they drop and understand how they behave under the hood.

Hardware Requirements

Model Size	CPU	RAM	GPU	VRAM	Storage	Examples
Small (1B–3B)	i5/Ryzen 5	16 GB	Optional	4–8 GB	10–30 GB	Phi-2, Gemma 2B, LLaMA 3.2-1B
Medium (7B–13B)	i7/Ryzen 7	32–64 GB	Recommended	12–24 GB	50–100 GB	LLaMA 3-8B, Mistral 7B, Gemma 7B
Large (30B–70B)	i9/Threadripper	128 GB+	Required	48–80 GB	150–300 GB	LLaMA 3-70B, Falcon 40B

Good GPU picks (2025): RTX 4090 (24 GB) for 13B–30B, 4080 (16 GB) for quantized 13B, 3090 (24 GB) as a value option.

Apple Silicon: M1/M2/M3 Max or Ultra run GGUF models very well.

CPU-only: Works, but 10–50× slower—use 4-bit or 5-bit quantized models for acceptable speed.

The Three Ways to Run LLMs Locally

A) Ollama (fastest path for developers)

A CLI that feels like “Docker for LLMs.” Pull a model, run it, call a local API.

Why you’d pick it:

Quick setup, large model library, great streaming speed
Built-in, OpenAI-compatible API on localhost:11434
macOS, Windows, and Linux

Install:

# Linux
curl -fsSL https://ollama.com/install.sh | sh
# macOS/Windows: download the installer from https://ollama.com

Pull & run a model:

ollama pull llama3        # Meta LLaMA 3 (8B)
ollama pull mistral       # Mistral 7B
ollama run llama3

Useful commands:

Command	Purpose
`ollama list`	Show downloaded models
`ollama pull <model>`	Download a model
`ollama run <model>`	Chat in the terminal
`ollama serve`	Start the local API server

Call the API (Python):

import requests
r = requests.post(
    "http://localhost:11434/api/generate",
    json={"model": "llama3", "prompt": "Why is the sky blue?", "stream": False},
)
print(r.json()["response"])

B) LM Studio (best desktop GUI)

A polished desktop app with a built-in chat UI and local API. Works with GGUF models (llama.cpp).

Why you’d pick it:

Point-and-click downloads with hardware compatibility hints
OpenAI-compatible local API
Load multiple models and compare responses

Install & run:

Download https://lmstudio.ai (Mac/Win/Linux)
Open Discover → search for “LLaMA 3” or “Mistral 7B”
Click Download (LM Studio picks a suitable quant)
Select the model and start chatting

Use the local API (Python):

from openai import OpenAI
client = OpenAI(base_url="http://localhost:1234/v1", api_key="not-needed")

chat = client.chat.completions.create(
    model="llama-3-8b-instruct",
    messages=[{"role": "user", "content": "Explain MXFP4 quantization in one paragraph."}],
)
print(chat.choices[0].message.content)

C) GPT4All (most beginner-friendly)

A simple desktop chatbot with curated models and excellent CPU-only performance. Includes document chat.

Why you’d pick it:

No terminal required
Works well without a GPU
Upload PDFs/Word/text and ask questions locally

Install & run:

Download https://gpt4all.io
First launch → choose a model (e.g., “LLaMA 3 Instruct” or “Mistral 7B”)
Start chatting
For document chat, upload files and ask questions—everything stays offline

Quick Comparison

Feature	Ollama	LM Studio	GPT4All
Setup	CLI (fast for devs)	GUI (easy)	GUI (easiest)
Interface	Terminal + API	Desktop UI + API	Desktop UI
Performance	Very fast	Fast	Good (great on CPU)
Model formats	Many (GGUF via llama.cpp backends)	GGUF	GGUF
API	Built-in	Built-in	Python bindings
Doc chat	Via libraries	—	Built-in
Best for	Developers	Visual users	Beginners/CPU

Troubleshooting

Out-of-memory (OOM)

Switch to a smaller model (llama3:8b)
Use heavier quantization (Q8 → Q5 → Q4)
Reduce generation length (max_new_tokens)
On LM Studio, lower GPU layers (offload to CPU)

Slow Responses

Check GPU with nvidia-smi (Linux/Windows)
Close memory-hungry apps
Prefer quantized builds (Q4/Q5)
Ensure the GPU is actually being used (Ollama auto-detects CUDA/Metal)

“Model not found” (Ollama)

ollama list
ollama pull llama3   # verify exact spelling

Best Practices

Start small — 7B–8B models first; scale up when stable

Default to quantized — Q4_K_M is the sweet spot for most setups

Watch resources:

nvidia-smi   # GPU
htop         # CPU/RAM
df -h        # Disk

Keep models tidy:

~/models/
├── llama3-8b-q4/
├── mistral-7b-q5/
└── gemma-2b-q4/

Update tools regularly:

brew upgrade ollama (Mac) or grab the latest installers; keep LM Studio / GPT4All current.

Resources

Ollama — https://ollama.com • https://github.com/ollama/ollama
LM Studio — https://lmstudio.ai
GPT4All — https://gpt4all.io
Meta LLaMA — https://llama.com • https://github.com/meta-llama/llama

Wrap-Up

Use Ollama if you’re comfortable in a terminal and want an API. Use LM Studio if you prefer a desktop app with a slick UI. Use GPT4All if you’re new to this (or CPU-only) and want simple document chat.

That’s it—private, low-cost, and ready to build with.

If you’re looking at how AI tools can speed up your development workflow, see how I used NotebookLM to learn RevenueCat and fix a broken subscription in a day.

How to Run Open-Source LLMs Locally

Why Run LLMs Locally?

Hardware Requirements

The Three Ways to Run LLMs Locally

A) Ollama (fastest path for developers)

B) LM Studio (best desktop GUI)

C) GPT4All (most beginner-friendly)

Quick Comparison

Troubleshooting

Out-of-memory (OOM)

Slow Responses

“Model not found” (Ollama)

Best Practices

Resources

Wrap-Up

Recent Posts

Living a Slow Life in the Mountains: Naggar

How to Self-Host Supabase with Coolify

Integrate Dodo Payments in a Fresh Next.js App (One-Time Payments)

Client Onboarding for Technical Projects: A Practical Guide

My First International Digital Nomad Experience: Bali (Part 1)

My First International Digital Nomad Experience: Thailand/Malaysia (Part 2)