Skip to content
Manish Saraan
Go back

How to Run Open-Source LLMs Locally

How to Run Open-Source LLMs Locally

You don’t need a cloud account to build with large language models. With Ollama, LM Studio, or GPT4All, you can run open-source models on your own machine—privately, cheaply, and offline.

Why Run LLMs Locally?

Privacy & security — Your prompts and files never leave your device. Great for regulated data, proprietary code, or teams with strict compliance requirements.

Cost control — No metered API bills. After your hardware purchase, usage is effectively free.

Offline access — Keep working on planes, in datacenters with no internet, or during outages.

Customization & control — Choose the exact model, tweak behavior, and avoid third-party filters/rate limits.

Learning & speed — Try new models the day they drop and understand how they behave under the hood.

Hardware Requirements

Model Size CPU RAM GPU VRAM Storage Examples
Small (1B–3B) i5/Ryzen 5 16 GB Optional 4–8 GB 10–30 GB Phi-2, Gemma 2B, LLaMA 3.2-1B
Medium (7B–13B) i7/Ryzen 7 32–64 GB Recommended 12–24 GB 50–100 GB LLaMA 3-8B, Mistral 7B, Gemma 7B
Large (30B–70B) i9/Threadripper 128 GB+ Required 48–80 GB 150–300 GB LLaMA 3-70B, Falcon 40B

Good GPU picks (2025): RTX 4090 (24 GB) for 13B–30B, 4080 (16 GB) for quantized 13B, 3090 (24 GB) as a value option.

Apple Silicon: M1/M2/M3 Max or Ultra run GGUF models very well.

CPU-only: Works, but 10–50× slower—use 4-bit or 5-bit quantized models for acceptable speed.

The Three Ways to Run LLMs Locally

A) Ollama (fastest path for developers)

A CLI that feels like “Docker for LLMs.” Pull a model, run it, call a local API.

Why you’d pick it:

Install:

# Linux
curl -fsSL https://ollama.com/install.sh | sh
# macOS/Windows: download the installer from https://ollama.com

Pull & run a model:

ollama pull llama3        # Meta LLaMA 3 (8B)
ollama pull mistral       # Mistral 7B
ollama run llama3

Useful commands:

Command Purpose
ollama list Show downloaded models
ollama pull <model> Download a model
ollama run <model> Chat in the terminal
ollama serve Start the local API server

Call the API (Python):

import requests
r = requests.post(
    "http://localhost:11434/api/generate",
    json={"model": "llama3", "prompt": "Why is the sky blue?", "stream": False},
)
print(r.json()["response"])

B) LM Studio (best desktop GUI)

A polished desktop app with a built-in chat UI and local API. Works with GGUF models (llama.cpp).

Why you’d pick it:

Install & run:

  1. Download https://lmstudio.ai (Mac/Win/Linux)
  2. Open Discover → search for “LLaMA 3” or “Mistral 7B”
  3. Click Download (LM Studio picks a suitable quant)
  4. Select the model and start chatting

Use the local API (Python):

from openai import OpenAI
client = OpenAI(base_url="http://localhost:1234/v1", api_key="not-needed")

chat = client.chat.completions.create(
    model="llama-3-8b-instruct",
    messages=[{"role": "user", "content": "Explain MXFP4 quantization in one paragraph."}],
)
print(chat.choices[0].message.content)

C) GPT4All (most beginner-friendly)

A simple desktop chatbot with curated models and excellent CPU-only performance. Includes document chat.

Why you’d pick it:

Install & run:

  1. Download https://gpt4all.io
  2. First launch → choose a model (e.g., “LLaMA 3 Instruct” or “Mistral 7B”)
  3. Start chatting
  4. For document chat, upload files and ask questions—everything stays offline

Quick Comparison

Feature Ollama LM Studio GPT4All
Setup CLI (fast for devs) GUI (easy) GUI (easiest)
Interface Terminal + API Desktop UI + API Desktop UI
Performance Very fast Fast Good (great on CPU)
Model formats Many (GGUF via llama.cpp backends) GGUF GGUF
API Built-in Built-in Python bindings
Doc chat Via libraries Built-in
Best for Developers Visual users Beginners/CPU

Troubleshooting

Out-of-memory (OOM)

Slow Responses

“Model not found” (Ollama)

ollama list
ollama pull llama3   # verify exact spelling

Best Practices

Start small — 7B–8B models first; scale up when stable

Default to quantized — Q4_K_M is the sweet spot for most setups

Watch resources:

nvidia-smi   # GPU
htop         # CPU/RAM
df -h        # Disk

Keep models tidy:

~/models/
├── llama3-8b-q4/
├── mistral-7b-q5/
└── gemma-2b-q4/

Update tools regularly:

brew upgrade ollama (Mac) or grab the latest installers; keep LM Studio / GPT4All current.

Resources

Wrap-Up

Use Ollama if you’re comfortable in a terminal and want an API. Use LM Studio if you prefer a desktop app with a slick UI. Use GPT4All if you’re new to this (or CPU-only) and want simple document chat.

That’s it—private, low-cost, and ready to build with.


If you’re looking at how AI tools can speed up your development workflow, see how I used NotebookLM to learn RevenueCat and fix a broken subscription in a day.


Share this post on:

Recent Posts