Manish Saraan
Nov 2, 2024

How to Run Open-Source LLMs Locally

You don't need a cloud account to build with large language models. With Ollama, LM Studio, or GPT4All, you can run open-source models on your own machine—privately, cheaply, and offline.

Why Run LLMs Locally?

Privacy & security — Your prompts and files never leave your device. Great for regulated data, proprietary code, or teams with strict compliance requirements.

Cost control — No metered API bills. After your hardware purchase, usage is effectively free.

Offline access — Keep working on planes, in datacenters with no internet, or during outages.

Customization & control — Choose the exact model, tweak behavior, and avoid third-party filters/rate limits.

Learning & speed — Try new models the day they drop and understand how they behave under the hood.

Hardware Requirements

Model SizeCPURAMGPUVRAMStorageExamples
Small (1B–3B)i5/Ryzen 516 GBOptional4–8 GB10–30 GBPhi-2, Gemma 2B, LLaMA 3.2-1B
Medium (7B–13B)i7/Ryzen 732–64 GBRecommended12–24 GB50–100 GBLLaMA 3-8B, Mistral 7B, Gemma 7B
Large (30B–70B)i9/Threadripper128 GB+Required48–80 GB150–300 GBLLaMA 3-70B, Falcon 40B

Good GPU picks (2025): RTX 4090 (24 GB) for 13B–30B, 4080 (16 GB) for quantized 13B, 3090 (24 GB) as a value option.

Apple Silicon: M1/M2/M3 Max or Ultra run GGUF models very well.

CPU-only: Works, but 10–50× slower—use 4-bit or 5-bit quantized models for acceptable speed.

The Three Ways to Run LLMs Locally

A) Ollama (fastest path for developers)

A CLI that feels like "Docker for LLMs." Pull a model, run it, call a local API.

Why you'd pick it:

  • Quick setup, large model library, great streaming speed
  • Built-in, OpenAI-compatible API on localhost:11434
  • macOS, Windows, and Linux

Install:

# Linux
curl -fsSL https://ollama.com/install.sh | sh
# macOS/Windows: download the installer from https://ollama.com

Pull & run a model:

ollama pull llama3        # Meta LLaMA 3 (8B)
ollama pull mistral       # Mistral 7B
ollama run llama3

Useful commands:

CommandPurpose
ollama listShow downloaded models
ollama pull <model>Download a model
ollama run <model>Chat in the terminal
ollama serveStart the local API server

Call the API (Python):

import requests
r = requests.post(
    "http://localhost:11434/api/generate",
    json={"model": "llama3", "prompt": "Why is the sky blue?", "stream": False},
)
print(r.json()["response"])

B) LM Studio (best desktop GUI)

A polished desktop app with a built-in chat UI and local API. Works with GGUF models (llama.cpp).

Why you'd pick it:

  • Point-and-click downloads with hardware compatibility hints
  • OpenAI-compatible local API
  • Load multiple models and compare responses

Install & run:

  1. Download https://lmstudio.ai (Mac/Win/Linux)
  2. Open Discover → search for "LLaMA 3" or "Mistral 7B"
  3. Click Download (LM Studio picks a suitable quant)
  4. Select the model and start chatting

Use the local API (Python):

from openai import OpenAI
client = OpenAI(base_url="http://localhost:1234/v1", api_key="not-needed")

chat = client.chat.completions.create(
    model="llama-3-8b-instruct",
    messages=[{"role": "user", "content": "Explain MXFP4 quantization in one paragraph."}],
)
print(chat.choices[0].message.content)

C) GPT4All (most beginner-friendly)

A simple desktop chatbot with curated models and excellent CPU-only performance. Includes document chat.

Why you'd pick it:

  • No terminal required
  • Works well without a GPU
  • Upload PDFs/Word/text and ask questions locally

Install & run:

  1. Download https://gpt4all.io
  2. First launch → choose a model (e.g., "LLaMA 3 Instruct" or "Mistral 7B")
  3. Start chatting
  4. For document chat, upload files and ask questions—everything stays offline

Quick Comparison

FeatureOllamaLM StudioGPT4All
SetupCLI (fast for devs)GUI (easy)GUI (easiest)
InterfaceTerminal + APIDesktop UI + APIDesktop UI
PerformanceVery fastFastGood (great on CPU)
Model formatsMany (GGUF via llama.cpp backends)GGUFGGUF
APIBuilt-inBuilt-inPython bindings
Doc chatVia librariesBuilt-in
Best forDevelopersVisual usersBeginners/CPU

Troubleshooting

Out-of-memory (OOM)

  • Switch to a smaller model (llama3:8b)
  • Use heavier quantization (Q8 → Q5 → Q4)
  • Reduce generation length (max_new_tokens)
  • On LM Studio, lower GPU layers (offload to CPU)

Slow Responses

  • Check GPU with nvidia-smi (Linux/Windows)
  • Close memory-hungry apps
  • Prefer quantized builds (Q4/Q5)
  • Ensure the GPU is actually being used (Ollama auto-detects CUDA/Metal)

"Model not found" (Ollama)

ollama list
ollama pull llama3   # verify exact spelling

Best Practices

Start small — 7B–8B models first; scale up when stable

Default to quantized — Q4_K_M is the sweet spot for most setups

Watch resources:

nvidia-smi   # GPU
htop         # CPU/RAM
df -h        # Disk

Keep models tidy:

~/models/
├── llama3-8b-q4/
├── mistral-7b-q5/
└── gemma-2b-q4/

Update tools regularly:

brew upgrade ollama (Mac) or grab the latest installers; keep LM Studio / GPT4All current.

Resources

  • Ollama — https://ollama.com • https://github.com/ollama/ollama
  • LM Studio — https://lmstudio.ai
  • GPT4All — https://gpt4all.io
  • Meta LLaMA — https://llama.com • https://github.com/meta-llama/llama

Wrap-Up

Use Ollama if you're comfortable in a terminal and want an API. Use LM Studio if you prefer a desktop app with a slick UI. Use GPT4All if you're new to this (or CPU-only) and want simple document chat.

That's it—private, low-cost, and ready to build with.