T
TrendHarvest
Developer Guides

How to Run AI Models Locally in 2026: Complete Ollama & llama.cpp Guide

Step-by-step guide to running AI models locally with Ollama and llama.cpp. Save API costs, protect privacy, and run LLMs offline on Mac, Linux, or Windows.

March 13, 2026·11 min read·2,004 words

Disclosure: This post may contain affiliate links. We earn a commission if you purchase — at no extra cost to you. Our opinions are always our own.

Advertisement

Running AI models locally has gone from a niche hobbyist pursuit to a legitimate engineering choice in 2026. The models are better, the tooling is mature, and the hardware requirements have dropped to the point where a decent laptop can run a capable 7–8B parameter model comfortably.

This guide covers the two most practical paths: Ollama (the easiest way to get started) and llama.cpp (the most flexible and performant option for those who want lower-level control). By the end, you'll have a working local AI setup and a clear sense of when local beats cloud.

Why Run AI Models Locally?

Before diving in, it's worth being specific about the reasons — because they affect which setup makes sense for you.

Privacy and Data Control

If you're processing sensitive data — customer PII, proprietary code, medical records, legal documents — sending it to a third-party API creates compliance and liability exposure. Running locally means your data never leaves your machine or your network.

Cost at Scale

API costs compound quickly. Claude Pro and Perplexity Pro are excellent for interactive use, but if you're building a pipeline that processes thousands of documents daily, the math often favors a one-time hardware investment. A single A100 GPU rental or a Mac Studio can pay for itself within months at scale.

Latency and Offline Operation

Local inference has no network round-trip. For real-time applications — autocomplete, code assistance, interactive agents — this matters. And if you're deploying to environments without reliable internet (edge devices, air-gapped systems), local is the only option.

Experimentation Without Rate Limits

Cloud APIs impose rate limits. Local models don't. This is valuable when you're running evals, fine-tuning experiments, or stress-testing a pipeline.

Get the Weekly TrendHarvest Pick

One email. The best tool, deal, or guide we found this week. No spam.

Hardware Requirements

Be realistic about what your hardware can actually run well.

RAM (for CPU-only inference)

The rule of thumb: you need roughly 1 GB of RAM per billion parameters at 4-bit quantization. Practical minimums:

  • 8 GB RAM: Phi-3 Mini (3.8B), Gemma 2B — usable but slow
  • 16 GB RAM: Mistral 7B, Llama 3 8B — comfortable
  • 32 GB RAM: Llama 3 70B at 4-bit quantization — workable
  • 64+ GB RAM: Llama 3 70B at full precision, larger models

GPU

GPU inference is 5–20x faster than CPU for most workloads:

  • NVIDIA GPU (CUDA): Best support across all tools. 8 GB VRAM handles 7B models; 24 GB handles 70B at 4-bit.
  • Apple Silicon (Metal): Excellent for Mac users. Unified memory architecture means 16 GB M2/M3 handles 7B models very well.
  • AMD GPU (ROCm): Supported but requires more setup. llama.cpp has ROCm builds.

Storage

Models range from 2 GB (small quantized models) to 40+ GB (70B full precision). Keep at least 50 GB free if you plan to experiment.

Ollama: The Easiest Path

Ollama wraps model management, inference, and an HTTP API into a single tool. It handles downloading models, managing quantization variants, and running a local server.

Installation

macOS:

brew install ollama

Or download the desktop app from ollama.com.

Linux:

curl -fsSL https://ollama.com/install.sh | sh

Windows: Download the installer from the Ollama website. WSL2 is not required.

Pulling and Running Models

Once installed, pull a model:

# Pull Llama 3 8B (recommended starting point)
ollama pull llama3

# Pull Mistral 7B
ollama pull mistral

# Pull Phi-3 Mini (fast, lightweight)
ollama pull phi3

# Pull Gemma 2 9B
ollama pull gemma2

Start a chat immediately:

ollama run llama3

That's it. You're now running a local LLM.

The Ollama API

Ollama exposes an OpenAI-compatible HTTP API on localhost:11434. This is the key feature for integrating local models into applications.

Generate a completion:

curl http://localhost:11434/api/generate -d '{
  "model": "llama3",
  "prompt": "Explain the difference between RAG and fine-tuning in two sentences.",
  "stream": false
}'

Chat endpoint (OpenAI-compatible):

curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3",
    "messages": [{"role": "user", "content": "Write a Python function to parse JSON"}]
  }'

Because it speaks the OpenAI API format, you can drop Ollama into any app that uses the OpenAI Python SDK by changing one line:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:11434/v1",
    api_key="ollama",  # required but unused
)

response = client.chat.completions.create(
    model="llama3",
    messages=[{"role": "user", "content": "Hello"}]
)

Managing Models

ollama list          # Show downloaded models
ollama rm mistral    # Remove a model
ollama show llama3   # Show model details and parameters

llama.cpp: For Power Users

llama.cpp is a C/C++ inference library for running LLMs in GGUF format. It's more complex to set up than Ollama, but gives you more control over quantization, hardware backends, and inference parameters.

What is GGUF?

GGUF (GPT-Generated Unified Format) is the standard model format for llama.cpp. Models on Hugging Face are distributed in GGUF with different quantization levels:

  • Q4_K_M — 4-bit quantization, good balance of quality and speed (recommended)
  • Q5_K_M — 5-bit, slightly better quality
  • Q8_0 — 8-bit, near-full quality but twice the size of Q4
  • F16 — Half precision, full quality, large

For most use cases, Q4_K_M is the right choice.

Building llama.cpp

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

# CPU-only build
make

# With Metal (Apple Silicon)
make LLAMA_METAL=1

# With CUDA (NVIDIA)
make LLAMA_CUDA=1

Downloading a GGUF Model

Find models on Hugging Face in GGUF format. TheBloke and bartowski maintain well-tested quantized versions.

# Example: download Mistral 7B Q4_K_M
wget https://huggingface.co/bartowski/Mistral-7B-Instruct-v0.3-GGUF/resolve/main/Mistral-7B-Instruct-v0.3-Q4_K_M.gguf

Running Inference

# Interactive chat
./llama-cli -m Mistral-7B-Instruct-v0.3-Q4_K_M.gguf \
  --chat-template mistral \
  -n 512 \
  -i

# Single completion
./llama-cli -m Mistral-7B-Instruct-v0.3-Q4_K_M.gguf \
  -p "Explain transformer attention in plain English." \
  -n 256

# Run as an HTTP server
./llama-server -m Mistral-7B-Instruct-v0.3-Q4_K_M.gguf \
  --port 8080 \
  --ctx-size 4096

The llama-server command starts an OpenAI-compatible server, similar to Ollama.

Llama 3 8B

Meta's flagship small model. Strong at reasoning, coding, and instruction following. The best all-rounder at the 7–8B class. Start here.

Mistral 7B / Mistral Nemo

Mistral's models punch above their weight class. Mistral Nemo (12B) is particularly good at coding and multi-step reasoning.

Phi-3 / Phi-4 Mini

Microsoft's Phi series is designed for efficiency. Phi-3 Mini (3.8B) runs on hardware with 8 GB RAM and is surprisingly capable for coding tasks. Good for resource-constrained deployments.

Gemma 2

Google's Gemma 2 (9B) has strong benchmark performance and good instruction following. Worth trying if you do a lot of document-level tasks.

CodeLlama / DeepSeek Coder

If your primary use case is code, these specialized models often outperform general-purpose models of the same size on coding benchmarks.

Local vs Cloud: When Each Makes Sense

Scenario Use Local Use Cloud
Sensitive data (PII, medical, legal) Yes Avoid
High-volume batch processing Yes (at scale) Expensive
Best possible quality No Yes
Real-time low latency Depends on hardware No
Experimental/eval runs Yes Rate limited
Customer-facing product Risky Yes
Offline / edge deployment Yes No
Latest model capabilities No Yes

For most developers, the answer is both. Use Claude Pro or Perplexity Pro for complex reasoning tasks and interactive work, and local models for bulk processing, private data, and experimentation. See our Claude Code review for a sense of what cloud models can do that local models still can't match.

Performance Tips

Quantization

Use Q4_K_M as your default. Don't use Q2 unless you have no choice — quality degrades significantly. If you have the VRAM for Q5 or Q8, the quality improvement is worth it for tasks where accuracy matters.

Metal on Apple Silicon

Make sure you're using Metal acceleration. In Ollama, it's automatic. In llama.cpp, compile with LLAMA_METAL=1. You can verify it's active by watching GPU usage in Activity Monitor during inference — it should spike.

Context Window Size

Larger context windows require more memory. If you're seeing slowdowns or OOM errors, reduce --ctx-size in llama.cpp or set OLLAMA_MAX_LOADED_MODELS=1 to prevent multiple models loading simultaneously.

Threading

For CPU inference, match thread count to your physical cores (not logical):

./llama-cli -m model.gguf --threads 8 -p "..."

Use Cases for Local Models

Local Code Assistant: Point Cursor or Continue.dev at your Ollama server as a custom model. You get code completions and chat without sending proprietary code to the cloud. The quality gap vs GPT-4 is real but acceptable for many tasks. See our Cursor vs Claude Code comparison for context.

Private Document Q&A: Run a local RAG pipeline with Ollama + ChromaDB. Ingest PDFs, contracts, or internal wikis and query them without data leaving your network.

Offline Agent Pipelines: Local models let you build agents that run in air-gapped or intermittently connected environments. Useful for IoT, edge AI, or regulated industries.

Evaluation Runs: Run thousands of prompt evaluations without worrying about cost or rate limits. This is increasingly important as developers take How to Evaluate LLM Outputs in 2026: The Developer's Guide to AI Evals" class="internal-link">LLM testing more seriously.


Tools We Recommend

  • Ollama (free) — Best way to run local LLMs — one-command install, auto-manages models (free, open source)
  • LM Studio (free) — Best GUI for running local models, great for non-developers (free)
  • Claude Pro — Best cloud AI for tasks requiring frontier-model quality ($20/mo)
  • Hugging Face Hub (free) — Largest repository of open-source models to download (free tier available)

FAQ

What's the minimum hardware to run a useful LLM locally?

16 GB of unified RAM on Apple Silicon (M2 or newer) or 16 GB system RAM with a discrete GPU is the practical minimum for a good experience. On this hardware, Llama 3 8B at Q4 quantization runs at 20–40 tokens/second, which is fast enough for interactive use. With 8 GB RAM, you're limited to smaller models like Phi-3 Mini, which are less capable but still useful for specific tasks.

How does local model quality compare to GPT-4 or Claude?

There's still a meaningful gap for complex reasoning, multi-step problem solving, and tasks requiring broad world knowledge. A local Llama 3 8B model is roughly GPT-3.5 class. Llama 3 70B at Q4 quantization is closer to GPT-4, but requires 40+ GB VRAM or very fast RAM to run at acceptable speed. For straightforward tasks — summarization, code generation, classification — 7–8B models are often good enough.

Is Ollama or llama.cpp better?

Ollama for most people. It handles model management, updates, and API serving automatically. Use llama.cpp if you need maximum control over quantization choices, want to compile with specific backends, or are deploying in an environment where you can't run Ollama's daemon.

Can I run local models on Windows?

Yes. Ollama has a native Windows installer. llama.cpp can be compiled on Windows with Visual Studio or MinGW, though it's more involved. CUDA acceleration works on Windows with NVIDIA GPUs. Metal (Apple Silicon) is Mac-only.

How do I use a local model as a drop-in replacement for OpenAI in my app?

Both Ollama and llama.cpp server expose an OpenAI-compatible API. Change your base_url to http://localhost:11434/v1 (Ollama) or http://localhost:8080/v1 (llama.cpp server) and set api_key to any non-empty string. The chat completions endpoint is compatible with the OpenAI Python SDK.

What about fine-tuning local models?

Fine-tuning is separate from inference. Tools like Unsloth, Axolotl, and the Hugging Face TRL library handle fine-tuning on consumer hardware using LoRA or QLoRA. After fine-tuning, you convert the result to GGUF and run it with llama.cpp or import it into Ollama. It's feasible on a 24 GB GPU but not for the faint of heart.

Are local models private if I'm using Ollama?

Yes, if you're not using any external services. Ollama itself doesn't send data anywhere — it's running entirely on your machine. The models are downloaded once and cached locally. Your review-2026" title="Claude Opus 4.6 Review 2026 — Is It Still the Best LLM for Serious Work?" class="internal-link">claude-for-content-writing" title="How to Use Claude for Content Writing (Without Sounding Like a Robot)" class="internal-link">prompts and outputs never leave your hardware.

📬

Enjoyed this? Get more picks weekly.

One email. The best AI tool, deal, or guide we found this week. No spam.

No spam. Unsubscribe anytime.

Related Articles