Ollama Guide: Run Llama, Mistral & Qwen LLMs Locally

2026-06-16 Ollama Large Language Models Local AI Llama Mistral

What is Ollama? An open-source local large language model runtime that lets you download, run, and manage open-source AI models like Llama, Mistral, and Qwen on your personal computer with a single click — no GPU cluster needed, no cloud API required.

What is Ollama?

Ollama is an open-source local large language model (LLM) runtime, created by Jeff Hancock and released under the MIT License. Its core mission: make it easy for anyone to run open-source AI models locally.

Core Features

Feature	Description
One-Click Run	Start a model with `ollama run llama3`
Model Library	Dozens of popular models built in, with custom Modelfile support
REST API	Standard OpenAI-compatible interface for easy integration
Cross-Platform	Full support for macOS, Linux, and Windows
GPU Acceleration	Automatically detects and leverages NVIDIA/AMD GPUs
Open & Free	MIT License, commercially usable

Why Choose Ollama?

Before Ollama, running large language models locally required complex setup: manually downloading weight files, installing PyTorch, resolving dependency conflicts, and writing inference code. Ollama wraps all of that into a single command line.

Comparison with Other Local LLM Tools:

Tool	Ease of Use	Model Count	API Compatibility	GPU Support
Ollama	⭐⭐⭐⭐⭐	50+	OpenAI Compatible	NVIDIA/AMD/Apple Silicon
LM Studio	⭐⭐⭐⭐	30+	OpenAI Compatible	NVIDIA/AMD
GPT4All	⭐⭐⭐	20+	Custom	NVIDIA/AMD
Manual Deploy	⭐	Unlimited	Custom	Manual setup required

Ollama's strengths lie in its minimal-friction usage and rich model ecosystem, making it especially well-suited for developers doing rapid prototyping and everyday AI-assisted coding.

Installing Ollama

macOS

# Method 1: One-click install (recommended)
curl -fsSL https://ollama.com/install.sh | sh

# Method 2: Install via Homebrew
brew install ollama

# Verify installation
ollama --version

The macOS version automatically leverages Apple Silicon (M1/M2/M3/M4) Metal GPU acceleration, delivering excellent performance.

Linux

# One-click install (includes systemd service)
curl -fsSL https://ollama.com/install.sh | sh

# Manual install (binary approach)
curl -fsSL https://ollama.com/install.sh | OLLAMA_INSTALL=1 sh

# Manage with systemd
sudo systemctl start ollama
sudo systemctl enable ollama

# Verify
ollama --version

The Linux installer automatically detects NVIDIA GPUs and installs CUDA driver support.

Windows

# Download Windows installer via winget
winget install Ollama.Ollama

# Or download from the official site
# https://ollama.com/download/OllamaSetup.exe

# Verify
ollama --version

The Windows version supports NVIDIA CUDA and AMD ROCm GPU acceleration.

Docker Deployment

# Pull official image
docker run -d --gpus=all -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama

# Verify
docker exec ollama ollama list

Docker deployment is ideal for server environments and CI/CD pipelines.

Quick Start: Run Your First Model

One Step to Run

ollama run llama3.2

That's it! Ollama will automatically: 1. Check if the model already exists locally 2. Download it from the official model library if missing 3. Launch an interactive chat interface

The first run downloads the model (around 2–4 GB), and subsequent launches take just a few seconds.

Interactive Chat

After starting a model, you'll see an interactive terminal:

>>> Hi! Please introduce yourself in Chinese.

Hello! I'm an AI assistant running on the Llama 3.2 model. I can help
you answer questions, write code, translate text, brainstorm ideas, and more.
What can I do for you?

>>> Help me write a Python quicksort implementation

Sure! Here's a Python implementation of quicksort:

def quick_sort(arr):
    if len(arr) <= 1:
        return arr
    pivot = arr[len(arr) // 2]
    left = [x for x in arr if x < pivot]
    middle = [x for x in arr if x == pivot]
    right = [x for x in arr if x > pivot]
    return quick_sort(left) + middle + quick_sort(right)

# Test
print(quick_sort([3, 6, 8, 10, 1, 2, 1]))
# Output: [1, 1, 2, 3, 6, 8, 10]

Common Keyboard Shortcuts

Shortcut	Function
`Enter`	Send message
`Ctrl+C`	Interrupt current response
`Ctrl+D` or `/bye`	Exit the conversation
`Ctrl+R`	Regenerate response
`/set parameter`	Adjust generation parameters

Model Management: Download, Switch, Delete

View Available Models

# List locally installed models
ollama list

# Example output:
# NAME                    ID              SIZE    MODIFIED
# llama3.2:latest         a1b2c3d4e5f6    2.0 GB  2 hours ago
# mistral:7b              f6e5d4c3b2a1    4.1 GB  1 day ago
# qwen2.5:14b             1a2b3c4d5e6f    8.9 GB  3 days ago

Browse the Official Model Library

Visit ollama.com/library to see all available models. Here are the most popular models recommended in 2026:

Model	Parameters	Size	Use Cases
Llama 3.2	1B/3B/8B	0.8–4.7 GB	General chat, code generation
Mistral 7B	7B	4.1 GB	General tasks, multilingual
Qwen 2.5	7B/14B/32B	4–18 GB	Chinese optimization, math reasoning
DeepSeek Coder V2	16B/236B	9–133 GB	Code generation, coding assistant
Phi-4	14B	8.4 GB	Lightweight, strong reasoning
Gemma 2	2B/9B/27B	1.6–15 GB	By Google, efficient inference
Yi 1.5	6B/9B/34B	3.5–20 GB	Chinese-English bilingual optimization

Download Models

# Download a specific model
ollama pull llama3.2
ollama pull mistral:7b
ollama pull qwen2.5:14b

# Download a specific version
ollama pull llama3.2:3b

Switch Models

In an interactive chat:

>>> /model mistral:7b
Switched to mistral:7b model

Or specify directly on the command line:

ollama run mistral:7b

Delete Models

# Remove a single model
ollama rm llama3.2

# Check disk usage for all models
du -sh ~/.ollama/models/*

Custom Modelfile

Ollama supports custom model configurations. Create a Modelfile:

FROM llama3.2:3b

# Set system prompt
SYSTEM """
You are a professional Python development assistant.
When responding, please:
1. Always provide runnable code examples first
2. Explain the key logic of the code
3. Point out potential pitfalls and optimizations
"""

# Adjust generation parameters
PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER num_ctx 4096

Then create your custom model:

# Create the model
ollama create my-python-assistant -f Modelfile

# Run the custom model
ollama run my-python-assistant

Ollama API: Integrate into Your Apps

REST API Basics

Ollama serves a REST API at http://localhost:11434 by default, compatible with the OpenAI interface format.

# Generate text (non-streaming)
curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2",
  "prompt": "Write a bubble sort in Python",
  "stream": false
}'

# Chat format (multi-turn conversation)
curl http://localhost:11434/api/chat -d '{
  "model": "llama3.2",
  "messages": [
    {"role": "system", "content": "You are a code expert"},
    {"role": "user", "content": "Explain async/await"}
  ],
  "stream": false
}'

OpenAI-Compatible Endpoint

Ollama provides an OpenAI-compatible endpoint, making it easy to migrate existing applications:

# OpenAI-compatible /v1/chat/completions
curl http://localhost:11434/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer ollama" \
  -d '{
    "model": "llama3.2",
    "messages": [{"role": "user", "content": "Hello"}]
  }'

Python Integration

Using the official ollama Python library:

# Install
# pip install ollama

import ollama

# Simple generation
response = ollama.generate(
    model='llama3.2',
    prompt='Explain quantum computing in one sentence'
)
print(response['response'])

# Chat format
response = ollama.chat(
    model='llama3.2',
    messages=[
        {'role': 'system', 'content': 'You are a code review expert'},
        {'role': 'user', 'content': "What's wrong with this code?\n\nfor i in range(len(arr)):\n    if arr[i] == x: return i"}
    ]
)
print(response['message']['content'])

# Streaming output
stream = ollama.chat(
    model='llama3.2',
    messages=[{'role': 'user', 'content': 'Write a poem about programming'}],
    stream=True
)
for chunk in stream:
    print(chunk['message']['content'], end='', flush=True)

Calling Ollama with the OpenAI SDK

from openai import OpenAI

# Point to Ollama's local endpoint
client = OpenAI(
    base_url='http://localhost:11434/v1',
    api_key='ollama'  # any value works
)

response = client.chat.completions.create(
    model='llama3.2',
    messages=[
        {'role': 'user', 'content': 'Explain what RAG is'}
    ]
)
print(response.choices[0].message.content)

JavaScript/TypeScript Integration

// npm install ollama
import { Ollama } from 'ollama';

const ollama = new Ollama();

// Simple generation
const response = await ollama.generate({
  model: 'llama3.2',
  prompt: 'What is Docker?',
});
console.log(response.response);

// Chat format
const chat = await ollama.chat({
  model: 'llama3.2',
  messages: [
    { role: 'user', content: 'Recommend 5 resources for learning Rust' }
  ],
});
console.log(chat.message.content);

Real-World Cases: Build a Local AI Assistant

Case 1: Local Code Review Assistant

Create an AI assistant specialized in reviewing code:

import ollama
import sys

def review_code(code: str, language: str = "Python"):
    """Local code review"""
    response = ollama.chat(
        model='llama3.2',
        messages=[
            {'role': 'system', 'content': f'''You are a senior {language} code reviewer.
Please review the code from the following perspectives:
1. Code style (PEP 8 / best practices)
2. Potential bugs and edge cases
3. Performance issues
4. Security vulnerabilities
5. Improvement suggestions

Output format:
- Severity: 🔴 Critical / 🟡 Warning / 🟢 Suggestion
- Issue description
- Fix with code example'''},
            {'role': 'user', 'content': f'Please review the following code:\n\n```{language}\n{code}\n```'}
        ]
    )
    return response['message']['content']

if __name__ == '__main__':
    # Read code from file
    with open(sys.argv[1], 'r') as f:
        code = f.read()
    print(review_code(code))

Usage:

python review.py my_script.py

Case 2: Local RAG Knowledge Base

Build a local RAG system combined with a vector database:

# pip install ollama chromadb sentence-transformers

import ollama
import chromadb
from chromadb.utils import embedding_functions

# Initialize Chroma client
chroma_client = chromadb.Client()

# Use a local embedding model
embed_fn = embedding_functions.SentenceTransformerEmbeddingFunction(
    model_name="all-MiniLM-L6-v2"
)

collection = chroma_client.get_or_create_collection(
    name="knowledge_base",
    embedding_function=embed_fn
)

# Add a document
def add_document(text: str, metadata: dict = None):
    collection.add(
        documents=[text],
        metadatas=[metadata or {}],
        ids=[f"doc_{collection.count()}"]
    )

# Retrieve relevant documents
def retrieve(query: str, n_results: int = 3):
    results = collection.query(
        query_texts=[query],
        n_results=n_results
    )
    return results['documents'][0]

# Generate an answer
def rag_answer(question: str) -> str:
    # 1. Retrieve relevant documents
    context_docs = retrieve(question)
    context = "\n\n".join(context_docs)

    # 2. Call Ollama to generate the answer
    response = ollama.chat(
        model='llama3.2',
        messages=[
            {'role': 'system', 'content': f'Answer the question based on the following context:\n\n{context}'},
            {'role': 'user', 'content': question}
        ]
    )
    return response['message']['content']

# Usage example
add_document("Ollama supports open-source models like Llama, Mistral, and Qwen")
add_document("Ollama provides an OpenAI-compatible REST API")
print(rag_answer("Which models does Ollama support?"))

Case 3: Batch Text Processing

import ollama
import json

def extract_entities(text: str) -> dict:
    """Extract entities from text"""
    response = ollama.chat(
        model='llama3.2',
        messages=[{
            'role': 'user',
            'content': f'''Extract entities from the following text and return in JSON format:
{text}

Return format:
{{
  "people": [],
  "organizations": [],
  "locations": [],
  "dates": [],
  "events": []
}}'''
        }]
    )
    try:
        return json.loads(response['message']['content'])
    except json.JSONDecodeError:
        return {}

# Batch processing
texts = [
    "Apple CEO Tim Cook visited Beijing in June 2026",
    "OpenAI released GPT-5 in San Francisco"
]

for text in texts:
    entities = extract_entities(text)
    print(json.dumps(entities, ensure_ascii=False, indent=2))

Performance Optimization & GPU Acceleration

GPU Selection

Ollama automatically selects the best available GPU. You can also control it manually:

# Specify GPU (Linux)
export HSA_OVERRIDE_GFX_VERSION=10.3.0
ollama run llama3.2

# Check GPU status
ollama ps

Context Window Tuning

# Increase context window for long documents
response = ollama.chat(
    model='llama3.2',
    messages=[...],
    options={
        'num_ctx': 8192  # default is 2048
    }
)

Quantization

Ollama automatically downloads quantized models (typically Q4_K_M). You can choose different quantization levels:

# Higher precision (larger file, better quality)
ollama pull llama3.2:8b-fp16

# Lower precision (smaller file, faster inference)
ollama pull llama3.2:8b-q2_K

Frequently Asked Questions

Does Ollama require an internet connection?

No. After downloading the model, everything runs fully offline. This is one of Ollama's biggest advantages — complete data privacy.

How much RAM do I need?

Model Size	Recommended RAM
1–3B	4 GB+
7B	8 GB+
14B	16 GB+
32B+	32 GB+

GPU memory (VRAM) is even more important if you want GPU acceleration.

Can I use Ollama in production?

Absolutely. Ollama is MIT-licensed and commercially usable. Many teams use it in production environments, especially for internal tools and privacy-sensitive applications.

How do I update Ollama?

# macOS (Homebrew)
brew upgrade ollama

# Linux (re-run installer)
curl -fsSL https://ollama.com/install.sh | sh

# Windows
winget upgrade Ollama.Ollama

Summary

Ollama has quickly become the go-to tool for running open-source LLMs locally. Its one-command simplicity, rich model ecosystem, and OpenAI-compatible API make it ideal for developers who want AI power without cloud dependencies or data privacy concerns.

Whether you're building a local coding assistant, a RAG knowledge base, or just experimenting with the latest open-source models, Ollama lowers the barrier to entry dramatically. Give it a try — ollama run llama3.2 and you'll be chatting with your own local AI in minutes.

Ollama: run open-source LLMs locally for privacy-first AI dev. Manage models, API integration, use cases for Llama, Mistral, Qwen. Full setup guide.