What is Ollama? An open-source local large language model runtime that lets you download, run, and manage open-source AI models like Llama, Mistral, and Qwen on your personal computer with a single click — no GPU cluster needed, no cloud API required.
Table of Contents
- What is Ollama?
- Installing Ollama
- Quick Start: Run Your First Model
- Model Management: Download, Switch, Delete
- Ollama API: Integrate into Your Apps
- Real-World Cases: Build a Local AI Assistant
- Performance Optimization & GPU Acceleration
- Frequently Asked Questions
- Summary
What is Ollama?
Ollama is an open-source local large language model (LLM) runtime, created by Jeff Hancock and released under the MIT License. Its core mission: make it easy for anyone to run open-source AI models locally.
Core Features
| Feature | Description |
|---|---|
| One-Click Run | Start a model with ollama run llama3 |
| Model Library | Dozens of popular models built in, with custom Modelfile support |
| REST API | Standard OpenAI-compatible interface for easy integration |
| Cross-Platform | Full support for macOS, Linux, and Windows |
| GPU Acceleration | Automatically detects and leverages NVIDIA/AMD GPUs |
| Open & Free | MIT License, commercially usable |
Why Choose Ollama?
Before Ollama, running large language models locally required complex setup: manually downloading weight files, installing PyTorch, resolving dependency conflicts, and writing inference code. Ollama wraps all of that into a single command line.
Comparison with Other Local LLM Tools:
| Tool | Ease of Use | Model Count | API Compatibility | GPU Support |
|---|---|---|---|---|
| Ollama | ⭐⭐⭐⭐⭐ | 50+ | OpenAI Compatible | NVIDIA/AMD/Apple Silicon |
| LM Studio | ⭐⭐⭐⭐ | 30+ | OpenAI Compatible | NVIDIA/AMD |
| GPT4All | ⭐⭐⭐ | 20+ | Custom | NVIDIA/AMD |
| Manual Deploy | ⭐ | Unlimited | Custom | Manual setup required |
Ollama's strengths lie in its minimal-friction usage and rich model ecosystem, making it especially well-suited for developers doing rapid prototyping and everyday AI-assisted coding.
Installing Ollama
macOS
# Method 1: One-click install (recommended)
curl -fsSL https://ollama.com/install.sh | sh
# Method 2: Install via Homebrew
brew install ollama
# Verify installation
ollama --version
The macOS version automatically leverages Apple Silicon (M1/M2/M3/M4) Metal GPU acceleration, delivering excellent performance.
Linux
# One-click install (includes systemd service)
curl -fsSL https://ollama.com/install.sh | sh
# Manual install (binary approach)
curl -fsSL https://ollama.com/install.sh | OLLAMA_INSTALL=1 sh
# Manage with systemd
sudo systemctl start ollama
sudo systemctl enable ollama
# Verify
ollama --version
The Linux installer automatically detects NVIDIA GPUs and installs CUDA driver support.
Windows
# Download Windows installer via winget
winget install Ollama.Ollama
# Or download from the official site
# https://ollama.com/download/OllamaSetup.exe
# Verify
ollama --version
The Windows version supports NVIDIA CUDA and AMD ROCm GPU acceleration.
Docker Deployment
# Pull official image
docker run -d --gpus=all -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama
# Verify
docker exec ollama ollama list
Docker deployment is ideal for server environments and CI/CD pipelines.
Quick Start: Run Your First Model
One Step to Run
ollama run llama3.2
That's it! Ollama will automatically: 1. Check if the model already exists locally 2. Download it from the official model library if missing 3. Launch an interactive chat interface
The first run downloads the model (around 2–4 GB), and subsequent launches take just a few seconds.
Interactive Chat
After starting a model, you'll see an interactive terminal:
>>> Hi! Please introduce yourself in Chinese.
Hello! I'm an AI assistant running on the Llama 3.2 model. I can help
you answer questions, write code, translate text, brainstorm ideas, and more.
What can I do for you?
>>> Help me write a Python quicksort implementation
Sure! Here's a Python implementation of quicksort:
def quick_sort(arr):
if len(arr) <= 1:
return arr
pivot = arr[len(arr) // 2]
left = [x for x in arr if x < pivot]
middle = [x for x in arr if x == pivot]
right = [x for x in arr if x > pivot]
return quick_sort(left) + middle + quick_sort(right)
# Test
print(quick_sort([3, 6, 8, 10, 1, 2, 1]))
# Output: [1, 1, 2, 3, 6, 8, 10]
Common Keyboard Shortcuts
| Shortcut | Function |
|---|---|
Enter |
Send message |
Ctrl+C |
Interrupt current response |
Ctrl+D or /bye |
Exit the conversation |
Ctrl+R |
Regenerate response |
/set parameter |
Adjust generation parameters |
Model Management: Download, Switch, Delete
View Available Models
# List locally installed models
ollama list
# Example output:
# NAME ID SIZE MODIFIED
# llama3.2:latest a1b2c3d4e5f6 2.0 GB 2 hours ago
# mistral:7b f6e5d4c3b2a1 4.1 GB 1 day ago
# qwen2.5:14b 1a2b3c4d5e6f 8.9 GB 3 days ago
Browse the Official Model Library
Visit ollama.com/library to see all available models. Here are the most popular models recommended in 2026:
| Model | Parameters | Size | Use Cases |
|---|---|---|---|
| Llama 3.2 | 1B/3B/8B | 0.8–4.7 GB | General chat, code generation |
| Mistral 7B | 7B | 4.1 GB | General tasks, multilingual |
| Qwen 2.5 | 7B/14B/32B | 4–18 GB | Chinese optimization, math reasoning |
| DeepSeek Coder V2 | 16B/236B | 9–133 GB | Code generation, coding assistant |
| Phi-4 | 14B | 8.4 GB | Lightweight, strong reasoning |
| Gemma 2 | 2B/9B/27B | 1.6–15 GB | By Google, efficient inference |
| Yi 1.5 | 6B/9B/34B | 3.5–20 GB | Chinese-English bilingual optimization |
Download Models
# Download a specific model
ollama pull llama3.2
ollama pull mistral:7b
ollama pull qwen2.5:14b
# Download a specific version
ollama pull llama3.2:3b
Switch Models
In an interactive chat:
>>> /model mistral:7b
Switched to mistral:7b model
Or specify directly on the command line:
ollama run mistral:7b
Delete Models
# Remove a single model
ollama rm llama3.2
# Check disk usage for all models
du -sh ~/.ollama/models/*
Custom Modelfile
Ollama supports custom model configurations. Create a Modelfile:
FROM llama3.2:3b
# Set system prompt
SYSTEM """
You are a professional Python development assistant.
When responding, please:
1. Always provide runnable code examples first
2. Explain the key logic of the code
3. Point out potential pitfalls and optimizations
"""
# Adjust generation parameters
PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER num_ctx 4096
Then create your custom model:
# Create the model
ollama create my-python-assistant -f Modelfile
# Run the custom model
ollama run my-python-assistant
Ollama API: Integrate into Your Apps
REST API Basics
Ollama serves a REST API at http://localhost:11434 by default, compatible with the OpenAI interface format.
# Generate text (non-streaming)
curl http://localhost:11434/api/generate -d '{
"model": "llama3.2",
"prompt": "Write a bubble sort in Python",
"stream": false
}'
# Chat format (multi-turn conversation)
curl http://localhost:11434/api/chat -d '{
"model": "llama3.2",
"messages": [
{"role": "system", "content": "You are a code expert"},
{"role": "user", "content": "Explain async/await"}
],
"stream": false
}'
OpenAI-Compatible Endpoint
Ollama provides an OpenAI-compatible endpoint, making it easy to migrate existing applications:
# OpenAI-compatible /v1/chat/completions
curl http://localhost:11434/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer ollama" \
-d '{
"model": "llama3.2",
"messages": [{"role": "user", "content": "Hello"}]
}'
Python Integration
Using the official ollama Python library:
# Install
# pip install ollama
import ollama
# Simple generation
response = ollama.generate(
model='llama3.2',
prompt='Explain quantum computing in one sentence'
)
print(response['response'])
# Chat format
response = ollama.chat(
model='llama3.2',
messages=[
{'role': 'system', 'content': 'You are a code review expert'},
{'role': 'user', 'content': "What's wrong with this code?\n\nfor i in range(len(arr)):\n if arr[i] == x: return i"}
]
)
print(response['message']['content'])
# Streaming output
stream = ollama.chat(
model='llama3.2',
messages=[{'role': 'user', 'content': 'Write a poem about programming'}],
stream=True
)
for chunk in stream:
print(chunk['message']['content'], end='', flush=True)
Calling Ollama with the OpenAI SDK
from openai import OpenAI
# Point to Ollama's local endpoint
client = OpenAI(
base_url='http://localhost:11434/v1',
api_key='ollama' # any value works
)
response = client.chat.completions.create(
model='llama3.2',
messages=[
{'role': 'user', 'content': 'Explain what RAG is'}
]
)
print(response.choices[0].message.content)
JavaScript/TypeScript Integration
// npm install ollama
import { Ollama } from 'ollama';
const ollama = new Ollama();
// Simple generation
const response = await ollama.generate({
model: 'llama3.2',
prompt: 'What is Docker?',
});
console.log(response.response);
// Chat format
const chat = await ollama.chat({
model: 'llama3.2',
messages: [
{ role: 'user', content: 'Recommend 5 resources for learning Rust' }
],
});
console.log(chat.message.content);
Real-World Cases: Build a Local AI Assistant
Case 1: Local Code Review Assistant
Create an AI assistant specialized in reviewing code:
import ollama
import sys
def review_code(code: str, language: str = "Python"):
"""Local code review"""
response = ollama.chat(
model='llama3.2',
messages=[
{'role': 'system', 'content': f'''You are a senior {language} code reviewer.
Please review the code from the following perspectives:
1. Code style (PEP 8 / best practices)
2. Potential bugs and edge cases
3. Performance issues
4. Security vulnerabilities
5. Improvement suggestions
Output format:
- Severity: 🔴 Critical / 🟡 Warning / 🟢 Suggestion
- Issue description
- Fix with code example'''},
{'role': 'user', 'content': f'Please review the following code:\n\n```{language}\n{code}\n```'}
]
)
return response['message']['content']
if __name__ == '__main__':
# Read code from file
with open(sys.argv[1], 'r') as f:
code = f.read()
print(review_code(code))
Usage:
python review.py my_script.py
Case 2: Local RAG Knowledge Base
Build a local RAG system combined with a vector database:
# pip install ollama chromadb sentence-transformers
import ollama
import chromadb
from chromadb.utils import embedding_functions
# Initialize Chroma client
chroma_client = chromadb.Client()
# Use a local embedding model
embed_fn = embedding_functions.SentenceTransformerEmbeddingFunction(
model_name="all-MiniLM-L6-v2"
)
collection = chroma_client.get_or_create_collection(
name="knowledge_base",
embedding_function=embed_fn
)
# Add a document
def add_document(text: str, metadata: dict = None):
collection.add(
documents=[text],
metadatas=[metadata or {}],
ids=[f"doc_{collection.count()}"]
)
# Retrieve relevant documents
def retrieve(query: str, n_results: int = 3):
results = collection.query(
query_texts=[query],
n_results=n_results
)
return results['documents'][0]
# Generate an answer
def rag_answer(question: str) -> str:
# 1. Retrieve relevant documents
context_docs = retrieve(question)
context = "\n\n".join(context_docs)
# 2. Call Ollama to generate the answer
response = ollama.chat(
model='llama3.2',
messages=[
{'role': 'system', 'content': f'Answer the question based on the following context:\n\n{context}'},
{'role': 'user', 'content': question}
]
)
return response['message']['content']
# Usage example
add_document("Ollama supports open-source models like Llama, Mistral, and Qwen")
add_document("Ollama provides an OpenAI-compatible REST API")
print(rag_answer("Which models does Ollama support?"))
Case 3: Batch Text Processing
import ollama
import json
def extract_entities(text: str) -> dict:
"""Extract entities from text"""
response = ollama.chat(
model='llama3.2',
messages=[{
'role': 'user',
'content': f'''Extract entities from the following text and return in JSON format:
{text}
Return format:
{{
"people": [],
"organizations": [],
"locations": [],
"dates": [],
"events": []
}}'''
}]
)
try:
return json.loads(response['message']['content'])
except json.JSONDecodeError:
return {}
# Batch processing
texts = [
"Apple CEO Tim Cook visited Beijing in June 2026",
"OpenAI released GPT-5 in San Francisco"
]
for text in texts:
entities = extract_entities(text)
print(json.dumps(entities, ensure_ascii=False, indent=2))
Performance Optimization & GPU Acceleration
GPU Selection
Ollama automatically selects the best available GPU. You can also control it manually:
# Specify GPU (Linux)
export HSA_OVERRIDE_GFX_VERSION=10.3.0
ollama run llama3.2
# Check GPU status
ollama ps
Context Window Tuning
# Increase context window for long documents
response = ollama.chat(
model='llama3.2',
messages=[...],
options={
'num_ctx': 8192 # default is 2048
}
)
Quantization
Ollama automatically downloads quantized models (typically Q4_K_M). You can choose different quantization levels:
# Higher precision (larger file, better quality)
ollama pull llama3.2:8b-fp16
# Lower precision (smaller file, faster inference)
ollama pull llama3.2:8b-q2_K
Frequently Asked Questions
Does Ollama require an internet connection?
No. After downloading the model, everything runs fully offline. This is one of Ollama's biggest advantages — complete data privacy.
How much RAM do I need?
| Model Size | Recommended RAM |
|---|---|
| 1–3B | 4 GB+ |
| 7B | 8 GB+ |
| 14B | 16 GB+ |
| 32B+ | 32 GB+ |
GPU memory (VRAM) is even more important if you want GPU acceleration.
Can I use Ollama in production?
Absolutely. Ollama is MIT-licensed and commercially usable. Many teams use it in production environments, especially for internal tools and privacy-sensitive applications.
How do I update Ollama?
# macOS (Homebrew)
brew upgrade ollama
# Linux (re-run installer)
curl -fsSL https://ollama.com/install.sh | sh
# Windows
winget upgrade Ollama.Ollama
Summary
Ollama has quickly become the go-to tool for running open-source LLMs locally. Its one-command simplicity, rich model ecosystem, and OpenAI-compatible API make it ideal for developers who want AI power without cloud dependencies or data privacy concerns.
Whether you're building a local coding assistant, a RAG knowledge base, or just experimenting with the latest open-source models, Ollama lowers the barrier to entry dramatically. Give it a try — ollama run llama3.2 and you'll be chatting with your own local AI in minutes.