Why Do You Need WhichLLM?
Running large language models (LLMs) locally has become the new normal for developers. Whether you're using Ollama to run Qwen, llama.cpp with GGUF, or LM Studio to load models, developers face a common challenge:
What models can my hardware handle? Which model performs best on my device?
HuggingFace hosts hundreds of thousands of models, ranging from 1.5B to 72B parameters, with quantization methods like Q4_K_M, Q5_K_M, Q8_0... Blindly downloading a 70B model only to discover your VRAM is insufficient wastes precious time.
WhichLLM was born to solve this pain point. It automatically detects your hardware configuration, compares real benchmark data (not just parameter counts), and recommends the most suitable models for your hardware in seconds.
Core Features at a Glance
| Feature | Description |
|---|---|
| Auto-Detection | Automatically identifies NVIDIA/AMD/Apple Silicon/CPU configurations |
| Smart Ranking | Based on real benchmarks (LiveBench, Aider, Open LLM Leaderboard, etc.), not parameter padding |
| Time-Aware | Older models won't rank ahead of newer ones just because of historical scores |
| GPU Simulation | Predict performance before upgrading GPUs: whichllm --gpu "RTX 5090" |
| One-Click Chat | Download and start interactive conversations after recommendation |
| Code Snippets | Generate ready-to-run Python inference code |
| Upgrade Planning | Compare differences between current and candidate hardware |
Installation Methods
WhichLLM offers multiple installation options. We recommend using uv or Homebrew:
Method 1: uv (Recommended, zero-install direct run)
# One-time run (no installation)
uvx whichllm@latest
# Install globally
uv tool install whichllm
# Update
uv tool upgrade whichllm
Method 2: Homebrew
brew install andyyyy64/whichllm/whichllm
Method 3: pip
pip install whichllm
After installation, simply run whichllm to get started.
Quick Start
Auto-Detect and Recommend Models
No configuration needed, just run:
whichllm
Example output:
#1 Qwen/Qwen3.6-27B 27.8B Q5_K_M score 92.8 27 t/s
#2 Qwen/Qwen3-32B 32.0B Q4_K_M score 83.0 31 t/s
#3 Qwen/Qwen3-30B-A3B 30.0B Q5_K_M score 82.7 102 t/s
WhichLLM automatically detected your GPU/CPU/RAM and recommended the most suitable models sorted by composite score. Note that #3 is a MoE (Mixture of Experts) model—although it has more total parameters, its active parameters are fewer, resulting in much faster speeds.
View Hardware Information
whichllm hardware
Example output:
GPU: NVIDIA RTX 4090 (24 GB VRAM)
CPU: AMD Ryzen 9 7950X (16 cores)
RAM: 64 GB
This is helpful for understanding your hardware configuration, especially when you're unsure about your GPU model and VRAM size.
Advanced Features
GPU Simulation: Test Before Buying
Considering an upgrade to RTX 5090 but want to know what models it can run? WhichLLM can simulate any GPU:
# Simulate RTX 5090
whichllm --gpu "RTX 5090"
# Simulate RTX 4060
whichllm --gpu "RTX 4060"
# Custom VRAM
whichllm --gpu "RTX 5060 16"
Example output:
#1 Qwen/Qwen3.6-27B 27.8B Q6_K score 94.7 ~40 t/s
#2 Qwen/Qwen3-32B 32.0B Q5_K_M score 88.0 ~38 t/s
This is very practical for making decisions before purchasing hardware.
Upgrade Planning: Compare Upgrade Benefits
Want to know how much improvement you'll get upgrading from RTX 4090 to RTX 5090?
whichllm upgrade "RTX 4090" "RTX 5090" "H100"
The output shows recommended models, scores, and speeds for each configuration at a glance.
Reverse Planning: What GPU Do You Need for a Specific Model?
# What GPU is needed to run Llama 3 70B?
whichllm plan "llama 3 70b"
# Run Qwen2.5-72B Q8_0 quantized version
whichllm plan "Qwen2.5-72B" --quant Q8_0
# Long context scenarios
whichllm plan "mistral 7b" --context-length 32768
The output tells you how much VRAM you need, what kind of GPU is recommended, and the achievable performance level.
One-Click Chat: Recommend and Use Immediately
After getting model recommendations, you can start a conversation directly:
# Download and chat with a specific model
whichllm run "qwen 2.5 1.5b gguf"
# Auto-select the best model and chat
whichllm run
WhichLLM automatically creates an isolated environment, downloads the model, and starts an interactive conversation.
Generate Python Code Snippets
If you want to use a model in your own application, WhichLLM can generate ready-to-run Python code:
whichllm snippet "qwen 7b"
Output:
from llama_cpp import Llama
llm = Llama.from_pretrained(
repo_id="Qwen/Qwen2.5-7B-Instruct-GGUF",
filename="qwen2.5-7b-instruct-q4_k_m.gguf",
n_ctx=4096,
n_gpu_layers=-1,
verbose=False,
)
output = llm.create_chat_completion(
messages=[{"role": "user", "content": "Hello!"}],
)
print(output["choices"][0]["message"]["content"])
Just copy and paste to use, saving time browsing documentation.
Advanced Configuration Options
Filter by Use Case
WhichLLM supports filtering models by task type:
# Select the best model for coding
whichllm --profile code
# General conversation
whichllm --profile general
# Mathematical reasoning
whichllm --profile math
# Vision/multimodal
whichllm --profile vision
Filter by Quantization Precision
# Only recommend Q4_K_M quantization
whichllm --quant Q4_K_M
# Show more results
whichllm --top 20
# Minimum speed requirement
whichllm --min-speed 30
Evidence Level Control
WhichLLM labels evidence levels for each model's score, allowing you to control recommendation reliability:
# Strict mode: only recommend models with direct benchmark data
whichllm --evidence strict
# Base mode: allow cross-generation matching within the same series
whichllm --evidence base
JSON Output: Script Integration
Integrate WhichLLM into automated workflows:
# Get the best model's HuggingFace ID
whichllm --top 1 --json | jq -r '.models[0].model_id'
# Get full info for the top 3 coding models
whichllm --profile code --top 3 --json | jq '.models[] | {name: .model_id, score: .score, speed: .estimated_tok_per_sec}'
Example output:
{
"models": [
{
"model_id": "Qwen/Qwen3.6-27B-GGUF",
"score": 92.8,
"estimated_tok_per_sec": 27.3,
"speed_confidence": "high",
"quant": "Q5_K_M"
}
]
}
Scoring Mechanism Explained
WhichLLM's scoring mechanism is its core highlight and deserves separate explanation.
Data Sources
WhichLLM aggregates benchmark data from:
- LiveBench — Real-time generated latest evaluations
- Artificial Analysis — Third-party independent testing platform
- Aider — AI coding assistant benchmarks
- Chatbot Arena ELO — LMSYS crowdsourced ratings
- Open LLM Leaderboard — HuggingFace official leaderboard
- Multimodal/Vision Evaluations — When applicable
Scoring Rules
- Bigger isn't better: A 27B model with higher scores will rank ahead of a 32B model
- Temporal decay: 2024 models won't suppress 2026 new models just because of historical high scores
- Confidence grading: Each score is labeled with evidence level:
-
direct— Direct benchmark data for this model and quantization -variant— Data from different quantization versions of the same model -base— Base model data (cross-generation inference) -interpolated— Interpolated estimates -self-reported— Developer self-reported data - Anti-fraud mechanism: Rejects unreliable fake upload data, preventing small modified models from borrowing large model scores
Real Tests: Best Choices for Different Hardware
According to WhichLLM's latest data (June 2026), recommendations for different hardware configurations are as follows:
| Hardware | VRAM | Best Recommendation | Speed |
|---|---|---|---|
| RTX 5090 | 32 GB | Qwen3.6-27B Q6_K (score 94.7) | ~40 t/s |
| RTX 4090 / 3090 | 24 GB | Qwen3.6-27B Q5_K_M (score 92.8) | ~27 t/s |
| RTX 4060 | 8 GB | Qwen3-14B Q3_K_M (score 71.0) | ~22 t/s |
| Apple M3 Max | 36 GB | Qwen3.6-27B Q5_K_M (score 89.4) | ~9 t/s |
| Pure CPU | — | gpt-oss-20b (MoE) Q4_K_M (score 45.2) | ~6 t/s |
💡 Note: MoE (Mixture of Experts) models have far fewer active parameters than total parameters, so they run faster on the same hardware. WhichLLM's scoring mechanism already accounts for this.
Comparison with Similar Tools
| Feature | WhichLLM | Ollama | LM Studio | LocalAI |
|---|---|---|---|---|
| Auto-recommend best model | ✅ | ❌ | ❌ | ❌ |
| GPU simulation | ✅ | ❌ | ❌ | ❌ |
| Benchmark-based ranking | ✅ | ❌ | ❌ | ❌ |
| Upgrade planning | ✅ | ❌ | ❌ | ❌ |
| Code snippet generation | ✅ | ❌ | ❌ | ❌ |
| One-click chat | ✅ | ✅ | ✅ | ✅ |
| JSON output | ✅ | ✅ | ❌ | ❌ |
| Lightweight CLI | ✅ | ✅ | ❌ | ✅ |
Ollama remains a great choice for running models, but WhichLLM solves the model selection problem—they're actually complementary: use WhichLLM to find the best model, then run it with Ollama or other tools.
Practical Scenarios
Scenario 1: New GPU Purchase Decision
You want to upgrade from RTX 3060, with enough budget for either RTX 5070 or a used RTX 4090:
# Simulate current hardware
whichllm --gpu "RTX 3060"
# Simulate upgrade options
whichllm upgrade "RTX 3060" "RTX 5070" "RTX 4090"
Compare the output to see what models each card can run and at what speed, enabling rational decision-making.
Scenario 2: Model Selection Advice for Your Project
Your team is developing a local RAG system and needs to choose an inference model:
# Test with CI server configuration
whichllm --gpu "RTX 4090" --profile code --json | jq
Use JSON output to integrate directly into decision documents.
Scenario 3: Edge Device Deployment
# CPU-only mode
whichllm --cpu-only --top 10
Find small models best suited for CPU inference, for IoT or embedded scenarios.
FAQ
Q: Does WhichLLM download models locally?
A: By default, whichllm only queries the HuggingFace API and local cached data—it does not download models. Only the whichllm run command downloads models.
Q: How do I update model data?
whichllm --refresh
Force refresh the cache to get the latest models from HuggingFace.
Q: What if I have no network access?
A: WhichLLM has a built-in caching mechanism and uses pre-cached snapshot data when offline.
Q: Does it support AMD graphics cards?
A: Yes, WhichLLM supports NVIDIA, AMD, Apple Silicon, and pure CPU modes.
Summary
WhichLLM solves a core pain point for local LLM developers: finding the right model for your hardware among a sea of options. Its highlights include:
- Evidence-based smart ranking—Comprehensive real benchmark data and timeliness, not just sorting by parameter size
- GPU simulation—Test performance before buying hardware to avoid impulse purchases
- One-click integration—From recommendation to chat to code generation, all solved with one tool
- Scriptable—JSON output for easy integration into automated workflows
Whether you're a developer just getting started with local LLMs or a team building production environments, WhichLLM can save you hours of trial-and-error time.
Project Info:
- GitHub: github.com/Andyyyy64/whichllm
- License: MIT
- Install: pip install whichllm or brew install andyyyy64/whichllm/whichllm