Which LLM for My GPU? Quick Check 2026

2026-06-10 LLM AI Local Models Open Source Tools Benchmarking

Why Do You Need WhichLLM?

Running large language models (LLMs) locally has become the new normal for developers. Whether you're using Ollama to run Qwen, llama.cpp with GGUF, or LM Studio to load models, developers face a common challenge:

What models can my hardware handle? Which model performs best on my device?

HuggingFace hosts hundreds of thousands of models, ranging from 1.5B to 72B parameters, with quantization methods like Q4_K_M, Q5_K_M, Q8_0... Blindly downloading a 70B model only to discover your VRAM is insufficient wastes precious time.

WhichLLM was born to solve this pain point. It automatically detects your hardware configuration, compares real benchmark data (not just parameter counts), and recommends the most suitable models for your hardware in seconds.

Core Features at a Glance

Feature	Description
Auto-Detection	Automatically identifies NVIDIA/AMD/Apple Silicon/CPU configurations
Smart Ranking	Based on real benchmarks (LiveBench, Aider, Open LLM Leaderboard, etc.), not parameter padding
Time-Aware	Older models won't rank ahead of newer ones just because of historical scores
GPU Simulation	Predict performance before upgrading GPUs: `whichllm --gpu "RTX 5090"`
One-Click Chat	Download and start interactive conversations after recommendation
Code Snippets	Generate ready-to-run Python inference code
Upgrade Planning	Compare differences between current and candidate hardware

Installation Methods

WhichLLM offers multiple installation options. We recommend using uv or Homebrew:

Method 1: uv (Recommended, zero-install direct run)

# One-time run (no installation)
uvx whichllm@latest

# Install globally
uv tool install whichllm

# Update
uv tool upgrade whichllm

Method 2: Homebrew

brew install andyyyy64/whichllm/whichllm

Method 3: pip

pip install whichllm

After installation, simply run whichllm to get started.

Quick Start

No configuration needed, just run:

whichllm

Example output:

#1 Qwen/Qwen3.6-27B 27.8B Q5_K_M  score 92.8  27 t/s
#2 Qwen/Qwen3-32B  32.0B Q4_K_M  score 83.0  31 t/s
#3 Qwen/Qwen3-30B-A3B 30.0B Q5_K_M  score 82.7  102 t/s

WhichLLM automatically detected your GPU/CPU/RAM and recommended the most suitable models sorted by composite score. Note that #3 is a MoE (Mixture of Experts) model—although it has more total parameters, its active parameters are fewer, resulting in much faster speeds.

View Hardware Information

whichllm hardware

Example output:

GPU: NVIDIA RTX 4090 (24 GB VRAM)
CPU: AMD Ryzen 9 7950X (16 cores)
RAM: 64 GB

This is helpful for understanding your hardware configuration, especially when you're unsure about your GPU model and VRAM size.

Advanced Features

GPU Simulation: Test Before Buying

Considering an upgrade to RTX 5090 but want to know what models it can run? WhichLLM can simulate any GPU:

# Simulate RTX 5090
whichllm --gpu "RTX 5090"

# Simulate RTX 4060
whichllm --gpu "RTX 4060"

# Custom VRAM
whichllm --gpu "RTX 5060 16"

Example output:

#1 Qwen/Qwen3.6-27B 27.8B Q6_K    score 94.7  ~40 t/s
#2 Qwen/Qwen3-32B  32.0B Q5_K_M  score 88.0  ~38 t/s

This is very practical for making decisions before purchasing hardware.

Upgrade Planning: Compare Upgrade Benefits

Want to know how much improvement you'll get upgrading from RTX 4090 to RTX 5090?

whichllm upgrade "RTX 4090" "RTX 5090" "H100"

The output shows recommended models, scores, and speeds for each configuration at a glance.

Reverse Planning: What GPU Do You Need for a Specific Model?

# What GPU is needed to run Llama 3 70B?
whichllm plan "llama 3 70b"

# Run Qwen2.5-72B Q8_0 quantized version
whichllm plan "Qwen2.5-72B" --quant Q8_0

# Long context scenarios
whichllm plan "mistral 7b" --context-length 32768

The output tells you how much VRAM you need, what kind of GPU is recommended, and the achievable performance level.

After getting model recommendations, you can start a conversation directly:

# Download and chat with a specific model
whichllm run "qwen 2.5 1.5b gguf"

# Auto-select the best model and chat
whichllm run

WhichLLM automatically creates an isolated environment, downloads the model, and starts an interactive conversation.

Generate Python Code Snippets

If you want to use a model in your own application, WhichLLM can generate ready-to-run Python code:

whichllm snippet "qwen 7b"

Output:

from llama_cpp import Llama

llm = Llama.from_pretrained(
    repo_id="Qwen/Qwen2.5-7B-Instruct-GGUF",
    filename="qwen2.5-7b-instruct-q4_k_m.gguf",
    n_ctx=4096,
    n_gpu_layers=-1,
    verbose=False,
)

output = llm.create_chat_completion(
    messages=[{"role": "user", "content": "Hello!"}],
)
print(output["choices"][0]["message"]["content"])

Just copy and paste to use, saving time browsing documentation.

Advanced Configuration Options

Filter by Use Case

WhichLLM supports filtering models by task type:

# Select the best model for coding
whichllm --profile code

# General conversation
whichllm --profile general

# Mathematical reasoning
whichllm --profile math

# Vision/multimodal
whichllm --profile vision

Filter by Quantization Precision

# Only recommend Q4_K_M quantization
whichllm --quant Q4_K_M

# Show more results
whichllm --top 20

# Minimum speed requirement
whichllm --min-speed 30

Evidence Level Control

WhichLLM labels evidence levels for each model's score, allowing you to control recommendation reliability:

# Strict mode: only recommend models with direct benchmark data
whichllm --evidence strict

# Base mode: allow cross-generation matching within the same series
whichllm --evidence base

JSON Output: Script Integration

Integrate WhichLLM into automated workflows:

# Get the best model's HuggingFace ID
whichllm --top 1 --json | jq -r '.models[0].model_id'

# Get full info for the top 3 coding models
whichllm --profile code --top 3 --json | jq '.models[] | {name: .model_id, score: .score, speed: .estimated_tok_per_sec}'

Example output:

{
  "models": [
    {
      "model_id": "Qwen/Qwen3.6-27B-GGUF",
      "score": 92.8,
      "estimated_tok_per_sec": 27.3,
      "speed_confidence": "high",
      "quant": "Q5_K_M"
    }
  ]
}

Scoring Mechanism Explained

WhichLLM's scoring mechanism is its core highlight and deserves separate explanation.

Data Sources

WhichLLM aggregates benchmark data from:

LiveBench — Real-time generated latest evaluations
Artificial Analysis — Third-party independent testing platform
Aider — AI coding assistant benchmarks
Chatbot Arena ELO — LMSYS crowdsourced ratings
Open LLM Leaderboard — HuggingFace official leaderboard
Multimodal/Vision Evaluations — When applicable

Scoring Rules

Bigger isn't better: A 27B model with higher scores will rank ahead of a 32B model
Temporal decay: 2024 models won't suppress 2026 new models just because of historical high scores
Confidence grading: Each score is labeled with evidence level: - direct — Direct benchmark data for this model and quantization - variant — Data from different quantization versions of the same model - base — Base model data (cross-generation inference) - interpolated — Interpolated estimates - self-reported — Developer self-reported data
Anti-fraud mechanism: Rejects unreliable fake upload data, preventing small modified models from borrowing large model scores

Real Tests: Best Choices for Different Hardware

According to WhichLLM's latest data (June 2026), recommendations for different hardware configurations are as follows:

Hardware	VRAM	Best Recommendation	Speed
RTX 5090	32 GB	Qwen3.6-27B Q6_K (score 94.7)	~40 t/s
RTX 4090 / 3090	24 GB	Qwen3.6-27B Q5_K_M (score 92.8)	~27 t/s
RTX 4060	8 GB	Qwen3-14B Q3_K_M (score 71.0)	~22 t/s
Apple M3 Max	36 GB	Qwen3.6-27B Q5_K_M (score 89.4)	~9 t/s
Pure CPU	—	gpt-oss-20b (MoE) Q4_K_M (score 45.2)	~6 t/s

💡 Note: MoE (Mixture of Experts) models have far fewer active parameters than total parameters, so they run faster on the same hardware. WhichLLM's scoring mechanism already accounts for this.

Comparison with Similar Tools

Feature	WhichLLM	Ollama	LM Studio	LocalAI
Auto-recommend best model	✅	❌	❌	❌
GPU simulation	✅	❌	❌	❌
Benchmark-based ranking	✅	❌	❌	❌
Upgrade planning	✅	❌	❌	❌
Code snippet generation	✅	❌	❌	❌
One-click chat	✅	✅	✅	✅
JSON output	✅	✅	❌	❌
Lightweight CLI	✅	✅	❌	✅

Ollama remains a great choice for running models, but WhichLLM solves the model selection problem—they're actually complementary: use WhichLLM to find the best model, then run it with Ollama or other tools.

Practical Scenarios

Scenario 1: New GPU Purchase Decision

You want to upgrade from RTX 3060, with enough budget for either RTX 5070 or a used RTX 4090:

# Simulate current hardware
whichllm --gpu "RTX 3060"

# Simulate upgrade options
whichllm upgrade "RTX 3060" "RTX 5070" "RTX 4090"

Compare the output to see what models each card can run and at what speed, enabling rational decision-making.

Scenario 2: Model Selection Advice for Your Project

Your team is developing a local RAG system and needs to choose an inference model:

# Test with CI server configuration
whichllm --gpu "RTX 4090" --profile code --json | jq

Use JSON output to integrate directly into decision documents.

Scenario 3: Edge Device Deployment

# CPU-only mode
whichllm --cpu-only --top 10

Find small models best suited for CPU inference, for IoT or embedded scenarios.

FAQ

Q: Does WhichLLM download models locally?

A: By default, whichllm only queries the HuggingFace API and local cached data—it does not download models. Only the whichllm run command downloads models.

Q: How do I update model data?

whichllm --refresh

Force refresh the cache to get the latest models from HuggingFace.

Q: What if I have no network access?

A: WhichLLM has a built-in caching mechanism and uses pre-cached snapshot data when offline.

Q: Does it support AMD graphics cards?

A: Yes, WhichLLM supports NVIDIA, AMD, Apple Silicon, and pure CPU modes.

Summary

WhichLLM solves a core pain point for local LLM developers: finding the right model for your hardware among a sea of options. Its highlights include:

Evidence-based smart ranking—Comprehensive real benchmark data and timeliness, not just sorting by parameter size
GPU simulation—Test performance before buying hardware to avoid impulse purchases
One-click integration—From recommendation to chat to code generation, all solved with one tool
Scriptable—JSON output for easy integration into automated workflows

Whether you're a developer just getting started with local LLMs or a team building production environments, WhichLLM can save you hours of trial-and-error time.

Project Info: - GitHub: github.com/Andyyyy64/whichllm - License: MIT - Install: pip install whichllm or brew install andyyyy64/whichllm/whichllm

Not sure which LLM fits your GPU? WhichLLM scans hardware in seconds, recommends the best local models with real benchmark scores. Free & open source.

Which LLM for My GPU? Quick Check 2026

Why Do You Need WhichLLM?

Core Features at a Glance

Installation Methods

Method 1: uv (Recommended, zero-install direct run)

Method 2: Homebrew

Method 3: pip

Quick Start

Auto-Detect and Recommend Models

View Hardware Information

Advanced Features

GPU Simulation: Test Before Buying

Upgrade Planning: Compare Upgrade Benefits

Reverse Planning: What GPU Do You Need for a Specific Model?

One-Click Chat: Recommend and Use Immediately

Generate Python Code Snippets

Advanced Configuration Options

Filter by Use Case

Filter by Quantization Precision

Evidence Level Control

JSON Output: Script Integration

Scoring Mechanism Explained

Data Sources

Scoring Rules

Real Tests: Best Choices for Different Hardware

Comparison with Similar Tools

Practical Scenarios

Scenario 1: New GPU Purchase Decision

Scenario 2: Model Selection Advice for Your Project

Scenario 3: Edge Device Deployment

FAQ

Q: Does WhichLLM download models locally?

Q: How do I update model data?

Q: What if I have no network access?

Q: Does it support AMD graphics cards?

Summary