TL;DR
whichllm is a Python CLI that auto-detects your GPU/CPU/RAM and ranks the local LLMs that will actually run best on your hardware — judged by merged real benchmarks (LiveBench, Artificial Analysis, Aider, Vision, Chatbot Arena ELO, Open LLM Leaderboard v2), not by “biggest model that fits.” It blew up on Hacker News with a 144-point Show HN and is currently trending on GitHub with 4,592 stars (1,800 in the past week).
The pitch is sharp: every other “what can I run?” tool just checks if the weights fit in VRAM. That hands you the biggest model — which is almost never the smartest model. whichllm does it the other way around: it knows that on an RTX 4090, a Qwen3.6‑27B at Q5_K_M beats a Qwen3‑32B at Q4_K_M, because the newer 27B has a higher real benchmark score, even though both fit. That gap is the whole product.
Key facts:
- 4,592 GitHub stars, 1,800 added this week, currently in the top 10 trending Python repos
- 144 points on Show HN (“One command to find the best local LLM for your hardware”)
- Built by
/Andyyyy64(with/claude,/devangpratap,/hibiki333155555,/justindotdevv) - One command, scriptable — no TUI, no setup, JSON-pipe friendly
- Live HuggingFace data — merges current-tier (LiveBench/Artificial Analysis/Aider/Vision) and frozen-tier (OLM Leaderboard v2 / Arena ELO) benchmarks
- Recency-aware — stale leaderboard scores get demoted along each model’s lineage so a 2024 model can’t outrank a current-generation one on an outdated score
- GPU simulation —
whichllm --gpu "RTX 5090"lets you plan an upgrade before you spend the money - MIT licensed, PyPI / Homebrew /
uvx
Why “biggest model that fits” is the wrong question
If you’ve spent any time in r/LocalLLaMA, you’ve seen the same loop: someone posts their rig, asks “what should I run?”, and the replies are some mix of “Qwen is good,” “just try llama3,” and personal anecdotes. Useful, but not exactly a system.
The other category of advice — “VRAM calculator” tools — answers a strictly easier question: does this model’s weights, KV cache, and activation budget fit? That’s a real check, but it’s not a recommendation. It will happily tell you that a 32B Q4 fits on your 4090 and stop there, even when a newer 27B at Q5 is meaningfully smarter at the same memory cost.
whichllm’s author calls this “evidence-based ranking, not a size heuristic.” The merged benchmark map decides the score; runtime fit, speed, evidence confidence, and source trust scale it. Size is a factor (capped at +35 as a log-scaled world-knowledge proxy), not the driver.
The result, for an RTX 4090 / 3090 with 24 GB VRAM at the time of writing:
#1 Qwen/Qwen3.6-27B 27.8B Q5_K_M score 92.8 27 t/s
#2 Qwen/Qwen3-32B 32.0B Q4_K_M score 83.0 31 t/s
#3 Qwen/Qwen3-30B-A3B 30.0B Q5_K_M score 82.7 102 t/s
Both #1 and #2 fit your card. A VRAM-only tool would rank the 32B higher. whichllm ranks the 27B higher because its merged real benchmark score is genuinely better, and prints the published-date and benchmark snapshot under the table so you can sanity-check it. The #3 row is a 30B MoE running at ~102 t/s — speed is scored on active parameters, quality on total. That’s the kind of nuance most “what runs on my GPU” pages get wrong.
Install in one line
The whole thing is a uvx-installable Python package, so you can run it without committing to anything:
# Run once, no install
uvx whichllm@latest
# Install when you use it often
uv tool install whichllm
uv tool upgrade whichllm
# Alternative install paths
brew install andyyyy64/whichllm/whichllm
pip install whichllm
Auto-detection covers NVIDIA (via nvidia-ml-py, fallback nvidia-smi), AMD (rocm-smi, fallback lspci), Intel iGPUs on Linux, Apple Silicon (system_profiler), CPU cores + AVX2/AVX‑512, RAM, and disk free. If detection fails, it returns an empty result instead of crashing — fail-safe by design.
Six commands, one tool
The CLI is intentionally small. Here are the six surfaces:
# 1) Best models for this machine
whichllm
# 2) Pretend you have a specific GPU (great for hardware planning)
whichllm --gpu "RTX 4090"
# 3) Compare upgrade candidates
whichllm upgrade "RTX 4090" "RTX 5090" "H100"
# 4) Find the GPU needed for a given model
whichllm plan "llama 3 70b"
# 5) Start a chat with a model immediately
whichllm run "qwen 2.5 1.5b gguf"
# 6) Print copy-paste Python you can drop into a script
whichllm snippet "qwen 7b"
The standouts here are plan and upgrade. If you’ve ever shopped for a GPU specifically to run a target model, you know “you need an H100” and “a 5090 will do” are very different conversations. plan gives you the floor; upgrade ranks candidate cards on the same scoring engine.
A peek at the scoring
This is the part that turns it from “another HN tool” into something I’d actually run before downloading 27 GB of weights:
| Factor | Effect | Description |
|---|---|---|
| Benchmark quality | core | Merged LiveBench / Artificial Analysis / Aider / Vision / Arena ELO / OLM Leaderboard v2, weighted by source confidence |
| Model size | up to +35 | log2-scaled world-knowledge proxy (MoE uses total params) |
| Quantization | × penalty | Lower-bit quants are discounted multiplicatively |
| Evidence confidence | ×0.55–1.0 | none / self-reported ×0.55, inherited ×0.78, direct full |
| Runtime fit | ×0.50–1.0 | partial-offload ×0.72, CPU-only ×0.50 |
| Speed | −8 to +8 | Usability gate vs a fit-dependent tok/s floor; reported with confidence and a range |
| Source trust | −5 to +5 | Official-org bonus, known-repackager penalty |
| Popularity | tie-breaker | Downloads + likes; weight shrinks as evidence strengthens |
Two design choices stood out reading the source. First, evidence has five grades (direct, variant, base_model, line_interp, self_reported), and each is discounted. A model with only a self-reported benchmark gets ×0.55 — uploader claims are not free real estate. Second, inheritance is rejected when a model’s parameter count diverges more than 2× from its family’s dominant member. This catches small draft heads, MTP heads, and abliterated forks that would otherwise borrow the score of their much larger base. It’s the kind of leaderboard-gaming guard a lot of “AI tool roundup” pages have no opinion about.
Score markers are surfaced inline:
~(yellow) — no direct benchmark; score is inherited/interpolated from the family!sr(bright yellow) — uploader-reported only, not independently verified?(red) — no benchmark data available
So if the top pick is yellow-~, you know to take it with a grain of salt before you commit to a download.
Run a model in literally one command
The run subcommand is what makes this more than a recommender. It creates an isolated uv environment, installs the right runtime (llama-cpp-python for GGUF, transformers + autoawq/auto-gptq for AWQ/GPTQ, plain transformers for FP16/BF16), downloads the right GGUF variant for your VRAM, and drops you straight into a chat:
# Auto-pick the best model for your hardware and chat
whichllm run
# Specify a model and quant
whichllm run "qwen 2.5 1.5b gguf"
# CPU-only
whichllm run "phi 3 mini gguf" --cpu-only
For people who just want a known-good model running in the next 90 seconds, this is the single most useful thing in the tool.
If you’d rather wire it into your own code, whichllm snippet prints ready-to-paste Python:
from llama_cpp import Llama
llm = Llama.from_pretrained(
repo_id="Qwen/Qwen2.5-7B-Instruct-GGUF",
filename="qwen2.5-7b-instruct-q4_k_m.gguf",
n_ctx=4096,
n_gpu_layers=-1,
verbose=False,
)
output = llm.create_chat_completion(
messages=[{"role": "user", "content": "Hello!"}],
)
print(output["choices"][0]["message"]["content"])
Pipe-friendly: the Ollama trick
The bit that won me over for scripting: whichllm --json returns a structured object you can pipe straight into jq, including estimated_tok_per_sec, speed_confidence, and a speed_range_tok_per_sec planning range. That means you can build an alias that always returns “the best model ID for this machine, right now”:
# Add to .bashrc / .zshrc
alias bestllm='whichllm --top 1 --json | jq -r ".models[0].model_id"'
# Then
ollama run $(bestllm)
Caveat: Ollama model names don’t always match HuggingFace repo IDs, so you’ll usually want a small mapping step in the middle. Profiles available: general, coding, vision, math — --profile coding --top 1 --json | jq -r '.models[0].model_id' gets you the best coding model in one line.
Sample top picks across hardware
This is from the README and reflects a 2026-05 snapshot — your results will track the live HuggingFace data, but it’s a useful directional read:
| Hardware | VRAM | Top pick | Speed |
|---|---|---|---|
| RTX 5090 | 32 GB | Qwen3.6‑27B · Q6_K · score 94.7 | ~40 t/s |
| RTX 4090 / 3090 | 24 GB | Qwen3.6‑27B · Q5_K_M · score 92.8 | ~27 t/s |
| RTX 4060 | 8 GB | Qwen3‑14B · Q3_K_M · score 71.0 | ~22 t/s |
| Apple M3 Max | 36 GB | Qwen3.6‑27B · Q5_K_M · score 89.4 | ~9 t/s |
| CPU only | — | gpt-oss-20b (MoE) · Q4_K_M · score 45.2 | ~6 t/s |
Two readings: Qwen3.6‑27B is doing serious work in the consumer-GPU bracket right now — the smarter quant at smaller params keeps winning over bigger-but-older alternatives. And on CPU, a well-chosen MoE (gpt-oss-20b) wins because only active params load into the bandwidth-bound speed model.
What the community is saying
The Show HN thread (144 points) and r/LocalLLaMA discussion give you a fair read of where it works and where the rough edges still are.
The good — most upvoted comments call out the recency-awareness and the runtime fit modeling specifically. The “biggest that fits” failure mode is the single most common complaint about VRAM calculators, and people noticed whichllm actually addresses it. The --gpu "RTX 5090" simulation got flagged repeatedly as the killer feature for pre-purchase planning — finally a way to answer “is upgrading worth it for this model?” without buying first.
The reality check — one r/LocalLLaMA thread asks “How accurate can whichllm be?” after a WSL user got reasonable picks but incorrect RAM and disk numbers. That’s an honest limitation of running detection inside WSL where the namespace gives you Linux-side counters rather than Windows host counters. Recent releases widened the Windows detector (WMI + registry fields), but on WSL specifically, double-check whichllm hardware against your actual host before trusting speed estimates.
The other recurring note: surprise at gpt-oss-20b appearing high on CPU-only lists. That’s working as intended — CPU-only triggers a ×0.50 runtime-fit penalty, but MoE active-param speed modeling rescues a 20B model with ~3B active back into the “actually usable” zone.
Honest limitations
A few things to know before you trust the top row blindly:
- Benchmarks are not your workload. LiveBench, Aider, and Arena ELO are great signals, but they’re not your domain. If you have a private eval set, run the top three picks against it before committing.
- Speed estimates are planning ranges. Backend (llama.cpp build, CUDA version), context length, batching, and prompt shape all move real numbers. The
~and?markers tell you when to be skeptical. - Detection on WSL/Windows is the edge case. Expect minor inaccuracies on RAM/disk if you’re running through WSL.
- MoE rankings assume active-param routing. If your inference stack doesn’t fully exploit MoE routing, the speed estimate will be optimistic.
- Benchmark snapshots can drift. Live sources fall back to curated frozen snapshots. The snapshot date is printed beneath the ranking — but only if you look at it.
How it compares
| Tool | What it answers | Recency-aware? | Runs the model for you? |
|---|---|---|---|
whichllm | Best model for my hardware by benchmark | Yes (lineage-demoted frozen scores) | Yes (whichllm run) |
| VRAM calculators | Will this model fit | No (size-only) | No |
| HuggingFace leaderboards | Best model overall | Partially | No |
| Ollama library list | Popular models by downloads | No | Yes (ollama run) |
| LM Studio model browser | GUI-curated picks | Editor-curated | Yes (GUI) |
The unique seat is: evidence-based ranking + runtime fit + one-command execute. Each other tool gets one or two of those; nothing else seems to do all three.
FAQ
Is whichllm free?
Yes — MIT license. You only pay for the model weights (HuggingFace downloads) and inference costs if you whichllm run. Sponsorships exist but the project will stay open-source either way.
Does it support AMD GPUs and Apple Silicon?
Yes. AMD is detected via rocm-smi with lspci fallback on Linux. Apple Silicon is detected via system_profiler on macOS. Both restrict the candidate set to GGUF for runtime stability. Linux + NVIDIA gets the widest support, including AWQ / GPTQ / FP16 / BF16.
How is this different from a VRAM calculator?
A VRAM calculator answers can this model fit. whichllm answers which model that fits will actually perform best, using merged real benchmarks plus runtime-fit and speed scaling. The README example — Qwen3.6‑27B ranked above Qwen3‑32B on the same RTX 4090 — is the canonical demo of the difference.
Can I use it offline?
Partially. Both the model and benchmark caches live under ~/.cache/whichllm/ (6h TTL for models, 24h for benchmarks). Curated frozen fallbacks ship with the package for offline / rate-limited use, so a cold offline run still gives you something — just don’t trust the recency markers on a stale cache.
How does GPU simulation work?
whichllm --gpu "RTX 5090" builds a synthetic GPUInfo from a curated table of memory bandwidth, VRAM size, and compute capability. The rest of the pipeline (VRAM fit, speed estimate, runtime fit factor) then runs as if that card were present. It’s a planning tool, not a guarantee — actual cards depend on cooling, PCIe gen, and your specific motherboard.
Will it pick a model I already have downloaded?
Today, ranking is over HuggingFace candidates, not your local cache. If you want “the best of what I already have,” pipe whichllm --json through jq and filter against your local model directory. A first-class “local-only” mode is a natural feature request — open an issue if you want it.
Bottom line
Local-LLM model selection has been a vibes-and-anecdotes problem for two years. whichllm is the first tool I’ve used that turns it into a defensible, auditable answer: “this is the top model for your machine, here’s its merged benchmark score, here’s the evidence grade, here’s the speed range, and here’s one command to actually run it.”
It won’t replace running your own eval on your own workload. Nothing should. But for the 90% of “which Qwen / Llama / Mistral should I download tonight?” decisions, uvx whichllm@latest is now the right first move.
Install: uv tool install whichllm. Star and report your top pick at github.com/Andyyyy64/whichllm.