LMCache Review: 3-10x Faster vLLM via KV Cache Reuse (2026)

TL;DR

LMCache is an open-source Key-Value caching layer that sits between vLLM (or SGLang) and your storage hierarchy, turning the KV cache from an in-GPU scratchpad into a persistent, reusable, vendor-neutral asset. The numbers are big: 3.7–6.8x lower TTFT (time-to-first-token), up to 15x throughput improvement in chatbot and RAG workloads, and a 10x boost on Mixture-of-Experts inference after the April 2026 multiprocess rearchitecture.

The project came out of breakthrough research at the University of Chicago, joined the PyTorch Foundation in October 2025, and is now integrated into NVIDIA Dynamo, IBM’s open-source LLM serving stack, CoreWeave’s production inference for Cohere, and the official vLLM production-stack. As of June 2026 it has 9.2K+ GitHub stars, 709 added this week, and a growing list of enterprise adopters who needed somewhere to put 1–2 GB of KV cache that wasn’t expensive H100 HBM.

Key facts:

Apache 2.0, open source at LMCache/LMCache
Vendor-neutral by design — works with vLLM, SGLang, NVIDIA Dynamo, multiple hardware vendors (NVIDIA, AMD MI300X, Arm, Huawei Ascend), and storage backends (CPU RAM, local SSD, Redis/Valkey, Mooncake, InfiniStore, S3-compatible, NIXL, GDS)
Two deployment modes — Multiprocess (standalone daemon, recommended) and In-process (embedded in vLLM)
Non-prefix KV reuse via CacheBlend — reuse cached blocks at any position in the prompt, not just shared prefixes
PD disaggregation support — transfer KV cache from prefill workers to decode workers over NVLink, RDMA, or TCP
Production observability — Kubernetes-native metrics, request-level and token-level cache hit ratios
Engine-independent — cache survives even if the inference engine crashes
Day-1 support for gpt-oss 20B/120B, Qwen3 series, Llama 3, and most modern model families

What LMCache Actually Solves

The KV cache is the single biggest performance lever in LLM serving, and almost nobody outside of model-serving teams understands it. Quick refresher: when an LLM processes a prompt, every token’s attention computation produces Key and Value tensors. For a 32K-token context on a 70B model, that’s roughly 1–2 GB of KV cache per request. The cache lets the model decode subsequent tokens without recomputing the prompt. Lose it, recompute. Long prompt? Long recompute.

The default behavior of every modern inference engine is brutal: KV cache lives in GPU HBM, gets evicted when memory pressure hits, and dies when the engine restarts. If a user comes back 5 minutes later with a follow-up question on the same 50-page document, the engine redoes the entire prefill — that’s your TTFT spike.

LMCache fixes this with three big ideas:

1. Tiered offloading. Move KV blocks from GPU HBM → CPU RAM → local SSD → remote storage (Redis, S3, Mooncake, etc.). When you need them back, retrieve only the relevant blocks via a high-throughput connector.

2. Cross-engine sharing. Multiple vLLM instances can share one KV cache pool. The LMCache server runs as a standalone daemon, so engines come and go without losing cached state.

3. Non-prefix reuse via CacheBlend. Prefix caching (vLLM’s built-in feature) only helps when the exact prefix matches. CacheBlend reuses KV blocks at any position by selectively recomputing a small number of tokens to recover quality. This is the unlock for RAG, where you stitch retrieved chunks into a fresh prompt every time.

The first two are operational wins. The third is what makes LMCache fundamentally more powerful than vLLM’s stock prefix cache.

Three concurrent waves:

Agentic workloads broke prefix caching. Multi-turn agent loops generate prompts where the shared structure isn’t at the front anymore — tool outputs, intermediate reasoning, and retrieved context all interleave. Stock prefix cache hit rates collapsed below 20% on agentic traffic. LMCache’s April 2026 MoE rearchitecture and May 2026 AMD MI300X benchmark directly targeted this, showing 10x improvement on multi-turn workloads.
NVIDIA Dynamo shipped with LMCache integration in September 2025, putting it in front of every team running NVIDIA’s reference inference stack.
Tensormesh launched in October 2025 as the commercial steward, providing enterprise support while keeping the project Apache 2.0. That removed the “who maintains this in production?” objection for risk-averse buyers.

The HN thread from July 2025 (Lossless LLM 3x Throughput Increase by LMCache) and the r/LocalLLaMA post (We built this project to increase LLM throughput by 3x) were both turning points — IBM’s adoption announcement in the comments gave it credibility, and the project has compounded since.

Install & First Run (60 Seconds)

The cleanest path is vLLM + LMCache via uv:

uv venv --python 3.12
source .venv/bin/activate
uv pip install lmcache vllm

Two deployment modes are available. Multiprocess (MP) mode is now the recommended default because the cache survives engine crashes and one server can feed multiple vLLM instances.

MP mode — start the LMCache server in one terminal:

lmcache server \
  --l1-size-gb 20 \
  --eviction-policy LRU \
  --chunk-size 256

The ZMQ port (default 5555) accepts engine connections; the HTTP frontend (default 8080) exposes Prometheus-compatible metrics and a management API.

Start vLLM with the MP connector in a second terminal:

vllm serve Qwen/Qwen3-8B \
  --port 8000 \
  --kv-transfer-config \
  '{"kv_connector":"LMCacheMPConnector",
    "kv_connector_module_path":"lmcache.integration.vllm.lmcache_mp_connector",
    "kv_role":"kv_both"}'

The kv_connector_module_path override is important: it pins the connector to the LMCache-shipped implementation rather than vLLM’s vendored copy, so you get the latest server protocol and fixes.

In-process mode is one command if you just want a single-node setup:

vllm serve Qwen/Qwen3-8B \
  --kv-offloading-backend lmcache \
  --kv-offloading-size 20 \
  --disable-hybrid-kv-cache-manager

That last flag is mandatory in in-process mode — forgetting it is the single most common setup mistake reported in the GitHub issues.

Production Test — Long-Doc Q&A

The canonical benchmark workload looks like this. Two requests share a long document prefix; the second one should hit cache.

# First request — cache is cold, full prefill happens
curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-8B",
    "prompt": "<50-page document>... Summarize section 3.",
    "max_tokens": 200
  }'

# Second request — same document, different question
curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen/Qwen3-8B",
    "prompt": "<50-page document>... What are the limitations?",
    "max_tokens": 200
  }'

You’ll see logs like this from the LMCache server on the cold pass:

[2026-04-22 19:49:56,316] LMCache INFO: Stored 256 tokens in 0.023 seconds
[2026-04-22 19:49:56,555] LMCache INFO: Stored 256 tokens in 0.005 seconds
...

And retrieval on the second request:

[2026-04-22 19:50:04,686] LMCache INFO: Retrieved 256 tokens in 0.003 seconds
[2026-04-22 19:50:04,968] LMCache INFO: Stored 256 tokens in 0.005 seconds

In the ceph.io benchmark with Qwen3-32B on a long-doc workload, TTFT dropped from ~6.5 seconds to ~0.4 seconds on the cached pass — roughly 16x faster on the cache hit path. Across the full LMCache paper (arXiv 2510.09665), the average improvement is 3.7–6.8x TTFT reduction and 19x inter-token-latency reduction on TriviaQA-style long-context workloads.

Benchmarks That Matter

A few numbers from peer-reviewed and vendor benchmarks:

Workload	Engine	Before	After LMCache	Speedup
Multi-round chat (LMCache paper)	vLLM	baseline	3.7–6.8x lower TTFT	3.7–6.8x
TriviaQA long-context	vLLM	baseline	19x lower ITL	19x
RAG document QA (ceph.io)	vLLM + Ceph	6.5s TTFT	0.4s TTFT	~16x
Multi-turn agentic (AMD MI300X)	vLLM	baseline	10x throughput	10x
MoE inference (Qwen3-235B)	vLLM 0.18.1	baseline	10x throughput	10x
Chatbot + RAG (PyTorch blog)	vLLM	baseline	up to 15x throughput	15x

Two important caveats. First, these are cache-hit benchmarks. If your workload has near-zero prompt overlap (e.g., one-shot classification of unique inputs), LMCache won’t help — and can add small overhead. The open GitHub issue #1812 shows exactly this: a benchmark designed without prefix overlap measured higher latency with LMCache enabled, which is the expected behavior. Match the tool to the workload.

Second, on the Level Up Coding benchmark, a small-model single-node Colab test, vLLM’s built-in prefix cache was actually faster than LMCache for the common case where everything fits in HBM. LMCache’s win condition is when your working set exceeds GPU memory — exactly the case in production multi-tenant serving and long-context RAG.

Community Reactions

From the r/LocalLLaMA launch thread:

“We built this project to increase LLM throughput by 3x. Now it has been adopted by IBM in their LLM serving stack.”

From the HN thread:

“In LLM serving, the input is computed into intermediate states called KV cache to further provide answers. These data are relatively large (~1-2GB for long context) and are often evicted when GPU memory is not enough.”

From r/mlops on production prefix-cache hit rates:

“Hey everyone, so I spent the last few weeks going down the KV cache rabbit hole. One thing which is most of what makes LLM inference expensive is the [recompute on prefix miss].”

The general sentiment in production inference circles: LMCache stopped being optional once you hit two conditions — long contexts (>8K tokens routine) and multi-tenant serving where cache eviction is constant. Below those thresholds, vLLM’s built-in prefix cache is enough and adding LMCache is over-engineering.

Honest Limitations

A frank list of things to weigh before adopting:

Not a magic bullet for stateless workloads. Classification, embeddings, one-shot translation — workloads with no prompt overlap — get no benefit and a small overhead penalty. Measure before deploying.
Operational complexity. MP mode means an additional daemon process, ZMQ ports, l1/l2 tiering, eviction policy tuning, and Prometheus scraping. In-process mode is simpler but loses the cross-engine sharing benefit.
Vendor moves fast. The connector interface changed between vLLM 0.18 and 0.20 (the kv_connector_module_path override exists precisely because of this). Pin versions in production and read release notes.
CacheBlend has a quality recovery step. Non-prefix reuse selectively recomputes tokens to repair quality, but this isn’t free — there’s a CPU/GPU cost. For high-quality-bar applications (legal, medical), validate accuracy on your golden set.
Storage backend complexity. S3-compatible and InfiniStore add the usual distributed-systems failure modes (network partitions, consistency edge cases). Start with CPU+SSD tiering and only move to remote storage when you genuinely need cross-node sharing.
No managed offering from the project itself. Tensormesh sells enterprise support, but if you want fully-managed serving, you’re combining LMCache with vLLM-on-Kubernetes yourself (or using the official production-stack Helm chart).

When to Choose LMCache vs Alternatives

vs vLLM’s built-in prefix cache — vLLM’s prefix cache is great until your working set exceeds HBM or your prompts share content that isn’t at the start. LMCache wins on both.
vs SGLang’s RadixAttention — Comparable in design philosophy. LMCache is more vendor-neutral and has stronger ecosystem integrations (Dynamo, IBM, PyTorch Foundation membership). SGLang ships RadixAttention as a built-in.
vs custom KV cache solutions — If you already built one (and many production inference teams have), LMCache’s value is the connector ecosystem and the published research backing CacheBlend.
vs not caching at all — If you’re running short-context, low-overlap workloads, you don’t need a KV cache layer. Don’t add LMCache as a default — measure first.

FAQ

Q: What’s the actual TTFT reduction I should expect for RAG? A: 3–16x depending on document overlap and context length. The high end (16x) requires real overlap across queries on the same documents; expect 3–5x on a typical mixed RAG workload with moderate reuse. The published LMCache paper reports 3.7–6.8x as the consistent range.

Q: Does LMCache work with closed-source APIs like OpenAI or Claude? A: No. LMCache operates on KV tensors inside the inference engine, so it only works with engines you self-host (vLLM, SGLang, NVIDIA Dynamo). Closed APIs expose only the chat interface and handle caching internally.

Q: Can I run LMCache on Apple Silicon / CPU only? A: Not as a primary serving stack — LMCache is a layer for GPU-based inference engines. You can run the cache server on CPU and offload to it, but the actual model serving needs vLLM/SGLang which require CUDA, ROCm, or compatible accelerators. AMD MI300X, Arm, and Huawei Ascend are explicitly supported.

Q: Does it work with quantized models (FP8, INT4)? A: Yes. LMCache caches the KV tensors regardless of model weight quantization — the KV cache itself is independent of weight dtype. Day-1 support is shipped for gpt-oss 20B/120B FP8 and most quantized Qwen3 / Llama variants.

Q: How does the cost math work in production? A: The case study most cited is the CoreWeave + Cohere deployment, where LMCache let them serve the same throughput with less GPU memory pressure, deferring an HBM-bound capacity buy. The savings are highly workload-dependent — multi-turn chat and RAG get the biggest wins; agentic loops (the new hot workload) benefit if you use the MP mode with multi-engine sharing.

Q: Is the project actually production-ready, or still research? A: Production-ready for the supported configurations. It’s deployed at IBM, CoreWeave (for Cohere), inside NVIDIA Dynamo, and powers the official vLLM production-stack. PyTorch Foundation membership and the Tensormesh commercial support layer add the institutional backing that risk-averse buyers care about.

Bottom Line

If you self-host LLM inference at any meaningful scale, LMCache is now table stakes for long-context and multi-turn workloads. The combination of 3–10x TTFT/throughput wins, vendor-neutral storage backends, NVIDIA Dynamo integration, and PyTorch Foundation stewardship makes it the default open-source KV cache layer in 2026.

For one-shot, stateless, or short-context workloads, don’t bother — vLLM’s built-in prefix cache is fine. For everything else: install it, run the long-doc benchmark on your traffic, and you’ll see the hit rate justify the daemon.

Project: github.com/LMCache/LMCache — Apache 2.0, 9.2K+ stars, PyTorch Ecosystem.