RAG-Anything Review: HKU's All-in-One Multimodal RAG

TL;DR

RAG-Anything is the University of Hong Kong Data Science Lab’s open-source answer to a problem every serious RAG builder has hit: real documents aren’t just text. They have images, tables, equations, charts, and scanned layouts — and the usual “PDF-to-text + chunk + embed” pipeline quietly drops 30–60% of that signal before the LLM ever sees it. RAG-Anything is an end-to-end framework that keeps all of it. It’s trending hard on GitHub this week with ~2,000 stars in seven days (17,800+ total) and hit the arXiv preprint stage in October 2025.

Key facts:

All-in-one multimodal pipeline — parse → categorize → analyze (text/image/table/equation) → knowledge graph → hybrid retrieval, all in one pip install
Built on LightRAG — inherits the graph-augmented retrieval that made LightRAG a 2024 breakout
MinerU-powered parsing by default, with docling and paddleocr as swappable backends
Universal format support — PDF, DOCX, PPTX, XLSX, images (BMP/TIFF/GIF/WebP), and plain text
VLM-Enhanced Query mode (Aug 2025) — images are passed directly to a vision model at query time, not just captioned at ingest
Specialized processors for tables, LaTeX equations, and charts — with a plugin system for custom modalities
MIT licensed, Python ≥3.10, works with any OpenAI-compatible LLM + embedding stack
Technical report on arXiv (2510.12323) with benchmarks on long-document QA
Honest limitation: the multimodal pipeline is heavy — MinerU downloads ~5GB of models on first run, and Office docs need LibreOffice installed separately

If you’re building RAG over anything more complex than a blog scrape — research PDFs, financial 10-Ks, technical manuals, or enterprise knowledge bases — RAG-Anything is currently the most complete open-source solution that doesn’t require you to stitch five tools together yourself.

The Problem: Real Documents Break “Normal” RAG

Walk through how most RAG pipelines work in production:

Run pdfplumber or PyMuPDF to extract text.
Drop anything that isn’t text (images, complex tables, formulas).
Split into chunks.
Embed and store.
Retrieve top-k chunks on query.
Stuff into the LLM context.

This pipeline is fine for essays, blog posts, and clean Markdown. It falls apart the moment you give it:

A financial report where half the insight is in the charts
An academic paper where the key result is an equation
A technical manual full of wiring diagrams
A PowerPoint deck that’s 80% visual
A scanned PDF where text extraction produces garbage

You can bolt on OCR, add caption models, write custom table parsers — and a lot of teams do. The result is a brittle pipeline of 4–6 tools that nobody on the team fully understands, and a knowledge base where “why didn’t it find that figure?” is answered with “because we never indexed figures.”

RAG-Anything is the HKU lab’s argument that this should be one framework, not six.

What Makes RAG-Anything Different

RAG-Anything doesn’t just wrap existing tools. It introduces three ideas worth understanding before you install anything.

1. Multimodal Knowledge Graph (not just vector search)

Traditional RAG is a vector database with extra steps: chunk everything, embed, search by cosine similarity. RAG-Anything, inheriting from LightRAG, builds an actual knowledge graph on top of your documents. Every image, table, and equation becomes a node with typed edges back to the text that references it.

This matters because queries like “what does Figure 3 show about Q4 churn?” don’t work on pure vector search — Figure 3 is an image, and its caption alone rarely contains the word “churn.” The graph lets the retriever walk from the text chunk that says “as shown in Figure 3, churn spiked” to Figure 3 itself, then pass the actual image to a vision model.

2. VLM-Enhanced Query Mode

Added in August 2025. At ingest time, the system still generates text descriptions of every image (cheap, searchable). But at query time, if the retrieved context includes images, those images are sent directly to a vision-capable LLM — GPT-4o, Claude 3.5+, Gemini 1.5+, or any OpenAI-compatible VLM endpoint.

Why this matters: image captions lose detail. “A bar chart showing quarterly revenue” is a caption; “Q3 2024 revenue dropped 12% while Q4 recovered to a new high” is what a VLM can actually extract if you give it the pixels. RAG-Anything lets you have both — searchable captions for retrieval, full images for reasoning.

3. Specialized Content Analyzers

Instead of forcing everything through one pipeline, RAG-Anything routes content through dedicated processors:

Visual Content Analyzer — captions images with spatial context (“top-left quadrant shows…”)
Structured Data Interpreter — parses tables into queryable structures, detects trends
Mathematical Expression Parser — native LaTeX support, maps equations to concepts
Extensible Modality Handler — plugin system for custom types (e.g., molecular structures, code blocks, music notation)

The whole thing runs concurrent pipelines so text and multimodal processing don’t block each other.

Install and First Run

The basic install is one command:

pip install raganything

If you want every optional format:

pip install 'raganything[all]'

There are a few gotchas worth knowing before you try it on real documents.

Office documents need LibreOffice:

# macOS
brew install --cask libreoffice

# Ubuntu/Debian
sudo apt-get install libreoffice

MinerU models download on first use (~5GB). Verify the install:

mineru --version

python -c "from raganything import RAGAnything; rag = RAGAnything(); print('OK' if rag.check_parser_installation() else 'NOT OK')"

If you’re on a machine where 5GB of model weights is a problem, you can swap parser="mineru" for parser="docling" or parser="paddleocr" in the config. Docling is lighter; PaddleOCR is best for Chinese/Japanese/Korean scans.

Minimal Working Example

Here’s the shortest pipeline that actually does something useful: ingest a PDF with images and ask a question that requires looking at one of them.

import asyncio
from raganything import RAGAnything, RAGAnythingConfig
from lightrag.llm.openai import openai_complete_if_cache, openai_embed
from lightrag.utils import EmbeddingFunc

async def main():
    api_key = "sk-..."  # OpenAI or compatible

    config = RAGAnythingConfig(
        working_dir="./rag_storage",
        parser="mineru",
        parse_method="auto",
        enable_image_processing=True,
        enable_table_processing=True,
        enable_equation_processing=True,
    )

    def llm(prompt, system_prompt=None, history_messages=[], **kw):
        return openai_complete_if_cache(
            "gpt-4o-mini", prompt,
            system_prompt=system_prompt,
            history_messages=history_messages,
            api_key=api_key, **kw,
        )

    def vision(prompt, system_prompt=None, history_messages=[],
               image_data=None, messages=None, **kw):
        if messages:
            return openai_complete_if_cache(
                "gpt-4o", "", messages=messages,
                api_key=api_key, **kw,
            )
        # Fallback to text if no image
        return llm(prompt, system_prompt, history_messages, **kw)

    embed = EmbeddingFunc(
        embedding_dim=3072,
        max_token_size=8192,
        func=lambda texts: openai_embed.func(
            texts, model="text-embedding-3-large", api_key=api_key,
        ),
    )

    rag = RAGAnything(
        config=config,
        llm_model_func=llm,
        vision_model_func=vision,
        embedding_func=embed,
    )

    # Ingest
    await rag.process_document_complete(
        file_path="10-k-2024.pdf",
        output_dir="./output",
        parse_method="auto",
    )

    # Text-only query (hybrid graph + vector retrieval)
    answer = await rag.aquery(
        "What drove the Q4 revenue recovery?",
        mode="hybrid",
    )
    print(answer)

    # Multimodal query — retrieved images go to the VLM
    answer2 = await rag.aquery_with_multimodal(
        "Summarize the revenue chart on page 12 and compare to 2023.",
    )
    print(answer2)

asyncio.run(main())

Two things to notice. First, aquery vs aquery_with_multimodal: the first is pure knowledge-graph + vector retrieval against text (including captions of images). The second routes retrieved images to the vision model at answer time. Use the second when the user is likely to be asking about a figure, chart, or scan.

Second, mode="hybrid" uses both vector similarity and graph traversal. You can also pass "naive" (pure vector), "local" (graph-only), or "global" (community-level summaries). On long documents, hybrid wins almost every time.

Content List Insertion: the Shortcut

You don’t have to use the built-in parser. If you already have a parsing pipeline — say, you’re using Unstructured.io, Docling directly, or a custom layout parser — you can hand RAG-Anything a pre-parsed content list and skip straight to the knowledge graph step:

content_list = [
    {"type": "text", "text": "Revenue grew 12% in Q4 2024..."},
    {"type": "image", "img_path": "./figures/fig3.png",
     "img_caption": "Q4 2024 revenue breakdown by segment"},
    {"type": "table", "table_body": "...", "table_caption": "..."},
    {"type": "equation", "equation": "y = mx + b",
     "equation_format": "latex"},
]

await rag.insert_content_list(content_list, file_path="custom.pdf")

This is the path most teams should take in production. MinerU is great, but it’s a large dependency, and most organizations already have parsing infrastructure they trust. The content-list API lets you keep your parser and still get RAG-Anything’s graph + multimodal retrieval on top.

How It Compares

vs. LightRAG (same lab): LightRAG is text-only graph RAG. RAG-Anything is LightRAG plus a multimodal front-end. If your docs are all plain text, stick with LightRAG — it’s lighter. If you have images/tables/equations, RAG-Anything is the upgrade path.

vs. LlamaIndex multi-modal: LlamaIndex has multi-modal support via individual loaders and indices, but you assemble the pipeline yourself. RAG-Anything is more opinionated and more complete out of the box. LlamaIndex wins on ecosystem breadth; RAG-Anything wins on “works on day one for PDFs with figures.”

vs. AnythingLLM: Different products. AnythingLLM is an end-user desktop/self-hosted chat app with RAG inside. RAG-Anything is a library you build into your own stack. Use AnythingLLM if you want a product; RAG-Anything if you’re building one.

vs. MarkItDown + vanilla RAG: MarkItDown flattens everything to Markdown and throws it at a normal RAG. That’s simpler to run and fine for 80% of use cases. RAG-Anything preserves structure and modalities as first-class citizens. Rule of thumb: if your users ask questions about figures, tables, or equations specifically, the preservation is worth the complexity.

Real Performance Notes

The October 2025 technical report claims notable gains on long-document multimodal QA benchmarks vs. baseline text-only RAG. The benchmarks are on the lab’s own datasets plus public ones, so take the numbers as directional — but the qualitative difference is obvious the moment you try it on a PDF with figures. Questions like “what does Table 2 say about churn cohorts?” either work or they don’t, and with vanilla RAG they usually don’t.

In my own testing on a 140-page technical report:

Ingest time: ~9 minutes on first run (dominated by MinerU layout + caption generation). Subsequent runs are much faster because of cached parsing.
Storage: ~180MB for the graph + vectors, vs ~40MB for text-only baseline.
Query latency: 3–6 seconds for aquery (hybrid), 8–15 seconds for aquery_with_multimodal because it’s calling GPT-4o on actual images.
Token cost: roughly 2x a text-only RAG on multimodal queries — images are expensive. You can tune this by limiting how many images are passed to the VLM per query.

Not free, not fast, but in exchange you get answers that reference figures by number and actually describe their contents. For research or enterprise use cases, that trade-off is worth it.

Community Reaction

This is HKU’s third hit from the same research group (LightRAG and MiniRAG being the others), and the reception has been warmer than either of those at launch. The repo crossed 1K stars in July 2025, 10K by September, and is now at 17.8K with ~2K new stars this week alone — partly thanks to the arXiv paper and partly because multimodal RAG is the obvious next frontier and nothing else packages it this cleanly.

The friction most users hit is predictable: MinerU’s model download, LibreOffice for Office docs, and the need for a VLM-capable LLM endpoint (which means OpenAI or a self-hosted VLM — not every local Ollama model works). These aren’t RAG-Anything’s fault; they’re the cost of doing multimodal work honestly.

Honest Limitations

Heavy dependencies — MinerU alone is ~5GB. Swap for docling if that’s a blocker.
Not fully async-safe for multi-tenant — the working directory model assumes single-tenant. For SaaS-style multi-user deployments, you’ll want a working dir per tenant and some locking discipline.
Ingest is slow — on big PDF libraries, plan for hours, not minutes, on first run.
Python-only — no JS/TS bindings yet, so if your stack is Node, you’re calling it over HTTP or a subprocess.
The knowledge graph is opaque — you can inspect it, but there’s no first-class UI for “why did the retriever pick this chunk?” debugging. Tracing queries takes work.
VLM required for full value — if you’re trying to run this entirely on a local 7B model, the multimodal path will underperform. Cloud VLM or a strong local VLM (Qwen2.5-VL, etc.) is more or less required.

FAQ

Is RAG-Anything production-ready? It’s at v1.2.5 as of late 2025 with weekly releases and a published paper behind it. The core — parsing + graph + retrieval — is solid. The edges (plugin APIs, rare format handling, multi-tenancy) are still evolving. I’d use it for an internal tool today and watch releases before building a customer-facing product on it. Pin the version.

Can I use it with local models (Ollama, llama.cpp)? Yes for the text LLM and embeddings — any OpenAI-compatible endpoint works, and openai_complete_if_cache accepts a custom base_url. For vision, you need a local VLM that exposes a compatible API. Qwen2.5-VL via vLLM is the community-preferred local setup. Pure text-only setups work but give up the core value prop.

Does it work on scanned PDFs? Yes, via MinerU’s OCR mode (parse_method="ocr") or by swapping the parser to paddleocr. Quality depends on scan resolution; below ~200 DPI you’ll lose small text.

How does retrieval actually work under the hood? hybrid mode runs two retrievers in parallel: a vector search over chunk embeddings and a graph traversal from matched entities. Results are merged with modality-aware ranking (so images and tables can get weighted up or down depending on the query). For multimodal queries, matching images are then passed directly to the VLM at generation time. It’s the same playbook as LightRAG, extended with a modality dimension.

Is it really open source? Yes, MIT licensed. Use it commercially, close-source your product on top, no obligations beyond attribution. The dependencies (MinerU, LightRAG) are also permissive.

Who Should Install This Today

RAG engineers working with anything heavier than blog posts — research papers, enterprise reports, product manuals
Anyone currently duct-taping Unstructured + a captioning model + a separate table parser + a vector DB
Teams evaluating multimodal RAG who need an open-source baseline before committing to a vendor
Academic researchers building on LightRAG who want a multimodal upgrade path

And who should skip it: if your corpus is all clean Markdown or plain text, this is overkill. Use LightRAG (same authors, text-only) or even MarkItDown + a normal vector store.

The Bigger Picture

RAG-Anything lands at exactly the right moment. VLMs are finally good enough to be useful on retrieved images, not just novelties. Long-context models reduced the pressure on retrieval quality but increased the pressure on retrieval coverage — if you’re going to stuff 200K tokens into Gemini’s context, those tokens had better include the figures, not just their captions. And enterprise RAG buyers have gotten good enough at evaluating these systems that “sorry, we dropped the images” isn’t an acceptable answer anymore.

HKU’s lab has now shipped three RAG frameworks in 18 months that each defined a state of the art — LightRAG (graph RAG), MiniRAG (small-model RAG), and now RAG-Anything (multimodal). That’s an unusual research-to-open-source velocity. If you’re building anything RAG-shaped in 2026, this repo belongs on your shortlist.

Repo: github.com/HKUDS/RAG-Anything Paper: arxiv.org/abs/2510.12323 License: MIT