TL;DR
PageIndex is an open-source RAG framework from VectifyAI that throws out the entire vector database stack. Instead of chunking, embedding, and running cosine similarity, it builds a hierarchical “table of contents” tree from a document and asks an LLM to reason its way to the right section — the way a human analyst would flip to the right chapter.
The headline numbers are doing real work for the hype:
- 30K+ GitHub stars total, 4,250 added this week (currently #6 on GitHub Trending Python)
- State-of-the-art 98.7% accuracy on FinanceBench — a benchmark where typical vector RAG scores 30–50%
- No vector DB. No chunking. No embedding model. Just a tree of section summaries and a reasoning LLM
- Multi-LLM via LiteLLM — OpenAI, Anthropic, Gemini, Mistral, local models
- MIT-licensed, with an OpenAI Agents SDK example for a fully agentic vectorless RAG demo
But the HN and r/Rag threads are not entirely starry-eyed: it’s slow (30–120s per query without caching), token-expensive, single-document-shaped, and — critics argue — every bit as “vibe-ish” as the vector search it’s pitching against. This review walks through what PageIndex actually is, when it wins decisively over traditional RAG, and the caveats you’ll want to know before pointing it at your company’s PDFs.
What is PageIndex?
PageIndex is a reasoning-based document index that lets an LLM retrieve information from long PDFs by navigating the document, not by searching its vector neighborhood.
The mental model the team uses is AlphaGo. AlphaGo didn’t memorize positions; it searched a tree. PageIndex applies the same idea to documents: instead of compressing every page into 1,536-dim vectors and hoping cosine similarity surfaces relevance, it generates a structured tree (chapters → sections → subsections), summarizes each node, and lets the LLM walk down to the right leaf.
The pipeline is:
- Tree generation. PageIndex parses a PDF, detects (or generates) a table of contents, and produces a JSON tree where each node has a title, page span, summary, and node ID.
- Reasoning-based retrieval. At query time, an LLM is shown the tree (titles + summaries, not raw text) and asked to reason about which nodes likely contain the answer.
- Targeted extraction. Only the selected leaf nodes are pulled into context for final answer generation, with explicit page and section citations.
The two moves — navigate, then extract — mirror how a human analyst handles a 300-page 10-K: skim the TOC, jump to “Risk Factors,” read the relevant subsection, cite the page. No embedding model anywhere in this loop.
Why It’s Trending NOW
The PageIndex repo first surfaced on Hacker News on April 1, 2025, got a follow-up in April, then a fresh “Show HN: PageIndex – Vectorless RAG” in September 2025 that pushed adoption hard. By May 2026 it’s at 30K+ stars and trending again.
Three forces are driving the surge:
- Vector RAG complexity fatigue. Pinecone/Weaviate/Qdrant, embedding model selection, chunk size tuning, re-embedding on doc updates — a lot of moving parts for a system that often returns “close-ish but wrong” chunks.
- Long-context models got cheap. GPT-4o-mini, Gemini 2.0 Flash, and Claude Haiku 3.5 made multiple sequential LLM calls affordable. The economics that killed reasoning-based retrieval in 2023 don’t hold in 2026.
- FinanceBench made the case undeniable. Mafin 2.5, VectifyAI’s commercial product built on PageIndex, hit 98.7% accuracy on FinanceBench versus 30–50% for vector RAG baselines. For finance, legal, and medical documents the gap is huge.
How the Architecture Works
1. Hierarchical Tree Index
The output of run_pageindex.py is a JSON tree that looks roughly like this (trimmed for readability):
{
"doc_id": "annual_report_2025",
"doc_description": "Acme Corp 2025 annual report covering...",
"nodes": [
{
"node_id": "1",
"title": "Item 1A. Risk Factors",
"page_start": 12,
"page_end": 38,
"summary": "Risk factors covering supply chain, FX exposure, regulatory...",
"children": [
{
"node_id": "1.1",
"title": "Cybersecurity Risk",
"page_start": 18,
"page_end": 22,
"summary": "Discusses Q3 2024 incident response..."
}
]
}
]
}
The key design choice: node summaries are LLM-generated, not extracted text. That’s how the tree fits in a reasoning prompt even for a 500-page document.
2. Reasoning-Based Retrieval
At query time, you feed the LLM the tree (titles + summaries, no raw text) plus the user’s question, and ask it to pick the relevant node IDs. Only those leaves get loaded into the final answer prompt. The cookbook example uses a setup like:
from openai import OpenAI
import json
client = OpenAI()
retrieval_prompt = f"""You are a document navigator. Given the following
document tree and a user question, return the node_ids that are most
likely to contain the answer.
Document tree:
{json.dumps(tree_without_text, indent=2)}
Question: {user_question}
Return JSON: {{"relevant_node_ids": ["1.2", "3.4"]}}
"""
response = client.chat.completions.create(
model="gpt-4o",
messages=[{"role": "user", "content": retrieval_prompt}],
response_format={"type": "json_object"},
)
selected_ids = json.loads(response.choices[0].message.content)["relevant_node_ids"]
Then you fetch the raw text for those node IDs and run a final answer generation pass. That’s the whole retrieval algorithm.
3. Agentic Vectorless RAG (with OpenAI Agents SDK)
The repo ships examples/agentic_vectorless_rag_demo.py which wraps PageIndex as a tool inside the OpenAI Agents SDK. The agent decides on its own when to read a section, when to drill deeper, and when it has enough context to answer — closer to how a human researcher works through a long document.
This is the more interesting use case in practice. Instead of one-shot tree traversal, the agent can do multi-hop navigation: read section A, realize it needs to cross-reference section C, fetch C, then synthesize.
Getting Started
You’ll need Python 3.9+ and an LLM API key. The README’s quickstart is genuinely a 5-minute path:
git clone https://github.com/VectifyAI/PageIndex
cd PageIndex
pip install --upgrade -r requirements.txt
Create a .env at the project root:
OPENAI_API_KEY=sk-...
Then generate a tree from any PDF:
python3 run_pageindex.py --pdf_path /path/to/document.pdf
Useful optional flags:
--model gpt-4o-2024-11-20— swap in any LiteLLM-supported model--toc-check-pages 20— how many pages to scan for an existing TOC--max-pages-per-node 10— splits large sections into multiple nodes--max-tokens-per-node 20000— per-node token cap--if-add-node-summary yes— adds an LLM-generated summary at each node (highly recommended)
For Markdown input, use --md_path /path/to/doc.md. The README is honest that Markdown converted from PDF often loses heading hierarchy, so you’ll generally want to use VectifyAI’s hosted OCR (or a tool like Marker) before falling back to the markdown path.
To try the agentic example:
pip install openai-agents
python3 examples/agentic_vectorless_rag_demo.py
Real-World Use Cases
PageIndex is a near-perfect fit when the document is long, structured, and the answer needs to be auditable:
- Financial filings — 10-Ks, 10-Qs, S-1s, earnings transcripts (where Mafin 2.5 hit 98.7%).
- Regulatory and compliance — long policy documents where you cite the exact paragraph.
- Legal contracts — direct quotes, cross-references, inconsistencies (where embeddings struggle).
- Technical manuals — 800-page automotive or industrial manuals where chapter structure matters.
- Academic textbooks and long-form research papers with proper section hierarchy.
- Medical and patient records when well structured.
It’s a worse fit when:
- You have a corpus of thousands of small documents (think: customer support tickets, news articles, product reviews). PageIndex is currently document-shaped, not corpus-shaped — though VectifyAI’s PageIndex File System is trying to address this.
- You need sub-second latency. Reasoning-based retrieval typically runs 30–120s without aggressive caching.
- The documents are flat with no meaningful section structure. Without a useful TOC, the tree degenerates into roughly equal-sized chunks and the reasoning advantage shrinks.
First Impressions from the Community
HN threads and r/Rag posts are stress-testing the claims. A few honest themes:
“Embeddings have real limits” (mostly people working on legal/finance docs). One commenter on the September Show HN summed it up:
“Embeddings are great at basic conceptual similarity, but in quality maximalist fields they fall apart very quickly. ‘Find inconsistencies across N documents.’ There is no concept of an inconsistency in an embedding… ‘Where are Sarah or John directly quoted in this folder full of legal documents?’ Finding where they are directly quoted is nearly impossible even in a high dimensional vector.”
“Still vibe retrieval.” The most-cited critique is a top HN comment:
“How is this not precisely ‘vibe retrieval’ and much more approximate? Similarity with conversion to high-dimensional vectors and then something like kNN seems significantly less approximate, less ‘vibe’ based, than this.”
That’s fair: PageIndex replaces deterministic vector math with a stochastic LLM call. You’re trading one source of approximation for another.
“Just an expensive conversion script.” Several r/Rag users have observed that the indexing step is mostly LLM calls to summarize sections, and at runtime the system is “stuff the tree into an LLM and ask it to point at a node.” A few enthusiasts have built simpler versions achieving ~82% on FinanceBench with fewer LLM calls.
Cost/latency reality check. Early adopters consistently report 30–120 seconds per query without caching. For a chatbot that’s a non-starter; for an analyst tool, perfectly acceptable.
Honest Limitations
Going in with eyes open:
- Slow without caching. 30–120s/query is normal. Cache aggressively at the tree level (the tree is reusable across queries).
- Token-expensive at index time. Every section gets an LLM-generated summary. A 300-page report might cost $0.50–$2 to index. Frequent-update workflows need to budget for this.
- PDF-first. Word, HTML, EPUB, and arbitrary structured text need preprocessing.
- Single-document mindset. Out of the box, PageIndex reasons over one tree at a time. Multi-document corpora work but require extra glue (or VectifyAI’s commercial filesystem layer).
- Sensitive to TOC quality. Without a usable TOC, the LLM-generated tree is hit-or-miss. Enhanced OCR (the cloud product) helps; the open-source PDF parser is intentionally basic.
- Vectorless ≠ free. You’ll trade Pinecone bills for OpenAI/Anthropic bills. For high-QPS retrieval, vector search remains drastically cheaper.
How PageIndex Compares to Vector RAG
| Dimension | PageIndex (Vectorless) | Traditional Vector RAG |
|---|---|---|
| Index time | Slow (LLM summarization) | Fast (embedding) |
| Index cost | $$ (LLM calls) | $ (embeddings) |
| Query latency | 30–120s | < 1s |
| Per-query cost | $$ (multiple LLM calls) | $ (one embedding + DB lookup) |
| Accuracy on long structured docs | ⭐⭐⭐⭐⭐ (98.7% FinanceBench) | ⭐⭐ (30–50% FinanceBench) |
| Accuracy on short flat docs | ⭐⭐⭐ | ⭐⭐⭐⭐ |
| Multi-document corpus | ⭐⭐ | ⭐⭐⭐⭐⭐ |
| Citation/explainability | ⭐⭐⭐⭐⭐ (page + section) | ⭐⭐ (chunk-level) |
| Operational complexity | Low (no DB) | Medium (DB + embedder) |
The headline: vector RAG is still the right default for most search-shaped workloads. PageIndex wins decisively when you need precise, auditable answers from long, structured documents. They’re not really competitors as much as different tools for different jobs.
FAQ
Does PageIndex replace vector databases entirely?
No, and the team is careful not to claim that. It replaces vector retrieval for long, structured documents where reasoning helps. For product catalogs, semantic search over millions of short snippets, or recommendation pipelines, vector search is still better — faster, cheaper, and good enough.
What’s the actual cost to index a 300-page PDF?
Roughly $0.50–$2 with GPT-4o-mini, depending on how detailed the summaries are and whether you enable per-node summaries (--if-add-node-summary yes). With Claude Haiku or Gemini Flash you can drive this lower. The tree is reusable across queries, so amortized cost per query drops fast on heavily queried documents.
Can I run PageIndex with a local LLM like Llama 3 or Qwen?
Yes — anything LiteLLM supports works. The --model flag accepts any LiteLLM model identifier, so you can point at Ollama, vLLM, or LM Studio. Quality drops noticeably with smaller open models on the reasoning step (the navigation prompt), so 70B+ class models or strong 32B reasoning models are recommended for production. Smaller models are fine for the summary step.
How is this different from just dumping the whole PDF into a long-context model?
For a single 100-page document, just stuffing the PDF into Gemini 2.0’s 2M context often works fine. PageIndex starts to win when (a) the document is too long even for long-context models, (b) you have many documents and only want to load relevant sections, or (c) you need the citation — page and section references — that PageIndex preserves natively but context-stuffing destroys.
Is PageIndex production-ready?
The open-source repo is solid for prototyping and lower-volume internal tools. For production, VectifyAI strongly nudges you toward their hosted API/MCP service, which has better OCR, faster tree building, and managed caching. That’s the standard “open core” play — workable but expect to pay if you’re at scale.
How does this compare to GraphRAG?
GraphRAG builds a knowledge graph across a corpus and uses graph traversal for retrieval. PageIndex builds a hierarchical tree per document and uses LLM reasoning over the tree. GraphRAG is corpus-shaped and great for “what’s the relationship between X and Y across all my docs”; PageIndex is document-shaped and great for “find the exact section in this 200-page report that answers my question.” They compose well — graph for cross-document, PageIndex for in-document depth.
Should You Use PageIndex?
Yes, if:
- You have long, structured documents (50+ pages with real chapter/section hierarchy)
- Citation and auditability matter — you need to point at the exact page that justified an answer
- You’re in a domain where vector RAG accuracy keeps disappointing you (finance, legal, regulatory, medical)
- 30–120s/query is acceptable for your UX (analyst tools, research assistants, async workflows)
Probably not, if:
- You have a large corpus of short, unstructured documents
- You need sub-second retrieval for a chatbot
- Your documents have no meaningful section structure
- You’re already happy with your vector RAG accuracy
PageIndex is the most interesting practical demonstration so far that “throw out the vector DB” can actually work — provided your problem looks like a long document and not a search index. The 98.7% FinanceBench score is the kind of benchmark gap that makes you take the architecture seriously, even if some of the HN critiques about it being “vibe retrieval with extra steps” are fair. For the right problem, the extra steps are exactly what you wanted.
The open-source repo is at github.com/VectifyAI/PageIndex — it’s a 5-minute install if you want to play with it on a PDF you already have.