MarkItDown Review: Microsoft's PDF-to-Markdown for LLMs

TL;DR

MarkItDown is Microsoft’s open-source Python utility that converts PDFs, Office docs, HTML, images, audio, and ZIP archives into clean Markdown — the format LLMs natively “speak.” It’s trending on GitHub with 113,000+ stars (and ~7,000 new this week), making it one of the most popular document-ingestion tools in the AI ecosystem. Key facts:

Converts 14+ formats — PDF, PowerPoint, Word, Excel, images, audio, HTML, CSV, JSON, XML, EPub, YouTube URLs, ZIP files, and more
MIT licensed — free for commercial use, no AGPL strings attached
CLI + Python API — markitdown file.pdf > out.md or MarkItDown().convert("file.pdf")
Plugin architecture — markitdown-ocr adds LLM-powered OCR via OpenAI-compatible clients
Azure Document Intelligence integration for high-fidelity PDFs
Optional per-format installs — pip install 'markitdown[pdf,docx,pptx]' if you don’t want every dependency
Python ≥3.10, Docker image available
Run without installing via uvx markitdown file.pdf
Honest limitation: Markdown tables are lossy for complex spreadsheets, and PDF extraction is heuristic-based unless you pay for Document Intelligence

If you’re building a RAG pipeline, feeding documents to an agent, or batch-processing a corpus for fine-tuning, MarkItDown is the simplest “boring tool that works” in a space full of overengineered startups.

The Problem MarkItDown Solves

Every serious LLM pipeline runs into the same bottleneck on day one: your data is not in a format the model likes.

You have PDFs from legal, PowerPoints from marketing, Excel files from finance, Word docs from HR, and a Confluence export full of HTML soup. LLMs want clean Markdown — structured enough to preserve headings, lists, and tables, but simple enough that every token carries signal instead of formatting noise. GPT-4o, Claude, and Gemini all demonstrably perform better on Markdown input than on raw HTML or PDF text, and Markdown is also the most token-efficient structured format you can feed them.

The existing options each have problems:

textract — unmaintained, no Markdown output, loses structure
unstructured — powerful but heavy, opinionated, and often overkill
docling (IBM) — excellent for PDFs, but narrower format coverage
LlamaParse / Azure Document Intelligence — cloud-only, paid, privacy-sensitive
Homegrown scripts — everyone writes the same pdfminer + python-docx + pandas glue

MarkItDown is Microsoft’s answer: a small, single-purpose library that handles the long tail of formats with reasonable defaults, preserves document structure as Markdown, and stays out of your way. As one Hacker News commenter put it after reading the source: “I really hope the end state is a simple project like this, easy to understand and easy to deploy.”

Install in 30 Seconds

The fastest path, if you have uv installed, is zero-install:

# Run once, no virtualenv needed
uvx markitdown path-to-file.pdf > document.md

Traditional install:

python -m venv .venv
source .venv/bin/activate
pip install 'markitdown[all]'

[all] pulls in every optional dependency. If you know you only need a subset, target it:

# PDF + DOCX + PPTX only — smaller install, fewer security surface bits
pip install 'markitdown[pdf,docx,pptx]'

Available extras: pdf, docx, pptx, xlsx, xls, outlook, az-doc-intel, audio-transcription, youtube-transcription, and all.

For production pipelines, the Docker route avoids dependency drift:

docker build -t markitdown:latest .
docker run --rm -i markitdown:latest < ~/your-file.pdf > output.md

First Run: CLI

Basic conversion is exactly what you’d hope:

# Output to stdout
markitdown quarterly-report.pdf

# Write to a file
markitdown quarterly-report.pdf -o quarterly-report.md

# Pipe from stdin (works great in shell pipelines)
cat quarterly-report.pdf | markitdown > quarterly-report.md

List installed plugins:

markitdown --list-plugins

Enable plugins for a conversion:

markitdown --use-plugins scan-of-contract.pdf

Python API: The Part You’ll Actually Use

Most production usage happens via the Python API, because you typically want to chain conversion into chunking, embedding, and vector storage.

Basic conversion

from markitdown import MarkItDown

md = MarkItDown(enable_plugins=False)
result = md.convert("test.xlsx")
print(result.text_content)

result.text_content is a string of Markdown. No temp files, no surprise I/O.

With LLM-powered image descriptions

PowerPoint decks and images can be described in-context by an LLM — MarkItDown will call your client and insert the description as alt text:

from markitdown import MarkItDown
from openai import OpenAI

client = OpenAI()
md = MarkItDown(
    llm_client=client,
    llm_model="gpt-4o",
    llm_prompt="Describe this image for a blind user in one sentence."
)

result = md.convert("investor-deck.pptx")
print(result.text_content)

This is where MarkItDown quietly gets interesting: you can point llm_client at any OpenAI-compatible endpoint — including a local Ollama or vLLM server — and keep the whole pipeline on-premises.

With Azure Document Intelligence (high-fidelity PDFs)

When heuristic PDF extraction isn’t good enough (scanned documents, complex layouts, tables), swap in Azure DI:

from markitdown import MarkItDown

md = MarkItDown(docintel_endpoint="https://<your-resource>.cognitiveservices.azure.com/")
result = md.convert("scanned-invoice.pdf")
print(result.text_content)

Building a small RAG ingestion script

Here’s a realistic end-to-end pattern — walk a folder, convert everything to Markdown, and write it to a parallel tree ready for chunking:

from pathlib import Path
from markitdown import MarkItDown

md = MarkItDown(enable_plugins=True)

SRC = Path("./raw-docs")
DST = Path("./markdown-docs")

EXTS = {".pdf", ".docx", ".pptx", ".xlsx", ".html", ".epub"}

for src_path in SRC.rglob("*"):
    if src_path.suffix.lower() not in EXTS:
        continue

    rel = src_path.relative_to(SRC)
    dst_path = (DST / rel).with_suffix(".md")
    dst_path.parent.mkdir(parents=True, exist_ok=True)

    try:
        result = md.convert(str(src_path))
        dst_path.write_text(result.text_content, encoding="utf-8")
        print(f"✓ {rel}")
    except Exception as e:
        print(f"✗ {rel}: {e}")

Pipe that output into your chunker of choice (LangChain’s MarkdownHeaderTextSplitter, LlamaIndex’s MarkdownNodeParser, or a 30-line custom splitter), embed, and you’re done.

Under the Hood

MarkItDown is a thin orchestration layer over battle-tested libraries — which is exactly the right design for this problem. According to InfoWorld’s review, it uses:

mammoth for DOCX → HTML → Markdown
python-pptx for PowerPoint slides
pandas for Excel (tabular → pipe-style tables)
pdfminer.six for PDF text extraction heuristics
BeautifulSoup for HTML cleaning
speech_recognition for audio transcription
EXIF parsers for image metadata

The architecture is deliberately shallow: each format has a convert_* function, plugins register new converters, and everything returns a DocumentConverterResult with a text_content field. You can audit the whole pipeline in an afternoon.

The Plugin Ecosystem

Microsoft exposed a simple plugin API, and the community has started filling gaps. The official markitdown-ocr plugin is the most interesting — it adds OCR to PDF, DOCX, PPTX, and XLSX converters by reusing the same llm_client / llm_model pattern:

pip install markitdown-ocr
pip install openai  # or any OpenAI-compatible client

from markitdown import MarkItDown
from openai import OpenAI

md = MarkItDown(
    enable_plugins=True,
    llm_client=OpenAI(),
    llm_model="gpt-4o",
)
result = md.convert("document_with_images.pdf")
print(result.text_content)

No Tesseract install, no new ML dependencies — the LLM does the OCR. If you don’t provide an llm_client, OCR is silently skipped and the standard converter runs. Search GitHub for #markitdown-plugin to find community plugins.

Community Reactions

The Hacker News thread on MarkItDown (500+ points) captures the tool’s reception well. A selection of signal from engineers who’d worked on similar in-house tooling:

“I worked on an in-house version of this feature for my employer. After reading the source code, I can say this is a pretty reasonable implementation of this type of thing. But I would avoid using it for images, since the LLM providers let you just pass images directly, and I would also avoid using it for spreadsheets, since LLMs are very bad at interpreting Markdown tables.”

“If you have uv installed you can run this against a file without first installing anything like this: uvx markitdown path-to-file.pdf. I’ve tried it against HTML and PDFs so far and it seems pretty decent.”

“I really hope the end state is a simple project like this, easy to understand and easy to deploy. I do wish it had a knob to turn for ‘how much processing do you want me to do.’ For PDF specifically, you either have to get a crappy version of the plain text using heuristics in a way that is very sensitive to how the PDF is exported, or you have to go full OCR.”

On r/ObsidianMD, the note-taking community picked it up immediately as a bulk-import tool for moving away from proprietary formats. The InfoWorld write-up called it “Microsoft’s quiet but significant contribution to the open-source AI tooling stack.”

Honest Limitations

This isn’t magic, and the trade-offs are real:

Markdown tables are lossy. Complex spreadsheets with merged cells, formulas, or multi-header rows lose information. The HN engineer’s advice — “pass structured data to a code interpreter instead” — is correct for any non-trivial tabular workload.
PDF extraction is heuristic without Azure DI. Scanned PDFs, multi-column academic papers, and anything with footnotes come out messy. You’re one step away from OCR and it shows.
No “processing intensity” knob. You get the default extraction path; you can’t ask for “try harder” without switching to Document Intelligence or markitdown-ocr.
Image handling is weak by default. Without an llm_client, images become EXIF metadata and nothing else. With one, you’re paying tokens for every picture.
Audio transcription uses speech_recognition by default — fine for clean speech, poor for noisy or accented audio. For production, route audio through Whisper separately.
Security surface grows with [all]. Pulling in every dependency means pulling in every CVE. The per-format install option exists for a reason.
No streaming API. Large files are loaded fully into memory. For 500-page PDFs, watch your RAM.

MarkItDown vs. The Alternatives

Tool	Formats	License	Best For	Weakness
MarkItDown	14+	MIT	General LLM pipelines, CLI scripts	Heuristic PDF extraction
Unstructured.io	20+	Apache 2.0	Enterprise RAG with chunking built-in	Heavy, opinionated
Docling (IBM)	6 (PDF-focused)	MIT	High-fidelity PDF → Markdown	Narrow format coverage
LlamaParse	PDF + others	Proprietary	Best PDF quality, cloud-only	Paid, privacy-sensitive
Azure Document Intelligence	PDFs	Proprietary	Enterprise PDFs with forms/tables	Paid, Azure lock-in
`textract`	Many	MIT	Legacy scripts	Unmaintained, no Markdown

Rule of thumb: start with MarkItDown. If PDF quality isn’t good enough, add markitdown-ocr or pipe through Docling. If you need enterprise chunking/metadata, graduate to Unstructured.

Who Should Use This

Good fit:

RAG developers who need broad format coverage without vendor lock-in
Teams building agent pipelines that ingest arbitrary user uploads
Anyone migrating a document corpus to Obsidian, Logseq, or a static site
Hobbyists who want uvx markitdown file.pdf and nothing else
Privacy-conscious shops — the library is local unless you wire in an LLM

Not a fit:

Scanned-PDF-heavy legal or medical workflows (use Azure DI or Docling)
Tabular-first finance pipelines (use pandas → CSV → code interpreter)
High-fidelity document-to-document conversion for human consumption

FAQ

How does MarkItDown compare to Unstructured.io?

MarkItDown is smaller and simpler — roughly 2,000 lines of orchestration code over mature libraries. Unstructured is a full ingestion framework with chunking, metadata extraction, and pluggable backends. If you’re writing a script, use MarkItDown. If you’re building a platform, consider Unstructured.

Can I run MarkItDown fully offline?

Yes. The default conversion path is 100% local. LLM-powered image descriptions and markitdown-ocr require an LLM client, but you can point that client at a local Ollama or vLLM server instead of the OpenAI API.

Does MarkItDown handle scanned PDFs?

Not well by default — it uses pdfminer.six heuristics that assume extractable text. For scanned PDFs, install the markitdown-ocr plugin with an LLM client, or use the Azure Document Intelligence integration by passing docintel_endpoint="..." to the MarkItDown constructor.

What’s the difference between `markitdown` and `markitdown[all]`?

The base install gives you the core CLI and Python API but skips most format-specific dependencies. [all] installs everything — PDF, Office, audio, YouTube, Outlook. For production, prefer targeted extras like [pdf,docx] to keep your dependency tree (and CVE surface) small.

Is MarkItDown safe to run on untrusted files?

The README is explicit: MarkItDown performs I/O with the privileges of the current process. Don’t pass untrusted input directly — sanitize, sandbox, and prefer the narrowest convert_* function (e.g., convert_stream() instead of convert_local()) in multi-tenant environments.

What license is MarkItDown?

MIT — one of the most permissive open-source licenses. You can use it commercially, modify it, and redistribute it without copyleft obligations. This is a meaningful difference from AGPL-licensed tools in the space.

Does MarkItDown support streaming for large files?

Not currently — files are loaded fully into memory during conversion. For very large PDFs (500+ pages), either split the file first or budget the RAM.

Final Take

MarkItDown is what happens when a large company ships a small tool. It’s not the most powerful document converter in the ecosystem, and it’s not trying to be. It’s the default — the requests of document ingestion. Install it, convert your PDFs, move on to the actual interesting parts of your pipeline.

If you’re starting a RAG project today, pip install 'markitdown[all]' is the right first move. You’ll know within a week whether its limitations force you to something heavier. For 80% of workloads, they won’t.

Repo: github.com/microsoft/markitdown · License: MIT · Stars: 113K+ · Weekly growth: ~7K