TL;DR
MarkItDown is Microsoft’s open-source Python utility that converts PDFs, Office docs, HTML, images, audio, and ZIP archives into clean Markdown — the format LLMs natively “speak.” It’s trending on GitHub with 113,000+ stars (and ~7,000 new this week), making it one of the most popular document-ingestion tools in the AI ecosystem. Key facts:
- Converts 14+ formats — PDF, PowerPoint, Word, Excel, images, audio, HTML, CSV, JSON, XML, EPub, YouTube URLs, ZIP files, and more
- MIT licensed — free for commercial use, no AGPL strings attached
- CLI + Python API —
markitdown file.pdf > out.mdorMarkItDown().convert("file.pdf") - Plugin architecture —
markitdown-ocradds LLM-powered OCR via OpenAI-compatible clients - Azure Document Intelligence integration for high-fidelity PDFs
- Optional per-format installs —
pip install 'markitdown[pdf,docx,pptx]'if you don’t want every dependency - Python ≥3.10, Docker image available
- Run without installing via
uvx markitdown file.pdf - Honest limitation: Markdown tables are lossy for complex spreadsheets, and PDF extraction is heuristic-based unless you pay for Document Intelligence
If you’re building a RAG pipeline, feeding documents to an agent, or batch-processing a corpus for fine-tuning, MarkItDown is the simplest “boring tool that works” in a space full of overengineered startups.
The Problem MarkItDown Solves
Every serious LLM pipeline runs into the same bottleneck on day one: your data is not in a format the model likes.
You have PDFs from legal, PowerPoints from marketing, Excel files from finance, Word docs from HR, and a Confluence export full of HTML soup. LLMs want clean Markdown — structured enough to preserve headings, lists, and tables, but simple enough that every token carries signal instead of formatting noise. GPT-4o, Claude, and Gemini all demonstrably perform better on Markdown input than on raw HTML or PDF text, and Markdown is also the most token-efficient structured format you can feed them.
The existing options each have problems:
textract— unmaintained, no Markdown output, loses structureunstructured— powerful but heavy, opinionated, and often overkilldocling(IBM) — excellent for PDFs, but narrower format coverage- LlamaParse / Azure Document Intelligence — cloud-only, paid, privacy-sensitive
- Homegrown scripts — everyone writes the same
pdfminer + python-docx + pandasglue
MarkItDown is Microsoft’s answer: a small, single-purpose library that handles the long tail of formats with reasonable defaults, preserves document structure as Markdown, and stays out of your way. As one Hacker News commenter put it after reading the source: “I really hope the end state is a simple project like this, easy to understand and easy to deploy.”
Install in 30 Seconds
The fastest path, if you have uv installed, is zero-install:
# Run once, no virtualenv needed
uvx markitdown path-to-file.pdf > document.md
Traditional install:
python -m venv .venv
source .venv/bin/activate
pip install 'markitdown[all]'
[all] pulls in every optional dependency. If you know you only need a subset, target it:
# PDF + DOCX + PPTX only — smaller install, fewer security surface bits
pip install 'markitdown[pdf,docx,pptx]'
Available extras: pdf, docx, pptx, xlsx, xls, outlook, az-doc-intel, audio-transcription, youtube-transcription, and all.
For production pipelines, the Docker route avoids dependency drift:
docker build -t markitdown:latest .
docker run --rm -i markitdown:latest < ~/your-file.pdf > output.md
First Run: CLI
Basic conversion is exactly what you’d hope:
# Output to stdout
markitdown quarterly-report.pdf
# Write to a file
markitdown quarterly-report.pdf -o quarterly-report.md
# Pipe from stdin (works great in shell pipelines)
cat quarterly-report.pdf | markitdown > quarterly-report.md
List installed plugins:
markitdown --list-plugins
Enable plugins for a conversion:
markitdown --use-plugins scan-of-contract.pdf
Python API: The Part You’ll Actually Use
Most production usage happens via the Python API, because you typically want to chain conversion into chunking, embedding, and vector storage.
Basic conversion
from markitdown import MarkItDown
md = MarkItDown(enable_plugins=False)
result = md.convert("test.xlsx")
print(result.text_content)
result.text_content is a string of Markdown. No temp files, no surprise I/O.
With LLM-powered image descriptions
PowerPoint decks and images can be described in-context by an LLM — MarkItDown will call your client and insert the description as alt text:
from markitdown import MarkItDown
from openai import OpenAI
client = OpenAI()
md = MarkItDown(
llm_client=client,
llm_model="gpt-4o",
llm_prompt="Describe this image for a blind user in one sentence."
)
result = md.convert("investor-deck.pptx")
print(result.text_content)
This is where MarkItDown quietly gets interesting: you can point llm_client at any OpenAI-compatible endpoint — including a local Ollama or vLLM server — and keep the whole pipeline on-premises.
With Azure Document Intelligence (high-fidelity PDFs)
When heuristic PDF extraction isn’t good enough (scanned documents, complex layouts, tables), swap in Azure DI:
from markitdown import MarkItDown
md = MarkItDown(docintel_endpoint="https://<your-resource>.cognitiveservices.azure.com/")
result = md.convert("scanned-invoice.pdf")
print(result.text_content)
Building a small RAG ingestion script
Here’s a realistic end-to-end pattern — walk a folder, convert everything to Markdown, and write it to a parallel tree ready for chunking:
from pathlib import Path
from markitdown import MarkItDown
md = MarkItDown(enable_plugins=True)
SRC = Path("./raw-docs")
DST = Path("./markdown-docs")
EXTS = {".pdf", ".docx", ".pptx", ".xlsx", ".html", ".epub"}
for src_path in SRC.rglob("*"):
if src_path.suffix.lower() not in EXTS:
continue
rel = src_path.relative_to(SRC)
dst_path = (DST / rel).with_suffix(".md")
dst_path.parent.mkdir(parents=True, exist_ok=True)
try:
result = md.convert(str(src_path))
dst_path.write_text(result.text_content, encoding="utf-8")
print(f"✓ {rel}")
except Exception as e:
print(f"✗ {rel}: {e}")
Pipe that output into your chunker of choice (LangChain’s MarkdownHeaderTextSplitter, LlamaIndex’s MarkdownNodeParser, or a 30-line custom splitter), embed, and you’re done.
Under the Hood
MarkItDown is a thin orchestration layer over battle-tested libraries — which is exactly the right design for this problem. According to InfoWorld’s review, it uses:
mammothfor DOCX → HTML → Markdownpython-pptxfor PowerPoint slidespandasfor Excel (tabular → pipe-style tables)pdfminer.sixfor PDF text extraction heuristicsBeautifulSoupfor HTML cleaningspeech_recognitionfor audio transcription- EXIF parsers for image metadata
The architecture is deliberately shallow: each format has a convert_* function, plugins register new converters, and everything returns a DocumentConverterResult with a text_content field. You can audit the whole pipeline in an afternoon.
The Plugin Ecosystem
Microsoft exposed a simple plugin API, and the community has started filling gaps. The official markitdown-ocr plugin is the most interesting — it adds OCR to PDF, DOCX, PPTX, and XLSX converters by reusing the same llm_client / llm_model pattern:
pip install markitdown-ocr
pip install openai # or any OpenAI-compatible client
from markitdown import MarkItDown
from openai import OpenAI
md = MarkItDown(
enable_plugins=True,
llm_client=OpenAI(),
llm_model="gpt-4o",
)
result = md.convert("document_with_images.pdf")
print(result.text_content)
No Tesseract install, no new ML dependencies — the LLM does the OCR. If you don’t provide an llm_client, OCR is silently skipped and the standard converter runs. Search GitHub for #markitdown-plugin to find community plugins.
Community Reactions
The Hacker News thread on MarkItDown (500+ points) captures the tool’s reception well. A selection of signal from engineers who’d worked on similar in-house tooling:
“I worked on an in-house version of this feature for my employer. After reading the source code, I can say this is a pretty reasonable implementation of this type of thing. But I would avoid using it for images, since the LLM providers let you just pass images directly, and I would also avoid using it for spreadsheets, since LLMs are very bad at interpreting Markdown tables.”
“If you have uv installed you can run this against a file without first installing anything like this:
uvx markitdown path-to-file.pdf. I’ve tried it against HTML and PDFs so far and it seems pretty decent.”
“I really hope the end state is a simple project like this, easy to understand and easy to deploy. I do wish it had a knob to turn for ‘how much processing do you want me to do.’ For PDF specifically, you either have to get a crappy version of the plain text using heuristics in a way that is very sensitive to how the PDF is exported, or you have to go full OCR.”
On r/ObsidianMD, the note-taking community picked it up immediately as a bulk-import tool for moving away from proprietary formats. The InfoWorld write-up called it “Microsoft’s quiet but significant contribution to the open-source AI tooling stack.”
Honest Limitations
This isn’t magic, and the trade-offs are real:
- Markdown tables are lossy. Complex spreadsheets with merged cells, formulas, or multi-header rows lose information. The HN engineer’s advice — “pass structured data to a code interpreter instead” — is correct for any non-trivial tabular workload.
- PDF extraction is heuristic without Azure DI. Scanned PDFs, multi-column academic papers, and anything with footnotes come out messy. You’re one step away from OCR and it shows.
- No “processing intensity” knob. You get the default extraction path; you can’t ask for “try harder” without switching to Document Intelligence or
markitdown-ocr. - Image handling is weak by default. Without an
llm_client, images become EXIF metadata and nothing else. With one, you’re paying tokens for every picture. - Audio transcription uses
speech_recognitionby default — fine for clean speech, poor for noisy or accented audio. For production, route audio through Whisper separately. - Security surface grows with
[all]. Pulling in every dependency means pulling in every CVE. The per-format install option exists for a reason. - No streaming API. Large files are loaded fully into memory. For 500-page PDFs, watch your RAM.
MarkItDown vs. The Alternatives
| Tool | Formats | License | Best For | Weakness |
|---|---|---|---|---|
| MarkItDown | 14+ | MIT | General LLM pipelines, CLI scripts | Heuristic PDF extraction |
| Unstructured.io | 20+ | Apache 2.0 | Enterprise RAG with chunking built-in | Heavy, opinionated |
| Docling (IBM) | 6 (PDF-focused) | MIT | High-fidelity PDF → Markdown | Narrow format coverage |
| LlamaParse | PDF + others | Proprietary | Best PDF quality, cloud-only | Paid, privacy-sensitive |
| Azure Document Intelligence | PDFs | Proprietary | Enterprise PDFs with forms/tables | Paid, Azure lock-in |
textract | Many | MIT | Legacy scripts | Unmaintained, no Markdown |
Rule of thumb: start with MarkItDown. If PDF quality isn’t good enough, add markitdown-ocr or pipe through Docling. If you need enterprise chunking/metadata, graduate to Unstructured.
Who Should Use This
Good fit:
- RAG developers who need broad format coverage without vendor lock-in
- Teams building agent pipelines that ingest arbitrary user uploads
- Anyone migrating a document corpus to Obsidian, Logseq, or a static site
- Hobbyists who want
uvx markitdown file.pdfand nothing else - Privacy-conscious shops — the library is local unless you wire in an LLM
Not a fit:
- Scanned-PDF-heavy legal or medical workflows (use Azure DI or Docling)
- Tabular-first finance pipelines (use pandas → CSV → code interpreter)
- High-fidelity document-to-document conversion for human consumption
FAQ
How does MarkItDown compare to Unstructured.io?
MarkItDown is smaller and simpler — roughly 2,000 lines of orchestration code over mature libraries. Unstructured is a full ingestion framework with chunking, metadata extraction, and pluggable backends. If you’re writing a script, use MarkItDown. If you’re building a platform, consider Unstructured.
Can I run MarkItDown fully offline?
Yes. The default conversion path is 100% local. LLM-powered image descriptions and markitdown-ocr require an LLM client, but you can point that client at a local Ollama or vLLM server instead of the OpenAI API.
Does MarkItDown handle scanned PDFs?
Not well by default — it uses pdfminer.six heuristics that assume extractable text. For scanned PDFs, install the markitdown-ocr plugin with an LLM client, or use the Azure Document Intelligence integration by passing docintel_endpoint="..." to the MarkItDown constructor.
What’s the difference between markitdown and markitdown[all]?
The base install gives you the core CLI and Python API but skips most format-specific dependencies. [all] installs everything — PDF, Office, audio, YouTube, Outlook. For production, prefer targeted extras like [pdf,docx] to keep your dependency tree (and CVE surface) small.
Is MarkItDown safe to run on untrusted files?
The README is explicit: MarkItDown performs I/O with the privileges of the current process. Don’t pass untrusted input directly — sanitize, sandbox, and prefer the narrowest convert_* function (e.g., convert_stream() instead of convert_local()) in multi-tenant environments.
What license is MarkItDown?
MIT — one of the most permissive open-source licenses. You can use it commercially, modify it, and redistribute it without copyleft obligations. This is a meaningful difference from AGPL-licensed tools in the space.
Does MarkItDown support streaming for large files?
Not currently — files are loaded fully into memory during conversion. For very large PDFs (500+ pages), either split the file first or budget the RAM.
Final Take
MarkItDown is what happens when a large company ships a small tool. It’s not the most powerful document converter in the ecosystem, and it’s not trying to be. It’s the default — the requests of document ingestion. Install it, convert your PDFs, move on to the actual interesting parts of your pipeline.
If you’re starting a RAG project today, pip install 'markitdown[all]' is the right first move. You’ll know within a week whether its limitations force you to something heavier. For 80% of workloads, they won’t.
Repo: github.com/microsoft/markitdown · License: MIT · Stars: 113K+ · Weekly growth: ~7K