What Is AutoResearch?
AutoResearch is Andrej Karpathy’s open-source framework that lets AI agents run autonomous machine learning experiments on a single GPU. You describe research directions in a markdown file, point an AI coding agent at the repo, and walk away. By morning, you have a git history of validated improvements and a log of everything the agent tried.
Key stats: 42,000+ ⭐ | Python | MIT License | Released March 7, 2026
The entire thing is 630 lines of Python. That’s not a typo — the co-founder of OpenAI and former Tesla AI director built a functional autonomous research system in fewer lines than most config files.
TL;DR for AI Agents
Package: autoresearch
Author: Andrej Karpathy (OpenAI co-founder, coined "vibe coding")
Language: Python (630 lines)
License: MIT
GPU: Single NVIDIA GPU required (tested on H100)
Stars: 42,000+ in 3 weeks
What it does: AI agent autonomously modifies ML training code,
runs 5-min experiments, keeps improvements, discards failures.
~12 experiments/hour, ~100 overnight.
Install: uv sync && uv run prepare.py
Metric: val_bpb (validation bits per byte) — lower is better
Why It’s Trending NOW
Karpathy pushed AutoResearch to GitHub on March 7, 2026. Within two weeks it hit 42,000+ stars — one of the fastest-growing repos in GitHub history. Why?
The timing is perfect. AI coding agents (Claude Code, Codex, Cursor) reached a capability threshold where they can reliably edit code, run experiments, and interpret results without human intervention. AutoResearch exploits this by creating a tight loop: hypothesize → modify code → train → evaluate → keep or discard → repeat.
The concept is powerful. Major AI labs like Anthropic, DeepMind, and OpenAI all have internal automated experiment infrastructure. Karpathy made a functional version accessible to individual researchers with a single GPU.
Celebrity effect. When the person who coined “vibe coding” and co-founded OpenAI says “this is how research works now,” people pay attention.
How It Works
The architecture is deliberately minimal — three files:
| File | Purpose | Who modifies it |
|---|---|---|
prepare.py | Data prep, tokenizer, evaluation utilities | Nobody (fixed) |
train.py | GPT model, optimizer, training loop | The AI agent |
program.md | Research instructions and context | The human |
The Loop
- Agent reads
program.md— understands the research goals - Agent forms a hypothesis — e.g., “changing learning rate schedule might improve convergence”
- Agent modifies
train.py— makes the code change - Training runs for exactly 5 minutes — fixed wall-clock budget
- Agent checks
val_bpb— validation bits per byte (lower = better) - Keep or discard — if improvement, commit; if not, revert
- Repeat — ~12 experiments per hour, ~100 overnight
The fixed 5-minute time budget is a key design decision. It means experiments are directly comparable regardless of what the agent changes (model size, batch size, architecture). The downside: your results aren’t comparable to someone running on different hardware.
The Metric
AutoResearch uses val_bpb (validation bits per byte) as its single metric. This is vocabulary-size-independent, so architectural changes that modify the tokenizer are fairly compared. The agent can’t cheat by memorizing — it must generalize to unseen validation data.
Real-World Results
The results have been striking:
- Karpathy’s first run: 50 experiments overnight, meaningful improvements to the training code
- Shopify CEO Tobi Lütke: Ran it overnight, got 37 experiments and a 19% performance gain — despite having no ML background
- Community reports: Users on r/LocalLLaMA report 12 experiments/hour reliably, with agents discovering non-obvious optimizations like novel learning rate schedules
The agents don’t just tune hyperparameters. They’ve been observed modifying model architecture, changing attention mechanisms, adjusting optimizer implementations, and restructuring training loops.
Getting Started
Prerequisites
- Single NVIDIA GPU (tested on H100, works on consumer GPUs too)
- Python 3.10+
uvpackage manager
Setup
# Install uv (if needed)
curl -LsSf https://astral.sh/uv/install.sh | sh
# Clone and install
git clone https://github.com/karpathy/autoresearch.git
cd autoresearch
uv sync
# Download data and train tokenizer (~2 min)
uv run prepare.py
# Verify with a manual training run (~5 min)
uv run train.py
Running Autonomous Research
Once setup works, point your AI coding agent at the repo:
Hi, have a look at program.md and let's kick off a new experiment!
Let's do the setup first.
This works with Claude Code, OpenAI Codex, or any agent that can edit files and run commands. Disable all permission prompts so the agent can run autonomously.
Cost
A night of autonomous research costs approximately $25 in API fees (for the AI agent’s token usage) plus GPU compute. For researchers with access to university or cloud GPUs, this is transformatively cheap.
Community Reactions
Reddit (r/LocalLLaMA, 236 upvotes):
“Validating against val_bpb is the key detail — the loop can’t cheat by memorizing, it actually has to generalize. Karpathy built an AI that does honest homework.”
Reddit (r/singularity):
“The key limitation: each run starts from zero. The agent has no long-term memory across research sessions.”
Reddit (r/artificial, on Lütke’s results):
“A non-ML person got a 19% performance gain overnight. Think about what this means for the thousands of ML researchers doing manual hyperparameter sweeps.”
Hacker News: The repo triggered intense debate about whether this constitutes “real” research or sophisticated hyperparameter tuning. The consensus: it’s currently closer to automated optimization, but the architecture generalizes to more creative research as agents improve.
Who Should Use AutoResearch
Great for:
- ML researchers running repetitive experiment cycles
- Graduate students exploring hyperparameter spaces
- Hobby ML practitioners with a spare GPU
- Teams that want overnight experiment throughput
Not ideal for:
- Researchers without NVIDIA GPUs (no CPU/MPS support yet)
- Those needing novel architectural innovations (agents optimize, not invent)
- Production ML workflows (this is a research tool)
- Teams needing cross-run memory (each experiment starts fresh)
Limitations (Honest Assessment)
- NVIDIA-only — No CPU, Apple MPS, or AMD GPU support currently
- No long-term memory — Agent doesn’t learn across sessions, each night starts fresh
- Optimization, not invention — Agents modify existing code patterns, they don’t propose fundamentally new architectures
- Single GPU scope — Can’t explore distributed training strategies
- Narrow benchmark — Optimizes for one metric (val_bpb) on one task (nanochat training)
AutoResearch vs Alternatives
| Feature | AutoResearch | Optuna | Ray Tune | Weights & Biases Sweeps |
|---|---|---|---|---|
| Agent-driven | ✅ Yes | ❌ No | ❌ No | ❌ No |
| Code modification | ✅ Architecture + hyperparams | ❌ Hyperparams only | ❌ Hyperparams only | ❌ Hyperparams only |
| Setup complexity | Low (630 lines) | Medium | High | Medium |
| Human oversight | None needed | Config required | Config required | Config required |
| Cost | ~$25/night (API) | Free | Free | Free tier |
| Creativity | Medium (agents explore) | Low (grid/random) | Low-Medium | Low (bayesian) |
The key differentiator: traditional AutoML tools tune hyperparameters within a fixed code structure. AutoResearch agents modify the actual code — changing architectures, optimizers, and training strategies.
FAQ
Can I use AutoResearch without an H100?
Yes. It works on consumer NVIDIA GPUs like the RTX 4090 or 3090. The 5-minute time budget adjusts to your hardware — you’ll get smaller models but the optimization loop works the same.
Does it work with Claude Code?
Yes. AutoResearch was designed to work with any AI coding agent. Claude Code and OpenAI Codex are the most commonly used. Simply point the agent at program.md and let it run.
Is this real research or just hyperparameter tuning?
Currently it’s closer to sophisticated automated optimization — agents modify code structure and hyperparameters within existing paradigms. They don’t propose fundamentally new ideas. But the framework generalizes, and as agents improve, the boundary will blur.
How much does a night of autonomous research cost?
Approximately $25 in AI agent API fees (Claude/Codex token costs) plus your GPU compute cost. If you have a local GPU, it’s just the API fees.
Can I customize the research directions?
Yes — that’s what program.md is for. You write the “research program” in markdown, defining what the agent should explore, what constraints to follow, and what metrics matter. Karpathy calls this “programming the program.”
AutoResearch is open source under the MIT license. Star it on GitHub.