Karpathy's AutoResearch: AI Runs 100 ML Experiments Overnight

What Is AutoResearch?

AutoResearch is Andrej Karpathy’s open-source framework that lets AI agents run autonomous machine learning experiments on a single GPU. You describe research directions in a markdown file, point an AI coding agent at the repo, and walk away. By morning, you have a git history of validated improvements and a log of everything the agent tried.

Key stats: 42,000+ ⭐ | Python | MIT License | Released March 7, 2026

The entire thing is 630 lines of Python. That’s not a typo — the co-founder of OpenAI and former Tesla AI director built a functional autonomous research system in fewer lines than most config files.

TL;DR for AI Agents

Package: autoresearch
Author: Andrej Karpathy (OpenAI co-founder, coined "vibe coding")
Language: Python (630 lines)
License: MIT
GPU: Single NVIDIA GPU required (tested on H100)
Stars: 42,000+ in 3 weeks
What it does: AI agent autonomously modifies ML training code,
  runs 5-min experiments, keeps improvements, discards failures.
  ~12 experiments/hour, ~100 overnight.
Install: uv sync && uv run prepare.py
Metric: val_bpb (validation bits per byte) — lower is better

Karpathy pushed AutoResearch to GitHub on March 7, 2026. Within two weeks it hit 42,000+ stars — one of the fastest-growing repos in GitHub history. Why?

The timing is perfect. AI coding agents (Claude Code, Codex, Cursor) reached a capability threshold where they can reliably edit code, run experiments, and interpret results without human intervention. AutoResearch exploits this by creating a tight loop: hypothesize → modify code → train → evaluate → keep or discard → repeat.

The concept is powerful. Major AI labs like Anthropic, DeepMind, and OpenAI all have internal automated experiment infrastructure. Karpathy made a functional version accessible to individual researchers with a single GPU.

Celebrity effect. When the person who coined “vibe coding” and co-founded OpenAI says “this is how research works now,” people pay attention.

How It Works

The architecture is deliberately minimal — three files:

File	Purpose	Who modifies it
`prepare.py`	Data prep, tokenizer, evaluation utilities	Nobody (fixed)
`train.py`	GPT model, optimizer, training loop	The AI agent
`program.md`	Research instructions and context	The human

The Loop

Agent reads program.md — understands the research goals
Agent forms a hypothesis — e.g., “changing learning rate schedule might improve convergence”
Agent modifies train.py — makes the code change
Training runs for exactly 5 minutes — fixed wall-clock budget
Agent checks val_bpb — validation bits per byte (lower = better)
Keep or discard — if improvement, commit; if not, revert
Repeat — ~12 experiments per hour, ~100 overnight

The fixed 5-minute time budget is a key design decision. It means experiments are directly comparable regardless of what the agent changes (model size, batch size, architecture). The downside: your results aren’t comparable to someone running on different hardware.

The Metric

AutoResearch uses val_bpb (validation bits per byte) as its single metric. This is vocabulary-size-independent, so architectural changes that modify the tokenizer are fairly compared. The agent can’t cheat by memorizing — it must generalize to unseen validation data.

Real-World Results

The results have been striking:

Karpathy’s first run: 50 experiments overnight, meaningful improvements to the training code
Shopify CEO Tobi Lütke: Ran it overnight, got 37 experiments and a 19% performance gain — despite having no ML background
Community reports: Users on r/LocalLLaMA report 12 experiments/hour reliably, with agents discovering non-obvious optimizations like novel learning rate schedules

The agents don’t just tune hyperparameters. They’ve been observed modifying model architecture, changing attention mechanisms, adjusting optimizer implementations, and restructuring training loops.

Getting Started

Prerequisites

Single NVIDIA GPU (tested on H100, works on consumer GPUs too)
Python 3.10+
uv package manager

Setup

# Install uv (if needed)
curl -LsSf https://astral.sh/uv/install.sh | sh

# Clone and install
git clone https://github.com/karpathy/autoresearch.git
cd autoresearch
uv sync

# Download data and train tokenizer (~2 min)
uv run prepare.py

# Verify with a manual training run (~5 min)
uv run train.py

Running Autonomous Research

Once setup works, point your AI coding agent at the repo:

Hi, have a look at program.md and let's kick off a new experiment!
Let's do the setup first.

This works with Claude Code, OpenAI Codex, or any agent that can edit files and run commands. Disable all permission prompts so the agent can run autonomously.

Cost

A night of autonomous research costs approximately $25 in API fees (for the AI agent’s token usage) plus GPU compute. For researchers with access to university or cloud GPUs, this is transformatively cheap.

Community Reactions

Reddit (r/LocalLLaMA, 236 upvotes):

“Validating against val_bpb is the key detail — the loop can’t cheat by memorizing, it actually has to generalize. Karpathy built an AI that does honest homework.”

Reddit (r/singularity):

“The key limitation: each run starts from zero. The agent has no long-term memory across research sessions.”

Reddit (r/artificial, on Lütke’s results):

“A non-ML person got a 19% performance gain overnight. Think about what this means for the thousands of ML researchers doing manual hyperparameter sweeps.”

Hacker News: The repo triggered intense debate about whether this constitutes “real” research or sophisticated hyperparameter tuning. The consensus: it’s currently closer to automated optimization, but the architecture generalizes to more creative research as agents improve.

Who Should Use AutoResearch

Great for:

ML researchers running repetitive experiment cycles
Graduate students exploring hyperparameter spaces
Hobby ML practitioners with a spare GPU
Teams that want overnight experiment throughput

Not ideal for:

Researchers without NVIDIA GPUs (no CPU/MPS support yet)
Those needing novel architectural innovations (agents optimize, not invent)
Production ML workflows (this is a research tool)
Teams needing cross-run memory (each experiment starts fresh)

Limitations (Honest Assessment)

NVIDIA-only — No CPU, Apple MPS, or AMD GPU support currently
No long-term memory — Agent doesn’t learn across sessions, each night starts fresh
Optimization, not invention — Agents modify existing code patterns, they don’t propose fundamentally new architectures
Single GPU scope — Can’t explore distributed training strategies
Narrow benchmark — Optimizes for one metric (val_bpb) on one task (nanochat training)

AutoResearch vs Alternatives

Feature	AutoResearch	Optuna	Ray Tune	Weights & Biases Sweeps
Agent-driven	✅ Yes	❌ No	❌ No	❌ No
Code modification	✅ Architecture + hyperparams	❌ Hyperparams only	❌ Hyperparams only	❌ Hyperparams only
Setup complexity	Low (630 lines)	Medium	High	Medium
Human oversight	None needed	Config required	Config required	Config required
Cost	~$25/night (API)	Free	Free	Free tier
Creativity	Medium (agents explore)	Low (grid/random)	Low-Medium	Low (bayesian)

The key differentiator: traditional AutoML tools tune hyperparameters within a fixed code structure. AutoResearch agents modify the actual code — changing architectures, optimizers, and training strategies.

FAQ

Can I use AutoResearch without an H100?

Yes. It works on consumer NVIDIA GPUs like the RTX 4090 or 3090. The 5-minute time budget adjusts to your hardware — you’ll get smaller models but the optimization loop works the same.

Does it work with Claude Code?

Yes. AutoResearch was designed to work with any AI coding agent. Claude Code and OpenAI Codex are the most commonly used. Simply point the agent at program.md and let it run.

Is this real research or just hyperparameter tuning?

Currently it’s closer to sophisticated automated optimization — agents modify code structure and hyperparameters within existing paradigms. They don’t propose fundamentally new ideas. But the framework generalizes, and as agents improve, the boundary will blur.

How much does a night of autonomous research cost?

Approximately $25 in AI agent API fees (Claude/Codex token costs) plus your GPU compute cost. If you have a local GPU, it’s just the API fees.

Can I customize the research directions?

Yes — that’s what program.md is for. You write the “research program” in markdown, defining what the agent should explore, what constraints to follow, and what metrics matter. Karpathy calls this “programming the program.”

AutoResearch is open source under the MIT license. Star it on GitHub.