What is OpenAI GeneBench-Pro?

GeneBench-Pro is a research-grade benchmark released by OpenAI on June 30, 2026. It consists of 129 synthetic computational biology problems that test whether AI agents can navigate messy biological data, choose the right analytical path, and make judgment calls — not just recognize patterns. It covers genomics, quantitative biology, and translational biomedicine.

How did models score on GeneBench-Pro?

OpenAI's best model, GPT-5.6 Sol, achieved a pass rate of 28.7% at the highest reasoning setting, increasing to 31.5% in Pro mode. For context, the strongest model available when the original GeneBench was introduced scored below 5%. These scores reveal that even frontier AI struggles significantly with open-ended scientific research tasks that require human-level judgment.

What makes GeneBench-Pro different from other AI benchmarks?

Most benchmarks test pattern matching or factual recall. GeneBench-Pro tests 'research taste' — the ability to choose the right analysis, handle messy real-world data, iterate when an approach fails, and make judgment calls about methodology. Each problem is designed to take a human expert 20-40 hours to solve. The benchmark explicitly requires multi-step reasoning where the agent must decide what to do next, not just answer a question.

Why does GeneBench-Pro matter for AI progress?

GeneBench-Pro targets a harder kind of intelligence than existing benchmarks. If AI can assist meaningfully with computational biology research that currently consumes hundreds of hours of PhD-level human labor, it could accelerate drug discovery, genetic diagnostics, and personalized medicine. The low scores (sub-32%) show how far there is to go, but they also establish a baseline for measuring whether future models are genuinely improving at scientific judgment.

Is GeneBench-Pro publicly available?

Yes. OpenAI published GeneBench-Pro alongside a bioRxiv preprint describing the benchmark design, methodology, and external review process. The benchmark problems, datasets, and evaluation framework are accessible. Researchers and AI labs can use it to evaluate their own models. It's hosted through OpenAI's standard research-paper and GitHub channels.

Quick Answer

OpenAI GeneBench-Pro: AI Genomics Benchmark Explained (July 2026)

Published: July 5, 2026

OpenAI GeneBench-Pro: AI Genomics Benchmark Explained (July 2026)

On June 30, 2026, OpenAI released GeneBench-Pro — a benchmark that asks a fundamentally harder question than most AI evaluations: can AI models do real computational biology research?

Not multiple choice. Not pattern matching. Not retrieving a known fact from training data. GeneBench-Pro gives an AI agent messy biological data and asks it to navigate the same kind of ambiguous, judgment-heavy analytical work that computational biologists perform every day.

The headline result: GPT-5.6 Sol, OpenAI’s most capable model, scored 28.7% (31.5% in Pro mode). The strongest model available when the original GeneBench launched scored below 5%.

What GeneBench-Pro Actually Tests

Most AI benchmarks work like a standardized test: here’s a question, pick the right answer. GeneBench-Pro works more like a PhD qualifying exam crossed with a real research project.

Key traits of GeneBench-Pro problems:

129 multi-stage problems across genomics, quantitative biology, and translational biomedicine
Each problem maps to tasks that take human experts 20–40 hours to complete
Problems require iterative analysis — the agent must decide what to do when initial approaches fail
Messy real-world data — not clean, pre-processed datasets but the kind of noisy, incomplete biological data researchers actually deal with
Tests “research taste” — choosing the right analytical path among many plausible options

The problems span 10 detailed case studies covering real-world challenges in genetic association studies, single-cell RNA sequencing analysis, epigenetics, population genetics, and clinical genomics.

How Models Performed

Model	Pass Rate (highest reasoning)	Pro Mode
GPT-5.6 Sol	28.7%	31.5%
GPT-5.5	~22%	~25%
GPT-5.4	~16%	~19%
Original GeneBench best model	<5%	—

Note: Results for non-OpenAI models (Claude, Gemini) on GeneBench-Pro are not yet publicly available as of July 5, 2026.

The low absolute scores aren’t surprising — they reflect that the benchmark is genuinely hard for machines. Human computational biologists also find these problems demanding, though at a higher success rate.

Why This Matters for AI Progress

GeneBench-Pro targets a capability gap that most existing benchmarks miss:

Scientific judgment vs. pattern recognition — Can an AI tell the difference between a correct but insignificant result and an incorrect but statistically significant one?
Multi-step research reasoning — When a GWAS analysis returns no significant associations, should the model try a different imputation method, adjust for population stratification, or question the study design?
Data quality assessment — Can the model identify that a batch effect is driving the signal before proceeding to downstream analysis?

The 28.7% score means GPT-5.6 Sol can handle some research-stage work but still needs human supervision for anything that requires genuine scientific judgment.

Implications for Drug Discovery and Genomics

If GeneBench-Pro scores improve significantly in the next generation of models, the implications are substantial:

Automated hypothesis generation from GWAS and sequencing data
Accelerated variant interpretation for rare disease diagnosis
Multi-omics data integration that currently requires specialized bioinformatics teams
Reproducibility checking — having AI verify that analysis methods are appropriate for the data

For biotech and pharma companies, GeneBench-Pro provides a concrete way to evaluate whether AI models are ready for research-stage work vs. just literature retrieval.

How It Compares to Other Benchmarks

Benchmark	What It Tests	Top Score (July 2026)	Human Performance Gap
GeneBench-Pro	Computational biology research judgment	31.5% (GPT-5.6 Sol Pro)	Large — humans far exceed this
SWE-bench Verified	Software engineering task completion	~58% (Opus 4.8)	Closing — AI approaching junior dev level
Humanity’s Last Exam	Expert-level knowledge across domains	~18% (Opus 4.8)	Very large
GPQA Diamond	Graduate-level science Q&A	~85% (Mythos Preview)	Near-human
FrontierMath	Advanced mathematical problem-solving	~32% (Opus 4.8)	Large

The Bottom Line

GeneBench-Pro is the most important AI benchmark released in the second half of 2026 because it measures something that matters for real scientific progress: can AI help with actual research, not just literature search?

The 28.7% score is simultaneously encouraging and humbling. It shows that frontier models can make meaningful progress on tasks that were nearly impossible for AI a year ago (<5%), while confirming that genuine scientific judgment remains a human strength — for now.

Published July 5, 2026. GeneBench-Pro was released June 30, 2026 by OpenAI. Paper and benchmark available via openai.com and bioRxiv. Non-OpenAI model scores were not publicly available at time of writing.