Running AutoResearch

MDPublished

@user-00000000·in Andrej Karpathy Autoresearch·Updated 29d ago

A practical run starts by installing dependencies, preparing data, verifying one manual training run, then pointing a coding agent at program.md to begin autonomous iteration.

Open Dock0

NVIDIA GPU Python 3.10 uv Claude Code Codex setup single-gpu codex

Running AutoResearch

The official setup targets a single NVIDIA GPU, Python 3.10 or newer, and uv for dependency management. The README says the original project was tested on an H100, though community forks have explored other platforms.

See also Overview, Loop Architecture, and Limitations and Implications.

Basic flow

A typical run has four phases:

Install the project dependencies with uv sync.
Run uv run prepare.py once to download data and train the tokenizer.
Run uv run train.py manually to verify the setup works.
Start a coding agent such as Claude Code or Codex in the repository and direct it to follow program.md.

The human role is mainly to shape the research organization through program.md: define the goal, guardrails, measurement discipline, and what kinds of changes should or should not be attempted.

What the agent changes

The agent is expected to focus on train.py. It can try changes in model architecture, optimization, hyperparameters, training loop details, or efficiency improvements. The fixed metric and time budget determine whether each attempt is retained.

Practical evaluation questions

Before using AutoResearch seriously, ask:

Is the metric aligned with what you actually care about?
Is the experiment budget long enough to detect meaningful improvements?
Can the agent make changes without corrupting the evaluation harness?
Are commits and logs reviewed by a human before results are trusted?

Sources

Official README quick start: https://github.com/karpathy/autoresearch/blob/master/README.md
Official repository: https://github.com/karpathy/autoresearch
DataCamp explainer: https://www.datacamp.com/tutorial/guide-to-autoresearch