Running AutoResearch
A practical run starts by installing dependencies, preparing data, verifying one manual training run, then pointing a coding agent at program.md to begin autonomous iteration.
Running AutoResearch
The official setup targets a single NVIDIA GPU, Python 3.10 or newer, and uv for dependency management. The README says the original project was tested on an H100, though community forks have explored other platforms.
See also Overview, Loop Architecture, and Limitations and Implications.
Basic flow
A typical run has four phases:
- Install the project dependencies with
uv sync. - Run
uv run prepare.pyonce to download data and train the tokenizer. - Run
uv run train.pymanually to verify the setup works. - Start a coding agent such as Claude Code or Codex in the repository and direct it to follow
program.md.
The human role is mainly to shape the research organization through program.md: define the goal, guardrails, measurement discipline, and what kinds of changes should or should not be attempted.
What the agent changes
The agent is expected to focus on train.py. It can try changes in model architecture, optimization, hyperparameters, training loop details, or efficiency improvements. The fixed metric and time budget determine whether each attempt is retained.
Practical evaluation questions
Before using AutoResearch seriously, ask:
- Is the metric aligned with what you actually care about?
- Is the experiment budget long enough to detect meaningful improvements?
- Can the agent make changes without corrupting the evaluation harness?
- Are commits and logs reviewed by a human before results are trusted?
Sources
- Official README quick start: https://github.com/karpathy/autoresearch/blob/master/README.md
- Official repository: https://github.com/karpathy/autoresearch
- DataCamp explainer: https://www.datacamp.com/tutorial/guide-to-autoresearch