Karpathy's Autoresearch: the self-improving AI

Andrej Karpathy, OpenAI co-founder and former head of AI at Tesla, released a 630-line script that lets an artificial intelligence run its own research experiments all night long, with zero human intervention.

The project is called Autoresearch, and it blew up: 33,000 GitHub stars in a single week, 8.6 million views on X in 48 hours.

The tech community didn’t just watch from the sidelines: freelancers, marketers, and traders have already “hacked” the concept to apply it to their own fields.

Here’s how this small framework is changing the way we think about automation, and more importantly, how you can put it to work for your business.

What exactly is Autoresearch?

Autoresearch is a minimalist machine learning research framework built on top of nanochat, Karpathy’s language model training system.

The core idea is almost provocatively simple: you give an AI agent a GPU, an editable code file, a metric to beat, and Git as its memory.

The agent does the rest.

It tests hypotheses, modifies the code, runs 5-minute training sessions, measures the results, and starts over.

A tireless AI researcher that works while you sleep: that’s exactly what Autoresearch does.

The initial goal: improve a small language model (around 11-12 million parameters) by finding the best combinations of architecture, hyperparameters, and training techniques.

The real brilliance of the project is that the underlying pattern works far beyond machine learning.

How does the “Karpathy Loop” work?

Robot humanoïde dans un couloir infini de miroirs reflétant des versions améliorées de lui-même, symbolisant la boucle d'auto-amélioration récursive d'Autoresearch

The 3 files that make up the system

The entire system runs on 3 files, and it’s this simplicity that makes it so powerful.

prepare.py is the fixed file: it handles data, tokenization (BPE with an 8,192-token vocabulary), and evaluation.

Nobody touches it, not the human, not the agent.

train.py is the agent’s playground: it contains the GPT architecture, the optimizers (Muon and AdamW), and the training loop.

It’s the only file the AI is allowed to modify.

program.md is the human file: you write your research instructions in plain language, like a brief to an assistant.

Karpathy calls it “programming research in Markdown”: you stop coding and start orchestrating.

The autonomous loop in 8 steps

The agent follows a precise cycle, repeated endlessly until you stop it (spoiler: the default instruction is “NEVER STOP”).

The agent reads the context: program.md, the Git history, previous results
It formulates a hypothesis and predicts the expected outcome (the “prediction block”)
It modifies train.py based on its hypothesis
It runs a 5-minute training session (wall-clock time, not step count)
It measures the val_bpb metric (validation bits per byte: lower is better)
It compares prediction vs. reality: if the gap is large, the agent learns from its own mistake
If the result improves, Git commit: otherwise, git reset and move on
It starts over from step 1

This cycle runs at about 12 experiments per hour, or 100 experiments per night on a single GPU.

Craig Huitt, tech analyst, sums it up: “This is the best example of the agentic loop that will eat everything.”

The detail that makes all the difference: Git serves as memory.

Every gain is a commit, every failure is a revert.

The agent can read through its full experiment history to avoid repeating the same mistakes.

The concrete results that got the community talking

The numbers speak for themselves.

Karpathy himself ran an overnight session: 126 experiments, with val_bpb dropping from 0.9979 to 0.9697.

Over a 2-day run with a deeper model (depth=12), the agent made about 700 modifications, of which 20 gains were transferable to other projects.

The most striking result: the agent uncovered bugs in attention scaling and regularization that Karpathy, with 20 years of deep learning experience, had missed.

Tobi Lutke, CEO of Shopify and engineer by training, tested the framework on his own models.

The outcome: a 19% performance boost, with a smaller model outperforming a larger one that had been configured manually.

When a 630-line script finds bugs that a 20-year expert missed, the message is clear: automated iteration beats human intuition on repetitive tasks.

Metric	Before Autoresearch	After Autoresearch
val_bpb (Karpathy, 1 night)	0.9979	0.9697
Experiments (1 night)	2-3 manual	126 automated
Performance (Tobi Lutke)	Manual baseline	+19% with a smaller model
Bugs found	0 (in 20 years)	Scaling + regularization errors

Claude Code + Autoresearch: AI that improves on its own

The most exciting combo around Autoresearch comes from pairing it with Claude Code, Anthropic’s development assistant.

The idea: instead of manually configuring the agent, you open Claude Code in the repo folder, and it takes over.

Claude Code reads program.md, understands the goals, and starts experimenting on its own.

It modifies train.py, launches training runs, analyzes results, commits or reverts, and starts again.

What changes at a deep level: Claude is no longer just following instructions.

It experiments, measures, and learns from its mistakes in real time.

The key insight: you stop coding and start orchestrating an agent that codes, tests, and iterates for you.

Setup is fast: clone the repo, open Claude Code in the folder, and within minutes you have a working self-improving pipeline.

For those who want to go further, the system also works with GPT-5, local models via Ollama, or even on Google Colab with a free T4 GPU.

The Karpathy Loop for SMBs and freelancers: 7 practical applications

A fox scientist surrounded by holographic screens displaying code experiments and progress charts, illustrating the Autoresearch concept by Karpathy

This is where it gets exciting for non-engineers.

The genius of Autoresearch is that the pattern is universal.

You don’t need to know how to code or own a $30,000 GPU.

You just need to swap out 3 elements:

The editable file (train.py becomes your template, your page, your email)
The metric (val_bpb becomes your business KPI: open rate, conversion, revenue)
The program (program.md becomes your strategy in plain language)

1. Fine-tuning your outreach emails while you sleep

You’re a freelancer or a sales rep at an SMB: you send cold emails to find clients.

The agent tweaks the subject line, the opening hook, the CTA, the length, sends to a small segment, measures the reply rate, and keeps what works.

Instead of 2-3 A/B tests per month done by hand, the agent runs dozens per day.

Accessible tools: Instantly, Lemlist, or Mailchimp combined with Claude Code.

2. Continuously improving your landing page

You have a website that sells a service, but you never know which headline or CTA performs best.

The agent tests variations of the headline, subheadline, call-to-action button, and social proof.

Every night, it proposes a new version, deploys it, measures the conversion rate, and only keeps what converts better.

Eric Siu (Single Grain agency) estimates that marketing teams run about 30 to 50 tests per year manually.

With this pattern: up to 36,500 tests per year.

3. Writing SEO content that ranks higher

The agent writes variations of titles, meta descriptions, introductions, and article structures.

The metric: SEO score (via Surfer or Clearscope), click-through rate in Search Console, average position.

The loop: the agent analyzes existing content, identifies weaknesses, proposes a rewrite, measures the impact, and iterates.

4. Testing your LinkedIn posts or newsletters

A freelancer publishing on LinkedIn never knows which format performs best.

The agent generates variations: length, tone, opening hook, hook type, storytelling vs. listicle.

It tests 5 hook variations, publishes, measures after 24-48 hours, and learns for the next post.

The agent maintains a “logbook” of what works for your specific audience.

5. Qualifying and scoring leads automatically

You have a CRM full of contacts, but you waste time reaching out to the wrong ones.

The agent modifies scoring criteria, tests against conversion history, measures accuracy, and refines.

The result: your sales team only contacts the hottest leads, while the rest are nurtured automatically.

Accessible tools: HubSpot, Pipedrive, or even a structured Google Sheet combined with Claude Code.

6. Supercharging your ad campaigns

The natural A/B testing pattern, but in turbo mode and running 24/7.

The agent generates title, description, and CTA variations for Google Ads or Meta campaigns.

The loop: it creates a variation, deploys it via the API, measures cost per acquisition (CPA) or ROAS after 24-72 hours, keeps or discards, and starts again.

Pro tip: start with a single ad group to keep the test clean.

7. Refining your internal processes

An SMB can apply the pattern to its own processes: quote templates, call scripts, customer response templates.

The agent modifies a template, tests it on the next batch of interactions, and measures the quote acceptance rate or resolution time.

Concrete example: testing your sales proposals by measuring how many quotes get accepted over 30 days.

The universal pattern in 5 steps

Step	What it means in practice
1. Define “better”	Pick ONE number to beat (reply rate, conversion, revenue)
2. The AI proposes	It creates a variation of your email, page, ad, or process
3. Test it	The variation is deployed on a small segment
4. Measure it	The AI compares: did it improve or not?
5. Keep or discard	If it’s better, it becomes the new baseline. If not, trash it.

The key: the same principle that helps Karpathy improve an AI model overnight helps a freelancer fine-tune outreach emails while they sleep.

Tech stack and how to get started

For the original ML use case, here’s what you need:

Python 3.10+ and PyTorch with CUDA
An NVIDIA GPU (H100 is ideal, but it works on an RTX 4090)
An LLM API: Claude, GPT-5, or a local model via Ollama
Git installed on the machine

For Apple Silicon users, an autoresearch-mlx variant exists.

On Google Colab, a free T4 GPU is enough to test the concept.

Getting started with Claude Code is even simpler: clone the repo, open Claude Code in the folder, and go.

For business applications (emails, landing pages, ads), you don’t need a GPU.

Access to Claude Code or a no-code tool like MindStudio is enough to replicate the loop.

What Autoresearch can and cannot do

Strengths of the system

Ablation studies on small and medium models: testing what happens when you remove or add a component
Hyperparameter sensitivity analysis: finding the optimal settings
Architecture comparisons: pitting multiple approaches against each other in a single night
Continuous iteration of any measurable system

Limitations to be aware of

No multi-GPU or multi-node support in the base version
The agent doesn’t create new mathematical theory: it tests hypotheses, it doesn’t invent
Without full history, the agent can generate repetitive experiments
It doesn’t replace the researcher’s intuition for choosing the right questions to ask
Noisy data (finance, marketing) requires additional guardrails to prevent false positives

Autoresearch is an extraordinary hammer, but it doesn’t choose the nails for you.

Why this is a turning point for AI and business

The shift from “I code it myself” to “I orchestrate agents that code” is happening right in front of us.

With Autoresearch, the professional’s role evolves: you become a designer of experimental arenas, not an executor.

The bottleneck is no longer the ability to code.

It shifts to 3 skills:

Asking the right questions (what to test?)
Defining the right metrics (how to measure “better”?)
Designing the right guardrails (what limits to set for the agent?)

An interesting parallel: AI Scientist-v2 by Sakana AI became the first AI system whose paper was accepted through peer review.

A 2-person team equipped with Autoresearch can produce at the pace of 20 engineers.

The safety questions are real: an AI that improves its own AI (recursive self-improvement) raises concerns about control and oversight.

Autonomous agents like Manus and Autoresearch show that the line between tool and collaborator blurs a little more each month.

Protocols like the Model Context Protocol (MCP) are standardizing communication between agents and tools, further accelerating this convergence.

Conclusion

Autoresearch is not a finished product: it’s a pattern.

And this pattern applies anywhere there’s a metric to beat and a process to iterate on.

The real skill of tomorrow won’t be knowing how to code train.py.

It will be knowing how to write a great program.md: defining what you’re looking for, setting the rules of the game, and letting the agent explore.

If you run an SMB, manage marketing campaigns, or want to scale your processes, the Karpathy Loop is a mental model worth adopting now.

The repo is open source on GitHub: explore it, fork it, and apply it to your own reality.

FAQ

What is Karpathy’s Autoresearch?

Autoresearch is an open-source framework created by Andrej Karpathy that lets an AI agent run machine learning experiments autonomously, 24/7, on a single GPU.

Do you need to know how to code to use the Karpathy Loop?

For the original ML use case, basic knowledge of Python and PyTorch is required.

For business applications (emails, landing pages, ads), Claude Code or a no-code tool like MindStudio is enough.

How many experiments can Autoresearch run per night?

About 100 experiments per night on a single GPU, roughly 12 per hour, thanks to fixed 5-minute training runs.

What GPU do you need to run Autoresearch?

The ideal option is an NVIDIA H100, but the framework runs on an RTX 4090, Apple Silicon (via MLX), and even a free Google Colab T4.

What’s the difference between Autoresearch and a simple A/B test?

A traditional A/B test compares 2 variations by hand.

Autoresearch automates the entire loop: hypothesis, modification, test, measurement, decision, and starts over with no human intervention.

Can you use Autoresearch for outreach emails?

Yes, by replacing train.py with your email template, val_bpb with your reply rate, and program.md with your outreach strategy.

Tools like Instantly or Lemlist can be plugged into the loop.

Does Autoresearch work with Claude Code?

Yes, and it’s one of the most popular combinations.

Claude Code reads program.md, modifies train.py, runs experiments, and handles Git commits automatically.

What are the limitations of Autoresearch?

The system doesn’t create new mathematical theory, doesn’t natively support multi-GPU setups, and can generate repetitive experiments without a complete history.

It requires additional guardrails when dealing with noisy data (finance, marketing).

Can Autoresearch replace a data scientist?

No.

It excels at rapid iteration and hyperparameter testing, but human intuition remains essential for choosing the right questions and interpreting results.

Where can you find the Autoresearch source code?

The official repo is on GitHub at github.com/karpathy/autoresearch, with over 33,000 stars and an active community.

Karpathy’s Autoresearch: the self-improving AI (and how to apply it to your business)