AUTORESEARCH

12 SRC

12 sources Updated April 5, 2026

Autoresearch

Autoresearch is a hill-climbing optimization loop originated by Andrej Karpathy: make one small change, test against a binary checklist, keep if improved, revert if not, repeat. The method has been adapted from ML research to Claude skill tuning (56% → 92% pass rate in 4 rounds), parallel GPU experiments (910 experiments in 8 hours, 9x speedup), and even distributed swarm optimization (Hyperspace). The key insight: quality scoring must use binary yes/no checklists (3-6 questions), not vague ratings. The most valuable artifact is the changelog — institutional knowledge about what works. The pattern generalizes to anything measurable: prompts, page load times, ML hyperparameters, trading strategies.

Guides

Apr 5

Designing a Second Brain for AI Agents: The Vault-as-Database Pattern

How to architect a local knowledge base that AI agents can reason over — starting from Karpathy's three-folder reference architecture, through the three-layer memory stack, MCP bridges, quality maintenance, and the scaling path from flat files to SQLite.

Mar 27

What Is Autoresearch and How to Use It

Autoresearch is Karpathy's hill-climbing optimization loop — make one change, test against a binary checklist, keep or revert, repeat. This guide covers the core method, practical implementation with Claude skills, scaling to parallel GPU clusters and distributed agent swarms, and the emerging pattern of full-stack AI research agents.

Insights

Core Method

Karpathy's autoresearch method adapted for Claude skills: make one small change to a skill prompt, test against a binary checklist, keep if score improves, revert if not — this hill-climbing loop took a landing page skill from 56% to 92% pass rate in 4 rounds (from ole lehmann 10x claude skills autoresearch)
Quality scoring for AI outputs should use specific yes/no checklist questions, not vague 1-10 ratings — binary criteria like "Does the headline include a specific number?" produce consistent, automatable evaluation (from ole lehmann 10x claude skills autoresearch)
The sweet spot for skill evaluation checklists is 3-6 questions; more than that and the skill starts gaming the checklist (from ole lehmann 10x claude skills autoresearch)
Effective skill prompt improvements are specific and concrete: banned buzzword lists, requiring specific numbers in headlines, worked examples of good output — not abstract instructions like "write better copy" (from ole lehmann 10x claude skills autoresearch)

Artifacts and Feedback Loops

The most valuable output of autoresearch is the changelog documenting every attempted change and its result — institutional knowledge that lets future models pick up where the last left off (from ole lehmann 10x claude skills autoresearch)
The system catches changes that improve individual checklist items but hurt overall output quality — a tighter word count improved conciseness but degraded CTA quality, so the system reverted it (from ole lehmann 10x claude skills autoresearch)
Running autoresearch autonomously with a live dashboard that auto-refreshes every 10 seconds lets you walk away while the agent iterates — stops when hitting 95%+ three times in a row (from ole lehmann 10x claude skills autoresearch)

Scaling: Automated Skill Tuning

Combining autoresearch with Hamel Husain's evals-skills framework creates an automated pipeline where skills are continuously evaluated and improved — "auto-evals" for agent skill quality (from autoresearch skill tuning evals)
Tuning 190+ skills requires automated pipelines rather than manual prompt engineering — running autoresearch "all day in the background" as a continuous improvement loop (from autoresearch skill tuning evals)

Scaling: Parallel Execution

Parallel experiment execution (910 experiments in 8 hours) achieved 9x speedup over sequential autoresearch, at ~$300 compute + $9 Claude API (from parallel gpu autoresearch skypilot)
Claude Code agent spontaneously developed emergent optimization: screening on cheaper H100s, promoting winners to H200s — autonomous resource allocation without instruction (from parallel gpu autoresearch skypilot)
Key ML finding: scaling model width mattered more than every hyperparameter trick combined — a result sequential search likely would have missed due to limited exploration (from parallel gpu autoresearch skypilot)

Scaling: Distributed Swarms

Hyperspace generalizes autoresearch into a platform where users describe optimization problems in plain English and a distributed swarm solves them — zero code (from hyperspace agi autoswarms)
237 agents with zero human intervention ran 14,832 experiments across 5 domains: ML agents drove validation loss down 75%, search agents evolved 21 scoring strategies, finance agents achieved Sharpe 1.32 (from hyperspace agi autoswarms)
A "playbook curator" distills why winning mutations work into reusable patterns, so new agents bootstrap from accumulated wisdom rather than starting cold (from hyperspace agi autoswarms)

Generalization

The autoresearch pattern generalizes beyond prompts to anything measurable: one person optimized page load from 1100ms to 67ms in 67 rounds using the same try-measure-keep/revert loop (from ole lehmann 10x claude skills autoresearch)

Full-Stack Research Agents

Feynman (Claude-based agent) generates cited meta-analyses in 30 minutes, replicates experiments on Runpod, audits claims against code, simulates peer review -- full research workflow automated (from feynman claude code for research)
GPT-5.4 Pro is "an order of magnitude better" (per Terence Tao) for deep/architectural questions -- use @steipete's 'oracle' tool to invoke it at the planning stage (from optimizing academic work with gpt 5.4 pro and coding agents)

Research-as-Agent-Workflow

Power users are investing 1,200+ hours into Claude-based research workflows across AI papers, market analysis, and competitive intelligence, indicating Claude is becoming a primary research tool for knowledge workers (from claude research assistant prompts)
Specialized prompt libraries for domain-specific research (papers, markets, competitive intel) are a key unlock for research agent productivity (from claude research assistant prompts)
At ~100 articles and ~400K words, LLMs can handle complex Q&A against a personal wiki without fancy RAG — auto-maintained index files and brief document summaries are sufficient for the model to navigate the space (from llm powered personal knowledge bases)
File query results back into the wiki after each research session — explorations and questions "add up" in the knowledge base, so every LLM interaction enhances future queries rather than disappearing into chat history (from llm powered personal knowledge bases)
As a knowledge base repo grows, the natural evolution is synthetic data generation + finetuning to have the LLM "know" the data in its weights rather than relying on context windows — a long-term trajectory for personal knowledge systems (from llm personal knowledge base workflow)
Monthly wiki health check: "flag contradictions between articles, find topics mentioned but never explained, list claims not backed by a source in raw/, suggest 3 new articles for gaps" — prevents error compounding when LLM outputs get filed back into the knowledge base (from nick spisak shared link)
LLM Council method (Karpathy-inspired): instead of trusting one model's answer, 5 advisors with different thinking styles argue the same question, then anonymously peer-review each other — catches blind spots no individual perspective reveals (from link share without context)

Voices

14 contributors

Andrej Karpathy

@karpathy

I like to train large deep neural nets. Previously Director of AI @ Tesla, founding team @ OpenAI, PhD @ Stanford.

2.0M followers 2 tweets

Ole Lehmann

@itsolelehmann

I help non-technical people make more money with AI agents. AI connoisseur, robotics maxi, eu/acc supporter, dad, techno optimist

135.5K followers 2 tweets

Alex Prompter

@alex_prompter

Marketing + AI = $$$ 🔑 @godofprompt (co-founder) 🎥 https://t.co/IodiF1Ra5f (co-founder)

91.0K followers 1 tweet

Cody Schneider

@codyschneiderxx

folllow for shiposting about the growth tactics i'm using to grow my startup building @graphed with @maxchehab Get Started Free - https://t.co/stXlkQBlSj

59.9K followers 1 tweet

Nick Spisak

@NickSpisak_

| AI Transformation Engineer | Seven Figure E-Commerce Business Owner

9.5K followers 1 tweet

Advait Paliwal

@advaitpaliwal

disciple of experience

12.6K followers 1 tweet

Aniket Panjwani

@aniketapanjwani

I teach agentic coding to economists || PhD Economics Northwestern || Director of AI/ML @ Payslice || ex-MLOps @ Zelle

4.9K followers 1 tweet

Griffin Hilly

@GriffinHilly

Bond trader by day, pro-humanism/nuclear on my own time. @MadiHilly’s biggest fan

637 followers 1 tweet

James Bedford

@jameesy

engineering @zerion, building https://t.co/3RQYuZAcrt

5.2K followers 1 tweet

Zhanghao Wu

@Michaelvll1

Building SkyPilot @skypilot_org | Co-creator of @lmsysorg, PhD @Berkeley_EECS @ucbrise. Prev: @MIT, @sjtu1896

1.8K followers 1 tweet

George from 🕹prodmgmt.world

@nurijanian

Can I make everyone a great product manager? I will do my best | Get my product management OS + AI skills for Claude Code/Cursor: https://t.co/ngCnvp77SD

43.8K followers 1 tweet

rahul

@rahulgs

head of applied ai @ ramp

13.1K followers 1 tweet

Shiv

@shivsakhuja

Pontificating... / Vibe GTM-ing / Making Claude Code do non-coding things building a team of AI coworkers @ Gooseworks / prev @AthinaAI /@google / @ycombinator

52.2K followers 1 tweet

Varun

@varun_mathur

Agentic General Intelligence @HyperspaceAI (Co-founder and CEO)

34.8K followers 1 tweet