Autoresearch

AUTORESEARCH

12 SRC

12 sources Updated April 5, 2026

Autoresearch

Autoresearch is a hill-climbing optimization loop originated by Andrej Karpathy: make one small change, test against a binary checklist, keep if improved, revert if not, repeat. The method has been adapted from ML research to Claude skill tuning (56% → 92% pass rate in 4 rounds), parallel GPU experiments (910 experiments in 8 hours, 9x speedup), and even distributed swarm optimization (Hyperspace). The key insight: quality scoring must use binary yes/no checklists (3-6 questions), not vague ratings. The most valuable artifact is the changelog — institutional knowledge about what works. The pattern generalizes to anything measurable: prompts, page load times, ML hyperparameters, trading strategies.

Guides

Insights

Core Method

  • Karpathy's autoresearch method adapted for Claude skills: make one small change to a skill prompt, test against a binary checklist, keep if score improves, revert if not — this hill-climbing loop took a landing page skill from 56% to 92% pass rate in 4 rounds (from ole lehmann 10x claude skills autoresearch)
  • Quality scoring for AI outputs should use specific yes/no checklist questions, not vague 1-10 ratings — binary criteria like "Does the headline include a specific number?" produce consistent, automatable evaluation (from ole lehmann 10x claude skills autoresearch)
  • The sweet spot for skill evaluation checklists is 3-6 questions; more than that and the skill starts gaming the checklist (from ole lehmann 10x claude skills autoresearch)
  • Effective skill prompt improvements are specific and concrete: banned buzzword lists, requiring specific numbers in headlines, worked examples of good output — not abstract instructions like "write better copy" (from ole lehmann 10x claude skills autoresearch)

Artifacts and Feedback Loops

  • The most valuable output of autoresearch is the changelog documenting every attempted change and its result — institutional knowledge that lets future models pick up where the last left off (from ole lehmann 10x claude skills autoresearch)
  • The system catches changes that improve individual checklist items but hurt overall output quality — a tighter word count improved conciseness but degraded CTA quality, so the system reverted it (from ole lehmann 10x claude skills autoresearch)
  • Running autoresearch autonomously with a live dashboard that auto-refreshes every 10 seconds lets you walk away while the agent iterates — stops when hitting 95%+ three times in a row (from ole lehmann 10x claude skills autoresearch)

Scaling: Automated Skill Tuning

  • Combining autoresearch with Hamel Husain's evals-skills framework creates an automated pipeline where skills are continuously evaluated and improved — "auto-evals" for agent skill quality (from autoresearch skill tuning evals)
  • Tuning 190+ skills requires automated pipelines rather than manual prompt engineering — running autoresearch "all day in the background" as a continuous improvement loop (from autoresearch skill tuning evals)

Scaling: Parallel Execution

  • Parallel experiment execution (910 experiments in 8 hours) achieved 9x speedup over sequential autoresearch, at ~$300 compute + $9 Claude API (from parallel gpu autoresearch skypilot)
  • Claude Code agent spontaneously developed emergent optimization: screening on cheaper H100s, promoting winners to H200s — autonomous resource allocation without instruction (from parallel gpu autoresearch skypilot)
  • Key ML finding: scaling model width mattered more than every hyperparameter trick combined — a result sequential search likely would have missed due to limited exploration (from parallel gpu autoresearch skypilot)

Scaling: Distributed Swarms

  • Hyperspace generalizes autoresearch into a platform where users describe optimization problems in plain English and a distributed swarm solves them — zero code (from hyperspace agi autoswarms)
  • 237 agents with zero human intervention ran 14,832 experiments across 5 domains: ML agents drove validation loss down 75%, search agents evolved 21 scoring strategies, finance agents achieved Sharpe 1.32 (from hyperspace agi autoswarms)
  • A "playbook curator" distills why winning mutations work into reusable patterns, so new agents bootstrap from accumulated wisdom rather than starting cold (from hyperspace agi autoswarms)

Generalization

  • The autoresearch pattern generalizes beyond prompts to anything measurable: one person optimized page load from 1100ms to 67ms in 67 rounds using the same try-measure-keep/revert loop (from ole lehmann 10x claude skills autoresearch)

Full-Stack Research Agents

  • Feynman (Claude-based agent) generates cited meta-analyses in 30 minutes, replicates experiments on Runpod, audits claims against code, simulates peer review -- full research workflow automated (from feynman claude code for research)
  • GPT-5.4 Pro is "an order of magnitude better" (per Terence Tao) for deep/architectural questions -- use @steipete's 'oracle' tool to invoke it at the planning stage (from optimizing academic work with gpt 5.4 pro and coding agents)

Research-as-Agent-Workflow

  • Power users are investing 1,200+ hours into Claude-based research workflows across AI papers, market analysis, and competitive intelligence, indicating Claude is becoming a primary research tool for knowledge workers (from claude research assistant prompts)

  • Specialized prompt libraries for domain-specific research (papers, markets, competitive intel) are a key unlock for research agent productivity (from claude research assistant prompts)

  • At ~100 articles and ~400K words, LLMs can handle complex Q&A against a personal wiki without fancy RAG — auto-maintained index files and brief document summaries are sufficient for the model to navigate the space (from llm powered personal knowledge bases)

  • File query results back into the wiki after each research session — explorations and questions "add up" in the knowledge base, so every LLM interaction enhances future queries rather than disappearing into chat history (from llm powered personal knowledge bases)

  • As a knowledge base repo grows, the natural evolution is synthetic data generation + finetuning to have the LLM "know" the data in its weights rather than relying on context windows — a long-term trajectory for personal knowledge systems (from llm personal knowledge base workflow)

  • Monthly wiki health check: "flag contradictions between articles, find topics mentioned but never explained, list claims not backed by a source in raw/, suggest 3 new articles for gaps" — prevents error compounding when LLM outputs get filed back into the knowledge base (from nick spisak shared link)

  • LLM Council method (Karpathy-inspired): instead of trusting one model's answer, 5 advisors with different thinking styles argue the same question, then anonymously peer-review each other — catches blind spots no individual perspective reveals (from link share without context)

Voices

14 contributors
Andrej Karpathy

Andrej Karpathy

@karpathy

I like to train large deep neural nets. Previously Director of AI @ Tesla, founding team @ OpenAI, PhD @ Stanford.

2.0M followers 2 tweets
Ole Lehmann

Ole Lehmann

@itsolelehmann

I help non-technical people make more money with AI agents. AI connoisseur, robotics maxi, eu/acc supporter, dad, techno optimist

135.5K followers 2 tweets
Alex Prompter

Alex Prompter

@alex_prompter

Marketing + AI = $$$ 🔑 @godofprompt (co-founder) 🎥 https://t.co/IodiF1Ra5f (co-founder)

91.0K followers 1 tweet
Cody Schneider

Cody Schneider

@codyschneiderxx

folllow for shiposting about the growth tactics i'm using to grow my startup building @graphed with @maxchehab Get Started Free - https://t.co/stXlkQBlSj

59.9K followers 1 tweet
Nick Spisak

Nick Spisak

@NickSpisak_

| AI Transformation Engineer | Seven Figure E-Commerce Business Owner

9.5K followers 1 tweet
Advait Paliwal

Advait Paliwal

@advaitpaliwal

disciple of experience

12.6K followers 1 tweet
Aniket Panjwani

Aniket Panjwani

@aniketapanjwani

I teach agentic coding to economists || PhD Economics Northwestern || Director of AI/ML @ Payslice || ex-MLOps @ Zelle

4.9K followers 1 tweet
Griffin Hilly

Griffin Hilly

@GriffinHilly

Bond trader by day, pro-humanism/nuclear on my own time. @MadiHilly’s biggest fan

637 followers 1 tweet
James Bedford

James Bedford

@jameesy

engineering @zerion, building https://t.co/3RQYuZAcrt

5.2K followers 1 tweet
Zhanghao Wu

Zhanghao Wu

@Michaelvll1

Building SkyPilot @skypilot_org | Co-creator of @lmsysorg, PhD @Berkeley_EECS @ucbrise. Prev: @MIT, @sjtu1896

1.8K followers 1 tweet
George from 🕹prodmgmt.world

George from 🕹prodmgmt.world

@nurijanian

Can I make everyone a great product manager? I will do my best | Get my product management OS + AI skills for Claude Code/Cursor: https://t.co/ngCnvp77SD

43.8K followers 1 tweet
rahul

rahul

@rahulgs

head of applied ai @ ramp

13.1K followers 1 tweet
Shiv

Shiv

@shivsakhuja

Pontificating... / Vibe GTM-ing / Making Claude Code do non-coding things building a team of AI coworkers @ Gooseworks / prev @AthinaAI /@google / @ycombinator

52.2K followers 1 tweet
Varun

Varun

@varun_mathur

Agentic General Intelligence @HyperspaceAI (Co-founder and CEO)

34.8K followers 1 tweet