AUTORESEARCH
12 SRC
Autoresearch
Guides
Designing a Second Brain for AI Agents: The Vault-as-Database Pattern
How to architect a local knowledge base that AI agents can reason over — starting from Karpathy's three-folder reference architecture, through the three-layer memory stack, MCP bridges, quality maintenance, and the scaling path from flat files to SQLite.
What Is Autoresearch and How to Use It
Autoresearch is Karpathy's hill-climbing optimization loop — make one change, test against a binary checklist, keep or revert, repeat. This guide covers the core method, practical implementation with Claude skills, scaling to parallel GPU clusters and distributed agent swarms, and the emerging pattern of full-stack AI research agents.
Insights
Core Method
- Karpathy's autoresearch method adapted for Claude skills: make one small change to a skill prompt, test against a binary checklist, keep if score improves, revert if not — this hill-climbing loop took a landing page skill from 56% to 92% pass rate in 4 rounds (from ole lehmann 10x claude skills autoresearch)
- Quality scoring for AI outputs should use specific yes/no checklist questions, not vague 1-10 ratings — binary criteria like "Does the headline include a specific number?" produce consistent, automatable evaluation (from ole lehmann 10x claude skills autoresearch)
- The sweet spot for skill evaluation checklists is 3-6 questions; more than that and the skill starts gaming the checklist (from ole lehmann 10x claude skills autoresearch)
- Effective skill prompt improvements are specific and concrete: banned buzzword lists, requiring specific numbers in headlines, worked examples of good output — not abstract instructions like "write better copy" (from ole lehmann 10x claude skills autoresearch)
Artifacts and Feedback Loops
- The most valuable output of autoresearch is the changelog documenting every attempted change and its result — institutional knowledge that lets future models pick up where the last left off (from ole lehmann 10x claude skills autoresearch)
- The system catches changes that improve individual checklist items but hurt overall output quality — a tighter word count improved conciseness but degraded CTA quality, so the system reverted it (from ole lehmann 10x claude skills autoresearch)
- Running autoresearch autonomously with a live dashboard that auto-refreshes every 10 seconds lets you walk away while the agent iterates — stops when hitting 95%+ three times in a row (from ole lehmann 10x claude skills autoresearch)
Scaling: Automated Skill Tuning
- Combining autoresearch with Hamel Husain's evals-skills framework creates an automated pipeline where skills are continuously evaluated and improved — "auto-evals" for agent skill quality (from autoresearch skill tuning evals)
- Tuning 190+ skills requires automated pipelines rather than manual prompt engineering — running autoresearch "all day in the background" as a continuous improvement loop (from autoresearch skill tuning evals)
Scaling: Parallel Execution
- Parallel experiment execution (910 experiments in 8 hours) achieved 9x speedup over sequential autoresearch, at ~$300 compute + $9 Claude API (from parallel gpu autoresearch skypilot)
- Claude Code agent spontaneously developed emergent optimization: screening on cheaper H100s, promoting winners to H200s — autonomous resource allocation without instruction (from parallel gpu autoresearch skypilot)
- Key ML finding: scaling model width mattered more than every hyperparameter trick combined — a result sequential search likely would have missed due to limited exploration (from parallel gpu autoresearch skypilot)
Scaling: Distributed Swarms
- Hyperspace generalizes autoresearch into a platform where users describe optimization problems in plain English and a distributed swarm solves them — zero code (from hyperspace agi autoswarms)
- 237 agents with zero human intervention ran 14,832 experiments across 5 domains: ML agents drove validation loss down 75%, search agents evolved 21 scoring strategies, finance agents achieved Sharpe 1.32 (from hyperspace agi autoswarms)
- A "playbook curator" distills why winning mutations work into reusable patterns, so new agents bootstrap from accumulated wisdom rather than starting cold (from hyperspace agi autoswarms)
Generalization
- The autoresearch pattern generalizes beyond prompts to anything measurable: one person optimized page load from 1100ms to 67ms in 67 rounds using the same try-measure-keep/revert loop (from ole lehmann 10x claude skills autoresearch)
Full-Stack Research Agents
- Feynman (Claude-based agent) generates cited meta-analyses in 30 minutes, replicates experiments on Runpod, audits claims against code, simulates peer review -- full research workflow automated (from feynman claude code for research)
- GPT-5.4 Pro is "an order of magnitude better" (per Terence Tao) for deep/architectural questions -- use @steipete's 'oracle' tool to invoke it at the planning stage (from optimizing academic work with gpt 5.4 pro and coding agents)
Research-as-Agent-Workflow
Power users are investing 1,200+ hours into Claude-based research workflows across AI papers, market analysis, and competitive intelligence, indicating Claude is becoming a primary research tool for knowledge workers (from claude research assistant prompts)
Specialized prompt libraries for domain-specific research (papers, markets, competitive intel) are a key unlock for research agent productivity (from claude research assistant prompts)
At ~100 articles and ~400K words, LLMs can handle complex Q&A against a personal wiki without fancy RAG — auto-maintained index files and brief document summaries are sufficient for the model to navigate the space (from llm powered personal knowledge bases)
File query results back into the wiki after each research session — explorations and questions "add up" in the knowledge base, so every LLM interaction enhances future queries rather than disappearing into chat history (from llm powered personal knowledge bases)
As a knowledge base repo grows, the natural evolution is synthetic data generation + finetuning to have the LLM "know" the data in its weights rather than relying on context windows — a long-term trajectory for personal knowledge systems (from llm personal knowledge base workflow)
Monthly wiki health check: "flag contradictions between articles, find topics mentioned but never explained, list claims not backed by a source in raw/, suggest 3 new articles for gaps" — prevents error compounding when LLM outputs get filed back into the knowledge base (from nick spisak shared link)
LLM Council method (Karpathy-inspired): instead of trusting one model's answer, 5 advisors with different thinking styles argue the same question, then anonymously peer-review each other — catches blind spots no individual perspective reveals (from link share without context)
Voices
14 contributors
Andrej Karpathy
@karpathy
I like to train large deep neural nets. Previously Director of AI @ Tesla, founding team @ OpenAI, PhD @ Stanford.
Ole Lehmann
@itsolelehmann
I help non-technical people make more money with AI agents. AI connoisseur, robotics maxi, eu/acc supporter, dad, techno optimist
Alex Prompter
@alex_prompter
Marketing + AI = $$$ 🔑 @godofprompt (co-founder) 🎥 https://t.co/IodiF1Ra5f (co-founder)
Cody Schneider
@codyschneiderxx
folllow for shiposting about the growth tactics i'm using to grow my startup building @graphed with @maxchehab Get Started Free - https://t.co/stXlkQBlSj
Nick Spisak
@NickSpisak_
| AI Transformation Engineer | Seven Figure E-Commerce Business Owner
Advait Paliwal
@advaitpaliwal
disciple of experience
Aniket Panjwani
@aniketapanjwani
I teach agentic coding to economists || PhD Economics Northwestern || Director of AI/ML @ Payslice || ex-MLOps @ Zelle
Griffin Hilly
@GriffinHilly
Bond trader by day, pro-humanism/nuclear on my own time. @MadiHilly’s biggest fan
James Bedford
@jameesy
engineering @zerion, building https://t.co/3RQYuZAcrt
Zhanghao Wu
@Michaelvll1
Building SkyPilot @skypilot_org | Co-creator of @lmsysorg, PhD @Berkeley_EECS @ucbrise. Prev: @MIT, @sjtu1896
George from 🕹prodmgmt.world
@nurijanian
Can I make everyone a great product manager? I will do my best | Get my product management OS + AI skills for Claude Code/Cursor: https://t.co/ngCnvp77SD
rahul
@rahulgs
head of applied ai @ ramp
Shiv
@shivsakhuja
Pontificating... / Vibe GTM-ing / Making Claude Code do non-coding things building a team of AI coworkers @ Gooseworks / prev @AthinaAI /@google / @ycombinator
Varun
@varun_mathur
Agentic General Intelligence @HyperspaceAI (Co-founder and CEO)