Scaling Karpathy's Autoresearch: What Happens When the Agent Gets a GPU Cluster

▲

Scaling Karpathy's Autoresearch: What Happens When the Agent Gets a GPU Cluster(blog.skypilot.co)

53 points byhopechong3 hours ago |7 comments

I feel like most of this recent Autoresearch trend boils down to reinventing hyper-parameter tuning. Is the SOTA still Bayesian optimization when given a small cluster? It was ~3 years ago when I was doing this kind of work, haven't kept up since then.

Also, shoutout SkyPilot! It's been a huge help for going multi-cloud with our training and inference jobs (getting GPUs is still a nightmare...)!

▲karpathy1 hour ago

Wrong and short-sighted take given that the LLM explores serially learning along the way, and can tool use and change code arbitrarily. It seems to currently default to something resembling hyperparameter tuning in absence of more specific instructions. I briefly considered calling the project “autotune” at first but I think “autoresearch” will prove to be the significantly more appropriate name.

▲achierius21 minutes ago

Out of curiosity, what sort of things have you seen it do that better fit 'autoresearch' than 'autotune' thus far? Optimizations it made that wouldn't be been surfaced by an autotune system, I suppose.

▲kraddypatties45 minutes ago

I can believe that in the long run.

Does the agent have access to arxiv (a brief skim of the README didn't have an answer)? If not, it could be that the current approach of relying on the model's weights only is resulting in the perceived local optimum of hyperparameter tuning.

Anecdotally, we built a little MCP for arxiv to help with our internal research, noticed a significant boost in the diversity of methods (architecture or otherwise) Claude and friends were able to reference.

▲corndoge46 minutes ago

Would you say it's fair to describe autoresearch as a form of neural architecture search? I am curious what you think the core differences are between them.

▲westurner29 minutes ago

Is there a cost to converge? And how much does it vary with the random seed?

Re: OpenCogPrime:EconomicAttentionAllocation https://news.ycombinator.com/item?id=45518074 and something about eWASM (edit) https://news.ycombinator.com/item?id=47171887 .. from https://news.ycombinator.com/item?id=46825026 re: eWASM and costed opcodes for agent efficiency

▲ipsum21 hour ago

Hyperparam tuning that has better intuition and can incorporate architecture changes automatically. It won't invent something completely new though.

▲kraddypatties1 hour ago

Hm, that's fair. It does feel like there's low hanging fruit in combining "old school" methods for conducting a hyperparameter sweep efficiently _with_ the higher level architecture edit ability of Autoresearch.

Probably would cut the number of runs down by a significant number (as far as I can tell it's doing a grid search once it decides to mess with a knob or section of the architecture).

▲zhwu1 hour ago

The most surprising part: the agent had access to both H100s and H200s. Without being told, it noticed H200s scored better and started screening ideas on H100s, then promoting winners to H200s for validation. That strategy emerged entirely on its own.

▲rogerrogerr1 hour ago

Why do we think this emerged “on its own”? Surely this technique has been discussed in research papers that are in the training set.

▲hhh1 hour ago

Why?… The experiment.yaml shows that it is calling h100/200 explicitly, it’s pretty common for humans to say “number bigger more gooder” for anything… Lie and reverse the values and see what happens. I would put money on a rabbit hole of complaining about it being misconfigured.

▲ed1 hour ago

Models are familiar with H100’s. They even predate ChatGPT.

▲Aboutplants1 hour ago

Yeah I thought that was a particularly neat part

▲fabmilo55 minutes ago

I am fascinated by this example of using AI to improve AI. I won a small prize using this technique on helion kernels at a pytorch hackathon in SF.

The next step are: - give the agent the whole deep learning literature research and do tree search over the various ideas that have been proposed in the past. - have some distributed notepad that any of these agents can read and improve upon.

▲covi1 hour ago

This feels like the chimpanzee with a power drill. An agent is honestly just brute-force search, but guided.

▲chaos_emergent1 hour ago

Human-driven research is also brute-force but with a more efficient search strategy. One can think of a parameter that represents research-search-space-navigation efficiency. RL-trained agents will inevitably optimize for that parameter. I agree with your statement insomuch as the value of that efficiency parameter is lower for agents than humans today.

It's really hard to imagine that they __won't__ exceed the human value for that efficiency parameter rather soon given that 1. there are plenty of scalar value functions that can represent research efficiency, of which a subset will result in robust training, and 2. that AI labs have a massive incentive to increase their research efficiency overall, along with billions of dollars and really good human researchers working on the problem.

▲groby_b47 minutes ago

Is there anything in the research space that doesn't fit "brute-force search, but guided"?

All of science is "gather inputs, make hypothesis, test, analyse" on repeat.

There's plenty to critique in the particular guidance approach, but the overall method is the same.

▲ipsum21 hour ago

A cluster is 2 nodes? That's technically true, but not very exciting.

▲saberience7 minutes ago

Wait, "Karpathy's Autoresearch", you mean a loop that prompts the agent to improve a thing given a benchmark?

People have been doing this for a year or more, Ralph loops etc.

I hate the weird strange Twitter world of hero-worship for folks that seems to arise just out of large followings.

Joe no-followers does this six months ago, nobody cares. Karpathy writes a really basic loop and it's now a kind of AI miracle prompting tons of grifters, copy-cats, weird hype.

I do wonder if LLMs have just made everyone seriously, seriously dumber all of a sudden. Most of the "Autoresearch" posts I see are completely rubbish, with AI optimizing for nonsense benchmarks and people failing to understand the graphs they are looking at. So yes, the AI made itself better at a useless benchmark while also making the code worse in 10 other ways you don't actually understand.