LLM Agents vs. Classical HPO: The Search Space Is the Whole Question
When Andrej Karpathy released his autoresearch repo earlier this year, the reaction was predictable. Posts declaring "AI does science now." Screenshots of LLM agents iterating on training loops. And one real question from data scientists: does this beat classical hyperparameter optimization?
The skeptical read is that autoresearch is just a new hyperparameter tuning algorithm dressed up in agent clothing. If that's true, classical HPO already has it beat on cost and sample efficiency. If it isn't, something more interesting is going on. This article works through what the evidence actually shows, and the answer is more specific than either side of the debate has landed on.
What the Benchmarks Show
Ferreira et al. ran the experiment most of us wanted. They defined a fixed hyperparameter search space, the kind you'd set up with ranges for learning rate, weight decay, batch size, and dropout. They pitted LLM agents against CMA-ES, TPE, random search, and two Bayesian optimization variants.
Classical methods beat LLM agents on sample efficiency and cost. This held across vision, tabular, and language benchmarks. The margins were wide, not a rounding error.
This matches why TPE and CMA-ES work:
- TPE models P(x | y < y*) vs. P(x | y ≥ y*). It learns the shape of the good region from every evaluation.
- CMA-ES adapts a covariance matrix that encodes the local geometry of the loss landscape.
- Both exploit the fact that hyperparameter response surfaces, while noisy, have structure.
LLMs reason about your search space the way a well-read grad student would: priors from the literature, heuristics about what usually works, occasional useful insights. That helps, but it isn't geometric exploration. It also costs orders of magnitude more per evaluation in tokens and latency.
A decade of Bayesian optimization research did not become obsolete because a frontier model can read your train.py. Within a bounded, numeric, continuous search space, classical methods exploit structure the LLM must rediscover from scratch on every run.
Then Why Use LLM Agents at All?
Because of the second result, which cuts the other way. A convergence analysis of roughly 10,000 LLM-guided experiments (the "Auto Researching" paper) found that architectural choices explain 94% of performance variance across runs. Not learning rate. Not weight decay. Not the knobs in a standard HPO study.
That number matters. If 94% of what separates a good run from a bad one lives in the model's structure (layer count, attention variant, normalization scheme, tokenizer choice, auxiliary losses), then a hyperparameter sweep optimizes the remaining 6%. You can run the most sample-efficient TPE study in the world over a search space that excludes the change that matters. You'll get a tight, converged answer to the wrong question.
This is where LLM agents help. Classical HPO cannot propose:
- "Add a residual connection between the third and fifth blocks."
- "Swap the loss to InfoNCE and add a temperature parameter."
- "Rewrite the data loader to use a different tokenization scheme."
These aren't points in a search space. They change what the search space is. An LLM agent operates on the code, not a parametrized study, so it can propose them. It can also be wrong, expensively and confidently. That's why the benchmarks matter.
The Reframe: It's a Search-Space Question
Combine the two results and the framing follows. Classical HPO excels at finding the best point in a space you've defined. LLM agents help when the space itself is wrong and the biggest wins sit outside your current knobs.
The decision isn't "which tool is better." Ask this first:
| Scenario | Recommendation |
|---|---|
| Architecture is fixed (hardware constraints, compliance, latency budget) | Use CMA-ES or TPE. An LLM agent will propose changes you can't ship. |
| Well-explored problem with strong priors (standard fine-tuning, known architectures) | Classical HPO. Numeric refinement is what these tools do best. |
| New domain, no community consensus yet (novel modality, unusual objective) | LLM agent. Sample inefficiency is the price of searching an unbounded space. |
| You're stuck: clean training, suspicious plateau after a respectable sweep | Use an agent that can propose structural changes outside your parametrization. |
The Hybrid Pattern Is Probably the Durable Answer
The systems emerging from this debate (Centaur is the most discussed, though it's more a pattern than a specific tool) don't pick a side. They split the problem along the axis the benchmarks reveal: LLMs handle categorical and structural reasoning; classical optimizers handle numeric refinement.
Here's how it works:
- The agent proposes a structural change: new architecture, different loss, new data augmentation pipeline.
- That change defines a fresh search space, the numeric knobs within the new structure.
- A classical optimizer runs a focused, sample-efficient study inside that space.
- The result feeds back to the agent, which decides whether to keep the structural change and what to try next.
Neither component does the other's job.
This changes the MLOps picture for production teams. The question is no longer "where do I run my HPO job." It's "how do I orchestrate a loop that alternates between structural proposal and numeric optimization, tracks which search space each trial belongs to, and doesn't burn a week of GPU time on an agent that stopped making progress."
Platforms that handle this well treat both optimizer types as interchangeable components of a single search. The agent is another suggester. Tag its proposals so you can tell later whether a given win came from architectural reasoning or numeric refinement.
What to Do Monday Morning
Before running any optimization, classical or otherwise, decide whether your architecture is in play. A lot of wasted compute sweeps the 6% slice of variance because the 94% slice felt risky to touch. If you're in an unfamiliar domain, let the architecture move before you tune the learning rate.
If your architecture is fixed, don't reach for an LLM agent because it's new. The benchmarks are clear: you'll pay more, wait longer, and get a worse answer. TPE and CMA-ES earned their place, and the reason hasn't changed.
If you're building infrastructure for yourself or your team, design for the hybrid case from the start:
- Tag which optimizer proposed each trial.
- Track structural changes separately from numeric ones.
- Record which search space a trial belongs to.
Knowing why a run won has high diagnostic value. It separates a workflow that improves over time from one where every new problem starts from zero.
The "AI does science now" framing will keep generating headlines, and some will be accurate. The working version of this story is quieter. Classical HPO is still the right tool for most of what you do. LLM agents are a new capability for the part you used to leave alone. The teams that win will stop asking which tool is better and start asking which search space they're in.
Further Reading
- Karpathy, A. autoresearch. The primary source for the current wave. Read it as code, not as a paper. github.com/karpathy/autoresearch
- Ferreira et al. (2026). Can LLMs Beat Classical HPO Algorithms? The contrarian benchmark. arxiv.org/abs/2603.24647
- Auto Researching, not hyperparameter tuning (2026). The 10,000-experiment convergence analysis and the 94% variance finding. arxiv.org/html/2603.15916

