What if LLMs formulated optimization problems instead of solving them? FormulationEvolve is an LLM-powered optimization router that classifies problems and routes them to classical solvers — Bayesian optimization, MILP, CMA-ES, Thompson Sampling — instead of synthesizing code directly. 2 LLM calls where AlphaEvolve needs hundreds.
AlphaEvolve and similar systems use LLMs to synthesize and mutate code — hundreds of LLM calls generating, evaluating, and evolving candidate programs. FormulationEvolve takes a different approach: the LLM formulates the problem (classifies type, defines variables, sets bounds, chooses constraints), then a classical solver does the actual optimization.
The analogy: LLMs are good at understanding problems and terrible at optimization. Solvers are good at optimization and terrible at understanding problems. Let each do what it's good at.
Across 9 experiments, no problem required Level 3. Most resolved at Level 1 (2 LLM calls). Circle packing needed Level 2 (iterative penalty weight refinement, 4 LLM calls).
Five problems validating each solver backend. All resolve at Level 1 except circle packing (Level 2).
Scheduler Policy Optimization: Instead of solving one scheduling instance, optimize parameters of a heuristic across 20 sampled instances per trial. Result: a reusable scheduling policy achieving 1.05× MILP optimal on 50 held-out instances. 2 LLM calls + 0.38s of BO produced a general-purpose heuristic, not a one-shot solution.
BLIS (Blackbox LLM Inference Simulator) is a Go CLI tool simulating LLM serving clusters. It takes configuration flags and returns JSON metrics (TTFT, throughput, latency percentiles). The optimizer has no analytical model — BLIS is a pure black box.
Integration required only ~15 lines of solver code changes (an evaluation_fn callback).
The LLM classifies the problem as black_box, formulates variable definitions, and BO
calls BLIS for each candidate evaluation.
| Objective | LLM+BO | Pure BO | CMA-ES |
|---|---|---|---|
| E2E P99 latency | 7,558ms | 7,558ms | 9,760ms |
| Throughput | Failed (bug) | 2,098 tok/s | 698 tok/s |
| TTFT P99 | 33.38ms | 33.38ms | 38.68ms |
On 2 of 3 objectives, the LLM independently chose the same variables, bounds, and types as a human expert would — and BO converged to identical solutions. The LLM's formulation was as good as expert-specified bounds. CMA-ES was fastest but 15-29% worse — missing categorical variables (scheduler, routing policy) that proved decisive.
The throughput failure exposed a direction-handling bug (maximize vs minimize mismatch in the pipeline) — a known issue, not a formulation problem.
The conceptual leap: applying the framework from Deep Multi-User Reinforcement Learning for Distributed Dynamic Spectrum Access (Naparstek & Cohen, IEEE JSAC 2017) to LLM serving policy optimization.
| DQSA (2017) | FormulationEvolve (2026) |
|---|---|
| DQN architecture = policy family | LLM selects scorers, scheduler, ranges = policy family |
| State = sensing history + actions | State = queue_depth, kv_utilization, in_flight_requests |
| RL training finds θ* | BO/CMA-ES finds θ* (scorer weights, infra params) |
The key separation: LLM formulates WHAT to optimize (state features, policy structure, parameter ranges). Classical optimizer finds HOW (optimal parameter values within that structure).
| Tier | Aggregate P99 | Best Workload Win | Variables |
|---|---|---|---|
| Joint ⭐ | 4,881ms | bursty_heavy: 2,586ms | ~12 mixed |
| Scenario | 6,831ms | mixed_medium: 6,566ms | ~21 (7×3) |
| Linear | 6,881ms | steady_light: 5,409ms | ~7 continuous |
| BLIS defaults | ~9,900ms | — | — |
BO discovered that rate-limiting during bursts (capacity=564, refill=3969)
reduces bursty_heavy P99 from 8,640ms → 2,586ms. This is load shedding — protecting admitted requests
at the cost of rejecting some. The LLM's contribution was including token-bucket as an option in the policy family.
BO found the optimal parameters within that structure.
Scenario tier had 21 variables with only 30 BO trials — severely under-sampled. Joint tier: one compact but expressive config (12 variables) that generalizes across all workloads. The DQSA paper observed the same thing: a well-designed compact architecture outperforms larger but poorly structured alternatives.
Separating policy formulation (LLM) from parameter optimization (BO) produces better results than either alone. The LLM designs the policy family structure; BO searches within it. This is the exact pattern from DQSA — DQN architecture = policy family, RL training = parameter optimization — applied without neural networks.
The LLM's contribution is choosing what to optimize (variable selection, bounds, types, policy structure), not how to optimize. When the LLM's formulation matches expert knowledge, results are identical to hand-crafted setups. For novel problems, the LLM designs adaptive strategies with genuine domain reasoning — different strategies per objective, with explanations.
A compact but expressive joint policy (12 variables) beat per-regime configs (21 variables) with limited trial budget. The right policy family reduces the effective search space. Same observation as DQSA: architecture design (policy family) matters more than brute-force parameter search.
Parameter tuning: 2 calls vs ~100+ in evolve-style synthesis. Request scheduling: 2 calls + 50ms MILP. Policy optimization: 2 calls + 90 BLIS evaluations → 25.8% improvement, under 2 minutes total. The cost advantage scales with problem complexity.
9 experiments · 4 solver backends · BLIS simulator integration · March 2026
LLM: Claude Sonnet 4 → Claude Opus 4.6 via LiteLLM gateway