← oshrinaparstek.com

FormulationEvolve

What if LLMs formulated optimization problems instead of solving them? FormulationEvolve is an LLM-powered optimization router that classifies problems and routes them to classical solvers — Bayesian optimization, MILP, CMA-ES, Thompson Sampling — instead of synthesizing code directly. 2 LLM calls where AlphaEvolve needs hundreds.

optimization LLM-as-formulator BLIS simulator DQSA analogy policy learning
Contents
The Thesis Three-Level Hierarchy Showcase: 5 Problems, 4 Solvers Black-Box Optimization (BLIS) DQSA-Inspired Policy Learning Results & Discoveries Key Takeaways

The Thesis

AlphaEvolve and similar systems use LLMs to synthesize and mutate code — hundreds of LLM calls generating, evaluating, and evolving candidate programs. FormulationEvolve takes a different approach: the LLM formulates the problem (classifies type, defines variables, sets bounds, chooses constraints), then a classical solver does the actual optimization.

2
LLM calls per problem
(classify + formulate)
4
solver backends
(BO, MILP, CMA-ES, Bandit)
1.05×
of MILP optimal
(reusable heuristic)

The analogy: LLMs are good at understanding problems and terrible at optimization. Solvers are good at optimization and terrible at understanding problems. Let each do what it's good at.


Three-Level Hierarchy

Level 1 — Direct
Classify & Solve
LLM classifies problem type, generates formulation, solver runs once. 2 LLM calls. Used by most problems.
Level 2 — Formulation
Iterative Refinement
LLM iterates on the formulation (penalty weights, constraint encoding) based on solver feedback. 3-4 LLM calls.
Level 3 — Synthesis
Code Generation
Full code synthesis for truly open-ended problems. The AlphaEvolve approach — reserved for when formulation isn't enough.

Across 9 experiments, no problem required Level 3. Most resolved at Level 1 (2 LLM calls). Circle packing needed Level 2 (iterative penalty weight refinement, 4 LLM calls).


Exp 1–5Showcase: 5 Problems, 4 Solvers

Five problems validating each solver backend. All resolve at Level 1 except circle packing (Level 2).

Exp 1 — Bayesian Optimization
LLM Scheduler Parameter Tuning
Tune batch size, max tokens, priority weight, prefill chunk for throughput under latency constraint.
2 LLM calls · 100 trials · 2.3s
Exp 2 — MILP
Request Scheduling
Assign 10 requests to 4 workers. LLM generates binary assignment MILP. Solver finds optimal in 50ms.
2 LLM calls · 0.05s solve
Exp 3 — CMA-ES (Level 2)
Circle Packing
Pack 5 circles into smallest square. LLM iterates on penalty weights across 3 formulation rounds.
4 LLM calls · 3 iterations · 1s
Exp 4 — Thompson Sampling
KV Cache Eviction Policy
Select best eviction policy from 6 candidates under noisy observations. ARC correctly identified in 1000 rounds.
2 LLM calls · 0.03s · 81% exploit
Exp 5 — The key differentiator

Scheduler Policy Optimization: Instead of solving one scheduling instance, optimize parameters of a heuristic across 20 sampled instances per trial. Result: a reusable scheduling policy achieving 1.05× MILP optimal on 50 held-out instances. 2 LLM calls + 0.38s of BO produced a general-purpose heuristic, not a one-shot solution.


Exp 6–7Black-Box Optimization — BLIS Simulator

BLIS (Blackbox LLM Inference Simulator) is a Go CLI tool simulating LLM serving clusters. It takes configuration flags and returns JSON metrics (TTFT, throughput, latency percentiles). The optimizer has no analytical model — BLIS is a pure black box.

Integration required only ~15 lines of solver code changes (an evaluation_fn callback). The LLM classifies the problem as black_box, formulates variable definitions, and BO calls BLIS for each candidate evaluation.

Benchmark: FormulationEvolve vs Pure BO vs CMA-ES

ObjectiveLLM+BOPure BOCMA-ES
E2E P99 latency 7,558ms 7,558ms 9,760ms
Throughput Failed (bug) 2,098 tok/s 698 tok/s
TTFT P99 33.38ms 33.38ms 38.68ms
Insight

On 2 of 3 objectives, the LLM independently chose the same variables, bounds, and types as a human expert would — and BO converged to identical solutions. The LLM's formulation was as good as expert-specified bounds. CMA-ES was fastest but 15-29% worse — missing categorical variables (scheduler, routing policy) that proved decisive.

The throughput failure exposed a direction-handling bug (maximize vs minimize mismatch in the pipeline) — a known issue, not a formulation problem.


Exp 8–9DQSA-Inspired Policy Learning

The conceptual leap: applying the framework from Deep Multi-User Reinforcement Learning for Distributed Dynamic Spectrum Access (Naparstek & Cohen, IEEE JSAC 2017) to LLM serving policy optimization.

DQSA (2017)FormulationEvolve (2026)
DQN architecture = policy family LLM selects scorers, scheduler, ranges = policy family
State = sensing history + actions State = queue_depth, kv_utilization, in_flight_requests
RL training finds θ* BO/CMA-ES finds θ* (scorer weights, infra params)

The key separation: LLM formulates WHAT to optimize (state features, policy structure, parameter ranges). Classical optimizer finds HOW (optimal parameter values within that structure).

Three Policy Family Tiers (Exp 9)

Linear
Weighted Scoring
score = Σ wj × scorerj(state). Fixed categoricals, BO learns weights. ~7 variables.
Joint ⭐
Categorical + Numeric
BO explores scheduler × priority × admission combinations jointly with scorer weights. ~12 variables.
Scenario
Per-Regime Meta-Policy
Different configs per workload regime. ~21 variables (7 per regime × 3 regimes).

Results & Discoveries

E2E P99 Latency — Cross-Workload (3 profiles)

TierAggregate P99Best Workload WinVariables
Joint ⭐ 4,881ms bursty_heavy: 2,586ms ~12 mixed
Scenario 6,831ms mixed_medium: 6,566ms ~21 (7×3)
Linear 6,881ms steady_light: 5,409ms ~7 continuous
BLIS defaults ~9,900ms
25.8%
improvement over
previous best (Exp 8)
3.3×
bursty_heavy P99
reduction via token-bucket
2
LLM calls total
(formulate + refine)
The breakthrough: token-bucket admission

BO discovered that rate-limiting during bursts (capacity=564, refill=3969) reduces bursty_heavy P99 from 8,640ms → 2,586ms. This is load shedding — protecting admitted requests at the cost of rejecting some. The LLM's contribution was including token-bucket as an option in the policy family. BO found the optimal parameters within that structure.

Why joint beat scenario

Scenario tier had 21 variables with only 30 BO trials — severely under-sampled. Joint tier: one compact but expressive config (12 variables) that generalizes across all workloads. The DQSA paper observed the same thing: a well-designed compact architecture outperforms larger but poorly structured alternatives.

BLIS Configuration Insights

SJF is a P99 killer
Shortest Job First systematically delays long requests. For P99 (tail latency), those starved requests are exactly what the metric measures. The LLM correctly diagnosed this after seeing per-workload results.
SLO-based priority rescues stragglers
Combined with FCFS, SLO-based priority promotes long-waiting requests. Directly targets the tail — the requests that would otherwise inflate P99.
Fewer instances can be better
Joint tier used 7 instances vs linear's 12. With token-bucket controlling admission, fewer instances with better per-instance utilization achieve lower tail latency.
Categorical search is essential
The winning combo (FCFS + SLO-based + token-bucket) would never be found by continuous-only optimization. BO's categorical exploration was decisive.

Key Takeaways

1. The DQSA analogy works

Separating policy formulation (LLM) from parameter optimization (BO) produces better results than either alone. The LLM designs the policy family structure; BO searches within it. This is the exact pattern from DQSA — DQN architecture = policy family, RL training = parameter optimization — applied without neural networks.

2. LLMs add value through problem understanding

The LLM's contribution is choosing what to optimize (variable selection, bounds, types, policy structure), not how to optimize. When the LLM's formulation matches expert knowledge, results are identical to hand-crafted setups. For novel problems, the LLM designs adaptive strategies with genuine domain reasoning — different strategies per objective, with explanations.

3. Policy family richness > per-regime specialization

A compact but expressive joint policy (12 variables) beat per-regime configs (21 variables) with limited trial budget. The right policy family reduces the effective search space. Same observation as DQSA: architecture design (policy family) matters more than brute-force parameter search.

4. 2 LLM calls, not hundreds

Parameter tuning: 2 calls vs ~100+ in evolve-style synthesis. Request scheduling: 2 calls + 50ms MILP. Policy optimization: 2 calls + 90 BLIS evaluations → 25.8% improvement, under 2 minutes total. The cost advantage scales with problem complexity.


9 experiments · 4 solver backends · BLIS simulator integration · March 2026
LLM: Claude Sonnet 4 → Claude Opus 4.6 via LiteLLM gateway

← Back to oshrinaparstek.com