The Synthesis Gap
The Synthesis Gap
Most benchmarks for LLM-generated hardware descriptions (Verilog, VHDL) test whether the code passes functional simulation — does it produce the right outputs for the right inputs? This is necessary but not sufficient. Hardware must also synthesize: the description must be convertible into an actual circuit by a synthesis tool. Simulation-correct code that fails synthesis is useless in practice.
Benchmarking 32 LLMs on 202 Verilog tasks through actual hardware synthesis reveals a systematic divergence: proprietary models fail late and open-weight models fail early.
Proprietary models (GPT-4, Claude) produce code that compiles and simulates correctly but fails during elaboration or synthesis — the tools can parse the code but can’t map it onto hardware. The failure mode is subtle: constructs that are valid in simulation (delays, dynamic memory allocation, unsynthesizable loop bounds) that have no hardware equivalent. The code is functionally correct software pretending to be hardware.
Open-weight models fail earlier: missing module wrappers, non-synthesizable constructs, syntax errors. The code doesn’t even compile. But when it compiles, it’s more likely to synthesize successfully — because the training data for open models includes more synthesis-grade RTL and less simulation-grade testbench code.
The through-claim: the gap between simulation and synthesis is a gap between two different notions of “correct.” LLMs trained on simulation-oriented code learn to write programs that behave correctly. LLMs trained on synthesis-oriented code learn to write descriptions that can become circuits. The distinction between a program and a description is precisely the distinction that hardware design education spends years teaching. LLMs reproduce whichever notion of correctness their training data embodies.