The Single Square
The Single Square
Multiplying two numbers requires a multiplier circuit. Squaring a number requires a squarer. Squarers use roughly half the gate count of multipliers — the partial product matrix is symmetric, so half the terms are redundant.
Tenca and Zhang (2026, arXiv:2603.08732) prove that matrix multiplications and convolutions can asymptotically replace each real multiplication with a single squaring operation. Complex multiplication requires three squarings instead of the usual four real multiplications or three multiplications plus extra additions.
The technique is not approximate. It’s algebraically exact: (a+b)² - (a-b)² = 4ab. This identity is ancient. What’s new is proving that it composes through the structured computation patterns of systolic arrays and tensor cores without asymptotic overhead — the squaring substitution preserves the dataflow parallelism that makes hardware accelerators fast.
The through-claim: the cost of multiplication has been overstated because the baseline assumed multiplier circuits were the atomic unit. They’re not. Squaring is simpler, and squaring is sufficient. The identity that converts multiplication to squaring is a first-year algebra exercise. The engineering contribution is showing that this algebraic simplification survives the transition from scalar arithmetic to the highly structured parallel computation patterns of modern accelerators. The theory was always available. The architecture just needed to take it seriously.
Half the gates. Same throughput. The expensive operation was hiding a cheaper one inside it.