bf16 ULP demo

bf16 absorbs RL-scale updates

bf16 has 7 mantissa bits, so consecutive representable values around |w| are spaced by roughly |w| · 2^-8. When the Adam update is smaller than that spacing, the post-step weight rounds back to the same byte. Try it.

weight magnitude 0.050

learning rate η 3e-6

gradient magnitude 1.000

absorbed by bf16 rounding |Δw| = 3.00e-6, ULP = 1.95e-4

256 weights, one optimizer step

0 / 256 flipped 100.00%
bit-identical

What this implies per step

Full bf16 payload

1.20 GB

Delta payload

0.0 MB

Reduction

∞×

For a 0.6B model in bf16 (~1.2 GB), at RL learning rates the change mask hits about 1% of elements. With the sparse {indices, values} encoding (int32 + bf16 = 6 bytes per changed element), a delta payload lands at 20 to 35 MB. Crank η up and watch that number explode.