bf16 absorbs RL-scale updates

bf16 has 7 mantissa bits, so consecutive representable values around |w| are spaced by roughly |w| · 2-8. When the Adam update is smaller than that spacing, the post-step weight rounds back to the same byte. Try it.

absorbed by bf16 rounding |Δw| = 3.00e-6, ULP = 1.95e-4

256 weights, one optimizer step

0 / 256 flipped 100.00%
bit-identical

What this implies per step

Full bf16 payload
1.20 GB
Delta payload
0.0 MB
Reduction
∞×
For a 0.6B model in bf16 (~1.2 GB), at RL learning rates the change mask hits about 1% of elements. With the sparse {indices, values} encoding (int32 + bf16 = 6 bytes per changed element), a delta payload lands at 20 to 35 MB. Crank η up and watch that number explode.