* This works and TG is descent, but PP is low
* Better
* Apply f_logit_scale before mul mat with output tensor
* This is better for PP: 600 t/s -> 700 t/s
* To not lose this again
* WIP
* Equal split
* WIP
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>