Use 3 bits for the exponent and 5 bits for the mantissa.
This makes PPL to be the same as fp16 (but the previous
version with 4 bits for the exponent and mantissa was
good enough for any practical purposes).
Just scaler and AVX2 for now.
PP-512 is even faster (325 t/s on the Ryzn-7950X, 404 t/s on
Ryzen-5975WX). We lose ~6-7% for TG due to being memory bound and
the model being 10% larger.