It is better to process one "row" at a time and to have
4 accumulators. I guess, this allows better interleving of
load and fmadd instructions. We get ~10% better performance
for 1 thread, and fully saturate memory bandwidth at 2 threads
with a ~3.5% better performance (4.4 vs 4.25 t/s for L3-8B).