mirror of
https://github.com/ikawrakow/ik_llama.cpp.git
synced 2026-03-06 03:50:08 +00:00
It is better to process one "row" at a time and to have 4 accumulators. I guess, this allows better interleving of load and fmadd instructions. We get ~10% better performance for 1 thread, and fully saturate memory bandwidth at 2 threads with a ~3.5% better performance (4.4 vs 4.25 t/s for L3-8B).