Files
ik_llama.cpp/ggml
Iwan Kawrakow 5bbdfc8bac q5_0_r4: NEON
We get PP-512(LLaMA-3.1-8B) = 99.6 t/s on M2-Max,
up from 71.0 t/s for Q5_0. The difference to mainline llama.cpp
is no longer funny: they get 26.5 t/s for Q5_0.

For TG, we are nor able to fully saturate memory bandwidth
and arrive at 22.1 t/s @ 8 threads. Mainline llama.cpp gets
20.6 t/s for Q5_0.
2024-12-03 11:09:34 +01:00
..
2024-07-27 07:55:01 +02:00
2024-12-03 11:29:57 +02:00
2024-12-03 11:09:34 +01:00
2024-07-27 07:55:01 +02:00
2024-10-04 14:43:26 +03:00