ik_llama.cpp/iqk-quantize.cpp at 49bacf2288d934b3974ac0bc9b25502f894fb7e9

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-03-05 19:40:19 +00:00

Files

Iwan Kawrakow aa14a06b44 Bitnet: trying an alternative iq1_bn grid

Faster on CUDA. The scalar version is faster too.
The issue with CUDA is that now I see wild performance
fluctuations. Running llama-bench I can get 220 t/s
for TG-128 one time, and 190 t/s another time, with
uncertaintiers of 1-2 t/s. Same for PP, results are
jumping back-and-fort between ~9500 t/s and ~8900 t/s.
So, basically no reliable measurement at this point,
but for sure faster than the previous version, which was
at around 170-180 t/s.

2024-06-25 11:32:48 +03:00

12 KiB

Raw Blame History

View Raw

12 KiB Raw Blame History

12 KiB

Raw Blame History