Files
ik_llama.cpp/iqk-quantize.cpp
Iwan Kawrakow aa14a06b44 Bitnet: trying an alternative iq1_bn grid
Faster on CUDA. The scalar version is faster too.
The issue with CUDA is that now I see wild performance
fluctuations. Running llama-bench I can get 220 t/s
for TG-128 one time, and 190 t/s another time, with
uncertaintiers of 1-2 t/s. Same for PP, results are
jumping back-and-fort between ~9500 t/s and ~8900 t/s.
So, basically no reliable measurement at this point,
but for sure faster than the previous version, which was
at around 170-180 t/s.
2024-06-25 11:32:48 +03:00

12 KiB