Files
ik_llama.cpp/ggml-cuda
Kawrakow ef16135920 Bitnet: trying an alternative iq1_bn grid
Faster on CUDA. The scalar version is faster too.
The issue with CUDA is that now I see wild performance
fluctuations. Running llama-bench I can get 220 t/s
for TG-128 one time, and 190 t/s another time, with
uncertaintiers of 1-2 t/s. Same for PP, results are
jumping back-and-fort between ~9500 t/s and ~8900 t/s.
So, basically no reliable measurement at this point,
but for sure faster than the previous version, which was
at around 170-180 t/s.
2024-06-25 11:32:48 +03:00
..
2024-06-05 16:53:00 +02:00
2024-03-29 17:45:46 +02:00
2024-04-30 12:16:08 +03:00
2024-06-22 12:02:52 +03:00
2024-06-05 11:29:20 +03:00
2024-06-17 00:23:04 +02:00
2024-06-17 00:23:04 +02:00