mirror of
https://github.com/ikawrakow/ik_llama.cpp.git
synced 2026-04-28 02:11:50 +00:00
Bitnet: trying an alternative iq1_bn grid
Faster on CUDA. The scalar version is faster too. The issue with CUDA is that now I see wild performance fluctuations. Running llama-bench I can get 220 t/s for TG-128 one time, and 190 t/s another time, with uncertaintiers of 1-2 t/s. Same for PP, results are jumping back-and-fort between ~9500 t/s and ~8900 t/s. So, basically no reliable measurement at this point, but for sure faster than the previous version, which was at around 170-180 t/s.
This commit is contained in:
@@ -2788,6 +2788,7 @@ bool MulMat::prepare(int typeA, int typeB, int ne00, MulMat& mm, int Ny) {
|
||||
MulMat::set_functions<DequantizerIQ2XXS>(mm);
|
||||
break;
|
||||
case GGML_TYPE_IQ1_BN:
|
||||
return false;
|
||||
assert (ne00 % QK_IQ1BN == 0);
|
||||
mm.funcs[0] = mul_mat_iq1bn_q8_K64<1>;
|
||||
mm.funcs[1] = mul_mat_iq1bn_q8_K64<2>;
|
||||
|
||||
Reference in New Issue
Block a user