Commit Graph

4 Commits

Author SHA1 Message Date
Iwan Kawrakow
86d94862ae iqk_soft_max
With this ggml_mul_mat_ext, he hit PP-512 = 209 t/s (iq1_bn) and
PP-512 = 246 t/s (iq2_bn) on the M2 Max CPU.
On the Ryzen-7950X we are at PP-512 = 447 t/s (iq1_bn, 32 threads)
and PP-512 = 530 t/s (iq2_bn, 16 threads).
2024-07-22 16:34:42 +02:00
Iwan Kawrakow
412bc31c75 Extended mul mat: C = alpha * A * B + beta
i.e., as in your typical GEMM interface.
For Bitnet this gives ~1% speedup for PP, no effect for TG.
Vey yeasy to implement for the CPU using iqk_mul_mat.
But given that every other backend requires a lot of change,
and given the just 1% speedup (which only applies to Bitnet),
it does not look like it is worth putting in the effort.
2024-07-22 09:26:55 +03:00
Iwan Kawrakow
ad53eabf87 iqk_mul_mat: be independent of llamafile_sgemm (WIP)
* Remove iqk_mul_mat from llamafile_sgemm
* Pass tensor types and strides to iqk_mul_mat

It is marked WIP because only tested on __aarch64__
2024-06-22 12:02:50 +03:00
Iwan Kawrakow
667bd4759c iqk_mul_mat: make it independent of sgemm 2024-06-22 12:02:50 +03:00