ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-04-30 11:21:56 +00:00

Author	SHA1	Message	Date
Iwan Kawrakow	86d94862ae	iqk_soft_max With this ggml_mul_mat_ext, he hit PP-512 = 209 t/s (iq1_bn) and PP-512 = 246 t/s (iq2_bn) on the M2 Max CPU. On the Ryzen-7950X we are at PP-512 = 447 t/s (iq1_bn, 32 threads) and PP-512 = 530 t/s (iq2_bn, 16 threads).	2024-07-22 16:34:42 +02:00
Iwan Kawrakow	412bc31c75	Extended mul mat: C = alpha * A * B + beta i.e., as in your typical GEMM interface. For Bitnet this gives ~1% speedup for PP, no effect for TG. Vey yeasy to implement for the CPU using iqk_mul_mat. But given that every other backend requires a lot of change, and given the just 1% speedup (which only applies to Bitnet), it does not look like it is worth putting in the effort.	2024-07-22 09:26:55 +03:00
Iwan Kawrakow	ad53eabf87	iqk_mul_mat: be independent of llamafile_sgemm (WIP) * Remove iqk_mul_mat from llamafile_sgemm * Pass tensor types and strides to iqk_mul_mat It is marked WIP because only tested on __aarch64__	2024-06-22 12:02:50 +03:00
Iwan Kawrakow	667bd4759c	iqk_mul_mat: make it independent of sgemm	2024-06-22 12:02:50 +03:00