ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-02-19 20:54:36 +00:00

Author	SHA1	Message	Date
Kawrakow	2c8d3dad1f	iqk_mul_mat: experimenting with zen4 (iq2_xs)	2024-06-22 12:02:49 +03:00
Kawrakow	0d9027fe74	iqk_mul_mat: experimenting with zen4 (iq3_s and iq2_m)	2024-06-22 12:02:49 +03:00
Kawrakow	ed8f1fe490	iqk_mul_mat: small improvement for iq3_s The same as in llamafile. We get PP-512 = 96.6 t/s TG-128 = 7.77 t/s @ 4 threads 14.4 t/s @ 8 threads 16.3 t/s @ 16 threads	2024-06-22 12:02:49 +03:00
Kawrakow	01d55dcbf0	iqk_mul_mat: better AVX2 implementation for iq2_xxs From here on switching to GCC 12. PP-512 is now 139.3 t/s. TG-128 is 13.5 t/s @ 4 threads 23.0 t/s @ 8 threads 25.1 t/s @ 16 threads	2024-06-22 12:02:49 +03:00
Kawrakow	d4e9e595f9	iqk_mul_mat: better AVX2 implementation for iq2_xxs 2.41X for PP-512 (120.5 t/s). Slightly faster for TG @ 4 threads (12.2 t/s vs 11.9 t/s). But somehow slower at 16 threads - 22.65 t/s vs 26.3 t/s. Very strange.	2024-06-22 12:02:49 +03:00
Kawrakow	41391ff4b0	iqk_mul_mat: AVX2 implementation for iq2_xxs 2.09X for PP-512 (104.7 t/s), worse than mainline for TG. I think it needs more work.	2024-06-22 12:02:49 +03:00
Kawrakow	be132341f5	iqk_mul_mat: AVX2 implementation for iq2_xs We get 2.19X for PP-512 (118.9 t/s). TG is mostly OK (slightly better @ 4 threads, slightly worse @ 16 threads).	2024-06-22 12:02:49 +03:00
Kawrakow	3c448906bf	iqk_mul_mat: AVX2 implementation for iq2_s We get 2.04X for PP-512 (107 t/s). TG againsuffers a small loss in performance (19.9 t/s vs 21.4 t/s @ 16 threads)	2024-06-22 12:02:49 +03:00
Kawrakow	f31200bde1	Separate templates for TG and PP for i-quants on AVX2	2024-06-22 12:02:49 +03:00
Kawrakow	3f90520d1f	iqk_mul_mat: AVX2 implementation for iq3_xxs We get 2.3X for PP-512 (87 t/s). But for TG, we need to use the original implementation in llama.cpp because the template is not able to match the performance of the special-purpose implementation. Also, 87 t/s is significantly lower than the 111 t/s I have in iquants.	2024-06-22 12:02:49 +03:00
Kawrakow	24ccf42a4f	iqk_mul_mat: AVX2 implementation for iq3_s We get 3.14X for PP-512 (96.6 t/s). But for TG, we need to use the original implementation in llama.cpp because the template is not able to match the performance of the special-purpose implementation.	2024-06-22 12:02:49 +03:00
Kawrakow	32f20a1b9b	Cleanup - Arm i-quants should be good now Still missing iq1_s and iq1_m, but I don't think I'll do those.	2024-06-22 12:02:49 +03:00
Kawrakow	7235135c3e	iqk_mul_mat: Arm implementation for iq3_s (llama.cpp version) Here we get 3.65X (!) for PP-512 (53 t/s).	2024-06-22 12:02:49 +03:00
Kawrakow	482dd30382	Simplify	2024-06-22 12:02:49 +03:00
Kawrakow	6aa7ac9cd3	iqk_mul_mat: Arm implementation for iq3_xxs (llama.cpp version) We get 2.66X for PP-512 (42.35 t/s)	2024-06-22 12:02:49 +03:00
Kawrakow	d041c81b1d	iqk_mul_mat: Arm implementation for iq2_xs (llama.cpp version) We get 2.2X for PP-512 (52 t/s)	2024-06-22 12:02:49 +03:00
Kawrakow	3fe4e1b27c	iqk_mul_mat: Arm implementation for iq2_s (llama.cpp version) We get only a 2.07X for PP-512 to get up to 31 t/s, so iq2_s remains slow.	2024-06-22 12:02:49 +03:00
Kawrakow	4c0920cb1b	Add Q8_0	2024-06-22 12:02:49 +03:00
Kawrakow	62122c1950	Cosmetics	2024-06-22 12:02:49 +03:00
Kawrakow	fb8bc26dc5	iqk_mul_mat: Arm implementation for iq2_xxs (llama.cpp version) We get ~5% speeedup for TG-128, 3X for PP-512	2024-06-22 12:02:49 +03:00
Kawrakow	a18a564e54	iqk_mul_mat: faster q3_K TG We get 31 t/s up from 26 t/s, but we need to treat PP differently from TG, else we get a ~10% drop in PP performance.	2024-06-22 12:02:49 +03:00
Kawrakow	d434b4751a	iqk_mul_mat for llama.cpp	2024-06-22 12:02:49 +03:00

22 Commits