ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-02-08 07:20:12 +00:00

Author	SHA1	Message	Date
Iwan Kawrakow	f0a52f2fbb	iqk_mul_mat: delete unused stuff	2024-06-22 12:02:50 +03:00
Iwan Kawrakow	74b711c8fd	iqk_mul_mat: add q8_0 It was actually ready but not turned on. Having forgotten, I made a new implementation along the lines of the fp16 implementation (i.e., using tiling). That matched tiinyBLAS performance. But the existing implementation that I now turned on is faster: PP-512 = 134 t/s vs 128.3 t/s for tinyBLAS TG-128 = 8.7 t/s vs 8.3 t/s for tinyBLAS (@ 4 threads)	2024-06-22 12:02:50 +03:00
Iwan Kawrakow	29164263f4	iqk_mul_mat: fp16 tweaks Use 4x3 tiling on a real AVX2 CPU (with only 16 vector registers). This works best for the Ryzen-5975WX.	2024-06-22 12:02:50 +03:00
Iwan Kawrakow	36c3f57b0a	iqk_mul_mat: fp16 implementation cleanup It turns out on my Ryzen-7950X CPU using AVX512 is slower.	2024-06-22 12:02:50 +03:00
Iwan Kawrakow	bc659e7de1	iqk_mul_mat: fp16 implementation for AVX2 This simple implementation beats jart's tiniBLAS by a small margin (143 t/s vs 137 t/s for PP-512, TG is 4.75 t/s, so exactly the same as ggml).	2024-06-22 12:02:50 +03:00
Iwan Kawrakow	667bd4759c	iqk_mul_mat: make it independent of sgemm	2024-06-22 12:02:50 +03:00
Iwan Kawrakow	2ee56b4f0d	iqk_mul_mat: minor improvements Current performance: \| model \| size \| threads \| test \| t/s \| \| ----------------- \| ---------: \| -------: \| ------: \| ---------------: \| \| llama 7B IQ3_S \| 2.75 GiB \| 16 \| pp512 \| 100.21 ± 0.32 \| \| llama 7B IQ3_XXS \| 2.41 GiB \| 16 \| pp512 \| 105.25 ± 0.75 \| \| llama 7B IQ2_M \| 2.20 GiB \| 16 \| pp512 \| 117.88 ± 0.15 \| \| llama 7B IQ2_XS \| 1.89 GiB \| 16 \| pp512 \| 136.38 ± 0.24 \| \| llama 7B IQ2_XXS \| 1.73 GiB \| 16 \| pp512 \| 128.47 ± 0.39 \| mean: 117.64 \| ----------------- \| ---------: \| -------: \| ------: \| ---------------: \| \| llama 7B IQ2_XXS \| 1.73 GiB \| 8 \| tg128 \| 23.94 ± 0.04 \| \| llama 7B IQ2_XS \| 1.89 GiB \| 8 \| tg128 \| 23.27 ± 0.03 \| \| llama 7B IQ2_M \| 2.20 GiB \| 8 \| tg128 \| 18.88 ± 0.03 \| \| llama 7B IQ3_XXS \| 2.41 GiB \| 8 \| tg128 \| 19.07 ± 0.04 \| \| llama 7B IQ3_S \| 2.75 GiB \| 8 \| tg128 \| 15.44 ± 0.05 \| mean: 20.12	2024-06-22 12:02:50 +03:00
Iwan Kawrakow	0ad646b9f0	iqk_mul_mat: no more templates in the IQ dequantizers Also moved the quant specific code from the EvenSignHelper into the corresponding dequantizers. These two changes had a tiniy performance benefit (much too small compared to what I was expecting/hoping for).	2024-06-22 12:02:50 +03:00
Iwan Kawrakow	e35a14ff5f	iqk_mul_mat: remove template on one of the prepare() functions	2024-06-22 12:02:49 +03:00
Iwan Kawrakow	e67626533c	iqk_mul_mat: experimenting with zen4 Nope, we cannot have good performance for iq2_xxs and iq3_xxs at the same time. If I don't force inline the sign functions, I get better performnce for iq2_xxs and bad performance for iq3_xxs. If I fore inline them, it is the other way around. Anyway, this is what we have now on Zen4 for all quants with forced inline EvenSignHelper methods: \| model \| size \| threads \| test \| t/s \| \| -----------------\| ---------: \| ------: \| -----: \| ------------: \| \| llama 7B IQ3_S \| 2.75 GiB \| 16 \| pp512 \| 100.91 ± 0.26 \| \| llama 7B IQ3_XXS \| 2.41 GiB \| 16 \| pp512 \| 106.08 ± 0.78 \| \| llama 7B IQ2_M \| 2.20 GiB \| 16 \| pp512 \| 116.41 ± 0.25 \| \| llama 7B IQ2_XS \| 1.89 GiB \| 16 \| pp512 \| 132.54 ± 1.07 \| \| llama 7B IQ2_XXS \| 1.73 GiB \| 16 \| pp512 \| 125.53 ± 0.06 \| arithmetic mean: 116.29 geometric mean: 115.70 \| -----------------\| ---------: \| ------: \| -----: \| ------------: \| \| llama 7B IQ3_S \| 2.75 GiB \| 8 \| tg128 \| 15.69 ± 0.04 \| \| llama 7B IQ3_XXS \| 2.41 GiB \| 8 \| tg128 \| 18.02 ± 0.04 \| \| llama 7B IQ2_M \| 2.20 GiB \| 8 \| tg128 \| 18.94 ± 0.03 \| \| llama 7B IQ2_XS \| 1.89 GiB \| 8 \| tg128 \| 23.29 ± 0.02 \| \| llama 7B IQ2_XXS \| 1.73 GiB \| 8 \| tg128 \| 22.96 ± 0.09 \| arithmetic mean: 19.78 geometric mean: 19.56 Without force-inlining, PP(iq3_xxs) drops to 98 t/s while PP(iq2_xxs) increases to 137 t/s.	2024-06-22 12:02:49 +03:00
Iwan Kawrakow	47ae12bbec	iqk_mul_mat: experimenting with zen4 (iq2_xxs) Observing again the wierdness of performance drop in a quant because of a change in another quant. After I added FANCY_SIMD implementations for ia3_s, iq2_s and iq2_xs, I'm observing that iq2_xxs PP performance dropped to 130 t/s from 139 t/s. Adding FANCY_SIMD implementation for applying the signs brings it back to 137 t/s and gives a small boost for TG as well (23.4 vs 23.0 t/s)	2024-06-22 12:02:49 +03:00
Iwan Kawrakow	dc96d5484f	iqk_mul_mat: experimenting with zen4 (iq2_xs)	2024-06-22 12:02:49 +03:00
Iwan Kawrakow	cb063a2a20	iqk_mul_mat: experimenting with zen4 (iq3_s and iq2_m)	2024-06-22 12:02:49 +03:00
Iwan Kawrakow	61b8cc1ff6	iqk_mul_mat: small improvement for iq3_s The same as in llamafile. We get PP-512 = 96.6 t/s TG-128 = 7.77 t/s @ 4 threads 14.4 t/s @ 8 threads 16.3 t/s @ 16 threads	2024-06-22 12:02:49 +03:00
Iwan Kawrakow	2a72d9f978	iqk_mul_mat: better AVX2 implementation for iq2_xxs From here on switching to GCC 12. PP-512 is now 139.3 t/s. TG-128 is 13.5 t/s @ 4 threads 23.0 t/s @ 8 threads 25.1 t/s @ 16 threads	2024-06-22 12:02:49 +03:00
Iwan Kawrakow	3a6e3943a8	iqk_mul_mat: better AVX2 implementation for iq2_xxs 2.41X for PP-512 (120.5 t/s). Slightly faster for TG @ 4 threads (12.2 t/s vs 11.9 t/s). But somehow slower at 16 threads - 22.65 t/s vs 26.3 t/s. Very strange.	2024-06-22 12:02:49 +03:00
Iwan Kawrakow	60f050d610	iqk_mul_mat: AVX2 implementation for iq2_xxs 2.09X for PP-512 (104.7 t/s), worse than mainline for TG. I think it needs more work.	2024-06-22 12:02:49 +03:00
Iwan Kawrakow	309e32405f	iqk_mul_mat: AVX2 implementation for iq2_xs We get 2.19X for PP-512 (118.9 t/s). TG is mostly OK (slightly better @ 4 threads, slightly worse @ 16 threads).	2024-06-22 12:02:49 +03:00
Iwan Kawrakow	8015edb3cc	iqk_mul_mat: AVX2 implementation for iq2_s We get 2.04X for PP-512 (107 t/s). TG againsuffers a small loss in performance (19.9 t/s vs 21.4 t/s @ 16 threads)	2024-06-22 12:02:49 +03:00
Iwan Kawrakow	b0071de081	Separate templates for TG and PP for i-quants on AVX2	2024-06-22 12:02:49 +03:00
Iwan Kawrakow	2c8c0d0a68	iqk_mul_mat: AVX2 implementation for iq3_xxs We get 2.3X for PP-512 (87 t/s). But for TG, we need to use the original implementation in llama.cpp because the template is not able to match the performance of the special-purpose implementation. Also, 87 t/s is significantly lower than the 111 t/s I have in iquants.	2024-06-22 12:02:49 +03:00
Iwan Kawrakow	34befcaf67	iqk_mul_mat: AVX2 implementation for iq3_s We get 3.14X for PP-512 (96.6 t/s). But for TG, we need to use the original implementation in llama.cpp because the template is not able to match the performance of the special-purpose implementation.	2024-06-22 12:02:49 +03:00
Iwan Kawrakow	4f53915dcb	Cleanup - Arm i-quants should be good now Still missing iq1_s and iq1_m, but I don't think I'll do those.	2024-06-22 12:02:49 +03:00
Iwan Kawrakow	4b27ade2fb	iqk_mul_mat: Arm implementation for iq3_s (llama.cpp version) Here we get 3.65X (!) for PP-512 (53 t/s).	2024-06-22 12:02:49 +03:00
Iwan Kawrakow	221a2c3807	Simplify	2024-06-22 12:02:49 +03:00
Iwan Kawrakow	7dcca6aea7	iqk_mul_mat: Arm implementation for iq3_xxs (llama.cpp version) We get 2.66X for PP-512 (42.35 t/s)	2024-06-22 12:02:49 +03:00
Iwan Kawrakow	effa4448d6	iqk_mul_mat: Arm implementation for iq2_xs (llama.cpp version) We get 2.2X for PP-512 (52 t/s)	2024-06-22 12:02:49 +03:00
Iwan Kawrakow	d2ee9ab95e	iqk_mul_mat: Arm implementation for iq2_s (llama.cpp version) We get only a 2.07X for PP-512 to get up to 31 t/s, so iq2_s remains slow.	2024-06-22 12:02:49 +03:00
Iwan Kawrakow	9ac9e928d5	Add Q8_0	2024-06-22 12:02:49 +03:00
Iwan Kawrakow	3f996d0c70	Cosmetics	2024-06-22 12:02:49 +03:00
Iwan Kawrakow	d7ab97149f	iqk_mul_mat: Arm implementation for iq2_xxs (llama.cpp version) We get ~5% speeedup for TG-128, 3X for PP-512	2024-06-22 12:02:49 +03:00
Iwan Kawrakow	b51922530f	iqk_mul_mat: faster q3_K TG We get 31 t/s up from 26 t/s, but we need to treat PP differently from TG, else we get a ~10% drop in PP performance.	2024-06-22 12:02:49 +03:00
Iwan Kawrakow	19c578b413	iqk_mul_mat for llama.cpp	2024-06-22 12:02:49 +03:00

33 Commits