ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-02-09 07:50:10 +00:00

Author	SHA1	Message	Date
Kawrakow	32ec107237	iqk_mul_mat: add IQ4_NL I never use it, so I had completely forgotten about it.	2024-06-22 12:02:52 +03:00
Kawrakow	912d6d9ce1	bitnet(scale in a separate tensor): CPU tweaks A somewhat nicer iq2_bn implementation on AVX2.	2024-06-22 12:02:52 +03:00
Kawrakow	f53d89dd53	bitnet(scale in a separate tensor): CPU tweaks I had ruined TG performance on AVX2 with the last commit. Was just testing at 8 threads and there we are totally memory bound. But at 4 threads we had regressed to 41 t/s on the Ryzen7950. Back to 51 t/s with this commit.	2024-06-22 12:02:52 +03:00
Kawrakow	52ad5764dd	bitnet(scale in a separate tensor): more CPU improvements It seems it is enough to have 4 scales per row for Q8. I get PPL = 8.5470 with this, which is slightly higher than the 8.5430 we get with 1 scale per 128 activations, but still OK, I think. With this, we get the following performance: Systema \| quant \| PP-512 \| TG-128a \| quant \| PP-512 \| TG-12s \| M2 Max \| iq2bn 229.02 ± 0.37 78.75 ± 0.61 \| iq1bn \| 146.67 ± 2.85 33.12 ± 0.03 Ryzen7950\| iq2bn 379.36 ± 1.03 49.08 ± 0.18 \| iq1bn \| 247.12 ± 1.53 32.80 ± 0.02 Ryzen5975\| iq2bn 465.28 ± 0.57 39.17 ± 0.02 \| iq1bn \| 325.86 ± 0.46 26.60 ± 0.10	2024-06-22 12:02:52 +03:00
Kawrakow	167489ef6c	bitnet(scale in a separate tensor): CPU improvements Arrange Q8 quants in blocks of 128 and adapt iqk_mul_mat to deal with that. This improves PP speef by a few percent.	2024-06-22 12:02:52 +03:00
Kawrakow	785cac7ee5	bitnet: put the scale in a separate tensor and correspondingly add an extra ggml_mul_mat operation. As per @ggerganov, this is how things should be done. It seems to be working, but as far as I can tell this results in a ~15% performance penalty for prompt processing. Commiting so I can go and test on othe platforms.	2024-06-22 12:02:52 +03:00
Kawrakow	1f9541172f	Bitnet(1.75 bpw): higher precision fp8 scale Use 3 bits for the exponent and 5 bits for the mantissa. This makes PPL to be the same as fp16 (but the previous version with 4 bits for the exponent and mantissa was good enough for any practical purposes).	2024-06-22 12:02:52 +03:00
Kawrakow	766975ecfa	Bitnet(2.25 bpw): NEON We get PP-512 = 192 t/s, TG-128 = 72 t/s	2024-06-22 12:02:52 +03:00
Kawrakow	39982764d7	Bitnet: 2.25 bpw version Just scaler and AVX2 for now. PP-512 is even faster (325 t/s on the Ryzn-7950X, 404 t/s on Ryzen-5975WX). We lose ~6-7% for TG due to being memory bound and the model being 10% larger.	2024-06-22 12:02:52 +03:00
Kawrakow	68741281e5	bitnet 2 bpw: NEON implementation We get PP-512 = 190 t/s and TG-128 = 75 t/s. 2 bpw TG on the CPU beats 1.75 bpw on the GPU!	2024-06-22 12:02:52 +03:00
Kawrakow	a8521b73d7	Removed extra column	2024-06-22 12:02:52 +03:00
Kawrakow	8ca1bdebe4	bitnet 2 bpw: AVX2 implementation We get PP-512 = 322 t/s. TG is already 51.6 t/s at 4 threads, then it saturates and starts going down for more than 8 threads.	2024-06-22 12:02:52 +03:00
Kawrakow	0efd620d01	iqk_mul_mat(bitnet): fix typo With the last change (which added the typo), I'm now getting PP-512 = 300 t/s on the Ryzen-5975WX.	2024-06-22 12:02:51 +03:00
Kawrakow	7b3cb2b96c	iqk_mul_mat(bitnet): slightly faster AVX2 We now get 214 t/s on the Ryzen-7950X	2024-06-22 12:02:51 +03:00
Kawrakow	e6d8441397	iq1_bn: better NEON implementation PP is decent with 131 t/s (q4_0 has 150 t/s). TG is better than last commit but still bad at 33.1 t/s (in comparison q4_0 gets 52.3 t/s). I had to go to the (0, 1, 2) table. Apple Silicon clearly does not like operations with signs.	2024-06-22 12:02:51 +03:00
Kawrakow	3686304e03	iq1_bn(NEON): works now, but very slow Basically 2X slower tan q4_0.	2024-06-22 12:02:51 +03:00
Kawrakow	4b64224645	iqk_mul_mat(iq1_bn): WIP NEON - don't see why it is not working	2024-06-22 12:02:51 +03:00
Kawrakow	77d8637925	iqk_mul_mat(iq1_bn): WIP NEON (not working)	2024-06-22 12:02:51 +03:00
Kawrakow	dfdc4dbee6	iqk_mul_mat: improve iq1_bn (bitnet) on vanilla AVX2 I now get PP-512 = 270 t/s on the Ryzen-5975WX	2024-06-22 12:02:51 +03:00
Kawrakow	dff96fb5f8	iqk_mul_mat: improve iq1_bn (bitnet) on AVX2 We now get 207 t/s for PP-512 and 51 t/s for TG-128 using 16 threads.	2024-06-22 12:02:51 +03:00
Kawrakow	88e98260bf	bitnet: scale is per row, not per tensor	2024-06-22 12:02:51 +03:00
Kawrakow	077270395b	iqk_mul_mat: add iq1_bn (bitnet) We get 174 t/s for PP-512 and 49 t/s for TG-128 using 16 threads.	2024-06-22 12:02:51 +03:00
Kawrakow	f9490aea46	iqk_mul_mat: cleanup	2024-06-22 12:02:50 +03:00
Kawrakow	389e6220e9	iqk_mul_mat: be independent of llamafile_sgemm Verified that it works on AVX2. Also turned on any combination of f16 and f32 (i.e., added f16 x 16 and f32 x f32).	2024-06-22 12:02:50 +03:00
Kawrakow	915a1b2665	iqk_mul_mat: be independent of llamafile_sgemm (WIP) * Remove iqk_mul_mat from llamafile_sgemm * Pass tensor types and strides to iqk_mul_mat It is marked WIP because only tested on __aarch64__	2024-06-22 12:02:50 +03:00
Kawrakow	154f56a8de	iqk_mul_mat: be able to handle any f16/f32 combination on AVX2 But only turning on f16 x f32 and f32 x f16 for now.	2024-06-22 12:02:50 +03:00
Kawrakow	1211a4b5d0	iqk_mul_mat: turn on AVX512 It makes no difference on my Ryzen-7950X, but perhaps it will be beneficial for CPU's with real AVX512.	2024-06-22 12:02:50 +03:00
Kawrakow	dfcb8bebc5	iqk_mul_mat: slightly better fp16 with 16 vector registers 2x6 (Nx x Ny) tiles instead of 3x4. We get 142.7 t/s on the Ryzen-5975WX up from 138 t/s. We use Nx registers to preload the fp16 weights, so total registers required is Nx * (Ny + 1), so 15 in the case of of 3 x 4 tiles and 14 for 2 x 6 tiles. I guess, the one spare register helps. But maybe it is just a matter of how things get loaded into the cache. On the 7950X I did try 3 x 8 and it did not perform as well as 5 x 5.	2024-06-22 12:02:50 +03:00
Kawrakow	9dba81ddf2	iqk_mul_mat: better fp16 for AVX2 Basically use what I did for Arm. Improves PP performance to 141.7 t/s up from 136 t/s on the Ryzen-7950X (32 vector registers, so we use 5x5 tiling). This is now 10% faster than tinyBLAS. There is a minor improvement also on the Ryzen-5975WX (16 vector registers, so we use 4x3 tiling): we get 138 t/s up from 136 t/s. tinyBLAS is at 132 t/s.	2024-06-22 12:02:50 +03:00
Kawrakow	baf6aaa31b	iqk_mul_mat: fp16 for Arm ~2% slower than tinyBLAS - not sure why.	2024-06-22 12:02:50 +03:00
Kawrakow	6ec0fcc5c7	iqk_mul_mat: slightly faster FANCY_SIMD dot product About 2% faster for q4_K.	2024-06-22 12:02:50 +03:00
Kawrakow	5812618409	iqk_mul_mat: fix q8_0 I was happily using _mm256_packs_epi32() to pack the q8_0 x q8_0 dot products back to int16_t, and getting useful results. But theoretically this can overflow, so it is better to use _mm256_unpacklo_ and _mm256_unpackhi_ to combine the 4 dot products using int32_t additions. This is (almost) as fast, unlike _mm256_hadd_epi32(), which seems excessively slow on the Ryzen-7950X.	2024-06-22 12:02:50 +03:00
Kawrakow	c7870afaad	iqk_mul_mat: use block_q8_1_x4 also for AVX2 Here the performance gain is more significant. E.g., for q4_1, PP-512 becomes 168 t/s up from 137 t/s. Now the performance gap to q4_0 is so significant that I wonder if I should change to using Q8_1 also for the qX_0 legacy quants.	2024-06-22 12:02:50 +03:00
Kawrakow	5b19e5e4a9	iqk_mul_mat: use block_q8_0_x4 also for AVX2	2024-06-22 12:02:50 +03:00
Kawrakow	30a0bf30fa	iqk_mul_mat: delete unused stuff	2024-06-22 12:02:50 +03:00
Kawrakow	64da6f7a97	iqk_mul_mat: add q8_0 It was actually ready but not turned on. Having forgotten, I made a new implementation along the lines of the fp16 implementation (i.e., using tiling). That matched tiinyBLAS performance. But the existing implementation that I now turned on is faster: PP-512 = 134 t/s vs 128.3 t/s for tinyBLAS TG-128 = 8.7 t/s vs 8.3 t/s for tinyBLAS (@ 4 threads)	2024-06-22 12:02:50 +03:00
Kawrakow	f2ced256b4	iqk_mul_mat: fp16 tweaks Use 4x3 tiling on a real AVX2 CPU (with only 16 vector registers). This works best for the Ryzen-5975WX.	2024-06-22 12:02:50 +03:00
Kawrakow	b4ecd2dce6	iqk_mul_mat: fp16 implementation cleanup It turns out on my Ryzen-7950X CPU using AVX512 is slower.	2024-06-22 12:02:50 +03:00
Kawrakow	e0b52e14a6	iqk_mul_mat: fp16 implementation for AVX2 This simple implementation beats jart's tiniBLAS by a small margin (143 t/s vs 137 t/s for PP-512, TG is 4.75 t/s, so exactly the same as ggml).	2024-06-22 12:02:50 +03:00
Kawrakow	ea239f8572	iqk_mul_mat: make it independent of sgemm	2024-06-22 12:02:50 +03:00
Kawrakow	5039ea8930	iqk_mul_mat: minor improvements Current performance: \| model \| size \| threads \| test \| t/s \| \| ----------------- \| ---------: \| -------: \| ------: \| ---------------: \| \| llama 7B IQ3_S \| 2.75 GiB \| 16 \| pp512 \| 100.21 ± 0.32 \| \| llama 7B IQ3_XXS \| 2.41 GiB \| 16 \| pp512 \| 105.25 ± 0.75 \| \| llama 7B IQ2_M \| 2.20 GiB \| 16 \| pp512 \| 117.88 ± 0.15 \| \| llama 7B IQ2_XS \| 1.89 GiB \| 16 \| pp512 \| 136.38 ± 0.24 \| \| llama 7B IQ2_XXS \| 1.73 GiB \| 16 \| pp512 \| 128.47 ± 0.39 \| mean: 117.64 \| ----------------- \| ---------: \| -------: \| ------: \| ---------------: \| \| llama 7B IQ2_XXS \| 1.73 GiB \| 8 \| tg128 \| 23.94 ± 0.04 \| \| llama 7B IQ2_XS \| 1.89 GiB \| 8 \| tg128 \| 23.27 ± 0.03 \| \| llama 7B IQ2_M \| 2.20 GiB \| 8 \| tg128 \| 18.88 ± 0.03 \| \| llama 7B IQ3_XXS \| 2.41 GiB \| 8 \| tg128 \| 19.07 ± 0.04 \| \| llama 7B IQ3_S \| 2.75 GiB \| 8 \| tg128 \| 15.44 ± 0.05 \| mean: 20.12	2024-06-22 12:02:50 +03:00
Kawrakow	e85753e1ad	iqk_mul_mat: no more templates in the IQ dequantizers Also moved the quant specific code from the EvenSignHelper into the corresponding dequantizers. These two changes had a tiniy performance benefit (much too small compared to what I was expecting/hoping for).	2024-06-22 12:02:50 +03:00
Kawrakow	b8556267cd	iqk_mul_mat: remove template on one of the prepare() functions	2024-06-22 12:02:49 +03:00
Kawrakow	44b1b4fb97	iqk_mul_mat: experimenting with zen4 Nope, we cannot have good performance for iq2_xxs and iq3_xxs at the same time. If I don't force inline the sign functions, I get better performnce for iq2_xxs and bad performance for iq3_xxs. If I fore inline them, it is the other way around. Anyway, this is what we have now on Zen4 for all quants with forced inline EvenSignHelper methods: \| model \| size \| threads \| test \| t/s \| \| -----------------\| ---------: \| ------: \| -----: \| ------------: \| \| llama 7B IQ3_S \| 2.75 GiB \| 16 \| pp512 \| 100.91 ± 0.26 \| \| llama 7B IQ3_XXS \| 2.41 GiB \| 16 \| pp512 \| 106.08 ± 0.78 \| \| llama 7B IQ2_M \| 2.20 GiB \| 16 \| pp512 \| 116.41 ± 0.25 \| \| llama 7B IQ2_XS \| 1.89 GiB \| 16 \| pp512 \| 132.54 ± 1.07 \| \| llama 7B IQ2_XXS \| 1.73 GiB \| 16 \| pp512 \| 125.53 ± 0.06 \| arithmetic mean: 116.29 geometric mean: 115.70 \| -----------------\| ---------: \| ------: \| -----: \| ------------: \| \| llama 7B IQ3_S \| 2.75 GiB \| 8 \| tg128 \| 15.69 ± 0.04 \| \| llama 7B IQ3_XXS \| 2.41 GiB \| 8 \| tg128 \| 18.02 ± 0.04 \| \| llama 7B IQ2_M \| 2.20 GiB \| 8 \| tg128 \| 18.94 ± 0.03 \| \| llama 7B IQ2_XS \| 1.89 GiB \| 8 \| tg128 \| 23.29 ± 0.02 \| \| llama 7B IQ2_XXS \| 1.73 GiB \| 8 \| tg128 \| 22.96 ± 0.09 \| arithmetic mean: 19.78 geometric mean: 19.56 Without force-inlining, PP(iq3_xxs) drops to 98 t/s while PP(iq2_xxs) increases to 137 t/s.	2024-06-22 12:02:49 +03:00
Kawrakow	eb9e2b628a	iqk_mul_mat: experimenting with zen4 (iq2_xxs) Observing again the wierdness of performance drop in a quant because of a change in another quant. After I added FANCY_SIMD implementations for ia3_s, iq2_s and iq2_xs, I'm observing that iq2_xxs PP performance dropped to 130 t/s from 139 t/s. Adding FANCY_SIMD implementation for applying the signs brings it back to 137 t/s and gives a small boost for TG as well (23.4 vs 23.0 t/s)	2024-06-22 12:02:49 +03:00
Kawrakow	2c8d3dad1f	iqk_mul_mat: experimenting with zen4 (iq2_xs)	2024-06-22 12:02:49 +03:00
Kawrakow	0d9027fe74	iqk_mul_mat: experimenting with zen4 (iq3_s and iq2_m)	2024-06-22 12:02:49 +03:00
Kawrakow	ed8f1fe490	iqk_mul_mat: small improvement for iq3_s The same as in llamafile. We get PP-512 = 96.6 t/s TG-128 = 7.77 t/s @ 4 threads 14.4 t/s @ 8 threads 16.3 t/s @ 16 threads	2024-06-22 12:02:49 +03:00
Kawrakow	01d55dcbf0	iqk_mul_mat: better AVX2 implementation for iq2_xxs From here on switching to GCC 12. PP-512 is now 139.3 t/s. TG-128 is 13.5 t/s @ 4 threads 23.0 t/s @ 8 threads 25.1 t/s @ 16 threads	2024-06-22 12:02:49 +03:00
Kawrakow	d4e9e595f9	iqk_mul_mat: better AVX2 implementation for iq2_xxs 2.41X for PP-512 (120.5 t/s). Slightly faster for TG @ 4 threads (12.2 t/s vs 11.9 t/s). But somehow slower at 16 threads - 22.65 t/s vs 26.3 t/s. Very strange.	2024-06-22 12:02:49 +03:00

1 2

67 Commits