ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-02-05 14:00:10 +00:00

Author	SHA1	Message	Date
Iwan Kawrakow	c2158c15d9	iqk_mul_mat(NEON): adding forgotten fp16 matrix x vector implementation	2024-07-25 08:37:13 +02:00
Iwan Kawrakow	770f3585c2	Add copyright notices Only on the files where I have contributed in a significant way, or the files I wrote myself.	2024-07-24 20:11:42 +03:00
Iwan Kawrakow	9eee03f4ee	Remove unused file	2024-07-24 19:33:19 +03:00
Iwan Kawrakow	6b4167164c	iqk_mul_mat(NEON): special case for n not divisible by 8 Else fp16 PP performance drops by nearly a factor of 2 compared to what we had before.	2024-07-24 08:04:47 +02:00
Iwan Kawrakow	abb740c9a4	Fix "make it work for row sizes that are multiple of 4 on NEON"	2024-07-24 08:04:47 +02:00
Iwan Kawrakow	30b8bcf1a3	iqk_mul_mat(f16): make it work for row sizes that are multiple of 4 on NEON Here the performance gain is more modest compared to AVX2: we get PP-512 = 200 t/s up from 190 t/s for iq1_bn-quantized Bitnet-3B running on M2 Max.	2024-07-18 13:55:51 +02:00
Iwan Kawrakow	744eb9ffa9	iqk_mul_mat(float): make it work for row sizes that are multiple of 4 on AVX2 I was trying to understand where the Bitnet bottleneck is, and at some point noticed the Q*K matrixt multiplication where Q and K have the shape of 100 x n_token x 32 x 1. The existing iqk_mul_mat for floats rerquiers that the row size is a multiple of the SIMD vector size (so, 16 on the Ryzen-7950X, 8 on the Ryzen-5975), and hence this matrix multiiplication was getting done with ggml. Changing the iqk_mul_mat float kernel to handle row sizes that are a multiple of 4 (via __m128 for the last values in a row) resulted in nearly a 20% performance boost for PP-512 and ~3% for TG-128! If I go to a context of 2048, PP performance increases by nearly 70%!	2024-07-18 11:39:32 +03:00
Iwan Kawrakow	7024ecfeb4	iq1bn: faster AVX2 Instead of shuffling quant data into a 128-bit register containing 8-bit ints, and then converting to 16 bit, we directly shuffle into a 256-bit register containing 16 bit ints. TG-128 @ 2 threads goes from 18.3 to 21.6 t/s. TG-128 performance now saturates already at 8 threads getting 60.4 t/s. There is almost no impact on PP-512 (322 -> 323 t/s). I guess, we amortize dequantization cost pretty well, so we don't gain much there. We get close to 100 GB/s single-threaded float32 throuput: ./bin/test-quantize-perf --op vec_dot_q -i 10000000 --type iq1_bn iq1_bn vec_dot_q 4096 values (0.02 MB) min cycles/32 vals : 3.87 avg cycles/32 vals : 4.40 float32 throughput : 98.27 GB/s quantized throughput : 4.99 GB/s	2024-07-17 10:17:05 +03:00
Iwan Kawrakow	873a790ee2	iq1bn(no lookup): better version We have 4 groups of 16 in a block of 64 quants. For each group of 16 we have 3 groups of 5, each using 8 bits. The remaining 16'th quants of the 4 groups of 16 are encoded with 8 bits using the same encoding as the groups of 5. The only kernel where we have complications is the CUDA dequantize kernel (because we are dequantizing 8 quants there, and we have different encoding for the 1st and 2nd group of 8 in a group of 16). Ths achieves better performance on all tested platforms than any previous 1.625 bpw attempt. We have: \| model \| size \| params \| backend \| threads \| test \| t/s \| \| ---------------- \| ---------: \| ---------: \| ---------- \| ------: \| ------------: \| ---------------: \| \| 1.625 bpw Bitnet \| 729.64 MiB \| 3.32 B \| CUDA \| 8 \| pp512 \| 9613.02 ± 24.54 \| \| 1.625 bpw Bitnet \| 729.64 MiB \| 3.32 B \| CUDA \| 8 \| tg128 \| 229.85 ± 0.33 \| \| 1.625 bpw Bitnet \| 729.64 MiB \| 3.32 B \| AVX2 \| 16 \| pp512 \| 322.59 ± 1.00 \| \| 1.625 bpw Bitnet \| 729.64 MiB \| 3.32 B \| AVX2 \| 16 \| tg128 \| 59.79 ± 0.03 \| \| 1.625 bpw Bitnet \| 729.64 MiB \| 3.32 B \| AVX2 \| 8 \| tg128 \| 57.62 ± 0.21 \| \| 1.625 bpw Bitnet \| 729.64 MiB \| 3.32 B \| AVX2 \| 4 \| tg128 \| 33.66 ± 0.29 \| \| 1.625 bpw Bitnet \| 729.64 MiB \| 3.32 B \| AVX2 \| 2 \| tg128 \| 18.30 ± 0.01 \| \| 1.625 bpw Bitnet \| 729.64 MiB \| 3.32 B \| Metal \| 8 \| pp512 \| 698.13 ± 0.21 \| \| 1.625 bpw Bitnet \| 729.64 MiB \| 3.32 B \| Metal \| 8 \| tg128 \| 68.88 ± 0.24 \| \| 1.625 bpw Bitnet \| 729.64 MiB \| 3.32 B \| NEON \| 8 \| pp512 \| 196.80 ± 0.50 \| \| 1.625 bpw Bitnet \| 729.64 MiB \| 3.32 B \| NEON \| 8 \| tg128 \| 51.58 ± 0.41 \| \| 1.625 bpw Bitnet \| 729.64 MiB \| 3.32 B \| NEON \| 4 \| tg128 \| 30.80 ± 0.03 \| \| 1.625 bpw Bitnet \| 729.64 MiB \| 3.32 B \| NEON \| 2 \| tg128 \| 16.89 ± 0.01 \| It is still slower than 2 bpw Bitnet, but the difference now is not as dramatic.	2024-07-17 08:54:11 +03:00
Iwan Kawrakow	6393e26827	iq1bn(no lookup): NEON attempts We are at TG-128 = 25.7 t/s, which is quite a bit worse than lookup.	2024-07-16 08:32:15 +02:00
Iwan Kawrakow	26a1a689c6	iq1bn(no lookup): NEON Pretty bad.	2024-07-15 20:40:14 +02:00
Iwan Kawrakow	e4dc3babb5	iq1bn(no lookup): somewhat better We now have for Bitnet-3B: \| threads \| test \| t/s \| \| ------: \| ------------: \| ---------------: \| \| 16 \| pp512 \| 308.97 ± 1.89 \| \| 16 \| tg128 \| 58.80 ± 0.07 \| \| 8 \| tg128 \| 49.79 ± 1.23 \| \| 4 \| tg128 \| 28.85 ± 0.02 \| \| 2 \| tg128 \| 15.39 ± 0.01 \|	2024-07-15 13:46:07 +03:00
Iwan Kawrakow	a4bbd36905	iq1bn: attempt without a lookup table	2024-07-15 11:02:41 +03:00
Iwan Kawrakow	753dbaeeb0	bitnet: remove iq1_bn lookup table storing +/- signs The AVX2 implementation was the only one left using it, so I decided to see if we can get a performant implementation using the 0,1,2 lookup table. Turns out we can, and it is even slightly faster than the sign based table. We now get PP-512 = 275 t/s and TG-128 = 57.7 t/s with 16 threads on the Ryzen-7950X. With only one lookup table left for iq1_bn, I renamed it to iq1bn_grid_u16.	2024-06-25 18:19:11 +03:00
Iwan Kawrakow	8b436a84c5	bitnet: simdify q8_K64 quantization on AVX Doesn't make a real difference in performance.	2024-06-25 17:20:34 +03:00
Iwan Kawrakow	c906c4c4fe	bitnet: NEON improvements for iq1_bn With these changes we get to TG-128 = 34 t/s, PP-512 = 153 t/s.	2024-06-25 13:48:29 +02:00
Iwan Kawrakow	7de9559cf2	Bitnet: adapt NEON and Metal to the alternative grid	2024-06-25 11:16:13 +02:00
Iwan Kawrakow	aa14a06b44	Bitnet: trying an alternative iq1_bn grid Faster on CUDA. The scalar version is faster too. The issue with CUDA is that now I see wild performance fluctuations. Running llama-bench I can get 220 t/s for TG-128 one time, and 190 t/s another time, with uncertaintiers of 1-2 t/s. Same for PP, results are jumping back-and-fort between ~9500 t/s and ~8900 t/s. So, basically no reliable measurement at this point, but for sure faster than the previous version, which was at around 170-180 t/s.	2024-06-25 11:32:48 +03:00
Iwan Kawrakow	3d61866f0a	Bitnet: slightly faster 1.625 bpw variant for AVX512VL	2024-06-25 08:33:00 +03:00
Iwan Kawrakow	fc04994ebf	iqk_mul_mat: add IQ4_NL also on NEON PPL seems somewhat higher? For llama-v2-7B iwe are still ~0.04 higher compared to hat we expect after ~30 batches.	2024-06-22 12:02:52 +03:00
Iwan Kawrakow	caa42ccc56	iqk_mul_mat: add IQ4_NL I never use it, so I had completely forgotten about it.	2024-06-22 12:02:52 +03:00
Iwan Kawrakow	86dc8e5f8b	bitnet(scale in a separate tensor): CPU tweaks A somewhat nicer iq2_bn implementation on AVX2.	2024-06-22 12:02:52 +03:00
Iwan Kawrakow	729ba46f77	bitnet(scale in a separate tensor): CPU tweaks I had ruined TG performance on AVX2 with the last commit. Was just testing at 8 threads and there we are totally memory bound. But at 4 threads we had regressed to 41 t/s on the Ryzen7950. Back to 51 t/s with this commit.	2024-06-22 12:02:52 +03:00
Iwan Kawrakow	f0325c5826	bitnet(scale in a separate tensor): more CPU improvements It seems it is enough to have 4 scales per row for Q8. I get PPL = 8.5470 with this, which is slightly higher than the 8.5430 we get with 1 scale per 128 activations, but still OK, I think. With this, we get the following performance: Systema \| quant \| PP-512 \| TG-128a \| quant \| PP-512 \| TG-12s \| M2 Max \| iq2bn 229.02 ± 0.37 78.75 ± 0.61 \| iq1bn \| 146.67 ± 2.85 33.12 ± 0.03 Ryzen7950\| iq2bn 379.36 ± 1.03 49.08 ± 0.18 \| iq1bn \| 247.12 ± 1.53 32.80 ± 0.02 Ryzen5975\| iq2bn 465.28 ± 0.57 39.17 ± 0.02 \| iq1bn \| 325.86 ± 0.46 26.60 ± 0.10	2024-06-22 12:02:52 +03:00
Iwan Kawrakow	e05cca9ef6	bitnet(scale in a separate tensor): CPU improvements Arrange Q8 quants in blocks of 128 and adapt iqk_mul_mat to deal with that. This improves PP speef by a few percent.	2024-06-22 12:02:52 +03:00
Iwan Kawrakow	58d9e8f1d2	bitnet: put the scale in a separate tensor and correspondingly add an extra ggml_mul_mat operation. As per @ggerganov, this is how things should be done. It seems to be working, but as far as I can tell this results in a ~15% performance penalty for prompt processing. Commiting so I can go and test on othe platforms.	2024-06-22 12:02:52 +03:00
Iwan Kawrakow	927e251a12	Bitnet(1.75 bpw): higher precision fp8 scale Use 3 bits for the exponent and 5 bits for the mantissa. This makes PPL to be the same as fp16 (but the previous version with 4 bits for the exponent and mantissa was good enough for any practical purposes).	2024-06-22 12:02:52 +03:00
Iwan Kawrakow	2998ca9b14	Bitnet(2.25 bpw): NEON We get PP-512 = 192 t/s, TG-128 = 72 t/s	2024-06-22 12:02:52 +03:00
Iwan Kawrakow	8c6276f6a1	Bitnet: 2.25 bpw version Just scaler and AVX2 for now. PP-512 is even faster (325 t/s on the Ryzn-7950X, 404 t/s on Ryzen-5975WX). We lose ~6-7% for TG due to being memory bound and the model being 10% larger.	2024-06-22 12:02:52 +03:00
Iwan Kawrakow	1de6476d75	bitnet 2 bpw: NEON implementation We get PP-512 = 190 t/s and TG-128 = 75 t/s. 2 bpw TG on the CPU beats 1.75 bpw on the GPU!	2024-06-22 12:02:52 +03:00
Iwan Kawrakow	f97a329638	Removed extra column	2024-06-22 12:02:52 +03:00
Iwan Kawrakow	6616985135	bitnet 2 bpw: AVX2 implementation We get PP-512 = 322 t/s. TG is already 51.6 t/s at 4 threads, then it saturates and starts going down for more than 8 threads.	2024-06-22 12:02:52 +03:00
Iwan Kawrakow	d82e5db6e5	iqk_mul_mat(bitnet): fix typo With the last change (which added the typo), I'm now getting PP-512 = 300 t/s on the Ryzen-5975WX.	2024-06-22 12:02:51 +03:00
Iwan Kawrakow	ddea72453b	iqk_mul_mat(bitnet): slightly faster AVX2 We now get 214 t/s on the Ryzen-7950X	2024-06-22 12:02:51 +03:00
Iwan Kawrakow	30a771bd6b	iq1_bn: better NEON implementation PP is decent with 131 t/s (q4_0 has 150 t/s). TG is better than last commit but still bad at 33.1 t/s (in comparison q4_0 gets 52.3 t/s). I had to go to the (0, 1, 2) table. Apple Silicon clearly does not like operations with signs.	2024-06-22 12:02:51 +03:00
Iwan Kawrakow	8222c9f3d1	iq1_bn(NEON): works now, but very slow Basically 2X slower tan q4_0.	2024-06-22 12:02:51 +03:00
Iwan Kawrakow	0c5a353ebd	iqk_mul_mat(iq1_bn): WIP NEON - don't see why it is not working	2024-06-22 12:02:51 +03:00
Iwan Kawrakow	bf22b701f4	iqk_mul_mat(iq1_bn): WIP NEON (not working)	2024-06-22 12:02:51 +03:00
Iwan Kawrakow	29d9bf65f3	iqk_mul_mat: improve iq1_bn (bitnet) on vanilla AVX2 I now get PP-512 = 270 t/s on the Ryzen-5975WX	2024-06-22 12:02:51 +03:00
Iwan Kawrakow	91ec824f2d	iqk_mul_mat: improve iq1_bn (bitnet) on AVX2 We now get 207 t/s for PP-512 and 51 t/s for TG-128 using 16 threads.	2024-06-22 12:02:51 +03:00
Iwan Kawrakow	4fcfcd05d1	bitnet: scale is per row, not per tensor	2024-06-22 12:02:51 +03:00
Iwan Kawrakow	7f8901dca1	iqk_mul_mat: add iq1_bn (bitnet) We get 174 t/s for PP-512 and 49 t/s for TG-128 using 16 threads.	2024-06-22 12:02:51 +03:00
Iwan Kawrakow	58756ef03f	iqk_mul_mat: cleanup	2024-06-22 12:02:50 +03:00
Iwan Kawrakow	7501184eb4	iqk_mul_mat: be independent of llamafile_sgemm Verified that it works on AVX2. Also turned on any combination of f16 and f32 (i.e., added f16 x 16 and f32 x f32).	2024-06-22 12:02:50 +03:00
Iwan Kawrakow	ad53eabf87	iqk_mul_mat: be independent of llamafile_sgemm (WIP) * Remove iqk_mul_mat from llamafile_sgemm * Pass tensor types and strides to iqk_mul_mat It is marked WIP because only tested on __aarch64__	2024-06-22 12:02:50 +03:00
Iwan Kawrakow	81cf6990f5	iqk_mul_mat: be able to handle any f16/f32 combination on AVX2 But only turning on f16 x f32 and f32 x f16 for now.	2024-06-22 12:02:50 +03:00
Iwan Kawrakow	b2acd81c75	iqk_mul_mat: turn on AVX512 It makes no difference on my Ryzen-7950X, but perhaps it will be beneficial for CPU's with real AVX512.	2024-06-22 12:02:50 +03:00
Iwan Kawrakow	9e3dc8c432	iqk_mul_mat: slightly better fp16 with 16 vector registers 2x6 (Nx x Ny) tiles instead of 3x4. We get 142.7 t/s on the Ryzen-5975WX up from 138 t/s. We use Nx registers to preload the fp16 weights, so total registers required is Nx * (Ny + 1), so 15 in the case of of 3 x 4 tiles and 14 for 2 x 6 tiles. I guess, the one spare register helps. But maybe it is just a matter of how things get loaded into the cache. On the 7950X I did try 3 x 8 and it did not perform as well as 5 x 5.	2024-06-22 12:02:50 +03:00
Iwan Kawrakow	ae1e77c5de	iqk_mul_mat: better fp16 for AVX2 Basically use what I did for Arm. Improves PP performance to 141.7 t/s up from 136 t/s on the Ryzen-7950X (32 vector registers, so we use 5x5 tiling). This is now 10% faster than tinyBLAS. There is a minor improvement also on the Ryzen-5975WX (16 vector registers, so we use 4x3 tiling): we get 138 t/s up from 136 t/s. tinyBLAS is at 132 t/s.	2024-06-22 12:02:50 +03:00
Iwan Kawrakow	9386b49918	iqk_mul_mat: fp16 for Arm ~2% slower than tinyBLAS - not sure why.	2024-06-22 12:02:50 +03:00

1 2

87 Commits