ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-01-27 01:29:51 +00:00

Author	SHA1	Message	Date
Iwan Kawrakow	770f3585c2	Add copyright notices Only on the files where I have contributed in a significant way, or the files I wrote myself.	2024-07-24 20:11:42 +03:00
Iwan Kawrakow	9eee03f4ee	Remove unused file	2024-07-24 19:33:19 +03:00
Iwan Kawrakow	6a132862fd	Fix Makefile, add GGML_USE_IQK_MULMAT ifdefs to iqk-quantize	2024-07-17 16:51:34 +03:00
Iwan Kawrakow	a4017cc047	iq1bn: faster scalar dot product At the end of the day, lookup is still better when not using simd. This scalar dot product version gets us 14.7 t/s on a Ryzen-7950X with 16 threads (up from 10.5 t/s).	2024-07-17 16:09:01 +03:00
Iwan Kawrakow	a0df4002fc	iq1bn: fix scalar dot product The fix makes it faster on the Ryzen-7950X (10.5 t/s vs 8.2 t/s) but slower on the M2 (6.8 t/s vs 8.6 t/s before).	2024-07-17 13:37:18 +03:00
Iwan Kawrakow	ba00f23ea1	iq1bn: adjust scalar dot product and some cleanup	2024-07-17 08:44:46 +02:00
Iwan Kawrakow	873a790ee2	iq1bn(no lookup): better version We have 4 groups of 16 in a block of 64 quants. For each group of 16 we have 3 groups of 5, each using 8 bits. The remaining 16'th quants of the 4 groups of 16 are encoded with 8 bits using the same encoding as the groups of 5. The only kernel where we have complications is the CUDA dequantize kernel (because we are dequantizing 8 quants there, and we have different encoding for the 1st and 2nd group of 8 in a group of 16). Ths achieves better performance on all tested platforms than any previous 1.625 bpw attempt. We have: \| model \| size \| params \| backend \| threads \| test \| t/s \| \| ---------------- \| ---------: \| ---------: \| ---------- \| ------: \| ------------: \| ---------------: \| \| 1.625 bpw Bitnet \| 729.64 MiB \| 3.32 B \| CUDA \| 8 \| pp512 \| 9613.02 ± 24.54 \| \| 1.625 bpw Bitnet \| 729.64 MiB \| 3.32 B \| CUDA \| 8 \| tg128 \| 229.85 ± 0.33 \| \| 1.625 bpw Bitnet \| 729.64 MiB \| 3.32 B \| AVX2 \| 16 \| pp512 \| 322.59 ± 1.00 \| \| 1.625 bpw Bitnet \| 729.64 MiB \| 3.32 B \| AVX2 \| 16 \| tg128 \| 59.79 ± 0.03 \| \| 1.625 bpw Bitnet \| 729.64 MiB \| 3.32 B \| AVX2 \| 8 \| tg128 \| 57.62 ± 0.21 \| \| 1.625 bpw Bitnet \| 729.64 MiB \| 3.32 B \| AVX2 \| 4 \| tg128 \| 33.66 ± 0.29 \| \| 1.625 bpw Bitnet \| 729.64 MiB \| 3.32 B \| AVX2 \| 2 \| tg128 \| 18.30 ± 0.01 \| \| 1.625 bpw Bitnet \| 729.64 MiB \| 3.32 B \| Metal \| 8 \| pp512 \| 698.13 ± 0.21 \| \| 1.625 bpw Bitnet \| 729.64 MiB \| 3.32 B \| Metal \| 8 \| tg128 \| 68.88 ± 0.24 \| \| 1.625 bpw Bitnet \| 729.64 MiB \| 3.32 B \| NEON \| 8 \| pp512 \| 196.80 ± 0.50 \| \| 1.625 bpw Bitnet \| 729.64 MiB \| 3.32 B \| NEON \| 8 \| tg128 \| 51.58 ± 0.41 \| \| 1.625 bpw Bitnet \| 729.64 MiB \| 3.32 B \| NEON \| 4 \| tg128 \| 30.80 ± 0.03 \| \| 1.625 bpw Bitnet \| 729.64 MiB \| 3.32 B \| NEON \| 2 \| tg128 \| 16.89 ± 0.01 \| It is still slower than 2 bpw Bitnet, but the difference now is not as dramatic.	2024-07-17 08:54:11 +03:00
Iwan Kawrakow	a4bbd36905	iq1bn: attempt without a lookup table	2024-07-15 11:02:41 +03:00
Iwan Kawrakow	753dbaeeb0	bitnet: remove iq1_bn lookup table storing +/- signs The AVX2 implementation was the only one left using it, so I decided to see if we can get a performant implementation using the 0,1,2 lookup table. Turns out we can, and it is even slightly faster than the sign based table. We now get PP-512 = 275 t/s and TG-128 = 57.7 t/s with 16 threads on the Ryzen-7950X. With only one lookup table left for iq1_bn, I renamed it to iq1bn_grid_u16.	2024-06-25 18:19:11 +03:00
Iwan Kawrakow	8b436a84c5	bitnet: simdify q8_K64 quantization on AVX Doesn't make a real difference in performance.	2024-06-25 17:20:34 +03:00
Iwan Kawrakow	c906c4c4fe	bitnet: NEON improvements for iq1_bn With these changes we get to TG-128 = 34 t/s, PP-512 = 153 t/s.	2024-06-25 13:48:29 +02:00
Iwan Kawrakow	aa14a06b44	Bitnet: trying an alternative iq1_bn grid Faster on CUDA. The scalar version is faster too. The issue with CUDA is that now I see wild performance fluctuations. Running llama-bench I can get 220 t/s for TG-128 one time, and 190 t/s another time, with uncertaintiers of 1-2 t/s. Same for PP, results are jumping back-and-fort between ~9500 t/s and ~8900 t/s. So, basically no reliable measurement at this point, but for sure faster than the previous version, which was at around 170-180 t/s.	2024-06-25 11:32:48 +03:00
Iwan Kawrakow	cc44d4a5c3	bitnet: fix scalar dot product for 1.625 bpw I had not adjusted after going to 4 q8 scales per row.	2024-06-25 08:31:12 +02:00
Iwan Kawrakow	b747093582	bitnet: qnfs tests Q8_0 fails because as per design the reference quantization is different from the vecdot quantization.	2024-06-22 12:02:53 +03:00
Iwan Kawrakow	f0325c5826	bitnet(scale in a separate tensor): more CPU improvements It seems it is enough to have 4 scales per row for Q8. I get PPL = 8.5470 with this, which is slightly higher than the 8.5430 we get with 1 scale per 128 activations, but still OK, I think. With this, we get the following performance: Systema \| quant \| PP-512 \| TG-128a \| quant \| PP-512 \| TG-12s \| M2 Max \| iq2bn 229.02 ± 0.37 78.75 ± 0.61 \| iq1bn \| 146.67 ± 2.85 33.12 ± 0.03 Ryzen7950\| iq2bn 379.36 ± 1.03 49.08 ± 0.18 \| iq1bn \| 247.12 ± 1.53 32.80 ± 0.02 Ryzen5975\| iq2bn 465.28 ± 0.57 39.17 ± 0.02 \| iq1bn \| 325.86 ± 0.46 26.60 ± 0.10	2024-06-22 12:02:52 +03:00
Iwan Kawrakow	e05cca9ef6	bitnet(scale in a separate tensor): CPU improvements Arrange Q8 quants in blocks of 128 and adapt iqk_mul_mat to deal with that. This improves PP speef by a few percent.	2024-06-22 12:02:52 +03:00
Iwan Kawrakow	58d9e8f1d2	bitnet: put the scale in a separate tensor and correspondingly add an extra ggml_mul_mat operation. As per @ggerganov, this is how things should be done. It seems to be working, but as far as I can tell this results in a ~15% performance penalty for prompt processing. Commiting so I can go and test on othe platforms.	2024-06-22 12:02:52 +03:00
Iwan Kawrakow	927e251a12	Bitnet(1.75 bpw): higher precision fp8 scale Use 3 bits for the exponent and 5 bits for the mantissa. This makes PPL to be the same as fp16 (but the previous version with 4 bits for the exponent and mantissa was good enough for any practical purposes).	2024-06-22 12:02:52 +03:00
Iwan Kawrakow	8c6276f6a1	Bitnet: 2.25 bpw version Just scaler and AVX2 for now. PP-512 is even faster (325 t/s on the Ryzn-7950X, 404 t/s on Ryzen-5975WX). We lose ~6-7% for TG due to being memory bound and the model being 10% larger.	2024-06-22 12:02:52 +03:00
Iwan Kawrakow	f6863cfa1b	bitnet: add 2 bpw quantization The scalar dot product already chieves 37 t/s for TG!	2024-06-22 12:02:51 +03:00
Iwan Kawrakow	765622ff8f	Move Q8_K64 quantization to iqk-quantize.cpp and add copyright notice	2024-06-22 12:02:51 +03:00
Iwan Kawrakow	d1c40ff7e2	bitnet: fix scalar dot product I had forgotten to adjust for the change to q8_K64. On the M2 I'm getting 10.8 t/s with the scalar version!	2024-06-22 12:02:51 +03:00
Iwan Kawrakow	f20b28558b	bitnet: python + llama	2024-06-22 12:02:51 +03:00

23 Commits