ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-02-04 21:40:10 +00:00

Author	SHA1	Message	Date
Kawrakow	32ec107237	iqk_mul_mat: add IQ4_NL I never use it, so I had completely forgotten about it.	2024-06-22 12:02:52 +03:00
Kawrakow	912d6d9ce1	bitnet(scale in a separate tensor): CPU tweaks A somewhat nicer iq2_bn implementation on AVX2.	2024-06-22 12:02:52 +03:00
Kawrakow	f53d89dd53	bitnet(scale in a separate tensor): CPU tweaks I had ruined TG performance on AVX2 with the last commit. Was just testing at 8 threads and there we are totally memory bound. But at 4 threads we had regressed to 41 t/s on the Ryzen7950. Back to 51 t/s with this commit.	2024-06-22 12:02:52 +03:00
Kawrakow	52ad5764dd	bitnet(scale in a separate tensor): more CPU improvements It seems it is enough to have 4 scales per row for Q8. I get PPL = 8.5470 with this, which is slightly higher than the 8.5430 we get with 1 scale per 128 activations, but still OK, I think. With this, we get the following performance: Systema \| quant \| PP-512 \| TG-128a \| quant \| PP-512 \| TG-12s \| M2 Max \| iq2bn 229.02 ± 0.37 78.75 ± 0.61 \| iq1bn \| 146.67 ± 2.85 33.12 ± 0.03 Ryzen7950\| iq2bn 379.36 ± 1.03 49.08 ± 0.18 \| iq1bn \| 247.12 ± 1.53 32.80 ± 0.02 Ryzen5975\| iq2bn 465.28 ± 0.57 39.17 ± 0.02 \| iq1bn \| 325.86 ± 0.46 26.60 ± 0.10	2024-06-22 12:02:52 +03:00
Kawrakow	167489ef6c	bitnet(scale in a separate tensor): CPU improvements Arrange Q8 quants in blocks of 128 and adapt iqk_mul_mat to deal with that. This improves PP speef by a few percent.	2024-06-22 12:02:52 +03:00
Kawrakow	8b31c14e0d	bitnet(scale in a separate tensor): mul -> scale on the CPU	2024-06-22 12:02:52 +03:00
Kawrakow	e423af855f	bitnet(scale in a separate tensor): mul -> scale on CUDA On CUDA we do not have access to the tensor data until we hit the kernel. That's why this hack. In any case, iq2_bn goes back up to 228 t/s, which is close to the 234 t/s we have without the extra scale operation. PP is 9400 t/s, down from 9600 t/s, but better than the 9200 t/s we get without making the mul -> scale replacement.	2024-06-22 12:02:52 +03:00
Kawrakow	f72db4769b	bitnet(scale in a separate tensor): mul -> scale on Metal Do the mul -> scale replacement on the fly in the Metal backend. This recovers the PP performace and cuts the TG performance degradation in half.	2024-06-22 12:02:52 +03:00
Kawrakow	30fc9b5753	Revert "bitnet(scale in a separate tensor): replace ggml_mul with ggml_scale" This reverts commit f83381371b61e0863b55c60e5f5df139126a496d. When using CUDA, the tensor contents have not been loaded yet, so we crash when trying to access the scale when building the graph. There must be a better way.	2024-06-22 12:02:52 +03:00
Kawrakow	f024804b9a	bitnet(scale in a separate tensor): replace ggml_mul with ggml_scale This recovers part of the performance loss. On Metal TG-128 is now 92 t/s, still short of the ~100 t/s with scales applied on the fly.	2024-06-22 12:02:52 +03:00
Kawrakow	3c5cd34a05	bitnet(scale in a separate tensor): Metal iq2_bn TG-128 drops to 84 t/s, while I see in the logs that we had 97 t/s. If true, that's a pretty massive performance penalty for TG. Let me guess: ggml_mul is not exactly the most performant operation on Metal.	2024-06-22 12:02:52 +03:00
Kawrakow	14081ee2ef	bitnet(scale in a separate tensor): CUDA	2024-06-22 12:02:52 +03:00
Kawrakow	785cac7ee5	bitnet: put the scale in a separate tensor and correspondingly add an extra ggml_mul_mat operation. As per @ggerganov, this is how things should be done. It seems to be working, but as far as I can tell this results in a ~15% performance penalty for prompt processing. Commiting so I can go and test on othe platforms.	2024-06-22 12:02:52 +03:00
Kawrakow	1f9541172f	Bitnet(1.75 bpw): higher precision fp8 scale Use 3 bits for the exponent and 5 bits for the mantissa. This makes PPL to be the same as fp16 (but the previous version with 4 bits for the exponent and mantissa was good enough for any practical purposes).	2024-06-22 12:02:52 +03:00
Kawrakow	9d38a61be7	Bitnet(1.75 bpw): slightly faster CUDA dot product We get 205 t/s, so ~13% slower than 2 bit.	2024-06-22 12:02:52 +03:00
Kawrakow	f6bfdce911	Bitnet(2.25 bpw): faster Metal dot product With this we get TG-128 = 97 t/s.	2024-06-22 12:02:52 +03:00
Kawrakow	f200d36a7f	Bitnet(2.25 bpw): Metal We get PP-512 = 702 t/s, TG-128 = 84 t/s. This is almost on par with q4_0, which is rare on Metal (to not say it does not exist). For reference, q4_0 gives 726 t/s / 86 t/s for Bitnet. TG is kind of funny because we hit 72 t/s on the CPU.	2024-06-22 12:02:52 +03:00
Kawrakow	ff718c2dc1	Bitnet(2.25 bpw): CUDA We get PP-512 = 9600 t/s, TG-128 = 234 t/s (but we need to use 8 CPU threads, else results are lower, so clearly there is something being computed on the CPU). PP-512 is very close to PP-512(fp16) = 9800 t/s	2024-06-22 12:02:52 +03:00
Kawrakow	766975ecfa	Bitnet(2.25 bpw): NEON We get PP-512 = 192 t/s, TG-128 = 72 t/s	2024-06-22 12:02:52 +03:00
Kawrakow	39982764d7	Bitnet: 2.25 bpw version Just scaler and AVX2 for now. PP-512 is even faster (325 t/s on the Ryzn-7950X, 404 t/s on Ryzen-5975WX). We lose ~6-7% for TG due to being memory bound and the model being 10% larger.	2024-06-22 12:02:52 +03:00
Kawrakow	68741281e5	bitnet 2 bpw: NEON implementation We get PP-512 = 190 t/s and TG-128 = 75 t/s. 2 bpw TG on the CPU beats 1.75 bpw on the GPU!	2024-06-22 12:02:52 +03:00
Kawrakow	a8521b73d7	Removed extra column	2024-06-22 12:02:52 +03:00
Kawrakow	8ca1bdebe4	bitnet 2 bpw: AVX2 implementation We get PP-512 = 322 t/s. TG is already 51.6 t/s at 4 threads, then it saturates and starts going down for more than 8 threads.	2024-06-22 12:02:52 +03:00
Kawrakow	318899c8b7	bitnet: add 2 bpw quantization The scalar dot product already chieves 37 t/s for TG!	2024-06-22 12:02:51 +03:00
Kawrakow	f9ba085ef7	Move Q8_K64 quantization to iqk-quantize.cpp and add copyright notice	2024-06-22 12:02:51 +03:00
Kawrakow	0efd620d01	iqk_mul_mat(bitnet): fix typo With the last change (which added the typo), I'm now getting PP-512 = 300 t/s on the Ryzen-5975WX.	2024-06-22 12:02:51 +03:00
Kawrakow	7b3cb2b96c	iqk_mul_mat(bitnet): slightly faster AVX2 We now get 214 t/s on the Ryzen-7950X	2024-06-22 12:02:51 +03:00
Kawrakow	e6d8441397	iq1_bn: better NEON implementation PP is decent with 131 t/s (q4_0 has 150 t/s). TG is better than last commit but still bad at 33.1 t/s (in comparison q4_0 gets 52.3 t/s). I had to go to the (0, 1, 2) table. Apple Silicon clearly does not like operations with signs.	2024-06-22 12:02:51 +03:00
Kawrakow	3686304e03	iq1_bn(NEON): works now, but very slow Basically 2X slower tan q4_0.	2024-06-22 12:02:51 +03:00
Kawrakow	798697a6ff	iq1_bn(Metal): 66.2 -> 67.1 t/s	2024-06-22 12:02:51 +03:00
Kawrakow	bd266036b6	iq1_bn(Metal): 64 -> 66.2 t/s for TG This should be good enough. One cannot ask Apple Silicon to do too much work.	2024-06-22 12:02:51 +03:00
Kawrakow	7cb77d7a67	iq1_bn(Metal): 64 -> 66.2 t/s for TG	2024-06-22 12:02:51 +03:00
Kawrakow	04fed5cd9f	iq1_bn(Metal): 60 -> 64 t/s for TG	2024-06-22 12:02:51 +03:00
Kawrakow	5d14a2243e	iq1_bn: very slightly better Metal dot product	2024-06-22 12:02:51 +03:00
Kawrakow	15e1aec7a5	iq1_bn: Metal now works PP performance is decent (668 t/s v 724 t/s for q4_0), but TG is kind of low (60 t/s vs 81 t/s for q4_0).	2024-06-22 12:02:51 +03:00
Kawrakow	4b64224645	iqk_mul_mat(iq1_bn): WIP NEON - don't see why it is not working	2024-06-22 12:02:51 +03:00
Kawrakow	77d8637925	iqk_mul_mat(iq1_bn): WIP NEON (not working)	2024-06-22 12:02:51 +03:00
Kawrakow	dfdc4dbee6	iqk_mul_mat: improve iq1_bn (bitnet) on vanilla AVX2 I now get PP-512 = 270 t/s on the Ryzen-5975WX	2024-06-22 12:02:51 +03:00
Kawrakow	dff96fb5f8	iqk_mul_mat: improve iq1_bn (bitnet) on AVX2 We now get 207 t/s for PP-512 and 51 t/s for TG-128 using 16 threads.	2024-06-22 12:02:51 +03:00
Kawrakow	b0967ffa79	bitnet: fix scalar dot product I had forgotten to adjust for the change to q8_K64. On the M2 I'm getting 10.8 t/s with the scalar version!	2024-06-22 12:02:51 +03:00
Kawrakow	88e98260bf	bitnet: scale is per row, not per tensor	2024-06-22 12:02:51 +03:00
Kawrakow	077270395b	iqk_mul_mat: add iq1_bn (bitnet) We get 174 t/s for PP-512 and 49 t/s for TG-128 using 16 threads.	2024-06-22 12:02:51 +03:00
Kawrakow	eecd48eab5	bitnet: CUDA, scalar, AVX2	2024-06-22 12:02:51 +03:00
Kawrakow	81576cdcac	bitnet: python + llama	2024-06-22 12:02:51 +03:00
Kawrakow	f9490aea46	iqk_mul_mat: cleanup	2024-06-22 12:02:50 +03:00
Kawrakow	389e6220e9	iqk_mul_mat: be independent of llamafile_sgemm Verified that it works on AVX2. Also turned on any combination of f16 and f32 (i.e., added f16 x 16 and f32 x f32).	2024-06-22 12:02:50 +03:00
Kawrakow	915a1b2665	iqk_mul_mat: be independent of llamafile_sgemm (WIP) * Remove iqk_mul_mat from llamafile_sgemm * Pass tensor types and strides to iqk_mul_mat It is marked WIP because only tested on __aarch64__	2024-06-22 12:02:50 +03:00
Kawrakow	cc628b2e39	Fix nb4	2024-06-22 12:02:50 +03:00
Kawrakow	d41aef5418	iqk_mul_mat: add ability to disable it	2024-06-22 12:02:50 +03:00
Kawrakow	154f56a8de	iqk_mul_mat: be able to handle any f16/f32 combination on AVX2 But only turning on f16 x f32 and f32 x f16 for now.	2024-06-22 12:02:50 +03:00

1 2 3 4 5 ...

3292 Commits