ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-02-20 21:24:08 +00:00

Author	SHA1	Message	Date
Iwan Kawrakow	7f968d51b4	bitnet(scale in a separate tensor): mul -> scale on Metal Do the mul -> scale replacement on the fly in the Metal backend. This recovers the PP performace and cuts the TG performance degradation in half.	2024-06-22 12:02:52 +03:00
Iwan Kawrakow	d08ff0df43	Revert "bitnet(scale in a separate tensor): replace ggml_mul with ggml_scale" This reverts commit f83381371b61e0863b55c60e5f5df139126a496d. When using CUDA, the tensor contents have not been loaded yet, so we crash when trying to access the scale when building the graph. There must be a better way.	2024-06-22 12:02:52 +03:00
Iwan Kawrakow	ad60fb3567	bitnet(scale in a separate tensor): replace ggml_mul with ggml_scale This recovers part of the performance loss. On Metal TG-128 is now 92 t/s, still short of the ~100 t/s with scales applied on the fly.	2024-06-22 12:02:52 +03:00
Iwan Kawrakow	257fa74014	bitnet(scale in a separate tensor): Metal iq2_bn TG-128 drops to 84 t/s, while I see in the logs that we had 97 t/s. If true, that's a pretty massive performance penalty for TG. Let me guess: ggml_mul is not exactly the most performant operation on Metal.	2024-06-22 12:02:52 +03:00
Iwan Kawrakow	a2e43b83c9	bitnet(scale in a separate tensor): CUDA	2024-06-22 12:02:52 +03:00
Iwan Kawrakow	58d9e8f1d2	bitnet: put the scale in a separate tensor and correspondingly add an extra ggml_mul_mat operation. As per @ggerganov, this is how things should be done. It seems to be working, but as far as I can tell this results in a ~15% performance penalty for prompt processing. Commiting so I can go and test on othe platforms.	2024-06-22 12:02:52 +03:00
Iwan Kawrakow	927e251a12	Bitnet(1.75 bpw): higher precision fp8 scale Use 3 bits for the exponent and 5 bits for the mantissa. This makes PPL to be the same as fp16 (but the previous version with 4 bits for the exponent and mantissa was good enough for any practical purposes).	2024-06-22 12:02:52 +03:00
Iwan Kawrakow	181fd9c56e	Bitnet(1.75 bpw): slightly faster CUDA dot product We get 205 t/s, so ~13% slower than 2 bit.	2024-06-22 12:02:52 +03:00
Iwan Kawrakow	fece7e1db7	Bitnet(2.25 bpw): faster Metal dot product With this we get TG-128 = 97 t/s.	2024-06-22 12:02:52 +03:00
Iwan Kawrakow	4f51348d3d	Bitnet(2.25 bpw): Metal We get PP-512 = 702 t/s, TG-128 = 84 t/s. This is almost on par with q4_0, which is rare on Metal (to not say it does not exist). For reference, q4_0 gives 726 t/s / 86 t/s for Bitnet. TG is kind of funny because we hit 72 t/s on the CPU.	2024-06-22 12:02:52 +03:00
Iwan Kawrakow	01ea9a862d	Bitnet(2.25 bpw): CUDA We get PP-512 = 9600 t/s, TG-128 = 234 t/s (but we need to use 8 CPU threads, else results are lower, so clearly there is something being computed on the CPU). PP-512 is very close to PP-512(fp16) = 9800 t/s	2024-06-22 12:02:52 +03:00
Iwan Kawrakow	2998ca9b14	Bitnet(2.25 bpw): NEON We get PP-512 = 192 t/s, TG-128 = 72 t/s	2024-06-22 12:02:52 +03:00
Iwan Kawrakow	8c6276f6a1	Bitnet: 2.25 bpw version Just scaler and AVX2 for now. PP-512 is even faster (325 t/s on the Ryzn-7950X, 404 t/s on Ryzen-5975WX). We lose ~6-7% for TG due to being memory bound and the model being 10% larger.	2024-06-22 12:02:52 +03:00
Iwan Kawrakow	1de6476d75	bitnet 2 bpw: NEON implementation We get PP-512 = 190 t/s and TG-128 = 75 t/s. 2 bpw TG on the CPU beats 1.75 bpw on the GPU!	2024-06-22 12:02:52 +03:00
Iwan Kawrakow	f97a329638	Removed extra column	2024-06-22 12:02:52 +03:00
Iwan Kawrakow	6616985135	bitnet 2 bpw: AVX2 implementation We get PP-512 = 322 t/s. TG is already 51.6 t/s at 4 threads, then it saturates and starts going down for more than 8 threads.	2024-06-22 12:02:52 +03:00
Iwan Kawrakow	f6863cfa1b	bitnet: add 2 bpw quantization The scalar dot product already chieves 37 t/s for TG!	2024-06-22 12:02:51 +03:00
Iwan Kawrakow	765622ff8f	Move Q8_K64 quantization to iqk-quantize.cpp and add copyright notice	2024-06-22 12:02:51 +03:00
Iwan Kawrakow	d82e5db6e5	iqk_mul_mat(bitnet): fix typo With the last change (which added the typo), I'm now getting PP-512 = 300 t/s on the Ryzen-5975WX.	2024-06-22 12:02:51 +03:00
Iwan Kawrakow	ddea72453b	iqk_mul_mat(bitnet): slightly faster AVX2 We now get 214 t/s on the Ryzen-7950X	2024-06-22 12:02:51 +03:00
Iwan Kawrakow	30a771bd6b	iq1_bn: better NEON implementation PP is decent with 131 t/s (q4_0 has 150 t/s). TG is better than last commit but still bad at 33.1 t/s (in comparison q4_0 gets 52.3 t/s). I had to go to the (0, 1, 2) table. Apple Silicon clearly does not like operations with signs.	2024-06-22 12:02:51 +03:00
Iwan Kawrakow	8222c9f3d1	iq1_bn(NEON): works now, but very slow Basically 2X slower tan q4_0.	2024-06-22 12:02:51 +03:00
Iwan Kawrakow	2f403d4c93	iq1_bn(Metal): 66.2 -> 67.1 t/s	2024-06-22 12:02:51 +03:00
Iwan Kawrakow	d42e9e2922	iq1_bn(Metal): 64 -> 66.2 t/s for TG This should be good enough. One cannot ask Apple Silicon to do too much work.	2024-06-22 12:02:51 +03:00
Iwan Kawrakow	9d58489c33	iq1_bn(Metal): 64 -> 66.2 t/s for TG	2024-06-22 12:02:51 +03:00
Iwan Kawrakow	f1d9c42f77	iq1_bn(Metal): 60 -> 64 t/s for TG	2024-06-22 12:02:51 +03:00
Iwan Kawrakow	a35330eb5c	iq1_bn: very slightly better Metal dot product	2024-06-22 12:02:51 +03:00
Iwan Kawrakow	d9fb92b710	iq1_bn: Metal now works PP performance is decent (668 t/s v 724 t/s for q4_0), but TG is kind of low (60 t/s vs 81 t/s for q4_0).	2024-06-22 12:02:51 +03:00
Iwan Kawrakow	0c5a353ebd	iqk_mul_mat(iq1_bn): WIP NEON - don't see why it is not working	2024-06-22 12:02:51 +03:00
Iwan Kawrakow	bf22b701f4	iqk_mul_mat(iq1_bn): WIP NEON (not working)	2024-06-22 12:02:51 +03:00
Iwan Kawrakow	29d9bf65f3	iqk_mul_mat: improve iq1_bn (bitnet) on vanilla AVX2 I now get PP-512 = 270 t/s on the Ryzen-5975WX	2024-06-22 12:02:51 +03:00
Iwan Kawrakow	91ec824f2d	iqk_mul_mat: improve iq1_bn (bitnet) on AVX2 We now get 207 t/s for PP-512 and 51 t/s for TG-128 using 16 threads.	2024-06-22 12:02:51 +03:00
Iwan Kawrakow	d1c40ff7e2	bitnet: fix scalar dot product I had forgotten to adjust for the change to q8_K64. On the M2 I'm getting 10.8 t/s with the scalar version!	2024-06-22 12:02:51 +03:00
Iwan Kawrakow	4fcfcd05d1	bitnet: scale is per row, not per tensor	2024-06-22 12:02:51 +03:00
Iwan Kawrakow	7f8901dca1	iqk_mul_mat: add iq1_bn (bitnet) We get 174 t/s for PP-512 and 49 t/s for TG-128 using 16 threads.	2024-06-22 12:02:51 +03:00
Iwan Kawrakow	0f53bc30bb	bitnet: CUDA, scalar, AVX2	2024-06-22 12:02:51 +03:00
Iwan Kawrakow	f20b28558b	bitnet: python + llama	2024-06-22 12:02:51 +03:00
Iwan Kawrakow	58756ef03f	iqk_mul_mat: cleanup	2024-06-22 12:02:50 +03:00
Iwan Kawrakow	7501184eb4	iqk_mul_mat: be independent of llamafile_sgemm Verified that it works on AVX2. Also turned on any combination of f16 and f32 (i.e., added f16 x 16 and f32 x f32).	2024-06-22 12:02:50 +03:00
Iwan Kawrakow	ad53eabf87	iqk_mul_mat: be independent of llamafile_sgemm (WIP) * Remove iqk_mul_mat from llamafile_sgemm * Pass tensor types and strides to iqk_mul_mat It is marked WIP because only tested on __aarch64__	2024-06-22 12:02:50 +03:00
Iwan Kawrakow	3593891f39	Fix nb4	2024-06-22 12:02:50 +03:00
Iwan Kawrakow	9593e163db	iqk_mul_mat: add ability to disable it	2024-06-22 12:02:50 +03:00
Iwan Kawrakow	81cf6990f5	iqk_mul_mat: be able to handle any f16/f32 combination on AVX2 But only turning on f16 x f32 and f32 x f16 for now.	2024-06-22 12:02:50 +03:00
Iwan Kawrakow	b2acd81c75	iqk_mul_mat: turn on AVX512 It makes no difference on my Ryzen-7950X, but perhaps it will be beneficial for CPU's with real AVX512.	2024-06-22 12:02:50 +03:00
Iwan Kawrakow	9e3dc8c432	iqk_mul_mat: slightly better fp16 with 16 vector registers 2x6 (Nx x Ny) tiles instead of 3x4. We get 142.7 t/s on the Ryzen-5975WX up from 138 t/s. We use Nx registers to preload the fp16 weights, so total registers required is Nx * (Ny + 1), so 15 in the case of of 3 x 4 tiles and 14 for 2 x 6 tiles. I guess, the one spare register helps. But maybe it is just a matter of how things get loaded into the cache. On the 7950X I did try 3 x 8 and it did not perform as well as 5 x 5.	2024-06-22 12:02:50 +03:00
Iwan Kawrakow	ae1e77c5de	iqk_mul_mat: better fp16 for AVX2 Basically use what I did for Arm. Improves PP performance to 141.7 t/s up from 136 t/s on the Ryzen-7950X (32 vector registers, so we use 5x5 tiling). This is now 10% faster than tinyBLAS. There is a minor improvement also on the Ryzen-5975WX (16 vector registers, so we use 4x3 tiling): we get 138 t/s up from 136 t/s. tinyBLAS is at 132 t/s.	2024-06-22 12:02:50 +03:00
Iwan Kawrakow	9386b49918	iqk_mul_mat: fp16 for Arm ~2% slower than tinyBLAS - not sure why.	2024-06-22 12:02:50 +03:00
Iwan Kawrakow	09d86e5876	iqk_mul_mat: slightly faster FANCY_SIMD dot product About 2% faster for q4_K.	2024-06-22 12:02:50 +03:00
Iwan Kawrakow	8a80a31ddd	iqk_mul_mat: fix q8_0 I was happily using _mm256_packs_epi32() to pack the q8_0 x q8_0 dot products back to int16_t, and getting useful results. But theoretically this can overflow, so it is better to use _mm256_unpacklo_ and _mm256_unpackhi_ to combine the 4 dot products using int32_t additions. This is (almost) as fast, unlike _mm256_hadd_epi32(), which seems excessively slow on the Ryzen-7950X.	2024-06-22 12:02:50 +03:00
Iwan Kawrakow	81409a02f3	iqk_mul_mat: decouple from llamafile also in cmake	2024-06-22 12:02:50 +03:00

1 2 3 4 5 ...

3285 Commits