ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-03-01 17:40:25 +00:00

Author	SHA1	Message	Date
Iwan Kawrakow	01ea9a862d	Bitnet(2.25 bpw): CUDA We get PP-512 = 9600 t/s, TG-128 = 234 t/s (but we need to use 8 CPU threads, else results are lower, so clearly there is something being computed on the CPU). PP-512 is very close to PP-512(fp16) = 9800 t/s	2024-06-22 12:02:52 +03:00
Iwan Kawrakow	2998ca9b14	Bitnet(2.25 bpw): NEON We get PP-512 = 192 t/s, TG-128 = 72 t/s	2024-06-22 12:02:52 +03:00
Iwan Kawrakow	8c6276f6a1	Bitnet: 2.25 bpw version Just scaler and AVX2 for now. PP-512 is even faster (325 t/s on the Ryzn-7950X, 404 t/s on Ryzen-5975WX). We lose ~6-7% for TG due to being memory bound and the model being 10% larger.	2024-06-22 12:02:52 +03:00
Iwan Kawrakow	1de6476d75	bitnet 2 bpw: NEON implementation We get PP-512 = 190 t/s and TG-128 = 75 t/s. 2 bpw TG on the CPU beats 1.75 bpw on the GPU!	2024-06-22 12:02:52 +03:00
Iwan Kawrakow	f97a329638	Removed extra column	2024-06-22 12:02:52 +03:00
Iwan Kawrakow	6616985135	bitnet 2 bpw: AVX2 implementation We get PP-512 = 322 t/s. TG is already 51.6 t/s at 4 threads, then it saturates and starts going down for more than 8 threads.	2024-06-22 12:02:52 +03:00
Iwan Kawrakow	f6863cfa1b	bitnet: add 2 bpw quantization The scalar dot product already chieves 37 t/s for TG!	2024-06-22 12:02:51 +03:00
Iwan Kawrakow	765622ff8f	Move Q8_K64 quantization to iqk-quantize.cpp and add copyright notice	2024-06-22 12:02:51 +03:00
Iwan Kawrakow	d82e5db6e5	iqk_mul_mat(bitnet): fix typo With the last change (which added the typo), I'm now getting PP-512 = 300 t/s on the Ryzen-5975WX.	2024-06-22 12:02:51 +03:00
Iwan Kawrakow	ddea72453b	iqk_mul_mat(bitnet): slightly faster AVX2 We now get 214 t/s on the Ryzen-7950X	2024-06-22 12:02:51 +03:00
Iwan Kawrakow	30a771bd6b	iq1_bn: better NEON implementation PP is decent with 131 t/s (q4_0 has 150 t/s). TG is better than last commit but still bad at 33.1 t/s (in comparison q4_0 gets 52.3 t/s). I had to go to the (0, 1, 2) table. Apple Silicon clearly does not like operations with signs.	2024-06-22 12:02:51 +03:00
Iwan Kawrakow	8222c9f3d1	iq1_bn(NEON): works now, but very slow Basically 2X slower tan q4_0.	2024-06-22 12:02:51 +03:00
Iwan Kawrakow	2f403d4c93	iq1_bn(Metal): 66.2 -> 67.1 t/s	2024-06-22 12:02:51 +03:00
Iwan Kawrakow	d42e9e2922	iq1_bn(Metal): 64 -> 66.2 t/s for TG This should be good enough. One cannot ask Apple Silicon to do too much work.	2024-06-22 12:02:51 +03:00
Iwan Kawrakow	9d58489c33	iq1_bn(Metal): 64 -> 66.2 t/s for TG	2024-06-22 12:02:51 +03:00
Iwan Kawrakow	f1d9c42f77	iq1_bn(Metal): 60 -> 64 t/s for TG	2024-06-22 12:02:51 +03:00
Iwan Kawrakow	a35330eb5c	iq1_bn: very slightly better Metal dot product	2024-06-22 12:02:51 +03:00
Iwan Kawrakow	d9fb92b710	iq1_bn: Metal now works PP performance is decent (668 t/s v 724 t/s for q4_0), but TG is kind of low (60 t/s vs 81 t/s for q4_0).	2024-06-22 12:02:51 +03:00
Iwan Kawrakow	0c5a353ebd	iqk_mul_mat(iq1_bn): WIP NEON - don't see why it is not working	2024-06-22 12:02:51 +03:00
Iwan Kawrakow	bf22b701f4	iqk_mul_mat(iq1_bn): WIP NEON (not working)	2024-06-22 12:02:51 +03:00
Iwan Kawrakow	29d9bf65f3	iqk_mul_mat: improve iq1_bn (bitnet) on vanilla AVX2 I now get PP-512 = 270 t/s on the Ryzen-5975WX	2024-06-22 12:02:51 +03:00
Iwan Kawrakow	91ec824f2d	iqk_mul_mat: improve iq1_bn (bitnet) on AVX2 We now get 207 t/s for PP-512 and 51 t/s for TG-128 using 16 threads.	2024-06-22 12:02:51 +03:00
Iwan Kawrakow	d1c40ff7e2	bitnet: fix scalar dot product I had forgotten to adjust for the change to q8_K64. On the M2 I'm getting 10.8 t/s with the scalar version!	2024-06-22 12:02:51 +03:00
Iwan Kawrakow	4fcfcd05d1	bitnet: scale is per row, not per tensor	2024-06-22 12:02:51 +03:00
Iwan Kawrakow	7f8901dca1	iqk_mul_mat: add iq1_bn (bitnet) We get 174 t/s for PP-512 and 49 t/s for TG-128 using 16 threads.	2024-06-22 12:02:51 +03:00
Iwan Kawrakow	0f53bc30bb	bitnet: CUDA, scalar, AVX2	2024-06-22 12:02:51 +03:00
Iwan Kawrakow	f20b28558b	bitnet: python + llama	2024-06-22 12:02:51 +03:00
Iwan Kawrakow	58756ef03f	iqk_mul_mat: cleanup	2024-06-22 12:02:50 +03:00
Iwan Kawrakow	7501184eb4	iqk_mul_mat: be independent of llamafile_sgemm Verified that it works on AVX2. Also turned on any combination of f16 and f32 (i.e., added f16 x 16 and f32 x f32).	2024-06-22 12:02:50 +03:00
Iwan Kawrakow	ad53eabf87	iqk_mul_mat: be independent of llamafile_sgemm (WIP) * Remove iqk_mul_mat from llamafile_sgemm * Pass tensor types and strides to iqk_mul_mat It is marked WIP because only tested on __aarch64__	2024-06-22 12:02:50 +03:00
Iwan Kawrakow	3593891f39	Fix nb4	2024-06-22 12:02:50 +03:00
Iwan Kawrakow	9593e163db	iqk_mul_mat: add ability to disable it	2024-06-22 12:02:50 +03:00
Iwan Kawrakow	81cf6990f5	iqk_mul_mat: be able to handle any f16/f32 combination on AVX2 But only turning on f16 x f32 and f32 x f16 for now.	2024-06-22 12:02:50 +03:00
Iwan Kawrakow	b2acd81c75	iqk_mul_mat: turn on AVX512 It makes no difference on my Ryzen-7950X, but perhaps it will be beneficial for CPU's with real AVX512.	2024-06-22 12:02:50 +03:00
Iwan Kawrakow	9e3dc8c432	iqk_mul_mat: slightly better fp16 with 16 vector registers 2x6 (Nx x Ny) tiles instead of 3x4. We get 142.7 t/s on the Ryzen-5975WX up from 138 t/s. We use Nx registers to preload the fp16 weights, so total registers required is Nx * (Ny + 1), so 15 in the case of of 3 x 4 tiles and 14 for 2 x 6 tiles. I guess, the one spare register helps. But maybe it is just a matter of how things get loaded into the cache. On the 7950X I did try 3 x 8 and it did not perform as well as 5 x 5.	2024-06-22 12:02:50 +03:00
Iwan Kawrakow	ae1e77c5de	iqk_mul_mat: better fp16 for AVX2 Basically use what I did for Arm. Improves PP performance to 141.7 t/s up from 136 t/s on the Ryzen-7950X (32 vector registers, so we use 5x5 tiling). This is now 10% faster than tinyBLAS. There is a minor improvement also on the Ryzen-5975WX (16 vector registers, so we use 4x3 tiling): we get 138 t/s up from 136 t/s. tinyBLAS is at 132 t/s.	2024-06-22 12:02:50 +03:00
Iwan Kawrakow	9386b49918	iqk_mul_mat: fp16 for Arm ~2% slower than tinyBLAS - not sure why.	2024-06-22 12:02:50 +03:00
Iwan Kawrakow	09d86e5876	iqk_mul_mat: slightly faster FANCY_SIMD dot product About 2% faster for q4_K.	2024-06-22 12:02:50 +03:00
Iwan Kawrakow	8a80a31ddd	iqk_mul_mat: fix q8_0 I was happily using _mm256_packs_epi32() to pack the q8_0 x q8_0 dot products back to int16_t, and getting useful results. But theoretically this can overflow, so it is better to use _mm256_unpacklo_ and _mm256_unpackhi_ to combine the 4 dot products using int32_t additions. This is (almost) as fast, unlike _mm256_hadd_epi32(), which seems excessively slow on the Ryzen-7950X.	2024-06-22 12:02:50 +03:00
Iwan Kawrakow	81409a02f3	iqk_mul_mat: decouple from llamafile also in cmake	2024-06-22 12:02:50 +03:00
Iwan Kawrakow	8b95156e83	iqk_mul_mat: make it build with the Makefile	2024-06-22 12:02:50 +03:00
Iwan Kawrakow	cd3d8ae0e7	iqk_mul_mat: use block_q8_1_x4 also for AVX2 Here the performance gain is more significant. E.g., for q4_1, PP-512 becomes 168 t/s up from 137 t/s. Now the performance gap to q4_0 is so significant that I wonder if I should change to using Q8_1 also for the qX_0 legacy quants.	2024-06-22 12:02:50 +03:00
Iwan Kawrakow	299c7f6e89	iqk_mul_mat: use block_q8_0_x4 also for AVX2	2024-06-22 12:02:50 +03:00
Iwan Kawrakow	f0a52f2fbb	iqk_mul_mat: delete unused stuff	2024-06-22 12:02:50 +03:00
Iwan Kawrakow	74b711c8fd	iqk_mul_mat: add q8_0 It was actually ready but not turned on. Having forgotten, I made a new implementation along the lines of the fp16 implementation (i.e., using tiling). That matched tiinyBLAS performance. But the existing implementation that I now turned on is faster: PP-512 = 134 t/s vs 128.3 t/s for tinyBLAS TG-128 = 8.7 t/s vs 8.3 t/s for tinyBLAS (@ 4 threads)	2024-06-22 12:02:50 +03:00
Iwan Kawrakow	29164263f4	iqk_mul_mat: fp16 tweaks Use 4x3 tiling on a real AVX2 CPU (with only 16 vector registers). This works best for the Ryzen-5975WX.	2024-06-22 12:02:50 +03:00
Iwan Kawrakow	36c3f57b0a	iqk_mul_mat: fp16 implementation cleanup It turns out on my Ryzen-7950X CPU using AVX512 is slower.	2024-06-22 12:02:50 +03:00
Iwan Kawrakow	bc659e7de1	iqk_mul_mat: fp16 implementation for AVX2 This simple implementation beats jart's tiniBLAS by a small margin (143 t/s vs 137 t/s for PP-512, TG is 4.75 t/s, so exactly the same as ggml).	2024-06-22 12:02:50 +03:00
Iwan Kawrakow	8e072bbba3	iqk_mul_mat: multi-thread quantization also for MoE models	2024-06-22 12:02:50 +03:00
Iwan Kawrakow	667bd4759c	iqk_mul_mat: make it independent of sgemm	2024-06-22 12:02:50 +03:00

1 2 3 4 5 ...

3275 Commits