ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-02-23 14:44:09 +00:00

Author	SHA1	Message	Date
Kawrakow	28b4229295	Correct spelling in README	2024-07-24 19:22:43 +03:00
Kawrakow	b84d0c1744	Update README.md Adding some more details	2024-07-24 17:38:37 +02:00
Kawrakow	de43999de5	Update README.md Adding MoE and Bitnet performance tables	2024-07-24 16:49:00 +02:00
Kawrakow	cd77618324	Update README.md I hate it when tables look fine in the Preview but then end up with columns split into 2 lines when committed. That's what is happening here, so removed test column from the performance tables.	2024-07-24 11:18:50 +02:00
Kawrakow	4bb58ea8f8	Update README.md Added performance comparison tables	2024-07-24 11:01:16 +02:00
Kawrakow	73b94e5c3f	iqk_mul_mat(NEON): special case for n not divisible by 8 Else fp16 PP performance drops by nearly a factor of 2 compared to what we had before.	2024-07-24 08:04:47 +02:00
Kawrakow	5992d2652b	ggml: thread syncronization on Arm For x86 slaren was genereous enough to add _mm_pause() to the busy spin wait loop in ggml_barrier(), but everything else just busy spins, loading an atomic int on every iteration, thus forcing cache sync between the cores. This results in a massive drop in performance on my M2-Max laptop when using 8 threads. The closest approximation to _mm_pause() on Arm seems to be __asm__ __volatile__("isb\n"); After adding this to the busy spin loop, performance for 8 threads recovers back to expected levels.	2024-07-24 08:04:47 +02:00
Kawrakow	005674cecc	Fix "make it work for row sizes that are multiple of 4 on NEON"	2024-07-24 08:04:47 +02:00
Kawrakow	847588cc92	Update README.md	2024-07-23 18:05:05 +02:00
Kawrakow	97680f602c	Update README.md	2024-07-23 12:23:06 +02:00
Kawrakow	8bf126c1d6	When tokenizer info is missing in the model, use llama3 by default	2024-07-19 12:29:01 +03:00
Kawrakow	6a94ca46ad	iqk_mul_mat(f16): make it work for row sizes that are multiple of 4 on NEON Here the performance gain is more modest compared to AVX2: we get PP-512 = 200 t/s up from 190 t/s for iq1_bn-quantized Bitnet-3B running on M2 Max.	2024-07-18 13:55:51 +02:00
Kawrakow	4d1e83f8b8	iqk_mul_mat: attentions matrix multiplications KQ and KQV are n_kv_embed x n_token x n_head matrix multiplications. Before this PR, this meant n_head calls to iqk_mul_mat to perform n_kv_embed x n_token 2D multiplications, each using nth threads. Instead, in this PR, if n_head is a multiple of nth, each thread does n_head/nth multiplications of the n_kv_embed x n_token 2D matrices. This improves PP-512(32 threads) for Bitnet-3B to 433 t/s up from 409 t/s. It is beneficial in other cases too. E.g., for LLaMA-7B, we go to 201 t/s up from 193 t/s for q4_K_S, and to 144 t/s up from 139 t/s for fp16. All these numbers are for the Ryzen-7950X CPU.	2024-07-18 14:00:56 +03:00
Kawrakow	c14a6a6862	iqk_mul_mat(float): make it work for row sizes that are multiple of 4 on AVX2 I was trying to understand where the Bitnet bottleneck is, and at some point noticed the Q*K matrixt multiplication where Q and K have the shape of 100 x n_token x 32 x 1. The existing iqk_mul_mat for floats rerquiers that the row size is a multiple of the SIMD vector size (so, 16 on the Ryzen-7950X, 8 on the Ryzen-5975), and hence this matrix multiiplication was getting done with ggml. Changing the iqk_mul_mat float kernel to handle row sizes that are a multiple of 4 (via __m128 for the last values in a row) resulted in nearly a 20% performance boost for PP-512 and ~3% for TG-128! If I go to a context of 2048, PP performance increases by nearly 70%!	2024-07-18 11:39:32 +03:00
Kawrakow	d556b1d809	Fix Makefile, add GGML_USE_IQK_MULMAT ifdefs to iqk-quantize	2024-07-17 16:51:34 +03:00
Kawrakow	6f0805a3c7	iq1bn: faster scalar dot product At the end of the day, lookup is still better when not using simd. This scalar dot product version gets us 14.7 t/s on a Ryzen-7950X with 16 threads (up from 10.5 t/s).	2024-07-17 16:09:01 +03:00
Kawrakow	02dc036187	iq1bn: fix scalar dot product The fix makes it faster on the Ryzen-7950X (10.5 t/s vs 8.2 t/s) but slower on the M2 (6.8 t/s vs 8.6 t/s before).	2024-07-17 13:37:18 +03:00
Kawrakow	04decf3fc5	iq1bn: faster AVX2 Instead of shuffling quant data into a 128-bit register containing 8-bit ints, and then converting to 16 bit, we directly shuffle into a 256-bit register containing 16 bit ints. TG-128 @ 2 threads goes from 18.3 to 21.6 t/s. TG-128 performance now saturates already at 8 threads getting 60.4 t/s. There is almost no impact on PP-512 (322 -> 323 t/s). I guess, we amortize dequantization cost pretty well, so we don't gain much there. We get close to 100 GB/s single-threaded float32 throuput: ./bin/test-quantize-perf --op vec_dot_q -i 10000000 --type iq1_bn iq1_bn vec_dot_q 4096 values (0.02 MB) min cycles/32 vals : 3.87 avg cycles/32 vals : 4.40 float32 throughput : 98.27 GB/s quantized throughput : 4.99 GB/s	2024-07-17 10:17:05 +03:00
Kawrakow	2d4fee2312	Remove the no longer used iq1bn_grid_u16	2024-07-17 10:16:50 +03:00
Kawrakow	0194639b6b	iq1bn: adjust scalar dot product and some cleanup	2024-07-17 08:44:46 +02:00
Kawrakow	2881bdf220	iq1bn(no lookup): better version We have 4 groups of 16 in a block of 64 quants. For each group of 16 we have 3 groups of 5, each using 8 bits. The remaining 16'th quants of the 4 groups of 16 are encoded with 8 bits using the same encoding as the groups of 5. The only kernel where we have complications is the CUDA dequantize kernel (because we are dequantizing 8 quants there, and we have different encoding for the 1st and 2nd group of 8 in a group of 16). Ths achieves better performance on all tested platforms than any previous 1.625 bpw attempt. We have: \| model \| size \| params \| backend \| threads \| test \| t/s \| \| ---------------- \| ---------: \| ---------: \| ---------- \| ------: \| ------------: \| ---------------: \| \| 1.625 bpw Bitnet \| 729.64 MiB \| 3.32 B \| CUDA \| 8 \| pp512 \| 9613.02 ± 24.54 \| \| 1.625 bpw Bitnet \| 729.64 MiB \| 3.32 B \| CUDA \| 8 \| tg128 \| 229.85 ± 0.33 \| \| 1.625 bpw Bitnet \| 729.64 MiB \| 3.32 B \| AVX2 \| 16 \| pp512 \| 322.59 ± 1.00 \| \| 1.625 bpw Bitnet \| 729.64 MiB \| 3.32 B \| AVX2 \| 16 \| tg128 \| 59.79 ± 0.03 \| \| 1.625 bpw Bitnet \| 729.64 MiB \| 3.32 B \| AVX2 \| 8 \| tg128 \| 57.62 ± 0.21 \| \| 1.625 bpw Bitnet \| 729.64 MiB \| 3.32 B \| AVX2 \| 4 \| tg128 \| 33.66 ± 0.29 \| \| 1.625 bpw Bitnet \| 729.64 MiB \| 3.32 B \| AVX2 \| 2 \| tg128 \| 18.30 ± 0.01 \| \| 1.625 bpw Bitnet \| 729.64 MiB \| 3.32 B \| Metal \| 8 \| pp512 \| 698.13 ± 0.21 \| \| 1.625 bpw Bitnet \| 729.64 MiB \| 3.32 B \| Metal \| 8 \| tg128 \| 68.88 ± 0.24 \| \| 1.625 bpw Bitnet \| 729.64 MiB \| 3.32 B \| NEON \| 8 \| pp512 \| 196.80 ± 0.50 \| \| 1.625 bpw Bitnet \| 729.64 MiB \| 3.32 B \| NEON \| 8 \| tg128 \| 51.58 ± 0.41 \| \| 1.625 bpw Bitnet \| 729.64 MiB \| 3.32 B \| NEON \| 4 \| tg128 \| 30.80 ± 0.03 \| \| 1.625 bpw Bitnet \| 729.64 MiB \| 3.32 B \| NEON \| 2 \| tg128 \| 16.89 ± 0.01 \| It is still slower than 2 bpw Bitnet, but the difference now is not as dramatic.	2024-07-17 08:54:11 +03:00
Kawrakow	d84748b71b	iq1bn(no lookup): Metal In summary, compared to lookup, the multiplication based approach is * Much better on AVX2 * Slightly better on CUDA * Slightly worse on Metal * Much worse on NEON	2024-07-16 09:12:15 +02:00
Kawrakow	d0f9d146b8	iq1bn(no lookup): NEON attempts We are at TG-128 = 25.7 t/s, which is quite a bit worse than lookup.	2024-07-16 08:32:15 +02:00
Kawrakow	597ea12970	iq1bn(no lookup): NEON Pretty bad.	2024-07-15 20:40:14 +02:00
Kawrakow	cd8fffc3cd	iq1bn(no lookup): CUDA Not good. We only get ~160 t/s.	2024-07-15 19:56:51 +03:00
Kawrakow	1f3dbbcc19	iq1bn(no lookup): somewhat better We now have for Bitnet-3B: \| threads \| test \| t/s \| \| ------: \| ------------: \| ---------------: \| \| 16 \| pp512 \| 308.97 ± 1.89 \| \| 16 \| tg128 \| 58.80 ± 0.07 \| \| 8 \| tg128 \| 49.79 ± 1.23 \| \| 4 \| tg128 \| 28.85 ± 0.02 \| \| 2 \| tg128 \| 15.39 ± 0.01 \|	2024-07-15 13:46:07 +03:00
Kawrakow	98be184c23	iq1bn: attempt without a lookup table	2024-07-15 11:02:41 +03:00
Kawrakow	43f4c58376	Remove all workflows	2024-06-27 09:45:56 +03:00
Kawrakow	aaec3c1f60	imatrix: be able to specify the name of the output tensor For some models the same tensor is used for token embeddings and output. This tensor tends to be named token_embedding.weight rather than output.weight, which prevernts us from collecting imatrix data for this tensor. With this commit we can tell the name of the output tensor to the imatrix tool.	2024-06-26 17:38:18 +03:00
Kawrakow	be36ca872f	bitnet: fold V scale into rms_norm	2024-06-26 12:05:57 +02:00
Kawrakow	6467358fd4	RoPE(Neox, Metal): don't use power functions in a loop Speeds up Bitnet by ~2% on Metal.	2024-06-26 11:22:47 +02:00
Kawrakow	d280bf30c4	Typo	2024-06-25 19:17:14 +03:00
Kawrakow	9918542658	bitnet: remove iq1_bn lookup table storing +/- signs The AVX2 implementation was the only one left using it, so I decided to see if we can get a performant implementation using the 0,1,2 lookup table. Turns out we can, and it is even slightly faster than the sign based table. We now get PP-512 = 275 t/s and TG-128 = 57.7 t/s with 16 threads on the Ryzen-7950X. With only one lookup table left for iq1_bn, I renamed it to iq1bn_grid_u16.	2024-06-25 18:19:11 +03:00
Kawrakow	12e97f1f1f	bitnet: simdify q8_K64 quantization on AVX Doesn't make a real difference in performance.	2024-06-25 17:20:34 +03:00
Kawrakow	cb12b6f253	bitnet: NEON improvements for iq1_bn With these changes we get to TG-128 = 34 t/s, PP-512 = 153 t/s.	2024-06-25 13:48:29 +02:00
Kawrakow	636dbd03c5	bitnet: remove the now unused iq1bn_grid_u16	2024-06-25 12:41:43 +02:00
Kawrakow	cd2f60c89a	Bitnet: adapt NEON and Metal to the alternative grid	2024-06-25 11:16:13 +02:00
Kawrakow	ef16135920	Bitnet: trying an alternative iq1_bn grid Faster on CUDA. The scalar version is faster too. The issue with CUDA is that now I see wild performance fluctuations. Running llama-bench I can get 220 t/s for TG-128 one time, and 190 t/s another time, with uncertaintiers of 1-2 t/s. Same for PP, results are jumping back-and-fort between ~9500 t/s and ~8900 t/s. So, basically no reliable measurement at this point, but for sure faster than the previous version, which was at around 170-180 t/s.	2024-06-25 11:32:48 +03:00
Kawrakow	90a6071a93	bitnet: fix scalar dot product for 1.625 bpw I had not adjusted after going to 4 q8 scales per row.	2024-06-25 08:31:12 +02:00
Kawrakow	ee6565fdeb	Bitnet: slightly faster 1.625 bpw variant for AVX512VL	2024-06-25 08:33:00 +03:00
Kawrakow	8542b4f359	Bitnet: tiny bity faster 1.625 bpw variant on Metal We get 70.7 t/s for TG-128 vs 69.5 t/s before.	2024-06-24 16:42:30 +02:00
Kawrakow	f2a82090df	Adding add_4, mul_4, div_4 kernels to Metal This gives ~2% speedup for Bitnet on Metal	2024-06-24 10:22:10 +02:00
Kawrakow	c9ddaf2fa3	bitnet: qnfs tests Q8_0 fails because as per design the reference quantization is different from the vecdot quantization.	2024-06-22 12:02:53 +03:00
Kawrakow	b1fb7df6a5	bitnet: replace ggml_mul with ggml_scale to apply the scales Also save one scale operation in the ffn network by adjusting rms_eps. We gain up to 3% in performance by doing this, but it is a bit of a hack (we store the tensor scales in op_params while loading the model).	2024-06-22 12:02:52 +03:00
Kawrakow	0fe0d54be6	iqk_mul_mat: add IQ4_NL also on NEON PPL seems somewhat higher? For llama-v2-7B iwe are still ~0.04 higher compared to hat we expect after ~30 batches.	2024-06-22 12:02:52 +03:00
Kawrakow	32ec107237	iqk_mul_mat: add IQ4_NL I never use it, so I had completely forgotten about it.	2024-06-22 12:02:52 +03:00
Kawrakow	912d6d9ce1	bitnet(scale in a separate tensor): CPU tweaks A somewhat nicer iq2_bn implementation on AVX2.	2024-06-22 12:02:52 +03:00
Kawrakow	f53d89dd53	bitnet(scale in a separate tensor): CPU tweaks I had ruined TG performance on AVX2 with the last commit. Was just testing at 8 threads and there we are totally memory bound. But at 4 threads we had regressed to 41 t/s on the Ryzen7950. Back to 51 t/s with this commit.	2024-06-22 12:02:52 +03:00
Kawrakow	52ad5764dd	bitnet(scale in a separate tensor): more CPU improvements It seems it is enough to have 4 scales per row for Q8. I get PPL = 8.5470 with this, which is slightly higher than the 8.5430 we get with 1 scale per 128 activations, but still OK, I think. With this, we get the following performance: Systema \| quant \| PP-512 \| TG-128a \| quant \| PP-512 \| TG-12s \| M2 Max \| iq2bn 229.02 ± 0.37 78.75 ± 0.61 \| iq1bn \| 146.67 ± 2.85 33.12 ± 0.03 Ryzen7950\| iq2bn 379.36 ± 1.03 49.08 ± 0.18 \| iq1bn \| 247.12 ± 1.53 32.80 ± 0.02 Ryzen5975\| iq2bn 465.28 ± 0.57 39.17 ± 0.02 \| iq1bn \| 325.86 ± 0.46 26.60 ± 0.10	2024-06-22 12:02:52 +03:00
Kawrakow	167489ef6c	bitnet(scale in a separate tensor): CPU improvements Arrange Q8 quants in blocks of 128 and adapt iqk_mul_mat to deal with that. This improves PP speef by a few percent.	2024-06-22 12:02:52 +03:00

1 2 3 4 5 ...

3337 Commits