ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-02-08 15:30:15 +00:00

Author	SHA1	Message	Date
Kawrakow	d5aa49b93b	Adding fused rms_norm (#42 ) * Fused rms_norm: works on the CPU * Fused rms_norm WIP * Fused rms_norm WIP * Fused rms_norm WIP * Fused rms_norm WIP * Fused rms_norm WIP --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-09-08 10:19:21 +03:00
Kawrakow	18f5bb47d8	Add support for bf16 to iqk_mul_mat (#39 ) * WIP: adding BF16 support to iqk_mul_mat * Minor * Improve TG speed (when not memory bound) --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-09-05 07:48:27 +03:00
Kawrakow	02e4cc0c18	Zen4 Flash Attention - bf16 support (#38 ) * Zen4 Flash Attnetion: WIP bf16 * Zen4 Flash Attnetion: bf16 seems to be working * Zen4 Flash Attnetion: improving bf16 * Zen4 Flash Attnetion: improving bf16 It is better (slightly faster) to first convert Q to bf16 before processing each block of q_step rows. This requires Dq_stepsizeof(bf16) bytes, so at most 4 kb for the head sizes we support, so we can just allocate on the stack instead of reserving and passing a work buffer in ggml. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-09-05 07:46:47 +03:00
Kawrakow	d47e1c63b3	Performance improvements for legacy quants on ARM_NEON (#37 ) * WIP: trying to improve legacy quants * WIP: trying to improve legacy quants With this commit PP-512 for LlaMA-3.1-8B goes from 72 t/s to 87.2 t/s for q4_0, and from 61.5 t/s to 73.9 t/s for q4_1, so 20+% improvement for both. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-09-04 07:24:04 +03:00
Kawrakow	0449090ae8	Zen4 Flash Attnetion 2 (#36 ) * Zen4 Flash Attnetion: WIP generalize to other types Now loading of data from K and V is done via a template parameter, so this should make it easy to generalize to typ[es other than F16 for the K and V cache. * Zen4 Flash Attnetion: it works for q4_0 and q8_0 * Zen4 Flash Attnetion: small q8_0 performance improvement * Zen4 Flash Attnetion: add q4_1 * Delete unused stuff --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-09-04 07:20:55 +03:00
Kawrakow	724854e7db	Fix Zen4 Flash Attention (#35 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-09-02 15:54:24 +03:00
Kawrakow	2db35edf71	Do not process prompts containing binary data for escapes (#33 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-09-02 09:18:48 +03:00
Kawrakow	59c2e77869	Zen4 Flash Attention (#32 ) * Zen4 flash attention: moving useful parts from the kq_fused_softmax branch * Add flash attention with soft-cap and fix D = 256 case * Flash attention refinements * Update FlashAttn comment --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-09-01 16:08:21 +03:00
Kawrakow	5ff997021e	Fix build when iqk_mul_mat is disabled (#31 ) Ref #29 Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-08-31 09:11:42 +03:00
Kawrakow	3f7899c250	Faster Gemma2 (#27 ) * soft_cap_max: initial CPU version of fused softcap + soft_max With this vanilla CPU implementation I'm already getting a ~3% speedup for Gemma-2-9b and a prompt of 8192 tokens. * soft_cap_max: WIP - something is wrong with CUDA * soft_cap_max: looks good on CPU and CUDA * Add softcap to flash attention Just CPU and CUDA for now (but, as we know, flash attention on the CPU is useless in llama.cpp). On CUDA this improves PP performance quite a bit, especially for long contexts. E.g., for PP-16384, I now get 3777 t/s. Without this change, one cannot use FA, and one gets 2300 t/s (after fusing softcap and softmax), or 2000 t/s without the fused softcap+softmax. In comparison, mainline llama.cpp has PP-16384 = 1549 t/s before PR-8542 (where Johannes Gaessler has also added softcap to FA), and PP-16384 = 3097 t/s after this PR. * soft_cap_max: Metal * Flash attention with softcap: Metal --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-08-27 17:40:59 +03:00
Kawrakow	1268ef9430	softcap: minor improvement (#24 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-08-21 13:00:09 +03:00
Kawrakow	8a10467990	Fused soft cap and SIMD-ified GeLU (#9 ) * Softcap: WIP Fuses scale + tanh + scale as used for softcaping in some models. Just CPU for now. ~1.4% for PP-512 on Gemma2-9b, no effect on TG. Somewhat surprisingly the improvement does not increase as I go to longer contexts. Gemma2 does softcap on KQ, which grows quadratically with context length, so I would have thought the benefit from fusing scale, tanh, scale would increase. But no, no luck. softcap: CUDA * softcap: CUDA ~1% speedup for Gemma2-9b * softcap: Metal and NEON About 1% speedup. * Simdified gelu Gives ~1% speedup for Gemma2-9b prompt processing on AVX512/AVX2. It looks like the gelu operation is memory bound on my CPU's after SIMD-ifying it. By not using the 128 kb gelu lookup table we gain a small advantage. On the M2-Max the lookup table is slightly faster than the SIMD version, so left the lookup table for ARM_NEON. * softcap, tanh: avoid NaNs for large arguments (AVX2, AVX512) Not that I have encountered this in practice, but just to be sure. This does it for AVX512 and AVX2, still need a guard for ARM_NEON. * llama-bench: add ability to turn off warmup runs So we don't need to wait forever on, e.g., benchmarks involving long contexts. * softcap, tanh: avoid NaNs for large arguments (NEON) --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-08-20 17:15:47 +03:00
Kawrakow	38dcba95fe	iq4_k: use iq5_k also when n_gqa = 2 (#23 ) This improves size vs quality balance for Gemma-2 models. Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-08-20 17:15:06 +03:00
Kawrakow	0a3b725e60	AVX2 quantization for Q8_K (#22 ) It has been there for a while, but forgot to add here. Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-08-19 15:33:27 +03:00
Kawrakow	fad55b735e	quantize_stats: print rmse and max error as fraction of <x> (#21 ) This allows for a better comparison between different models or different tensors of the same model where the magnitude of the model weights may differ. Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-08-19 13:49:28 +03:00
Kawrakow	041d79925c	iq2_k: slightly better bpw - accuracy compromise (#20 ) For LLaMA-3.1 models: * It is better to quantize all of attn_v with iq3_k instead of half of attn_v with iq4_k * Quantizing attn_output with iq3_k results in a larger PPL decrease compared to what one expects from the added bpw. Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-08-19 13:36:51 +03:00
Kawrakow	a58853bf5e	Skip barriers of noops (#19 ) GGML_OP_RESHAPE, GGML_OP_VIEW, GGML_OP_PERMUTE, GGML_OP_TRANSPOSE, along with GGML_OP_NONE, are all noops. I.e., nothinh happens. But ggml still has a barrier after them, which wastes time. The waste is not too bad for large models where computations are long compared to the time taken for thread synchronization. But for small models skipping those unnecessary waits makes a significant difference. E.g., for the 99M TriLMamodel, TG-500 goes up to 1426 t/s from 1240 t/s. Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-08-14 10:40:09 +02:00
Kawrakow	25ade24526	Update README.md	2024-08-12 15:16:00 +02:00
Kawrakow	1a4cfbcc53	Merge mainline - Aug 12 2024 (#17 ) * Merge mainline * Fix after merge * Remove CI check --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-08-12 15:14:32 +02:00
Kawrakow	5ed6d94cb5	Fix Makefile I always use cmake, so had forgotten to pay attention to the Makefile.	2024-08-09 16:31:04 +02:00
Kawrakow	af2bb96de5	Fix Zen4 implementation of iq3_k, iq4_k, iq5_k See comments in `f3a823ce72`	2024-08-09 16:00:31 +02:00
Kawrakow	3f67708b91	iq6_k: AVX2	2024-08-09 16:00:31 +02:00
Kawrakow	fa668c7dcb	iq6_k: Metal About 4% slower than Q6_K for PP-512, but 10% faster for TG-128. Someone has screwed up Q6_K TG performance on Metal? With the cobntinuous "improvements" in ggml I wouldn't be surprised. Need to look into it later.	2024-08-09 16:00:31 +02:00
Kawrakow	ed462a512a	iq6_k: NEON Respectable performance, only slightly slower than Q6_K.	2024-08-09 16:00:31 +02:00
Kawrakow	ef32a01c2a	iq6_k: slightly better Zen4 iqk_mul_mat We now arrive at pp-512 = 147 t/s for LLaMA-3.1-8B. TG-128 is 9.5 t/s. This is better than last commit, but still kind of slow compared to Q6_K. My last commit message is wrong: also iq3_k needs a fix for overflow.	2024-08-09 16:00:31 +02:00
Kawrakow	0bee1c0c0a	iq6_k: Zen4 iqk_mul_mat We need to do 4 shuffles to get the non-uniform values, so this makes it slower than other iqX_k quants. And then I realized that I was using the standard Zen4 template for all iqX_k quants. The standard template converts the 32-bit integers obtained after _mm512_dpbusds_epi32 back to 16 bits, and then multiples with 16-bit block scales. But this can overfow for iq4_k, iq5_k, and iq6_k. I guess, I did not notice with iq4_k and iq5_k because the PPL difference to CUDA was relatively small, and I attributed it to Q8_K not being accurate enough for the activations. But for iq6_k the PPL difference was much too big to be attributable to Q8_K inaccuracies, so that's when I realized that I cannot be packing the _mm512_dpbusds_epi32 result into 16 bit for 4-,5-,6-bit iqX_k quants. For now I fixed it for iq6_k, but the outcome is that it is significantly slower than Q6_K: I get PP-512 = 125 t/s for LLaMA-3.1-8B vs 180 t/s for Q6_K, so I need to look for a better approach.	2024-08-09 16:00:31 +02:00
Kawrakow	1593acd09a	iq6_k: CUDA dot product 90.2 t/s for LLaMA-3.1-8B. Q6_K gives 91.2 t/s, so we are good.	2024-08-09 16:00:31 +02:00
Kawrakow	4fda827258	iq6_k: CUDA dequantize We get a slightly better PPL for LLaMA-3.1-8B compared to q6_K (0.14% vs 0.26% quantization error).	2024-08-09 16:00:31 +02:00
Kawrakow	4b2c94618f	iq6_k: WIP (quantize/dequantize)	2024-08-09 16:00:31 +02:00
Kawrakow	81266c22d6	iq6_k: WIP (nothing works)	2024-08-09 16:00:31 +02:00
Kawrakow	58a323f585	Adding IQ2_TN for use with ternary models (#13 ) * iq2_tn: TriLM specific 2.0625 bpw quantization Quantize/dequantize/scale dot product. I get 46 t/s for the TriLM-3.9B with any SIMD! Finally a compiler doing a decent job auto-vectorizing the scalar implementation. * iq2_tn: AVX512 Just reusing the k-quants template gets us to PP-512 = 376 t/s, TG-128 = 47.6 t/s for TriLM-3.9B. * iq2_tn: AVX512 With this tweak we get to PP-512 = 431 t/s. * iq2_tn: AVX512 With this tweak we get TG-128 = 19.58 / 35.18 t/s for 1 / 2 threads. At 4 threads we saturate at 48.41 t/s, and then performance slowly degrades with increasing number of threads. * iq2_tn: AVX2 PP512 = 440 t/s on the Ryzen-5975WX. We should be able to do better. * iq2_tn: initial NEON version * iq2_tn: NEON For TriLM-3.9B running on the M2-Max we get PP-512 = 193.5 t/s, TG-128 = 75.5 t/s. This is in line with what we have for iq2_bn ant 3.3B Bitnet. * iq2_tn: Metal For TriLM-3.9B on a 30-core M2-Max we get PP-512 = 890 t/s, TG-128 = 98.5 t/s. * iq2_tn: CUDA For TriLM-3.9B running on RTX-4080 we get PP-512 = 9936 t/s, TG-128 = 299.2 t/s. * iq2_tn: AVX2 PP improvement We now get PP-512 = 490.73 t/s for TriLM-3.9B on the Ryzen-5975WX. We have PP-512 = 636.61 t/s for Bintnet-3B quantized with iq2_bn. Bintnet-3B is actually 3.4B, TriLM-3.9B is 3.99B, so we would expect 3.43/3.99 * 636 = 546 t/s, so it seems we still have something that is not quite optimal in iq2_tn. * iq2_tn: small NEON improvement For TriLM-3.9B we now get PP-512 = 206.6 t/s and TG-128 = 76.4 t/s. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-08-07 07:56:09 +02:00
Kawrakow	695c7eef49	q2_K: allow it to detect ternary nets and quantize accordingly	2024-08-05 11:39:10 +02:00
Kawrakow	74f2f50abf	Update README.md There have been a few minor improvements here and there, so updated the AVX2 Bitnet performance values to current main branch.	2024-08-05 07:35:30 +02:00
Kawrakow	8db9c5b1fd	iq3_k, iq5_k: faster quantization Just use the same trick as iq4_k	2024-08-05 07:18:18 +02:00
Kawrakow	5b06d7999c	iq4_k: speedup quantization by a factor of ~2	2024-08-03 18:38:39 +02:00
Kawrakow	2af0d6fbac	Add copyright notice	2024-08-01 09:38:06 +02:00
Kawrakow	904fdbcfb7	iq2/3_k: tiny bit faster Metal dot products	2024-08-01 09:38:06 +02:00
Kawrakow	088a8360a1	iq3_k: slightly faster Metal dequantize kernel PP-512 goes to 473 t/s up from 452 t/s.	2024-08-01 09:38:06 +02:00
Kawrakow	606f02ae89	iq3_k: Metal dot product Quite slow: 43 t/s for a 7B model	2024-08-01 09:38:06 +02:00
Kawrakow	95a6820d79	iq2_k: Metal dot product finally works It is slow: 45.4 t/s for 7B model vs 50 t/s for iq2_xs, or 63.3 t/s for q2_K_S.	2024-08-01 09:38:06 +02:00
Kawrakow	033299c9f9	iq3_k: Metal dequantize	2024-08-01 09:38:06 +02:00
Kawrakow	2927d4f841	iq3_k: NEON	2024-08-01 09:38:06 +02:00
Kawrakow	9c1eea6048	iq3_k: AVX2 iqk_mul_mat We get PP-512 = 196 t/s for LLaMA-3.1-8B on the Ryzen-5975WX.	2024-08-01 09:38:06 +02:00
Kawrakow	a9fa3b1563	iq3_k: AVX512 iqk_mul_mat We get PP-512 = 180 t/s, TG-128(4 threads) = 16.35 on the Ryzen-7950X for LLaMA-3.1-8B. In comparison, iq3_s has PP-512 = 96 t/s, TG-128 = 7.6 t/s with iqk_mul_mat, and PP-512 = 28 t/s, TG-128 = 6.8 t/s in mainline llama.cpp	2024-08-01 09:38:06 +02:00
Kawrakow	a4371b7842	iq3_k: faster CUDA dot product 138 t/s for LLaMA-3.1-8B, which is almost on par with iq3_s.	2024-08-01 09:38:06 +02:00
Kawrakow	81f15c0ba8	iq3_k: CUDA dot product Slightly slower than iq3_s - 132 t/s vs 138 t/s for LLaMA-3.1-8B.	2024-08-01 09:38:06 +02:00
Kawrakow	fb4cff3458	iq3_k: Basics Quantize/dequantize, CUDA dequantize. PPL of LLaMA-3.1-8B is better than iq3_s and iq3_m.	2024-08-01 09:38:06 +02:00
Kawrakow	7dcd64c9bd	iq2_k: very slightly better CUDA dot product 169.2 t/s vs 167.8 t/s before.	2024-08-01 09:38:06 +02:00
Kawrakow	0c1d7383a5	iq2_k: better CUDA dot product Almost on par with iq2_xs (168 t/s vs 172 t/s).	2024-08-01 09:38:06 +02:00
Kawrakow	f30bcc1e17	iq2_k: CUDA dot product finally works Performance is pathetic: 140 t/s for LLaMA-3.1-8B vs 172 t/s for iq2_xs.	2024-08-01 09:38:06 +02:00

1 2 3 4 5 ...

3413 Commits