ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-02-02 04:29:53 +00:00

Author	SHA1	Message	Date
Kawrakow	12bbdb8ce7	Fix compiler warnings (#58 ) * Fix C++ compilation warnings caused by ggml-common.h * Disable c99-extensions warning I get tons of those on macOS due to the arm_neon.h header. * Disable c99-extensions warning only for APPLE * Fix warnings in iqk_quantize.cpp Also add GGML_ABORT when implementation is missing. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-09-17 14:31:29 +03:00
Kawrakow	4ee889f158	BF16 support on Metal (#56 ) * BF16 support on Metal * Faster BF16 Metal dot product --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-09-17 10:54:42 +03:00
Kawrakow	2874b98400	iqk_mul_mat(ARM_NEON): adding bf16 support (#41 ) It looks like ArmV8 ISA has support for bf16, but my M2 Max does not have it, so resorting to bf16 -> f32 conversion and computations in f32. This is 2x slower than f16, but 8x better compared to what I get if I try to run a bf16 model on the M2 (NEON and Metal). Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-09-16 16:47:36 +03:00
Iwan Kawrakow	20f3e6fd2d	Minor	2024-09-15 12:59:14 +03:00
Kawrakow	6f11c95994	Adding bf16 support to CUDA (#40 ) * Adding bf16 support to CUDA - matrix multipications * Adding bf16 support to CUDA - cleanup * Adapt to latest master --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-09-14 20:02:32 +03:00
Kawrakow	76be98fdec	Improve Q5_0 performance (#55 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-09-14 19:47:26 +03:00
Kawrakow	064b99365c	Improve Q4_0 and Q8_0 performance on AVX2/Zen4 (#54 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-09-14 13:53:50 +03:00
Kawrakow	43b934b19f	Quantization mixes tweaks (#53 ) * Some tweaks for i-quants Improve Gemma2 PPL while reducing size * Some tweaks for iq2_k and iq3_k --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-09-14 10:29:44 +03:00
Iwan Kawrakow	ec1cbc8884	Minor	2024-09-13 15:46:36 +03:00
Kawrakow	f853f6c6a5	Fix bug and D < 128 case for Q8_0 k-cache (#52 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-09-13 07:19:47 +03:00
Kawrakow	5017f8b3f0	Quantized Flash Attention for all supported CPU platforms (#51 ) * NEON Flash Attention: add support for Q8_0, Q4_0, Q4_1 * NEON Flash Attention: quantized KQ for q4_0 I could finally take advantage of the matrix multiplication templates. We get quite a bit of speedup that way for q4_0: For Gemma-2b using mul_mat_qX_0_q8_0<DequantizerQ40, q_step> results in PP-2048 = 287 t/s vs 268 t/s when converting the q4_0 k-cache and Q to fp16 and using fp16 multiplication. NEON Flash Attention: quantized KQ for q4_1 NEON Flash Attention: quantized KQ for q8_0 This makes quite a bit of difference: For Gemma2-2b PP-8192 is 228 t/s with quantized KQ vs 178 t/s when converting things to fp16 and using fp16 matrix multiplication. We have PP-512 = 307 t/s, so PP-8192 is now ~75% of the performance of PP-512. In contrast, llama.cpp with Q8_0 cache is 38% of PP-512. * Zen4 Flash Attention: quantized KQ for q4_0, q4_1, q8_0 AVX2 Flash Attention: quantized KQ for q4_0, q4_1, q8_0 Tidy up FlashMS * Delete no longer used stuff With the usage of quantized matrix multiplications for quantized k- and/or v-cache, we no longer need the helper methods loading entire rows. * Disallow mixing bf16 with other types for kv caches --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-09-12 19:03:20 +03:00
Kawrakow	c920195edd	AVX2 Flash Attention 2 (#50 ) * AVX2 Flash Attention: add ability to use Q8_0 for kv-cache * AVX2 Flash Attention: add ability to use Q4_0 for kv-cache * AVX2 Flash Attention: add ability to use Q4_1 for kv-cache * Fix Zen4 --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-09-11 19:55:42 +03:00
Kawrakow	d98a6753a6	ARM_NEON Flash Attention (#49 ) * NEON Flash Attention - first working version Simply reuse the Zen4/AVX2 implementation, but use f16 for the KQ multiplication and Vsoftmax(KQ) accumulation. This makes the FlashMS portion somewhat awkward because we do not have fast f16 implementations for expf (and tanh when softcap is enabled), so we need to convert back-and-fort to f32. FA is slightly faster than no-FA for the 4B TriLM model, but lightly slower for Gemma-2b. NEON Flash Attention - convert Q to f16 before computing QK NEON Flash Attention - use fp32 for KQ operations Else I get wrong results for LLaMA-3.1-8B (but it works for Gemma-2b). Delete commented out stuff --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-09-11 10:26:49 +03:00
Kawrakow	72f5dfe12a	AVX2 Flash Attention (#48 ) * First version of AVX2 Flash attention I simply took the Zen4 implementation and converted platform specific stuff to methods of a struct providing data loading/storing, conversions, multiply, add, etc. Most likely not optimal as the Zen4 strategy has been designed based on having 32 512-bit registers, so basically we can have 4X more data stored in vector registers compared to AVX2 with 16 x 256-bit. It still gives a small speedup (~4% at 2048 tokens) for Gemma-2b. * Fix Zenn4 parts broken via the AVX2 change * Try smaller q_step - no improvement * Fix ARM_NEON I had forgotten to guard the AVX2/Zen4 implementation against __aarch64__ --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-09-10 19:17:04 +03:00
Kawrakow	d17d0c4426	iq2_tn: slightly better performance on AVX2 (#47 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-09-10 16:21:57 +03:00
Kawrakow	a1f7a03f50	IQ1_TN Metal implementation (#46 ) * iq1_tn: Metal implementation Rquires to change the get_rows and matrix multiplication kernels to use a dequantizer type rather than a dequantization function. But once this is done, we can simply reuse the iq1_bn implementation. This change will also allow to add other quantization types that have meta data (such as a row scale) stored at the beginning of a row (or change existing quantization types to row-wise scales). * Some cleanup --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-09-10 09:43:05 +03:00
Kawrakow	918ada20fa	Add CUDA support for IQ1_TN (#45 ) * iq1_tn: adding CUDA dequantize * iq1_tn: adding CUDA dot product * Delete commented out stuff * Delete forgotten TODO --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-09-09 21:17:17 +03:00
Kawrakow	8c86231f93	Adding IQ1_TN - 1.6875 bpw for TriLM ternary models (#44 ) * Adding iq1_tn - 1.6875 bpw for TriLM ternary models * iq1_tn: NEON * iq1_tn: faster NEON * iq2_bn: improve performance on NEON We now get TG-128 = 100 t/s for Bitnet-3B-1.58b! * iq1_tn: improve AVX2 PP-512 goes to 533 t/s up from 455. TG-128 @ 2 threads goes to 16.6 t/s up from 14.2. However, we seem to have a bottleneck somewhere as TG saturates at 8 threads. * iq1_tn: improve Zen4 PP-512 goes to 485 t/s up from 352. With FA we get 545 t/s up from 380. TG-128 @ 1 thread goes to 12.4 t/s up from 10.4. However, we seem to have a bottleneck somewhere as TG saturates at 8 threads. * iq2_bn: improve on Zen4 We now get PP-512 = 614 t/s up from 542 t/s * iq2_bn: improve AVX2 implementation We now get PP-512 = 753 t/s up from 680 t/s. * Remove unnecessary barrier in ggml_compute_forward_mul_mat --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-09-09 14:56:34 +03:00
Kawrakow	bf4b19b474	iq2_tn: slightly faster PP (#43 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-09-08 12:41:44 +03:00
Kawrakow	6136a4b803	Adding fused rms_norm (#42 ) * Fused rms_norm: works on the CPU * Fused rms_norm WIP * Fused rms_norm WIP * Fused rms_norm WIP * Fused rms_norm WIP * Fused rms_norm WIP --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-09-08 10:19:21 +03:00
Kawrakow	0087008d29	Add support for bf16 to iqk_mul_mat (#39 ) * WIP: adding BF16 support to iqk_mul_mat * Minor * Improve TG speed (when not memory bound) --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-09-05 07:48:27 +03:00
Kawrakow	7b1b2b2c06	Zen4 Flash Attention - bf16 support (#38 ) * Zen4 Flash Attnetion: WIP bf16 * Zen4 Flash Attnetion: bf16 seems to be working * Zen4 Flash Attnetion: improving bf16 * Zen4 Flash Attnetion: improving bf16 It is better (slightly faster) to first convert Q to bf16 before processing each block of q_step rows. This requires Dq_stepsizeof(bf16) bytes, so at most 4 kb for the head sizes we support, so we can just allocate on the stack instead of reserving and passing a work buffer in ggml. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-09-05 07:46:47 +03:00
Kawrakow	f17d0d72f5	Performance improvements for legacy quants on ARM_NEON (#37 ) * WIP: trying to improve legacy quants * WIP: trying to improve legacy quants With this commit PP-512 for LlaMA-3.1-8B goes from 72 t/s to 87.2 t/s for q4_0, and from 61.5 t/s to 73.9 t/s for q4_1, so 20+% improvement for both. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-09-04 07:24:04 +03:00
Kawrakow	8c94dcd433	Zen4 Flash Attnetion 2 (#36 ) * Zen4 Flash Attnetion: WIP generalize to other types Now loading of data from K and V is done via a template parameter, so this should make it easy to generalize to typ[es other than F16 for the K and V cache. * Zen4 Flash Attnetion: it works for q4_0 and q8_0 * Zen4 Flash Attnetion: small q8_0 performance improvement * Zen4 Flash Attnetion: add q4_1 * Delete unused stuff --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-09-04 07:20:55 +03:00
Kawrakow	9b53c2533f	Fix Zen4 Flash Attention (#35 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-09-02 15:54:24 +03:00
Kawrakow	5518e24be8	Do not process prompts containing binary data for escapes (#33 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-09-02 09:18:48 +03:00
Kawrakow	dc023bc3be	Zen4 Flash Attention (#32 ) * Zen4 flash attention: moving useful parts from the kq_fused_softmax branch * Add flash attention with soft-cap and fix D = 256 case * Flash attention refinements * Update FlashAttn comment --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-09-01 16:08:21 +03:00
Kawrakow	dbb1db9899	Fix build when iqk_mul_mat is disabled (#31 ) Ref #29 Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-08-31 09:11:42 +03:00
Kawrakow	c7e99c88a2	Faster Gemma2 (#27 ) * soft_cap_max: initial CPU version of fused softcap + soft_max With this vanilla CPU implementation I'm already getting a ~3% speedup for Gemma-2-9b and a prompt of 8192 tokens. * soft_cap_max: WIP - something is wrong with CUDA * soft_cap_max: looks good on CPU and CUDA * Add softcap to flash attention Just CPU and CUDA for now (but, as we know, flash attention on the CPU is useless in llama.cpp). On CUDA this improves PP performance quite a bit, especially for long contexts. E.g., for PP-16384, I now get 3777 t/s. Without this change, one cannot use FA, and one gets 2300 t/s (after fusing softcap and softmax), or 2000 t/s without the fused softcap+softmax. In comparison, mainline llama.cpp has PP-16384 = 1549 t/s before PR-8542 (where Johannes Gaessler has also added softcap to FA), and PP-16384 = 3097 t/s after this PR. * soft_cap_max: Metal * Flash attention with softcap: Metal --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-08-27 17:40:59 +03:00
Kawrakow	bd99ed7d0a	softcap: minor improvement (#24 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-08-21 13:00:09 +03:00
Kawrakow	d259a50ca6	Fused soft cap and SIMD-ified GeLU (#9 ) * Softcap: WIP Fuses scale + tanh + scale as used for softcaping in some models. Just CPU for now. ~1.4% for PP-512 on Gemma2-9b, no effect on TG. Somewhat surprisingly the improvement does not increase as I go to longer contexts. Gemma2 does softcap on KQ, which grows quadratically with context length, so I would have thought the benefit from fusing scale, tanh, scale would increase. But no, no luck. softcap: CUDA * softcap: CUDA ~1% speedup for Gemma2-9b * softcap: Metal and NEON About 1% speedup. * Simdified gelu Gives ~1% speedup for Gemma2-9b prompt processing on AVX512/AVX2. It looks like the gelu operation is memory bound on my CPU's after SIMD-ifying it. By not using the 128 kb gelu lookup table we gain a small advantage. On the M2-Max the lookup table is slightly faster than the SIMD version, so left the lookup table for ARM_NEON. * softcap, tanh: avoid NaNs for large arguments (AVX2, AVX512) Not that I have encountered this in practice, but just to be sure. This does it for AVX512 and AVX2, still need a guard for ARM_NEON. * llama-bench: add ability to turn off warmup runs So we don't need to wait forever on, e.g., benchmarks involving long contexts. * softcap, tanh: avoid NaNs for large arguments (NEON) --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-08-20 17:15:47 +03:00
Kawrakow	a325745000	iq4_k: use iq5_k also when n_gqa = 2 (#23 ) This improves size vs quality balance for Gemma-2 models. Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-08-20 17:15:06 +03:00
Kawrakow	a73702d93b	AVX2 quantization for Q8_K (#22 ) It has been there for a while, but forgot to add here. Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-08-19 15:33:27 +03:00
Kawrakow	5652100afc	quantize_stats: print rmse and max error as fraction of <x> (#21 ) This allows for a better comparison between different models or different tensors of the same model where the magnitude of the model weights may differ. Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-08-19 13:49:28 +03:00
Kawrakow	c7b47fc67f	iq2_k: slightly better bpw - accuracy compromise (#20 ) For LLaMA-3.1 models: * It is better to quantize all of attn_v with iq3_k instead of half of attn_v with iq4_k * Quantizing attn_output with iq3_k results in a larger PPL decrease compared to what one expects from the added bpw. Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-08-19 13:36:51 +03:00
Kawrakow	6c5384f20e	Skip barriers of noops (#19 ) GGML_OP_RESHAPE, GGML_OP_VIEW, GGML_OP_PERMUTE, GGML_OP_TRANSPOSE, along with GGML_OP_NONE, are all noops. I.e., nothinh happens. But ggml still has a barrier after them, which wastes time. The waste is not too bad for large models where computations are long compared to the time taken for thread synchronization. But for small models skipping those unnecessary waits makes a significant difference. E.g., for the 99M TriLMamodel, TG-500 goes up to 1426 t/s from 1240 t/s. Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-08-14 10:40:09 +02:00
Kawrakow	bb5ff6fade	Update README.md	2024-08-12 15:16:00 +02:00
Kawrakow	8f43e55103	Merge mainline - Aug 12 2024 (#17 ) * Merge mainline * Fix after merge * Remove CI check --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-08-12 15:14:32 +02:00
Iwan Kawrakow	f5d1af61d7	Fix Makefile I always use cmake, so had forgotten to pay attention to the Makefile.	2024-08-09 16:31:04 +02:00
Iwan Kawrakow	f0d7a0d53b	Fix Zen4 implementation of iq3_k, iq4_k, iq5_k See comments in `f3a823ce72`	2024-08-09 16:00:31 +02:00
Iwan Kawrakow	c77dba5273	iq6_k: AVX2	2024-08-09 16:00:31 +02:00
Iwan Kawrakow	a829cb7794	iq6_k: Metal About 4% slower than Q6_K for PP-512, but 10% faster for TG-128. Someone has screwed up Q6_K TG performance on Metal? With the cobntinuous "improvements" in ggml I wouldn't be surprised. Need to look into it later.	2024-08-09 16:00:31 +02:00
Iwan Kawrakow	48c4389e3d	iq6_k: NEON Respectable performance, only slightly slower than Q6_K.	2024-08-09 16:00:31 +02:00
Iwan Kawrakow	595d2ae32d	iq6_k: slightly better Zen4 iqk_mul_mat We now arrive at pp-512 = 147 t/s for LLaMA-3.1-8B. TG-128 is 9.5 t/s. This is better than last commit, but still kind of slow compared to Q6_K. My last commit message is wrong: also iq3_k needs a fix for overflow.	2024-08-09 16:00:31 +02:00
Iwan Kawrakow	849476acc7	iq6_k: Zen4 iqk_mul_mat We need to do 4 shuffles to get the non-uniform values, so this makes it slower than other iqX_k quants. And then I realized that I was using the standard Zen4 template for all iqX_k quants. The standard template converts the 32-bit integers obtained after _mm512_dpbusds_epi32 back to 16 bits, and then multiples with 16-bit block scales. But this can overfow for iq4_k, iq5_k, and iq6_k. I guess, I did not notice with iq4_k and iq5_k because the PPL difference to CUDA was relatively small, and I attributed it to Q8_K not being accurate enough for the activations. But for iq6_k the PPL difference was much too big to be attributable to Q8_K inaccuracies, so that's when I realized that I cannot be packing the _mm512_dpbusds_epi32 result into 16 bit for 4-,5-,6-bit iqX_k quants. For now I fixed it for iq6_k, but the outcome is that it is significantly slower than Q6_K: I get PP-512 = 125 t/s for LLaMA-3.1-8B vs 180 t/s for Q6_K, so I need to look for a better approach.	2024-08-09 16:00:31 +02:00
Iwan Kawrakow	050bdfa101	iq6_k: CUDA dot product 90.2 t/s for LLaMA-3.1-8B. Q6_K gives 91.2 t/s, so we are good.	2024-08-09 16:00:31 +02:00
Iwan Kawrakow	c3f5e4d9a7	iq6_k: CUDA dequantize We get a slightly better PPL for LLaMA-3.1-8B compared to q6_K (0.14% vs 0.26% quantization error).	2024-08-09 16:00:31 +02:00
Iwan Kawrakow	a9b3f4a54b	iq6_k: WIP (quantize/dequantize)	2024-08-09 16:00:31 +02:00
Iwan Kawrakow	cfb0410067	iq6_k: WIP (nothing works)	2024-08-09 16:00:31 +02:00
Kawrakow	a9f302ebe2	Adding IQ2_TN for use with ternary models (#13 ) * iq2_tn: TriLM specific 2.0625 bpw quantization Quantize/dequantize/scale dot product. I get 46 t/s for the TriLM-3.9B with any SIMD! Finally a compiler doing a decent job auto-vectorizing the scalar implementation. * iq2_tn: AVX512 Just reusing the k-quants template gets us to PP-512 = 376 t/s, TG-128 = 47.6 t/s for TriLM-3.9B. * iq2_tn: AVX512 With this tweak we get to PP-512 = 431 t/s. * iq2_tn: AVX512 With this tweak we get TG-128 = 19.58 / 35.18 t/s for 1 / 2 threads. At 4 threads we saturate at 48.41 t/s, and then performance slowly degrades with increasing number of threads. * iq2_tn: AVX2 PP512 = 440 t/s on the Ryzen-5975WX. We should be able to do better. * iq2_tn: initial NEON version * iq2_tn: NEON For TriLM-3.9B running on the M2-Max we get PP-512 = 193.5 t/s, TG-128 = 75.5 t/s. This is in line with what we have for iq2_bn ant 3.3B Bitnet. * iq2_tn: Metal For TriLM-3.9B on a 30-core M2-Max we get PP-512 = 890 t/s, TG-128 = 98.5 t/s. * iq2_tn: CUDA For TriLM-3.9B running on RTX-4080 we get PP-512 = 9936 t/s, TG-128 = 299.2 t/s. * iq2_tn: AVX2 PP improvement We now get PP-512 = 490.73 t/s for TriLM-3.9B on the Ryzen-5975WX. We have PP-512 = 636.61 t/s for Bintnet-3B quantized with iq2_bn. Bintnet-3B is actually 3.4B, TriLM-3.9B is 3.99B, so we would expect 3.43/3.99 * 636 = 546 t/s, so it seems we still have something that is not quite optimal in iq2_tn. * iq2_tn: small NEON improvement For TriLM-3.9B we now get PP-512 = 206.6 t/s and TG-128 = 76.4 t/s. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-08-07 07:56:09 +02:00

1 2 3 4 5 ...

3432 Commits