ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-02-19 20:54:36 +00:00

Author	SHA1	Message	Date
Kawrakow	0f3a424166	Enable q6_0 for flash attention (#101 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-10-22 11:34:49 +02:00
Kawrakow	7c5a91daf1	Enable IQ4_NL for KV-cache in token generation using Flash Attention (#99 ) * Enable IQ4_NL for V-cache in token generation * We don't need these * Update printour of allowed quantized KV-cache combinations * Add IQ4_NL + IQ4_NL to FA This is a better alternative than Q4_0 + Q4_0 for the VRAM poor. * Remove file added by mistake * Fix typo, which is not really a bug --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-10-21 12:16:54 +02:00
agray3	d336410509	Avoid rebuild of GGML graph for each token (#98 ) Introduces caching of GGML graph to avoid unnecessary full rebuild between each token. KV cache parameters, which change with each token, are updated directly in cached GGML graph. Can be disabled with GGML_DISABLE_GRAPH_CACHING environment variable.	2024-10-20 08:36:16 +02:00
Kawrakow	b091a3513e	Bitnet: make the scale tensors optional (#97 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-10-19 18:52:58 +02:00
Nexes the Elder	b94179b741	Quant strategies: attn_q Q4 & attn_v Q6 for Llama 3.1 Q5_K_S (#96 ) * attn_q Q4 & attn_v Q6 for Llama 3.1 Q5_K_S Pattern worth to be tested on more quants and on L3 8B. PPL 512 = -0.024 for 70b ; - 0.005 for 8b Size = - 640MiB for 70b ; - 64MiB for 8b 70b Q5_K_S now beats Q5_K_M by -0.012 ppl I suspect that it goes for L3 as well, which was quite insensitive to attn_q quantization. * indent	2024-10-19 17:24:43 +02:00
Kawrakow	a049537904	Attempt to blindly fix Windows build failure (#93 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-10-19 11:43:04 +02:00
Nexes the Elder	2b1af6bade	CLI - Specify GGML_TYPE to quantize for the main tensors. (#91 ) To complement the token_embd.weight and output.weight : attn_v.weight attn_k.weight. attn_q_weight attn_output.weight attn_qkv.weight ffn_gate ffn_down ffn_up	2024-10-18 09:48:15 +02:00
Kawrakow	f369c6f921	Adding IQ4_KSS: 4.0 bpw quants (#89 ) * iq4_kss: WIP * iq4_kss: CUDA dequantize works So we can run perplexity. Sadly, the result does not look good on the bpw vs quantization error plot. * iq4_kss: slightly better quantization * iq4_kss: another small quantization improvement * iq4_kss: CUDA works TG-128 performance is very decent with 131 t/s for LLaMA-3.1-8B. In comparison, we have 123 t/s for q4_0 and 128 t/s for iq4_ks. I.e., the reduced model size more than offsets the additional bit fiddling required for iq4_kss. * iq4_kss: new bit arrangement - CUDA and Zen4 work Did not lose performance on CUDA. Zen4 is decent, but not great: PP-512(LLaMA-3.1-8B) = 163 t/s. TG-128 is of course better than other 4-bit quants due to smaller model size. We get 14.5 t/s @ 8 threads. * iq4_kss: ARM_NEON. Predictably very slow * iq4_kss: Metal PP is not too bad - just 10% slower than q4_0. But TG is 30% slower, i.e., predictably bad. * iq4_kss: somewhat faster Metal dot product 45.75 t/s -> 48.75 t/s. Still 22% slower than q4_0 * iq4_kss: AVX2 Bad, but better than I expected. PP-512(LLaMA-3.1-8B) = 167 t/s on the Ryzen-5950X. I.e., with 32 AVX2 threads we get the performance of 16 Zen4 threads. * iq4_kss: very slightly faster Metal dot product 48.7 t/s -> 49.3 t/s --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-10-16 15:18:26 +03:00
Kawrakow	a09de6eaef	iq4_ks: faster dot product on Metal (#90 ) TG-128(LLaMA-3.1-8B) goes to 52.5 t/s up from 48.4 t/s. Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-10-16 14:13:03 +03:00
Kawrakow	1882040c70	Minor iq3_k tweak	2024-10-14 18:13:11 +03:00
Kawrakow	250c325e7e	iq3_k: fix and optimize Metal dot product (#87 ) * iq3_k: fix Metal dot product I was accessing the scales as 4-byte aligned, but iq3_k is not 4-byte aligned. Instead of throwing an error (as it happens on CUDA when one makes this mistake), Metal silently accepts and we get garbage. * iq3_k: slightly faster Metal dot product --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-10-14 10:46:41 +03:00
Kawrakow	f61bd33a04	Fix and optimize iq2k Metal implementation (#86 ) * I somehow broke iq2_k on Metal? - fix dequantize * I somehow broke iq2_k on Metal? - fix dot product * iq2_k: optimize Metal dot product 42.6 t/s -> 46.2 t/s --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-10-13 14:30:30 +03:00
Kawrakow	67817fb5b9	IQ2_KS: 2.1875 bpw non-linear quantization (#85 ) * Experimenting * iq2k: Try make_qx_quants for the scale Slightly better for LLaMA-3.1, Gemma-2, slightly worse for Qwen2.5 * iq2k with make_qx_quants: adjust scale * iq2ks: basics * iq2_ks: CUDA works * iq2_ks: WIP * iq2_ks: WIP * iq2_ks: Zen4 * iq2_ks: AVX2 * iq2_ks: scalar dot product * iq2_ks: ARM_NEON * iq2_ks: Metal * iq2_ks: faster Metal LLaMA-3.1-8B: PP-512 = 475.22 ± 0.37 t/s TG-128 = 45.32 ± 0.03 t/s --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-10-13 13:34:30 +03:00
Kawrakow	c4c70af543	Minor: printf -> LLAMA_LOG_INFO	2024-10-11 12:49:47 +03:00
Kawrakow	6a16fe2f4e	Better model info (#84 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-10-10 18:21:24 +03:00
Kawrakow	a10ccd65f3	New SOTA quantization: 4.25 bpw IQ4_KS (#83 ) * iq4_k_xxs: basics * WIP + adding iq3_kl quantization mix * iq4_xxs: this looks very viable compared to iq4_xs At the same 4.25 bpw PPL is always better, for some models significantly better. I'll rename to iq4_ks and keep it. * iq4_xxs: CUDA dot product We get TG-128 = 126 t/s for LLaMA-3.1-8B, compared to 123 t/s for q4_0. * iq4_xxs: scalar CPU dot product Also fix the breakage I caused with the dedicated work buffer quantization portion when the multiplication is not done via iqk_mul_mat. * iq4_xxs: Zen4 I noticed that iq4_xs is wrong on Zen4 (and possibly AVX2). Again the same mistake of packing int32_t back to int16_t, which overflows occasionally (just occasionally, that's why the result doesn't look completely wrong, so I didn't notice). * Fix iq4_xs (Zen4) * iq4_xxs: AVX2 * iq4_xxs: ARM_NEON * iq4_xxs: Metal * iq4_xxs: slightly faster TG on Metal * iq4_xxs: rename to iq4_ks After all, tt is a smaller variant of iq4_k. * iq3_kl: use iq4_ks instead of iq4_k/iq4_xs --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-10-09 12:54:40 +03:00
Kawrakow	6648952ed8	Fix compiler warnings	2024-10-04 16:17:36 +03:00
Kawrakow	65575488d9	Move scale fudge factors to quantization (#81 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-10-04 16:16:01 +03:00
Kawrakow	d2b53228f5	Move to c++17 projectwide (#80 ) * Slightly better * Make the entire project c++17 --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-10-04 14:43:26 +03:00
Kawrakow	f1066edc4e	Do not quantize activations if not necessary (#79 ) * Do not quantize activations if not necessary * Do not quantize activations if not necessary also for MoE models --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-10-04 11:22:57 +03:00
Kawrakow	b44d05dbe0	q6_0: Slightly faster Zen4/AVX2 (#78 ) * Faster q6_0 on AVX2 PP-512 goes up by 3.4%. * q6_0: this is slightly better --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-10-02 18:09:47 +03:00
Kawrakow	4390096212	Fused unary(x)y (#70 ) Adding fused yunary(x) op Fused yunary(x) op: CUDA Fused yunary(x) op: dedicated CPU implementation for silu and gelu Fused y*unary(x) op: Metal --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-10-02 17:05:56 +03:00
Kawrakow	104e7e26c4	Adding Q6_0 (#77 ) * Adding q6_0 - basics + AVX2/Zen4 working * Adding q6_0: CUDA dequantize works, but not mmvq * Adding q6_0: CUDA mmvq works * Adding q6_0: CUDA cpy, so Q6_0 can be used for KV-cache * Add q6_0 to CPU flash attention Disappointing result: for LlaMA-3.2-1B, q6_0 K- and V-cache gives about the same PPL as q8_0 K-cache and q4_0 V-cache, while needing the exact same RAM. I.e., what was the point? * q6_0: slightly better kv-cache result Better than q8_0+q4_0, but not as good as q8_0+iq4_nl * q6_0: works on ARM_NEON * q6_0: dequantize works on Metal, but not vector dot product * q6_0: it now works on Metal Outperforms q5_0 by a significant margin. E.g. \| model \| size \| params \| backend \| ngl \| threads \| test \| t/s \| \| ------------------------------ \| ---------: \| ---------: \| ---------- \| --: \| ------: \| ------------: \| ---------------: \| \| llama 8B Q6_0 \| 6.08 GiB \| 8.03 B \| Metal \| 100 \| 4 \| tg128 \| 44.02 ± 0.08 \| \| llama 8B Q5_0 \| 5.21 GiB \| 8.03 B \| Metal \| 100 \| 4 \| tg128 \| 40.13 ± 0.12 \| \| llama 8B Q6_0 \| 6.08 GiB \| 8.03 B \| Metal \| 100 \| 4 \| pp512 \| 500.55 ± 0.32 \| \| llama 8B Q5_0 \| 5.21 GiB \| 8.03 B \| Metal \| 100 \| 4 \| pp512 \| 448.02 ± 0.27 \| * q6_0: can now be used for kv-cache on Metal --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-10-02 15:22:13 +03:00
Kawrakow	6dec4112a1	iq4_nl: faster quantization (#76 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-10-02 08:17:00 +03:00
Kawrakow	d2c74a369b	Fix Q5_0 flash attention (#75 ) When I changed iqk_mul_mat to use type-1 dot products for type-0 legacy quants, I forgot to also change the vec_dot_type when the dot product is done via ggml as in flash attention. This commit fixes it. Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-10-01 15:52:35 +03:00
Kawrakow	5e0614f9e5	Fix last commit Did not re-check on AVX2/Zen4 after NEON related changes and, sure enough, I broke AVX2/Zen4.	2024-10-01 14:48:44 +03:00
Kawrakow	6123e1700b	IQ4_NL kv-cache on the CPU (Zen4/AVX2/ARM_NEON) (#74 ) * Be able to use IQ4_NL for KV cache on AVX2/Zen4 * Be able to use IQ4_NL for KV cache on ARM_NEON --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-10-01 14:46:40 +03:00
Kawrakow	7d9d275fdd	CUDA: faster float -> iq4_nl conversion (#73 ) * iqk_mul_mat: better iq4_nl implementation on Zen4/AVX2 PP-512 performance for LLaMA-3.1-8B goes to 162.6 t/s up from 133.2 t/s. * Speed up float -> iq4_nl conversion on CUDA --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-10-01 12:28:29 +03:00
Kawrakow	ee274a4148	iqk_mul_mat: better iq4_nl implementation on Zen4/AVX2 (#72 ) * iqk_mul_mat: better iq4_nl implementation on Zen4/AVX2 PP-512 performance for LLaMA-3.1-8B goes to 162.6 t/s up from 133.2 t/s. * Fix AVX2 In addition to fixing iq4_nl, it seems I never adhusted the AVX2 implementation for iq2_tn to the block scale removal? This commit also fixes that. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-10-01 10:56:50 +03:00
Kawrakow	480a405a9c	iqk_mul_mat: better srategy when nrc_y not divisible by ny (#71 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-10-01 08:57:34 +03:00
Kawrakow	cd7e7b6bbc	Allow bf16 kv-cache (#69 ) On the CPU I get the exact same PPL with and without FA using bf16 for kv-cache. But on CUDA the bf16 kv-cache result is about the same as the fp16 kv-cache CPU result, so I'm missing some conversion somewhere. Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-09-29 09:03:52 +03:00
Kawrakow	f55789e50a	Time to fix replace_all (#68 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-09-28 17:59:47 +03:00
Kawrakow	54b1c97878	CUDA non-contiguous RoPE (#66 ) In this way we can avoid the Q, K, V copies being made after multiplication with the QKV tensor in, e.g., Phi-3.5-mini. This results in a 6-7% speedup of PP-512(Phi-3.5-mini) on CUDA (RTX-4080) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-09-28 17:41:21 +03:00
Kawrakow	947a348990	Adding SWIGLU unary op (#65 ) * Adding GGML_UNARY_OP_SWIGLU This commit implements the ggml op and CPU compute forward. I see ~3-4% speedup of PP-512 for Phi-3.5-mini. * GGML_UNARY_OP_SWIGLU: CUDA implementation I observe ~12% speedup for PP-512(Phi-3.5-mini). * GGML_UNARY_OP_SWIGLU: Metal implementation We get ~2% speedup for PP-512(Phi-3.5-mini). * GGML_UNARY_OP_SWIGLU: minor improvement on Metal * GGML_UNARY_OP_SWIGLU: cleanup --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-09-28 13:37:25 +03:00
Kawrakow	843de005d6	Better sub-3-bit quantization mixes with a qkv tensor (#64 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-09-28 08:17:19 +03:00
Kawrakow	733660accd	Adding ability to have meta data per tensor row (#61 ) * POC: per row scale This is a POC how to work around opinionated ggml to have scales per row rather than per block. Only implemened for Zen4 and only for iq2_tn. * POC per row scale: iq2_tn on NEON * POC per row scale: iq2_tn on Metal * Per row scale Metal templates * iq1_tn: shrink to 1.625 bpw (NEON and Metal) * POC per row scale: CUDA * POC per row scale: add CUDA TODOs There are two places in ggml-cuda.cu left where it is assumed that type_size * n_per_row / block_size is the way to compute and handle row sizes. This does not affect simple usage, but will lead to issues when tensors are split between GPUs. * Per row scales - CUDA The only place left where there are unnecessary assumptions being made is in the Flash Attention code. As we are not using any quants that use per row scales for quantized KV cache, it should be OK for now. * Update IQ1_TN and IQ2_TN bpw shown to user --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-09-27 08:16:06 +03:00
Kawrakow	1cdb6993ee	Use fp32 for K*Q in Metal FA implementation (#62 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-09-25 13:08:55 +03:00
Kawrakow	15b231cb74	Minor	2024-09-19 11:14:53 +03:00
Kawrakow	d373900e99	Fix compiler warnings (#58 ) * Fix C++ compilation warnings caused by ggml-common.h * Disable c99-extensions warning I get tons of those on macOS due to the arm_neon.h header. * Disable c99-extensions warning only for APPLE * Fix warnings in iqk_quantize.cpp Also add GGML_ABORT when implementation is missing. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-09-17 14:31:29 +03:00
Kawrakow	bd4243bfbf	BF16 support on Metal (#56 ) * BF16 support on Metal * Faster BF16 Metal dot product --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-09-17 10:54:42 +03:00
Kawrakow	fc8920282f	iqk_mul_mat(ARM_NEON): adding bf16 support (#41 ) It looks like ArmV8 ISA has support for bf16, but my M2 Max does not have it, so resorting to bf16 -> f32 conversion and computations in f32. This is 2x slower than f16, but 8x better compared to what I get if I try to run a bf16 model on the M2 (NEON and Metal). Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-09-16 16:47:36 +03:00
Kawrakow	2d532f85d6	Minor	2024-09-15 12:59:14 +03:00
Kawrakow	ba291cbaed	Adding bf16 support to CUDA (#40 ) * Adding bf16 support to CUDA - matrix multipications * Adding bf16 support to CUDA - cleanup * Adapt to latest master --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-09-14 20:02:32 +03:00
Kawrakow	2a7623ffc6	Improve Q5_0 performance (#55 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-09-14 19:47:26 +03:00
Kawrakow	e833fa76a1	Improve Q4_0 and Q8_0 performance on AVX2/Zen4 (#54 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-09-14 13:53:50 +03:00
Kawrakow	a972f41adb	Quantization mixes tweaks (#53 ) * Some tweaks for i-quants Improve Gemma2 PPL while reducing size * Some tweaks for iq2_k and iq3_k --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-09-14 10:29:44 +03:00
Kawrakow	e23dce7a51	Minor	2024-09-13 15:46:36 +03:00
Kawrakow	2bafb03aac	Fix bug and D < 128 case for Q8_0 k-cache (#52 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-09-13 07:19:47 +03:00
Kawrakow	e25c2e7ec2	Quantized Flash Attention for all supported CPU platforms (#51 ) * NEON Flash Attention: add support for Q8_0, Q4_0, Q4_1 * NEON Flash Attention: quantized KQ for q4_0 I could finally take advantage of the matrix multiplication templates. We get quite a bit of speedup that way for q4_0: For Gemma-2b using mul_mat_qX_0_q8_0<DequantizerQ40, q_step> results in PP-2048 = 287 t/s vs 268 t/s when converting the q4_0 k-cache and Q to fp16 and using fp16 multiplication. NEON Flash Attention: quantized KQ for q4_1 NEON Flash Attention: quantized KQ for q8_0 This makes quite a bit of difference: For Gemma2-2b PP-8192 is 228 t/s with quantized KQ vs 178 t/s when converting things to fp16 and using fp16 matrix multiplication. We have PP-512 = 307 t/s, so PP-8192 is now ~75% of the performance of PP-512. In contrast, llama.cpp with Q8_0 cache is 38% of PP-512. * Zen4 Flash Attention: quantized KQ for q4_0, q4_1, q8_0 AVX2 Flash Attention: quantized KQ for q4_0, q4_1, q8_0 Tidy up FlashMS * Delete no longer used stuff With the usage of quantized matrix multiplications for quantized k- and/or v-cache, we no longer need the helper methods loading entire rows. * Disallow mixing bf16 with other types for kv caches --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-09-12 19:03:20 +03:00
Kawrakow	7874e4425f	AVX2 Flash Attention 2 (#50 ) * AVX2 Flash Attention: add ability to use Q8_0 for kv-cache * AVX2 Flash Attention: add ability to use Q4_0 for kv-cache * AVX2 Flash Attention: add ability to use Q4_1 for kv-cache * Fix Zen4 --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-09-11 19:55:42 +03:00

1 2 3 4 5 ...

3470 Commits