ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-02-22 14:14:32 +00:00

Author	SHA1	Message	Date
Nexes the Elder	ec4563221e	Streamline a bit the quant strategies (#443 ) * Streamline a bit the quant strategies No change over the existing patterns, except for the bump for attn_k and attn_v for the models with 4 and 6 experts (several frankensteins seen on HF, and which also use GQA). The rest is applying the existing patterns to the new IQ_K quants. Also, a Q8_0 for attn_q slipped into the MOEs 8 experts rule, I removed it, because that tensor is much bigger than attn_k or attn_v. * remove <=8 experts condition.	2025-05-22 18:04:47 +03:00
Kawrakow	b94cd3b632	Refactor iqk_mul_mat.cpp (#435 ) * Refactor iqk: WIP * Refactor iqk: Factor out float GEMM (AVX2/AVX512) * Refactor iqk: Factor out GEMM for legacy quants (AVX2/AVX512) * Refactor iqk: Factor out GEMM for k-quants (AVX2/AVX512) * Refactor iqk: fix AVX2 * Refactor iqk: Factor out GEMM for i-quants (AVX2/AVX512) * Refactor iqk: fix AVX2 * Refactor iqk: Factor out GEMM for iqk-quants (AVX2/AVX512) * Refactor iqk: fix AVX2 * Refactor iqk: Factor out GEMM for 1-bit quants (ABX2/AVX512) * Refactor iqk: fix AVX2 * Refactor iqk: Factor out GEMM for iq1_bn, iq2_bn, iq2_bn_r4 * Refactor iqk: Factor out GEMM for repacked legacy quants * Refactor iqk: Factor out GEMM for q8_K_R8, q8_KV * Refactor iqk: Factor out GEMM for repacked i-quants * Refactor iqk: GEMM kernels are refactored on AVX2/AVX512 * Refactor iqk: factor out 1-bit quants (NEON) * Refactor iqk: factor out k-quants (NEON) * Refactor iqk: factor out floats (NEON) * Also iq4_xs belongs to k-quants * Refactor iqk: factor out iqk quants (NEON) * Refactor iqk: factor out legacy quants (NEON) * Refactor iqk: factor out repacked legacy quants (NEON) * Refactor iqk: factor out repacked k-quants (NEON) * Refactor iqk: factor out repacked iqk quants (NEON) * Refactor iqk: GEMM kernels are refactored on NEON * Refactor iqk: FA compiles If it works is a different story. Current compile time: 107.3 sesonds on the Ryzen-7950X * Refactor iqk: FA refactored (Zen4) Compile time for the FA files is now ~21 seconds on my Ryzen-7950X, so still slightly too long for my taste but much better than the 142 seconds we had before. * Adding forgotten file * Most helpers don't need to be templates Also hide Q4_0 and Q8_KV behind IQK_FA_ALL_QUANTS. Compilation time drops to 14 second on the Ryzen-5975WX * Fix bf16 * Refactor iqk: FA refactored (NEON) * Forgotten MMQ ref and typo (#431) * Adding forgotten iq5_k_r4 * Fix iq4_k_r4 on NEON * Fix iq4_ks on NEON It was broken before the refactoring (the shifts were not correctly applied). * Fix q8_0 on NEON * Fix q6_0 K cache --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com> Co-authored-by: Nexes the Elder <124105151+Nexesenex@users.noreply.github.com>	2025-05-22 10:05:51 +03:00
Kawrakow	a2b5057a0c	Bug fixes from mainline (#439 ) * Add __syncthreads() to the new FA kernel * Clearing padding --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-05-20 17:03:14 +03:00
Nexes the Elder	2ec2229f2e	Forgotten MMQ ref and typo (#431 )	2025-05-18 17:36:41 +03:00
Kawrakow	b3036a872f	Option to enable disable the IQK CPU FA kernels (#429 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-05-17 11:21:58 +03:00
Kawrakow	c35a383bcd	Zen4: Faster PP for IQ2_KS, IQ4_KS, IQ5_KS (#428 ) * Zen4: faster PP for iq4_ks and iq5_ks * Zen4: faster PP for iq2_ks --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-05-17 10:42:33 +03:00
Kawrakow	7abdf2b099	IQ5_KS_R4: row-interleaved IQ5_KS (#426 ) * iq5_ks_r4: basics * iq5_ks_r4: Zen4 works * iq5_ks_r4: AVX2 works * iq5_ks_r4: NEON * Fix iq5_ks on NEON --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-05-17 08:57:26 +03:00
Kawrakow	134d548173	Fix AVX2 implementation of IQ4_K, IQ4_KS, IQ5_K, IQ6_K (#427 ) * Fix IQ4_K on AVX2 * Fix IQ4_KS on AVX2 * Fix IQ5_K on AVX2 * Fix IQ6_K on AVX2 --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-05-16 17:25:15 +03:00
Kawrakow	34ae71c4d7	Adding forgotten template instance for iq5_ks (#424 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-05-15 16:50:15 +03:00
Kawrakow	3d92d7f802	Adding IQ5_KS - 5.25 bpw quants (#422 ) * iq5_ks: basics * iq5_ks: quantize * iq5_ks: CUDA dequantize works * iq5_ks: dot product works on CUDA * iq5_ks: MMQ works * iq5_ks: Zen4 * iq5_ks: AVX2 But is is not quite right, just like iq4_k, iq5_k, iq6_k, iq4_ks. All these need fixing on AVX2. * iq5_ks: NEON * iq5_ks: Metal dequantize * iq5_ks: Metal dot product --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-05-15 16:02:39 +03:00
Kawrakow	3f8c865b92	Fix standard attention on the CPU (#421 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-05-15 08:43:39 +03:00
Kawrakow	14ed9fb44d	CUDA: quantized GEMM for for IQ2_KS, IQ2_K, IQ3_K (#418 ) * MMQ for iq2_k * This works * MMQ for iq3_k * MMQ for iq2_ks * Fix iq2_ks --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-05-15 08:15:08 +03:00
Kawrakow	0435b68e6d	CUDA: quantized GEMM for for IQ4_K, IQ5_K, IQ6_K (#417 ) * MMQ for iq4_k: WIP (not working) * MMQ for iq4_k: working now * MMQ for iq5_k * Cleanup * MMQ for iq5_k: slightly faster * MMQ for iq6_k --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-05-14 14:04:11 +03:00
Kawrakow	b90d6ede2e	Fix SER (CUDA) (#416 ) * Fixing SER bugs * Cleanup * This seems to fix it. * This seems to work --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-05-14 07:29:28 +03:00
Kawrakow	13740622e9	Fix SER (CPU) (#415 ) * Fixing SER bugs * Cleanup --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-05-13 17:55:04 +03:00
Kawrakow	0c57f84dc4	Fix imatrix calculation for MLA models (#411 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-05-13 17:53:38 +03:00
Kawrakow	553c08b6b4	Better CPU FA performance for DeepSeek-Lite (#410 ) * Better CPU FA performance for DeepSeek-Lite * It must be like this --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-05-13 17:53:20 +03:00
Kawrakow	4ba6bbb44a	Update README.md	2025-05-12 15:48:37 +03:00
Kawrakow	627f406437	Fix new CUDA FA on Touring (#413 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-05-12 15:09:33 +03:00
Kawrakow	1d2da7feae	Add batch warmup to sweep-bench (#375 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-05-12 07:50:26 +03:00
Kawrakow	f27cd40542	Enable faster prompt processing with mainline llama.cpp GGUFs (#409 ) * Enable MLA-3 in crippled GGUFs: WIP * Enable MLA-3 in crippled GGUFs: seems to work * Add newly created tensors to model.tensors_by_name Else they don't get run-time repacked. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-05-12 07:49:51 +03:00
Kawrakow	465569dff8	Faster DeepSeek FA on CUDA (#408 ) * New DeepSeek FlashMLA Does not work because the RoPE portion is stored at the end in our case, while in mainline it is stored at the beginning, and the FA kernel assumes that. * Rearrange MLA K cache so it first new CUDA FA implementation * constexpr and minor changes --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-05-12 07:49:00 +03:00
Kawrakow	8669c3db2b	GPU offload policy (#405 ) * Adding GPU offload policy * Minor --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-05-12 07:47:46 +03:00
Iwan Kawrakow	504fb890d9	Revert "Fix race in the CUDA DeepSeek FA kernel (#406 )" This reverts commit `36e6e888b7`. I should have tested. We get NaNs.	2025-05-11 12:22:19 +03:00
Kawrakow	36e6e888b7	Fix race in the CUDA DeepSeek FA kernel (#406 ) Reference: https://github.com/ggml-org/llama.cpp/pull/13438 Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-05-11 08:12:47 +03:00
Kawrakow	a2d24c97e5	TG improvements for MoE models (#404 ) * cuda: Remove unnecessary device to host copy of row ids We get 3-4% TG speed improvement for DeepSeek-Lite just from that. * CPU: fix get_rows when SER is used With smart experts reduction (SER), one potentially uses fewer experts than specified by the model. This is accomplished by setting the ID of the not seected tensors to -1. Most of the necessary stuff was implemented when I added the SER option, but I forgot to update get_rows() for not quantized tensors. As a result, we get random garbage for the weights of the not-selected epxerts, which leads to garbage output. This commit fixes it on the CPU. I'm not quite sure yet why the GPU is not working. * CUDA: fix TG with SER --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-05-10 18:52:54 +03:00
Kawrakow	43a154d8b8	Handle incompatible DeepSeek GGUFs (#394 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-05-09 22:00:40 +03:00
saood06	967a2e1860	Fix missing rope_freqs with convert_hf_to_gguf (#402 ) * lora : fix llama conversion script with ROPE_FREQS * convert : refactor rope_freqs generation This should also fix vocab-only conversion for Phi-3. * convert : adapt MiniCPM3 to separate rope_freqs insertion MiniCPM3's tokenizer is treated as a SentencePiece tokenizer to avoid having to run its custom Python code which mixes tokenization in the same file as tool calls. gguf-py : add long and short RoPE factors to tensor mappings Empty, but the key names are used to populate the mappings. --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co> Co-authored-by: Francis Couture-Harpin <git@compilade.net>	2025-05-09 09:17:41 -05:00
Kawrakow	e5a4a3ce78	Update README.md @saood06 Thanks!	2025-05-09 11:16:36 +03:00
Kawrakow	8777fc4855	Fix CUDA FlashMLA-3 with quantized KV cache (#400 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-05-09 10:22:48 +03:00
Kawrakow	496451a1d4	Update README.md	2025-05-09 10:13:25 +03:00
saood06	bc6ae515ce	Support for Llama-3-Nemotron models (#377 ) * conflict resolution * Changes to make work and add longrope support * Changes to n_attention_wv rule * Untested support of 253B * DeciLMCausalModel now reads rope_theta from config.json properly * Remove errant Granite mentions * Better n_attention_vw rule * Update vocab.py --------- Co-authored-by: Yee Man Chan <ymchan@gmail.com> Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-05-09 10:09:59 +03:00
Kawrakow	4084ca7331	Update README.md	2025-05-07 18:59:01 +03:00
Kawrakow	30536ee369	FlashMLA-3 for DeepSeek models on CUDA (#386 ) * CUDA WIP: support for FlashMLA-3 * Much better The issue was that I did not change the number of warps used for 3D matrix multiplications (wk_b * kv_cache, MoE), so we ended up using 4 warps for TG. By going to 1 warp in these cases, we get a significant boost in TG performance (tested with DeepSeek-Lite) * Sadly, the previous commit was wrong * Finalizing * Also add these * Minor * Minor tweak --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-05-07 17:38:22 +03:00
Gaolingx	17c6fc6b73	fix some MSVC build problem. (#392 ) * cmake: force MSVC compiler charset to utf-8 * build: apply MSVC /bigobj option to c/cpp files only * Update CMakeLists.txt	2025-05-07 17:04:39 +03:00
Kawrakow	8a5c0410e1	Fix DeepSeek q8_0 cache (#391 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-05-07 12:06:49 +03:00
Kawrakow	090eae4d69	Fix build for Xeon Gold 6226R (#390 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-05-07 10:33:27 +03:00
Kawrakow	6c23618ca5	Update README.md	2025-05-06 08:48:11 +03:00
Kawrakow	e3fec17347	Fix DeepSeek FA (#382 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-05-05 08:39:10 +03:00
Kawrakow	f7c9a0f036	CUDA: MMQ for IQ4_KS (#374 ) * WIP * WIP: still getting illegal memory access * CUDA: MMQ for iq4_ks now works ~25% faster than dequantize+cuBLAS, ~10% slower than Q4_0 MMQ. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-05-04 12:45:00 +03:00
Kawrakow	1328128298	Update README.md	2025-05-04 12:06:47 +03:00
Kawrakow	7cb6a76cd0	Update README.md	2025-05-04 11:49:29 +03:00
Kawrakow	ce2b0292e1	CUDA: faster FA TG for GQA models (#370 ) * cuda: WIP MMA FA * Use MMA for TG also when quantized --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-05-04 09:17:44 +03:00
Kawrakow	b890e01238	Another attempt to fix #367 (#371 ) * Another attempt to fix #367 * Yet another --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-05-04 09:02:12 +03:00
Gaolingx	ab7f694b71	cmake: force MSVC compiler charset to utf-8 (#369 )	2025-05-03 15:56:29 +03:00
Kawrakow	afcfa85756	Trying to fix iq1_s_r4/iq1_m_r4 quantization failure (#368 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-05-03 14:43:55 +03:00
Kawrakow	1ea1df4b2d	Fix FA bug on AVX2 (#364 ) * Fix FA bug on AVX2 * Also this was wrong --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-05-02 07:09:09 +02:00
saood06	d37add8b39	Fix model architecture name (#366 ) Co-authored-by: junhuihe <junhui-he@outlook.com>	2025-05-02 07:07:24 +02:00
Kawrakow	98d1626469	Update README.md (#352 ) * Update README.md * Edits * Updates --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-04-30 15:11:29 +02:00
Kawrakow	4c2bee0bed	Fix IQK_FA_ALL_QUANTS on AVX2 (#360 ) * Fix IQK_FA_ALL_QUANTS on AVX2 * Make it also work, not just compile --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-04-30 10:45:43 +02:00

1 2 3 4 5 ...

3705 Commits