ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-02-21 13:44:10 +00:00

Author	SHA1	Message	Date
Kawrakow	17d721820a	Fix standard attention on the CPU (#421 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-05-15 08:43:39 +03:00
Kawrakow	5e31a7df43	CUDA: quantized GEMM for for IQ2_KS, IQ2_K, IQ3_K (#418 ) * MMQ for iq2_k * This works * MMQ for iq3_k * MMQ for iq2_ks * Fix iq2_ks --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-05-15 08:15:08 +03:00
Kawrakow	51db1bf2d2	CUDA: quantized GEMM for for IQ4_K, IQ5_K, IQ6_K (#417 ) * MMQ for iq4_k: WIP (not working) * MMQ for iq4_k: working now * MMQ for iq5_k * Cleanup * MMQ for iq5_k: slightly faster * MMQ for iq6_k --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-05-14 14:04:11 +03:00
Kawrakow	fba62d61c0	Fix SER (CUDA) (#416 ) * Fixing SER bugs * Cleanup * This seems to fix it. * This seems to work --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-05-14 07:29:28 +03:00
Kawrakow	d002b9b4a0	Fix SER (CPU) (#415 ) * Fixing SER bugs * Cleanup --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-05-13 17:55:04 +03:00
Kawrakow	4071472bdc	Fix imatrix calculation for MLA models (#411 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-05-13 17:53:38 +03:00
Kawrakow	86dbdea6fc	Better CPU FA performance for DeepSeek-Lite (#410 ) * Better CPU FA performance for DeepSeek-Lite * It must be like this --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-05-13 17:53:20 +03:00
Kawrakow	537f72f9cc	Update README.md	2025-05-12 15:48:37 +03:00
Kawrakow	be1d5c4b7e	Fix new CUDA FA on Touring (#413 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-05-12 15:09:33 +03:00
Kawrakow	ceb8f513e4	Add batch warmup to sweep-bench (#375 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-05-12 07:50:26 +03:00
Kawrakow	2e585d4508	Enable faster prompt processing with mainline llama.cpp GGUFs (#409 ) * Enable MLA-3 in crippled GGUFs: WIP * Enable MLA-3 in crippled GGUFs: seems to work * Add newly created tensors to model.tensors_by_name Else they don't get run-time repacked. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-05-12 07:49:51 +03:00
Kawrakow	0c02e16a39	Faster DeepSeek FA on CUDA (#408 ) * New DeepSeek FlashMLA Does not work because the RoPE portion is stored at the end in our case, while in mainline it is stored at the beginning, and the FA kernel assumes that. * Rearrange MLA K cache so it first new CUDA FA implementation * constexpr and minor changes --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-05-12 07:49:00 +03:00
Kawrakow	aa8ec5dfa6	GPU offload policy (#405 ) * Adding GPU offload policy * Minor --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-05-12 07:47:46 +03:00
Kawrakow	8f7bd74afb	Revert "Fix race in the CUDA DeepSeek FA kernel (#406 )" This reverts commit `36e6e888b7`. I should have tested. We get NaNs.	2025-05-11 12:22:19 +03:00
Kawrakow	0abcf0749e	Fix race in the CUDA DeepSeek FA kernel (#406 ) Reference: https://github.com/ggml-org/llama.cpp/pull/13438 Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-05-11 08:12:47 +03:00
Kawrakow	a961f41762	TG improvements for MoE models (#404 ) * cuda: Remove unnecessary device to host copy of row ids We get 3-4% TG speed improvement for DeepSeek-Lite just from that. * CPU: fix get_rows when SER is used With smart experts reduction (SER), one potentially uses fewer experts than specified by the model. This is accomplished by setting the ID of the not seected tensors to -1. Most of the necessary stuff was implemented when I added the SER option, but I forgot to update get_rows() for not quantized tensors. As a result, we get random garbage for the weights of the not-selected epxerts, which leads to garbage output. This commit fixes it on the CPU. I'm not quite sure yet why the GPU is not working. * CUDA: fix TG with SER --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-05-10 18:52:54 +03:00
Kawrakow	47fa8380c6	Handle incompatible DeepSeek GGUFs (#394 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-05-09 22:00:40 +03:00
saood06	a7e5b01540	Fix missing rope_freqs with convert_hf_to_gguf (#402 ) * lora : fix llama conversion script with ROPE_FREQS * convert : refactor rope_freqs generation This should also fix vocab-only conversion for Phi-3. * convert : adapt MiniCPM3 to separate rope_freqs insertion MiniCPM3's tokenizer is treated as a SentencePiece tokenizer to avoid having to run its custom Python code which mixes tokenization in the same file as tool calls. gguf-py : add long and short RoPE factors to tensor mappings Empty, but the key names are used to populate the mappings. --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co> Co-authored-by: Francis Couture-Harpin <git@compilade.net>	2025-05-09 09:17:41 -05:00
Kawrakow	b64cb29713	Update README.md @saood06 Thanks!	2025-05-09 11:16:36 +03:00
Kawrakow	dd2014a853	Fix CUDA FlashMLA-3 with quantized KV cache (#400 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-05-09 10:22:48 +03:00
Kawrakow	957a6e7911	Update README.md	2025-05-09 10:13:25 +03:00
saood06	87bfad8437	Support for Llama-3-Nemotron models (#377 ) * conflict resolution * Changes to make work and add longrope support * Changes to n_attention_wv rule * Untested support of 253B * DeciLMCausalModel now reads rope_theta from config.json properly * Remove errant Granite mentions * Better n_attention_vw rule * Update vocab.py --------- Co-authored-by: Yee Man Chan <ymchan@gmail.com> Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-05-09 10:09:59 +03:00
Kawrakow	828758ec0d	Update README.md	2025-05-07 18:59:01 +03:00
Kawrakow	92ceda1d06	FlashMLA-3 for DeepSeek models on CUDA (#386 ) * CUDA WIP: support for FlashMLA-3 * Much better The issue was that I did not change the number of warps used for 3D matrix multiplications (wk_b * kv_cache, MoE), so we ended up using 4 warps for TG. By going to 1 warp in these cases, we get a significant boost in TG performance (tested with DeepSeek-Lite) * Sadly, the previous commit was wrong * Finalizing * Also add these * Minor * Minor tweak --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-05-07 17:38:22 +03:00
Gaolingx	5436acdb6c	fix some MSVC build problem. (#392 ) * cmake: force MSVC compiler charset to utf-8 * build: apply MSVC /bigobj option to c/cpp files only * Update CMakeLists.txt	2025-05-07 17:04:39 +03:00
Kawrakow	8a2d611083	Fix DeepSeek q8_0 cache (#391 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-05-07 12:06:49 +03:00
Kawrakow	6104bf5296	Fix build for Xeon Gold 6226R (#390 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-05-07 10:33:27 +03:00
Kawrakow	6e7b28f7b0	Update README.md	2025-05-06 08:48:11 +03:00
Kawrakow	b08471f717	Fix DeepSeek FA (#382 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-05-05 08:39:10 +03:00
Kawrakow	45cd1bcd59	CUDA: MMQ for IQ4_KS (#374 ) * WIP * WIP: still getting illegal memory access * CUDA: MMQ for iq4_ks now works ~25% faster than dequantize+cuBLAS, ~10% slower than Q4_0 MMQ. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-05-04 12:45:00 +03:00
Kawrakow	db0ed280f1	Update README.md	2025-05-04 12:06:47 +03:00
Kawrakow	7cb99f8078	Update README.md	2025-05-04 11:49:29 +03:00
Kawrakow	711ba7e8f4	CUDA: faster FA TG for GQA models (#370 ) * cuda: WIP MMA FA * Use MMA for TG also when quantized --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-05-04 09:17:44 +03:00
Kawrakow	fdbdb5310a	Another attempt to fix #367 (#371 ) * Another attempt to fix #367 * Yet another --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-05-04 09:02:12 +03:00
Gaolingx	8db70379ae	cmake: force MSVC compiler charset to utf-8 (#369 )	2025-05-03 15:56:29 +03:00
Kawrakow	758ca617cd	Trying to fix iq1_s_r4/iq1_m_r4 quantization failure (#368 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-05-03 14:43:55 +03:00
Kawrakow	892e96be53	Fix FA bug on AVX2 (#364 ) * Fix FA bug on AVX2 * Also this was wrong --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-05-02 07:09:09 +02:00
saood06	aca68016d8	Fix model architecture name (#366 ) Co-authored-by: junhuihe <junhui-he@outlook.com>	2025-05-02 07:07:24 +02:00
Kawrakow	9303df7450	Update README.md (#352 ) * Update README.md * Edits * Updates --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-04-30 15:11:29 +02:00
Kawrakow	1ea49001f3	Fix IQK_FA_ALL_QUANTS on AVX2 (#360 ) * Fix IQK_FA_ALL_QUANTS on AVX2 * Make it also work, not just compile --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-04-30 10:45:43 +02:00
Kawrakow	71bc74d738	Add missing enum values for qwen3 and qwen3moe (#356 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-04-29 10:05:38 +02:00
Ben Harris	8b62ee32ca	Apply Qwen3 PR from llama.cpp (#355 )	2025-04-29 10:02:08 +02:00
Kawrakow	2f2803a1d7	Update AUTHORS Add @ubergarm	2025-04-29 07:22:06 +02:00
Kawrakow	9d9f9f96b2	CPU FA improvements (#351 ) * FA: provide work buffer for K repacking * Add header to avoid comp0iler warnings * WIP * WIP * WIP * WIP * Slightly better * WIP (Zen4) * WIP * Try to improve for unusual number of heads/number of threads * Use mul_mat_qX_0_q8_2_Tx for q6_0 in FA * Use mul_mat_qX_0_q8_2_Tx for q4_0 in FA * Use Sum4q4 for q4_0 * WIP * WIP * Much better FA TG with q8_0 KV cache Just repack it even for TG. But do the repacking for k_step rows, not the whole K tensor. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-04-29 07:19:43 +02:00
ubergarm	42d7e58a96	Add GLM-4-0414 Model Support (#344 ) * Add GLM-4-0414 model support Based on zRzRzRzRzRzRzR's PR on mainline llama.cpp. Still some issues where it doesn't work: * offloading >=60 layers to GPU * no flash attention * Remove seemingly unused llm_tensor enums Both of these seem unused and LLM_TENSOR_ATTN_POST_NORM already existed which seems pretty similar? Don't think they were used in the python code either... So removed these as possibly just cruft: * LLM_TENSOR_POST_ATTN_NORM * LLM_TENSOR_POST_MLP_NORM * Set flash attention precision to f32 on GLM4 arch * Set non flash attention precision to f32 on GLM4 * Remove reshape_3d() for Vcur in build_glm4() This fixes the non-flash-attention inferencing on both CPU and CUDA.	2025-04-26 17:34:04 +02:00
Kawrakow	815307d3bd	Fix division by zero bug (#349 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-04-26 09:19:43 +02:00
Kawrakow	86be28d5bd	Add support for Cohere2 (#341 ) * Add support for Cohere2 * Fixe IQ4_NL on AVX2 * Command-A needs fp32 precision for K*Q --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-04-26 08:13:25 +02:00
Kawrakow	4413f17b58	Fix q4_1 and q5_1 on Arm (#348 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-04-25 19:48:08 +02:00
Kawrakow	fb98619852	Add ability to manually set arch flags (#347 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-04-25 13:24:18 +02:00
Kawrakow	542351d088	Fix FA on ARM (#346 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-04-25 11:01:08 +02:00

1 2 3 4 5 ...

3695 Commits