ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-02-24 15:14:10 +00:00

Author	SHA1	Message	Date
Iwan Kawrakow	154a195f75	Minor	2025-05-10 19:07:02 +03:00
Iwan Kawrakow	3a671301f8	Adding GPU offload policy	2025-05-10 19:01:21 +03:00
Kawrakow	a2d24c97e5	TG improvements for MoE models (#404 ) * cuda: Remove unnecessary device to host copy of row ids We get 3-4% TG speed improvement for DeepSeek-Lite just from that. * CPU: fix get_rows when SER is used With smart experts reduction (SER), one potentially uses fewer experts than specified by the model. This is accomplished by setting the ID of the not seected tensors to -1. Most of the necessary stuff was implemented when I added the SER option, but I forgot to update get_rows() for not quantized tensors. As a result, we get random garbage for the weights of the not-selected epxerts, which leads to garbage output. This commit fixes it on the CPU. I'm not quite sure yet why the GPU is not working. * CUDA: fix TG with SER --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-05-10 18:52:54 +03:00
Kawrakow	43a154d8b8	Handle incompatible DeepSeek GGUFs (#394 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-05-09 22:00:40 +03:00
saood06	967a2e1860	Fix missing rope_freqs with convert_hf_to_gguf (#402 ) * lora : fix llama conversion script with ROPE_FREQS * convert : refactor rope_freqs generation This should also fix vocab-only conversion for Phi-3. * convert : adapt MiniCPM3 to separate rope_freqs insertion MiniCPM3's tokenizer is treated as a SentencePiece tokenizer to avoid having to run its custom Python code which mixes tokenization in the same file as tool calls. gguf-py : add long and short RoPE factors to tensor mappings Empty, but the key names are used to populate the mappings. --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co> Co-authored-by: Francis Couture-Harpin <git@compilade.net>	2025-05-09 09:17:41 -05:00
Kawrakow	e5a4a3ce78	Update README.md @saood06 Thanks!	2025-05-09 11:16:36 +03:00
Kawrakow	8777fc4855	Fix CUDA FlashMLA-3 with quantized KV cache (#400 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-05-09 10:22:48 +03:00
Kawrakow	496451a1d4	Update README.md	2025-05-09 10:13:25 +03:00
saood06	bc6ae515ce	Support for Llama-3-Nemotron models (#377 ) * conflict resolution * Changes to make work and add longrope support * Changes to n_attention_wv rule * Untested support of 253B * DeciLMCausalModel now reads rope_theta from config.json properly * Remove errant Granite mentions * Better n_attention_vw rule * Update vocab.py --------- Co-authored-by: Yee Man Chan <ymchan@gmail.com> Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-05-09 10:09:59 +03:00
Kawrakow	4084ca7331	Update README.md	2025-05-07 18:59:01 +03:00
Kawrakow	30536ee369	FlashMLA-3 for DeepSeek models on CUDA (#386 ) * CUDA WIP: support for FlashMLA-3 * Much better The issue was that I did not change the number of warps used for 3D matrix multiplications (wk_b * kv_cache, MoE), so we ended up using 4 warps for TG. By going to 1 warp in these cases, we get a significant boost in TG performance (tested with DeepSeek-Lite) * Sadly, the previous commit was wrong * Finalizing * Also add these * Minor * Minor tweak --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-05-07 17:38:22 +03:00
Gaolingx	17c6fc6b73	fix some MSVC build problem. (#392 ) * cmake: force MSVC compiler charset to utf-8 * build: apply MSVC /bigobj option to c/cpp files only * Update CMakeLists.txt	2025-05-07 17:04:39 +03:00
Kawrakow	8a5c0410e1	Fix DeepSeek q8_0 cache (#391 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-05-07 12:06:49 +03:00
Kawrakow	090eae4d69	Fix build for Xeon Gold 6226R (#390 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-05-07 10:33:27 +03:00
Kawrakow	6c23618ca5	Update README.md	2025-05-06 08:48:11 +03:00
Kawrakow	e3fec17347	Fix DeepSeek FA (#382 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-05-05 08:39:10 +03:00
Kawrakow	f7c9a0f036	CUDA: MMQ for IQ4_KS (#374 ) * WIP * WIP: still getting illegal memory access * CUDA: MMQ for iq4_ks now works ~25% faster than dequantize+cuBLAS, ~10% slower than Q4_0 MMQ. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-05-04 12:45:00 +03:00
Kawrakow	1328128298	Update README.md	2025-05-04 12:06:47 +03:00
Kawrakow	7cb6a76cd0	Update README.md	2025-05-04 11:49:29 +03:00
Kawrakow	ce2b0292e1	CUDA: faster FA TG for GQA models (#370 ) * cuda: WIP MMA FA * Use MMA for TG also when quantized --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-05-04 09:17:44 +03:00
Kawrakow	b890e01238	Another attempt to fix #367 (#371 ) * Another attempt to fix #367 * Yet another --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-05-04 09:02:12 +03:00
Gaolingx	ab7f694b71	cmake: force MSVC compiler charset to utf-8 (#369 )	2025-05-03 15:56:29 +03:00
Kawrakow	afcfa85756	Trying to fix iq1_s_r4/iq1_m_r4 quantization failure (#368 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-05-03 14:43:55 +03:00
Kawrakow	1ea1df4b2d	Fix FA bug on AVX2 (#364 ) * Fix FA bug on AVX2 * Also this was wrong --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-05-02 07:09:09 +02:00
saood06	d37add8b39	Fix model architecture name (#366 ) Co-authored-by: junhuihe <junhui-he@outlook.com>	2025-05-02 07:07:24 +02:00
Kawrakow	98d1626469	Update README.md (#352 ) * Update README.md * Edits * Updates --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-04-30 15:11:29 +02:00
Kawrakow	4c2bee0bed	Fix IQK_FA_ALL_QUANTS on AVX2 (#360 ) * Fix IQK_FA_ALL_QUANTS on AVX2 * Make it also work, not just compile --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-04-30 10:45:43 +02:00
Kawrakow	9ba362706c	Add missing enum values for qwen3 and qwen3moe (#356 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-04-29 10:05:38 +02:00
Ben Harris	1064f5bc31	Apply Qwen3 PR from llama.cpp (#355 )	2025-04-29 10:02:08 +02:00
Kawrakow	99b87a375f	Update AUTHORS Add @ubergarm	2025-04-29 07:22:06 +02:00
Kawrakow	cda24b58cb	CPU FA improvements (#351 ) * FA: provide work buffer for K repacking * Add header to avoid comp0iler warnings * WIP * WIP * WIP * WIP * Slightly better * WIP (Zen4) * WIP * Try to improve for unusual number of heads/number of threads * Use mul_mat_qX_0_q8_2_Tx for q6_0 in FA * Use mul_mat_qX_0_q8_2_Tx for q4_0 in FA * Use Sum4q4 for q4_0 * WIP * WIP * Much better FA TG with q8_0 KV cache Just repack it even for TG. But do the repacking for k_step rows, not the whole K tensor. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-04-29 07:19:43 +02:00
ubergarm	baeefb4731	Add GLM-4-0414 Model Support (#344 ) * Add GLM-4-0414 model support Based on zRzRzRzRzRzRzR's PR on mainline llama.cpp. Still some issues where it doesn't work: * offloading >=60 layers to GPU * no flash attention * Remove seemingly unused llm_tensor enums Both of these seem unused and LLM_TENSOR_ATTN_POST_NORM already existed which seems pretty similar? Don't think they were used in the python code either... So removed these as possibly just cruft: * LLM_TENSOR_POST_ATTN_NORM * LLM_TENSOR_POST_MLP_NORM * Set flash attention precision to f32 on GLM4 arch * Set non flash attention precision to f32 on GLM4 * Remove reshape_3d() for Vcur in build_glm4() This fixes the non-flash-attention inferencing on both CPU and CUDA.	2025-04-26 17:34:04 +02:00
Kawrakow	9e846f0eb1	Fix division by zero bug (#349 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-04-26 09:19:43 +02:00
Kawrakow	715fc552ad	Add support for Cohere2 (#341 ) * Add support for Cohere2 * Fixe IQ4_NL on AVX2 * Command-A needs fp32 precision for K*Q --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-04-26 08:13:25 +02:00
Kawrakow	770892086c	Fix q4_1 and q5_1 on Arm (#348 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-04-25 19:48:08 +02:00
Kawrakow	c817160d03	Add ability to manually set arch flags (#347 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-04-25 13:24:18 +02:00
Kawrakow	25d1a0dca8	Fix FA on ARM (#346 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-04-25 11:01:08 +02:00
Kawrakow	f176122a3d	Fix LLaMA-4 attention (#342 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-04-25 09:21:03 +02:00
Kawrakow	c9eec1729f	cuda: use switch in constexpr funcs (#343 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-04-24 17:37:12 +02:00
saood06	222a195743	Update gguf-py constants (#298 ) * Update GGMLQuantizationType * Update LlamaFileType * Update GGML_QUANT_SIZES	2025-04-24 00:34:10 -05:00
Kawrakow	9dac3edf2f	BitNet adjustments (#338 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-04-22 08:46:31 +02:00
saood06	cc39800723	Add support for bitnet2b_2501 model (#337 ) * add support for bitnet2b_2501 model * Fixes * Support both model names --------- Co-authored-by: potassiummmm <zhou.hansong@outlook.com>	2025-04-22 08:34:13 +02:00
saood06	93cd77b655	Fix termux/android build (#336 ) * Attempt fix * Attempt fix 2 * Attempt fix 3 * Attempt fix 4 * Attempt fix 5 * Attempt fix 6 * Attempt fix 7 * Attempt fix 8 * Attempt fix 9 * Attempt fix 10 * Attempt fix 11 * Attempt fix 12 * Attempt fix 13	2025-04-21 09:13:46 +02:00
Kawrakow	3bb64d9330	Better TG performance for GQA models (CPU) (#332 ) * Slightly better CPU TG performance for GQA * Better CPU FA implementation for TG when GQA * Minor --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-04-17 08:08:40 +02:00
Kawrakow	f7c5a94e75	Better gemm/gemv on AVX2 fr q4_0_r8 (#331 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-04-15 17:18:50 +02:00
Kawrakow	1bbb143eb3	Allow q8_0 KV cache for head size 256 (#330 ) * Allow q8_0 KV cache for head size 256 * We need also these --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-04-15 17:05:31 +02:00
Kawrakow	05dbbeaf14	imatrix: collect layer influence statistics (#328 ) * imatrix: collect layer influence statistics * imatrix: collect layer influence statiscs also for the last layer For the last layer we need to use the input for the output.weight tensor. Last layer(s) tend(s) to be important, so it is useful to also have its influence metric. * imatrix: separate metric for attention and ffn importance * Use stripped tensor name, not src0->name --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-04-14 19:43:19 +02:00
Kawrakow	028e0cfa19	Add ability to hide imatrix details in llama-quantize (#329 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-04-14 19:41:31 +02:00
Kawrakow	d210661c91	Improved IQ1_M quantization (#327 ) * Much faster and it looks like better iq1_m quantiation * Cleanup * Minor --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-04-13 10:37:55 +02:00
Kawrakow	c01449a478	Fix KLD precision (#325 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-04-12 16:17:50 +02:00

1 2 3 4 5 ...

3682 Commits