ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-03-12 23:10:01 +00:00

Author	SHA1	Message	Date
Kawrakow	22c20fcd6d	Fix flash attention long argument for mainloine compatibility	2025-11-13 19:22:16 +02:00
Kawrakow	ce3ce97a29	Fix repacked legacy quants (#951 ) * Fix q5_0_r4 The issue waqs in the tail part. As almost all models have tensor rows that are multiple of 128, that part was never triggered in testing. But ithe gpt-oss models have an embedding size of 2880, so we end up there and trigger the bug. * Fix q6_0_r4 Same fix as q5_0_r4 * Fix q4_0_r8 * Fix q5_0_r4 and q6_0_r4 also on Zen4 * Fix q4_0_r8 also on Zen4 --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-13 15:35:37 +02:00
Kawrakow	88c02fa108	Set default MLA to 3 also in llama-bench (#949 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-13 09:52:06 +02:00
Kawrakow	874926800f	Add mainline compatible FA command line option (#944 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-13 08:55:33 +02:00
Kawrakow	32edcb4b74	Fix rope_norm_fast_cuda (#945 ) * Fix rope_norm_fast_cuda * One more * Also fix mrope and vision --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-13 08:54:37 +02:00
Kawrakow	ddc88bac17	Set mla=3 by default (#943 ) so more recent users that haven't followed the history of FlashMLA evolution and hence don't know about the MLA options get the best setting without having to add -mla 3 on the command line. Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-12 11:00:58 +02:00
Kawrakow	0d97b9c0bf	Enable fusion by default (#939 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-11 10:35:48 +02:00
Kawrakow	219fe93973	Opt from #880 also for iqk cuda gemv (#938 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-11 10:01:34 +02:00
Kawrakow	25cd985c9b	Add --n-cpu-moe to llama_bench (#937 ) * Add --n-cpu-moe to llama_banch * Add usage --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-11 08:44:59 +02:00
Kawrakow	121ed91165	Add rcache to llama-bench (#936 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-11 08:06:18 +02:00
Kawrakow	1223bc63b8	Minor: remove unnecesssary calls to build_inp_out_ids (#935 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-10 17:38:46 +02:00
Kawrakow	263be6670b	Add support for SmolLM3 (#934 ) * Convert from HF * Model loading and compute graph --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-10 15:40:12 +02:00
Kawrakow	86e2bec04e	DeepSeek FA optimizations (#929 ) * Use new-new-mma also for MLA=3, and use mask bounds This gives us ~25% better PP at 32k tokens compared to main * This seems better --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-10 09:55:30 +02:00
Kawrakow	adba641347	DeepSeek TG optimizations for TG (#928 ) * Fuse concat and copy into K cache * Avoid ggml_cont() when n_token = 1 Combined effect: about +2% in TG performance with full GPU offload Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-10 09:52:07 +02:00
firecoperana	eea6cc4433	Server: Add --draft-params to set draft model parameter via command line args (#932 ) * Add command line argument for draft model * Remove second context of draft model * Format print * print usage if parsing -draft fails --------- Co-authored-by: firecoperana <firecoperana>	2025-11-10 09:51:07 +02:00
Kawrakow	bf474e9bff	Use fused gemv+add only for TG (#933 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-10 08:34:24 +02:00
Kawrakow	56ee303254	Make biased gemv fusion optional (#931 ) * Make biased gemv fusion optional * Fix one path through gemv fusion * Remove forgotten printf --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-09 19:09:47 +02:00
Kawrakow	7df9947923	Fix compiler warning	2025-11-09 14:35:59 +02:00
Lennart Lopin	fd37776584	Add ARM Grace Blackwell (NVIDIA DGX Spark) support (#922 ) This commit enables IQK quantization operations on ARM-based systems, specifically tested on NVIDIA DGX Spark with GB10 Grace Blackwell. Changes: - Enable IQK_IMPLEMENT macro for ARM NEON operations - Add arm_neon.h header include for ARM SIMD intrinsics - Fix compilation errors related to missing NEON types and functions Build requirements for ARM: cmake .. -DGGML_CUDA=ON \ -DCMAKE_CXX_FLAGS="-march=armv8.2-a+dotprod+fp16" \ -DCMAKE_C_FLAGS="-march=armv8.2-a+dotprod+fp16" Tested on: - Platform: NVIDIA DGX Spark (aarch64) - CPU: GB10 Grace Blackwell Superchip - Memory: 128GB unified memory Fixes build errors: - 'float32x4_t' does not name a type - 'vld1q_f32' was not declared in this scope - 'v_expf' was not declared in this scope - Missing FP16 NEON intrinsics	2025-11-09 14:22:40 +02:00
firecoperana	73c28dbef4	server: bug fix for preserved_tokens not preserved in process_token (#926 ) Co-authored-by: firecoperana <firecoperana>	2025-11-09 14:16:29 +02:00
firecoperana	b63309a918	Fix embedding missing, CORS and crash using verbose in server (#924 ) * server: fix crash when prompt has image and is too long * server: fix CORS * server: fix empty result for embedding * change error message to truncate prompt * server: fix slot id for save and load state * bug fix * server: update slot similarity to handle mtmd * server: quick hack to calculate number of token processed with image * server: fix out of range error when detokenizing prompt under verbose * Add back Access-Control-Allow-Origin * Server: Add prompt tokens in embedding results --------- Co-authored-by: firecoperana <firecoperana>	2025-11-09 14:16:03 +02:00
Kawrakow	5cc15d0ecf	CUDA MoE improvements (#923 ) * Use mmq_id in mul_mat_id * Better * Also use it in the fused up+gate op * Better -no-fmoe TG on CUDA Still much slower than -fmoe, but abot 20-25% faster than what we had before. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-09 11:34:33 +02:00
Kawrakow	defa6945b3	CUDA: fuse copies to K and V cache (#921 ) * Fuse copies to K- and V-cache on CUDA * Adapt to latest main --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-08 18:13:58 +02:00
Kawrakow	3614c4f098	Adopt fix from mainline PR 17089 (#920 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-08 07:44:20 +02:00
Kawrakow	d0850dccc8	Disable add + fused_rms_norm fusion (#916 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-07 19:38:18 +02:00
Kawrakow	1c31b25380	Fix PPL increase caused by mmq_id (#913 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-07 18:58:09 +02:00
Nexes the Elder	f9a411e5db	More informative PPL readout line (#914 ) * More informative PPL readout line * trailing whitespace..	2025-11-07 16:41:24 +02:00
Kawrakow	532a05e466	CUDA: set compute parameters via command line arguments (#910 ) * cuda: set compute parameters via command line arguments * Also llama-bench --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-07 07:11:23 +02:00
Kawrakow	49befdd4fb	Fix iqk_mul_mat when number of rows is not multiple of repack rows (#911 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-06 19:07:46 +02:00
firecoperana	e15a215e6b	model : Port Minimax M2 from mainline (#907 ) Co-authored-by: firecoperana <firecoperana>	2025-11-06 18:09:24 +02:00
Kawrakow	66ef68bc14	Fix compiler warning	2025-11-06 07:12:07 +02:00
firecoperana	18f5a6caef	Bug fixes for completions and prompt caching in server (#906 ) * Bug fixes for completions and prompt caching in server * Fix compiler warning about redefinition --------- Co-authored-by: firecoperana <firecoperana>	2025-11-06 07:10:51 +02:00
Kawrakow	50f95d7bf3	Disable CUDA fusion by default for now (#903 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-05 10:58:12 +02:00
Kawrakow	cb30f8e057	Merge Q and K into a single tensor (#892 ) * Merge Q and K into a single tensor * Make V mul mat follow QK mul mat so they can be fused, which gives a slightly bbetter TG performance. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-05 10:54:36 +02:00
Kawrakow	e68f50be9a	Allow quantization of ffn_gate_inp (#896 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-05 10:44:32 +02:00
firecoperana	7978f04996	Add vision support in llama-server (#901 ) * server: add support for vision model webui: add support for vision model * server : remove hack for extra parallel slot#10187 * llama : fix KV shift for qwen2vl #13870 * add no-context-shift parameter --------- Co-authored-by: firecoperana <firecoperana>	2025-11-05 10:43:46 +02:00
Kawrakow	92607d44c4	Much better CPU TG performance at long context for GLM-4.5 (#899 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-05 10:20:26 +02:00
Kawrakow	98357d9aa5	Adding cmake option to disable CUDA fusion (#902 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-05 07:09:27 +02:00
Kawrakow	11feb49562	Fix compilation failure after merging #883 (#900 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-04 19:28:52 +02:00
Thireus ☠	86597623a5	Port of Qwen3-VL support from mainline (#883 ) * Port of Qwen3-VL for latest ik_llama.cpp - convert_hf_to_gguf.py - Not touched, use llama.cpp to convert model instead - sysl and metal support for imrope not added - Vulkan support for imrope not tested - Code not tested * Bugfix n_embd was declared multiple times https://github.com/ikawrakow/ik_llama.cpp/pull/883#issuecomment-3471179655 * Fix n_embd issue with qwen3vl * model.output tensor not required https://github.com/ikawrakow/ik_llama.cpp/pull/883#discussion_r2480388389 * Improved logic for qkv combined tensors `59ceaf8fcb (r2480395800)` `59ceaf8fcb (r2480398187)` * Fix n_embd for merge_qkv() + cleaner code https://github.com/ikawrakow/ik_llama.cpp/pull/883#discussion_r2481227395 * Revert TENSOR_NOT_REQUIRED	2025-11-04 19:20:54 +02:00
Kawrakow	efcb5f9d9e	sweep-bench: be able to set TG tokens via -n (#897 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-04 14:39:30 +02:00
Kawrakow	c23fda2103	Disable some fusion, RoPE cache off by default (#894 ) * Disable some fusion and make rope cahe off by default * Minor --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-04 07:50:14 +02:00
Kawrakow	fb0d5a995c	RoPE cache (#887 ) * Introducing rope cache When computing RoPE, the rotation angles in each layer are exactly the same, and only depend on the token positions (and other constant, model dependent parameters). So, I wonder, why don't we compute the angles just once and then reuse for the Q and K RoPE in each layer? This commit does it as a POC on the CPU, and uses it in the Qwen3-MoE compute graph. * cuda: neox works * WIP * rope_cache: norm works * Fused rope+rope * Fused rope+rope (norm) * Fused rms+rms+rope+rope (neox) - not working * WIP * Also qwen3 * Add command line arg to disable rope cache * Disable RoPE cache if rope type is not neox or norm * Add missing break after merge with main * Fused fused_rms+fused_rms+rope+rope (with -mqkv) * Fused fused_rms+fused_rms+rope+rope (without -mqkv) --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-03 18:42:20 +02:00
Kawrakow	846e736e85	cuda: add missing backwards RoPE op (#889 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-03 07:45:18 +02:00
Kawrakow	37c4d19021	Compiler warning	2025-10-31 14:58:00 +02:00
Kawrakow	55a704b67a	Fused Q and K fused_rms_norm for TG on CUDA (#882 ) * Biased mmvq: minor optimization * Fusing Q and K rms_norm for TG on CUDA * Remove commented out code --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-10-31 14:41:28 +02:00
Kawrakow	cfb840379f	Biased mmvq: minor optimization (#880 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-10-31 14:21:18 +02:00
firecoperana	a3bd0158f7	Disable pipeline parallel for tensor override or allocation failed (#879 ) * disable pipeline parallelism when tensor override present * disable pipeline parallel if allocation failed --------- Co-authored-by: firecoperana <firecoperana>	2025-10-31 14:20:48 +02:00
Kawrakow	56fc5454ff	Merge Q, K, V (#878 ) * POC: merge Q, K, V into a single, contiguous tensor Done just for Qwen3-MoE, where I see a 4% uplift in TG. PP performance gain is sub-percent, if any. Still, it seems it makes sense to do it in general given the TG performance gain. * WIP * merge_qkv: it works for gpt-oss ...but we see a smaller TG gain (~1.5%) * WIP * Don't ignore the return value of create_tensors() else, when q, k, v get merged and we are running on the CPU, we get a crash because the backend is trying to use mmap, but that no longer works. * merge_qkv: bias can be required, optional, or mandatory * merge_qkv: glm4.5moe * merge_qkv: add command loine argument to enable * merge_qkv: fix tensor dimensions * merge_qkv: llama-4 * merge_qkv: qwen3 (dense) * merge_qkv: simplify build_qwen3moe * cohere2 - simplify graph building --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-10-30 10:49:48 +02:00
jarrodfeaks	92517e74ad	fix v1/chat/completions assistant prefill (#874 )	2025-10-29 17:21:05 +02:00

1 2 3 4 5 ...

3988 Commits