ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-05-11 08:30:19 +00:00

Author	SHA1	Message	Date
Iwan Kawrakow	d2f79beba4	Disable RoPE cache if rope type is not neox or norm	2025-11-03 08:31:18 +02:00
Iwan Kawrakow	525dda2e80	Add command line arg to disable rope cache	2025-11-03 08:31:18 +02:00
Iwan Kawrakow	aa76ff2c9d	Also qwen3	2025-11-03 08:31:18 +02:00
Iwan Kawrakow	f5ac78de5c	Fused rope+rope	2025-11-03 08:31:18 +02:00
Iwan Kawrakow	ea97dc3a1c	rope_cache: norm works	2025-11-03 08:31:18 +02:00
Iwan Kawrakow	f2c4b3a8d1	cuda: neox works	2025-11-03 08:31:14 +02:00
Iwan Kawrakow	9a790a8905	Introducing rope cache When computing RoPE, the rotation angles in each layer are exactly the same, and only depend on the token positions (and other constant, model dependent parameters). So, I wonder, why don't we compute the angles just once and then reuse for the Q and K RoPE in each layer? This commit does it as a POC on the CPU, and uses it in the Qwen3-MoE compute graph.	2025-11-03 08:30:32 +02:00
Iwan Kawrakow	58922c23ca	Compiler warning	2025-10-31 14:58:00 +02:00
Kawrakow	8c8a7fb7c8	Fused Q and K fused_rms_norm for TG on CUDA (#882 ) * Biased mmvq: minor optimization * Fusing Q and K rms_norm for TG on CUDA * Remove commented out code --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-10-31 14:41:28 +02:00
firecoperana	c7dbe3f2c1	Disable pipeline parallel for tensor override or allocation failed (#879 ) * disable pipeline parallelism when tensor override present * disable pipeline parallel if allocation failed --------- Co-authored-by: firecoperana <firecoperana>	2025-10-31 14:20:48 +02:00
Kawrakow	14760aaf46	Merge Q, K, V (#878 ) * POC: merge Q, K, V into a single, contiguous tensor Done just for Qwen3-MoE, where I see a 4% uplift in TG. PP performance gain is sub-percent, if any. Still, it seems it makes sense to do it in general given the TG performance gain. * WIP * merge_qkv: it works for gpt-oss ...but we see a smaller TG gain (~1.5%) * WIP * Don't ignore the return value of create_tensors() else, when q, k, v get merged and we are running on the CPU, we get a crash because the backend is trying to use mmap, but that no longer works. * merge_qkv: bias can be required, optional, or mandatory * merge_qkv: glm4.5moe * merge_qkv: add command loine argument to enable * merge_qkv: fix tensor dimensions * merge_qkv: llama-4 * merge_qkv: qwen3 (dense) * merge_qkv: simplify build_qwen3moe * cohere2 - simplify graph building --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-10-30 10:49:48 +02:00
Iwan Kawrakow	9a651e8476	Fix device parsing bug	2025-10-29 08:28:57 +02:00
Iwan Kawrakow	65763a2a70	Fix warnings about LLAMA_DEBUG being redefined	2025-10-27 18:41:03 +02:00
firecoperana	6dc5bd847b	Support --device and --device-draft parameter (#866 ) * add --device and --device-draft parameter * don't print debug message in release mode * fix * bug fix to throw exception when no device specified * add const --------- Co-authored-by: firecoperana <firecoperana>	2025-10-27 18:13:28 +02:00
Kawrakow	bdf4f0ddce	Even more fused ops (#868 ) * Fuse Q, K, V gemv+add * More gemv+add fusing * Faster copy when tensors are contiguous Relevant for storing data into the KV cache. I see ~1% speedup for fast models (Ling-mini-2.0, gpt-oss-20b, etc.) * Cleanup * Make sure the bias really is 1 row to use fusion --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-10-27 16:09:01 +02:00
Kawrakow	16f30fcf31	Change flash attention and fmoe to be on by default (#863 ) * Change fmoe to be on by default * Change default fmoe also in llama-bench * Change flash attention to be on by default --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-10-25 09:37:28 +03:00
Kawrakow	2522c97dc9	Faster tensor name formatting (#860 ) * Adding fused mul+multi_add + CPU implementation * fused mul+multi_add: command line argument to disable it * Faster tensor name formatting We gain ~1% for Ling-mini-2.0 when running on CUDA. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-10-24 07:46:18 +03:00
Kawrakow	db3ba4999f	Fused mul + multi_add op (#858 ) * Adding fused mul+multi_add + CPU implementation * fused mul+multi_add: CUDA * fused mul+multi_add: command line argument to disable it --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-10-24 07:40:35 +03:00
Kawrakow	483cea527d	Fix experts mul node name (#857 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-10-23 09:46:01 +03:00
Kawrakow	0e1d33ca4a	Fuse add+add+fused_rms (#853 ) * Fuse add+add+fused_rms * Try this * Macro to easily enable/disable fusion * Various: * Check that all tensors involved are on the same device before applying fusion * Fuse sigmoid+scale+sum_rows+div * Fix the fused bailingmoe2 experts selection The issue there was that the bias was not per row, but per expert group, so only the first n_per_group biases were used for al experts. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-10-22 16:18:11 +03:00
Kawrakow	caf9759c97	Fuse add + fused_rms_norm (CUDA) (#852 ) * Combine all calls to llm_build_norm to a single line so more easily check what kind of arguments are being passed by simply using grep. * Combine add + fused_rms_norm For many models this happens at each layer: the result of the layer is added to the ayer input, which then becomes the input to the next layer, which then is typically normalized via fused_rms_norm. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-10-21 14:29:50 +03:00
Kawrakow	22540cee60	Do not allocate KV cache for unused layers (#843 ) * Do not allocate KV cache for unused layers * Do not apply experts weight scale if it is 1 --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-10-20 10:09:39 +03:00
Kawrakow	28d3e63805	Various fused ops around expert selection (#840 ) * Fuse sigmoid+add+grouped_topk+get_rows (CPU) * Fix CPU + CUDA but CUDA is somehow not 100% correct as I get a slightly different PPL (lower!) * Minor * Fuse sigmoid+add+topk+get_rows (CUDA) * Fuse sigmoid+add+topk+get_rows (CPU) * Fuse topk+view+get_rows+reshape+softmax (CPU) * Fuse topk+view+get_rows+reshape+softmax (CUDA) * cpu: turn off the openai topk fusing for now Something is not right and I don't see the bug. On the CPU one doesn't gain much if anything, so not a big loss. * Also fuse sum_rows and div --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-10-19 19:02:46 +03:00
Kawrakow	dbfd151594	Grouped expert routing (CPU only) (#836 ) * Better argsort (CPU) * Attemt at grouped topk * This seems to do the trick for grouped experts routing * Cleanup * Trying to merge, something is not right * Working merged grouped top_k (CPU) * Add command line option to enable grouped expert routing * Add grouped expert routing option to llama-bench --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-10-16 14:57:02 +03:00
Kawrakow	9d364b88ba	Adding Ling/Ring (a.k.a., Bailing-MoE2) support (#833 ) * Adding Ling/Ring (a.k.a., Bailing-MoE2) * Add expert group selection (not working, so turned off) * BailingMoE2 conversion * WIP * Bits and pieces --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-10-15 14:20:40 +03:00
Kawrakow	8d0d01a593	gpt-oss: duplicate experts biases when necessary (#829 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-10-14 14:38:40 +03:00
Kawrakow	9724ea9213	Attention mask tweaks for better long context performance (#825 ) * Parallelize mask We see non-negligible PP gains for long contexts. More importantly, the strange drop in performance observed for GPT-OSS for context >= 32k tokens is gone. * Whith FA on, create mask as f16 directly * WIP * Reduce KQ mask padding to 16 Why was it 64 in the first place? I don't observe any issues, while TG performance for long contexts improves by 2-4%. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-10-13 14:01:11 +03:00
Kawrakow	1db0c490be	Fix PATH_MAX not defined on Windows (#828 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-10-13 09:25:57 +03:00
Kawrakow	0030bc89c9	Fix performance regression introduced in #823 (#826 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-10-13 08:09:55 +03:00
Kawrakow	0ad1d34090	Enable and clean up compiler warnings in src (#824 ) * WIP: enable and clean up warnings in src * All warnings handled --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-10-11 16:01:13 +03:00
Kawrakow	335a1f9b71	Refactor file llama.cpp (#823 ) * llama_model and llama_hparams * llama_build_context Surprisingly small reduction in llama.cpp compile time given the reduction in LOCs (22k -> 14k) * LLM_TN llama.cpp compilation: 50 s -> 33 s * llama_quantize * arch names * All graph building is now in llm-build-context.cpp * hparams loading llama.cpp is now just 9300 LOC, but still takes 32 seconds to compile. * We are now at 6 seconds to build the src folder * load -> create We are not actually loading the tensors, but just creating them. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-10-11 11:35:20 +03:00
Downtown-Case	6051ba25ee	Mark some multi-prediction tensors as not required. (#814 )	2025-10-01 20:37:31 +02:00
Kawrakow	87e4762720	Port mdmd from mainline + Qwen2/2.5-VL support (#798 ) * Add mtmd: the beginning * Add mtmd: mtmd.cpp compiles * Add mtmd: clip initialization compiles * Add mtmd: clip.cpp compiles * Add mtmd: builds successfully * Add CPU implementation for GGML_OP_GLU * Add CUDA implementation for GGML_OP_GLU * Add CPU implementation for GGML_OP_CONV_2D and GGML_OP_CONV_2D_DW * Add CUDA implementation for GGML_OP_CONV_2D and GGML_OP_CONV_2D_DW * Add mtmd: refresh CPU rope * Add mtmd: refresh CUDA rope * Add mtmd: add Qwen2-VL * Add mtmd: Qwen2.5-VL text seems to work with this change * Add mtmd: fix swiglu * Add mtmd: use LOG_TEE so generated tokens show up in terminal * Add mtmd: do not attempt to load a GPU backend if none are available * GLU, not GPU * Fix typo * Fix new/free mismatch * LOG stuff * Add mtmd: this fixes gibberish on second image --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-09-27 08:45:29 +02:00
Kawrakow	8e497e704e	Fused matrix multiplications (CUDA and CPU) (#796 ) * Quick attempt to fuse the Q, K, V GEMMs Doesn't do much on the CPU * Doesn't do much on the GPU either * Use llm_build_mul_mat_qkv * This is not needed * Revert timing on committed by mistake --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-09-24 16:52:54 +02:00
Kawrakow	0d1bbde1c4	Fix dequantization when requantizing (#795 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-09-24 12:44:30 +02:00
firecoperana	8cd2d7ccd7	model : add grok-2 support (#782 ) Co-authored-by: firecoperana <firecoperana>	2025-09-23 16:31:01 +02:00
Kawrakow	18f04350e9	cuda: fused top_k+softmax as used in most MoE models (#789 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-09-23 13:45:57 +02:00
firecoperana	33e071201f	Add Ernie 4.5 MOE and 0.3B Support (#759 ) * Add Ernie4_5MoeModel * add ernie 4.5 0.3B model --------- Co-authored-by: firecoperana <firecoperana>	2025-09-05 11:54:35 +02:00
firecoperana	cec8b70a7e	llama: enable K-shift for quantized KV cache for cuda (#760 ) cuda: add q8_0->f32 cpy operation (#9571) It will fail on unsupported backends or quant types. Co-authored-by: Ivan <nekotekina@gmail.com>	2025-09-05 11:54:18 +02:00
Kawrakow	0c15494c30	Offload only activated experts to the GPU (#698 ) * Offload only activated experts * This seems to do the trick for -fmoe * Do not recalculate activated expers for fused up/gate * Log out of bounds access details * Add a command line argument --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-09-04 12:22:30 +02:00
Kawrakow	f5e68bf8b6	Alternative CUDA FA for SWA models (#754 ) * Bounds for flash attention * Add n_swa to FA parameters * Fix it * This seems very slightly better * Using vec kernel when we have SWA * Need also this * f32 vec kernel * This is slightly better --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-09-04 08:42:18 +02:00
Kawrakow	62f5382c2b	Revert "CUDA: prompt processing optimizations for MoE models (#739 )" (#748 ) This reverts commit `f22a9ef95a`. Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-09-02 06:55:48 +02:00
Iwan Kawrakow	d10d90ae27	Remove double definition of LLAMA_LOG_DEBUG	2025-09-01 08:42:04 +03:00
firecoperana	0f9ecaec04	Tool calls support from mainline (#723 ) * Tool calls support from mainline * update cmake * revert api for /completions * Fix broken thinking process for gpt-oss * add missing args and fix webui bugs * add missing args and fix webui bugs2 * Fix reasoning format error * add usage * change default post_sampling_probs to true * add back generated_text * Remove server endpoints tests * add log * Chat fixes * Remove logs * webui: revert extra handling of thinking process --------- Co-authored-by: firecoperana <firecoperana> Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-09-01 08:38:49 +03:00
Kawrakow	b66cecca45	Fused FFN_UP+FFN_GATE op (#741 ) * Fused up+gate+unary for regular (not MoE) FFN - CPU * WIP CUDA * Seems to be working on CUDA For a dense model we get 2-3% speedup for PP and ~0.6% for TG. * Add command line option This time the option is ON by default, and one needs to turn it off via -no-fug or --no-fused-up-gate --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-08-31 18:16:36 +03:00
Kawrakow	f22a9ef95a	CUDA: prompt processing optimizations for MoE models (#739 ) * Skip the row id computation for the ffn_down op Sadly, almost negligible performance gain. * Also this doesn't do much * Also this barely moves the needle * This is slightly better --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-08-30 12:09:41 +03:00
Kawrakow	872ac10b02	Make yarn_log_multiplier optional (#738 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-08-28 14:09:59 +03:00
Kawrakow	dac5b48398	Check for NaNs while loading the model. (#727 ) * Check for NaNs while loading the model. * Also tell which experts have NaNs. * Add command line option to validate quants * Add checks for more quantization types * Add checks for more quantizagtion types --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-08-27 19:00:17 +03:00
Mohan Krishnan	50f7119dfd	Fix undefined template std::basic_string<char> (#726 ) Getting this error when compiling on Mac with clang 17 Simple fix, add the string header in src/llama-impl.h Co-authored-by: Mohan Krishnan <mohan.krishnan@grab.com>	2025-08-25 11:34:01 +03:00
Kawrakow	9351cc3416	Remove scary warning about incompatible model (#717 ) * Remove scary warning about incompatible model * Minor --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-08-22 18:42:01 +03:00

1 2 3 4 5

225 Commits