ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-02-28 17:14:17 +00:00

Author	SHA1	Message	Date
Kawrakow	1e6d36b1b4	Graph parallel for dense Qwen-3.5 models (#1331 ) * Graph parallel for idense Qwen-3.5 models * Cleanup	2026-02-27 07:03:25 +01:00
Kawrakow	62a7dcac5a	Move the Qwen-3.5 models to the standard attention mechanism (#1329 )	2026-02-26 15:50:51 +01:00
Kawrakow	0aa6f7e7cd	iAdding support for dense Qwen-3.5 models (#1326 )	2026-02-26 08:51:01 +01:00
firecoperana	3fac78c48b	server: enable checkpoint for recurrent models (#1310 ) * server: enable checkpoint for recurrent models create checkpoint after cancel fix ban string and rm context during rewind add checkpoint interval only save recurrent cache * save checkpoint during pp --------- Co-authored-by: firecoperana <firecoperana>	2026-02-26 06:51:18 +01:00
Kawrakow	7065488135	Slightly better graph parallel for Qwen3-Next (#1307 ) * Make sure we pick the reduced tensor from the right GPU * Minor	2026-02-24 15:22:30 +01:00
Kawrakow	5dacb5355a	Graph parallel for Qwen3-Next (#1292 ) * WIP * This works, but is slower than split mode layer	2026-02-23 07:58:00 +01:00
Kawrakow	89b1e2b518	Better estimate for max. nuber of compute nodes (#1296 ) * Better estimate for max. nuber of compute nodes * Just in case	2026-02-22 18:16:49 +01:00
Samuel Oliveira Alves	09a88c9ae5	Add MTP decoding support for GLM-4.x MoE (#1270 ) * wip: port MTP architecture Ports the Multi-Token Prediction (MTP) architecture to the older `llama.cpp` codebase used by `ikllama`. Changes include: - Updating `llama_batch` to support `mtp_params`. - Modifying `llama_decode_internal` (and `encode`) to handle MTP operations (Warmup, Update, Draft). - Adding public APIs for MTP state management (`llama_set_draft_input_hidden_state`). - Adapting the embedding extraction logic to skip MTP update passes. * Refactors `server_slot` to support generic speculative decoding (MTP or Draft Model). * core: enable hybrid outputs (logits + embeddings) for MTP support * fix(mtp): correct KV-cache slot finding for updates * fix(mtp): persist hidden states to prevent context corruption during drafting * refactor(mtp): clean unused code * fix(mtp): update server to new functions name * fix(mtp): fix graph and save hidden state * mtp: refactor integration, context params and kv cache search * mtp: fix hidden state extraction and speculative acceptance flow * server: fix MTP warmup for long prompts and reset token buffer * llama: refactor MTP operation state to context parameters * server: fix n_past calculation in MTP acceptance * llama: fix mtp enable flags * speculative: refactor MTP to use common_speculative interface * context: remove unused signatures * clip: fix deprecated enum-enum conversion warning * common: fix format string crash in help message * context: fix mtp activation logic	2026-02-22 18:14:39 +01:00
Kawrakow	13c3d83ce7	Qwen3.5-MoE support (#1288 ) * WIP: loads and runs, but not correct Very high PPL, empty TG. * This appears to work	2026-02-21 08:33:06 +01:00
Kawrakow	04cf685e82	Factor out delta net (#1286 ) * WIP: factor out delta net implementation * WIP * Use the standard FFN functions * More standard attn for Qwen3-Next	2026-02-18 17:16:17 +01:00
Kawrakow	cafeef484c	More Qwen3-Next optimizations (#1277 ) * Optimizing q3next TG * Fused add -> softplus -> mul on CUDA * Remove forgotten debug log * Increase ggml context size Required for Qwen3-Next with batch/u-batch size of 4096 * WIP * Avoid some contiguous ops * Avoid some repeats * Avoid some more repeats	2026-02-17 16:03:51 +01:00
Kawrakow	e30198a553	WIP: Qwen3Next (#1266 ) * qwen3next: add architecture support and recurrent-state fixes * qwen3next: optimize broadcast sub and single-seq ssm conv * cuda: build MoE row mapping on device in mul_mat_id * cuda: add guarded multi-seq fast path for ssm_conv * docs: update qwen3next perf report for cuda MoE/SSM tuning * cuda: reduce qwen3next moe/ssm sync overhead and refresh eval * qwen3next: split cpu/cuda eval builds and tune PP scheduling * qwen3next: harden seq-state flow and support optional dense FFN layers * qwen3next: trim delta-net graph overhead in chunking path * qwen3next: remove redundant v_conv cont in delta path * qwen3next: avoid extra cont on linear attention output * qwen3next: drop redundant cont before recurrent state flatten * qwen3next: keep recurrent state in 4d layout through delta path * qwen3next: add fused delta-net op and wire model path * tests: add backend-op coverage for ggml_delta_net * qwen3next: add runtime switch for fused delta-net path * docs: refresh qwen3next perf review and benchmark matrix * qwen3next: default fused delta-net off and document quality checks * qwen3next: add decode-only fused delta mode * qwen3next: make fused delta safe by default and fix fused tensor layout * qwen3next: warn when forcing fused decode mode * qwen3next: add fused-delta regression runner script * qwen3next: integrate fused regression into eval harness * qwen3next: clean up chunked delta-net shape handling * qwen3next: add absolute sanity guards to fused regression * qwen3next: add unified regression runner script * qwen3next: disable flash-attn for cpu-only contexts * docs: reconcile qwen3next status and remaining upstream gaps * common: add qwen3next fused-delta runtime flag * cuda: add qwen3next delta-net kernel dispatch override * docs: update qwen3next quality and serving baseline findings * qwen3next: keep fused delta on safe path and remove PR artifacts * qwen3next: align autoregressive delta-net decode layout * Revert "qwen3next: align autoregressive delta-net decode layout" This reverts commit `9241164a5e`. * cuda: port solve-tri fast-paths for qwen3next delta-net * qwen3next: add fused-delta runtime flag and drop env toggle * qwen3next: make fused delta single-flag and default on * Account for GPU arch differences * Revert "cuda: build MoE row mapping on device in mul_mat_id" This reverts commit `89e9ecfa84`. * qwen3next: drop non-essential MoE scheduling and split heuristics * qwen3next: avoid generic ggml_sub broadcast changes * llama: restore only_active_experts log message * Remove unnecessary hacks, disable fusion for now. * qwen3next: port hybrid recurrent state memory semantics * qwen3next: clean up recurrent state slot plumbing * qwen3next: fix hybrid V-cache layout plumbing * qwen3next: guard recurrent state slots against kv capacity * qwen3next: persist recurrent state in session data - serialize/restore qwen3next cache.s_l in state/session paths\n- bump session and sequence-state file versions for format change\n- fallback to single-token chunking for mixed repeated seq_id batches * qwen3next: drop unused fused-delta builder path - remove dead build_delta_net_fused lambda\n- remove unused llm_build_context::fused_delta member * qwen3next: remove unused fused-delta CLI/context plumbing - drop -fd/-no-fd options and related YAML dump field\n- remove fused_delta fields from public/internal context params\n- remove fused_delta assignment and logging in context init * ggml: remove unused DELTA_NET operator stack * Missing include * Reorder ops/unary ops So we don't change again the enum values of the mul mat ops * Minor * Discard unnecessary changes in llama-build-context.cpp * Minor * Revert "Discard unnecessary changes in llama-build-context.cpp" This reverts commit `edadb80ed6`. * Increase GGML_SCHED_MAX_SPLITS - required for larger u-batches * Fix CPU concat in the TG case: 7.25 -> 10.5 t/s for Qwen3Next * Fix CPU sum_rows: 10.5 -> 13.6 t/s for Qwen3Next It was single-threaded and was taking ~25% of the computation time during TG. It is now down to 2%. Strangely enough, I measure 13.6 t/s with llama-bench, but if I let the model give me an actual response with llama-cli, I get close to 17 t/s. * Fix CPU scale: 13.6 -> 16.7 t/s for Qwen3Next For Qwen3Next there is a scale op on a largish tensor (548k elements) that has a single row for TG, so was done in a single thread. We now simply use blocks of 1024 elements. * Optimize CPU mul: 16.7 -> 17.6 t/s for Qwen3Next * CPU: fuse transpose -> cont -> sum_rows -> transpos: 17.6 -> 23.1 t/s for Qwen3Next * Optimize CPU repeat: 176 -> 200 t/s for Qwen3Next PP-512 * Multithreading for OP_SUB * Don't commit with timing trace on * Multithread neg and sigmoid * Be able to turn on/off fusion more easily (CPU) * Name the mul_mat ops so we know where the time goes * WIP * Much better PP on CUDA * CUDA: fuse transpose -> cont -> sum_rows -> transpose Needs non-coontiguous variant of sum_rows. On the CPU this gave 30+% improvement in TG performance, on CUDA ist is disapointing 6-7%. I guess, this is because Georgi's cont CPU implementation was so bad that skipping it made such a big difference. * CUDA: faster mul for special case relevant for Qwen3Next Worth 1% in TG * Fix CPU OP_CONT --------- Co-authored-by: yurko <yurko@local> Co-authored-by: Yurko <yurko@example.com> Co-authored-by: yurko <yurko@pop-os.tail5a1a6b.ts.net> Co-authored-by: Yurko Hoshko <YurkoHoshko@users.noreply.github.com>	2026-02-16 06:50:28 +01:00
Kawrakow	528cadb07b	GLM-5 support (#1268 )	2026-02-15 07:49:44 +01:00
Kawrakow	1fdbc0dafe	Fix #1222 (#1257 ) * Fix #1222 * Typo	2026-02-09 16:20:16 +01:00
Kawrakow	82c4f27332	Fuse the attention gate in Step-3.5-Flash (#1244 ) * WIP * This works but is slow * Turn off the up / gate clamps for now * OK we need the clamping * Fuse the clamp (CUDA) * Fuse the clamp (CPU) * WIP * Be able to use merged q, k, v * Be able to use merged up/gate experts * Fuse the clamp (CUDA mmvq) * WIP: graph parallel for Step-3.5 * WIP * This should be it * Cleanup * Fix merge * Not working attempt to extend fused_mul_unary to the Step-3.5 case * It works now, but performance gain is very minor	2026-02-07 07:56:58 +02:00
Kawrakow	c5d74f66e2	Fix graph parallel when ngl < n_layers (#1241 ) * Fix graph parallel when ngl < n_layers * Fix using ffn_norm When using graph parallel with ngl < n_layers, the ffn_norm tensor may have ended up being split, while the ffn tensors are on the CPU. In that case we will get a crash because we attempt to use the not-split buffer of ffn_norm, which is invalid. Thi commit fixes that. * Cleanup	2026-02-06 11:48:24 +02:00
Kawrakow	4d86907b18	Remove forgotten printf	2026-02-06 07:43:18 +00:00
Kawrakow	81ea911f0d	Graph parallel for Step-3.5-Flash (#1236 ) * WIP * This works but is slow * Turn off the up / gate clamps for now * OK we need the clamping * Fuse the clamp (CUDA) * Fuse the clamp (CPU) * WIP * Be able to use merged q, k, v * Be able to use merged up/gate experts * Fuse the clamp (CUDA mmvq) * WIP: graph parallel for Step-3.5 * WIP * This should be it * Cleanup * Fix merge	2026-02-06 06:56:51 +02:00
Kawrakow	1ec12b8e3b	Fix #1237 (#1238 )	2026-02-05 18:30:18 +02:00
Kawrakow	9c1c74acda	Step-3.5-Flash support (#1231 ) * WIP * This works but is slow * Turn off the up / gate clamps for now * OK we need the clamping * Fuse the clamp (CUDA) * Fuse the clamp (CPU) * WIP * Be able to use merged q, k, v * Be able to use merged up/gate experts * Fuse the clamp (CUDA mmvq)	2026-02-05 08:13:22 +02:00
Kawrakow	b41b8cf813	Graph parallel for SEED-OSS (#1222 ) * Graph parallel for SEED-OSS * Cleanup	2026-02-04 16:07:43 +02:00
usrlocalben	e5622a2e91	Fix Phi-3, Phi-4 (#1226 ) * fix phi3 tensor setup * avoid SWA for Phi-4	2026-02-04 11:57:50 +02:00
saood06	8ba7e2b40c	Add support for Seed-OSS (#1218 ) * it compiles * Fix constants.py	2026-02-03 07:39:45 +02:00
Kawrakow	b85a2a50d5	Reduce compute buffer size for mla=3	2026-01-31 10:43:05 +00:00
Kawrakow	686fd1ebec	Use standard output calculation for MiniMax-M2 graph parallel (#1199 )	2026-01-29 09:06:40 +02:00
Kawrakow	68ed62447c	Split mode graph for Minimax-M2 (#1195 ) * Split mode graph for Minimax-M2 * Cleanup * Forgotten ffn_exp_probs_b	2026-01-29 07:27:06 +02:00
Kawrakow	30381fc1fc	Faster hybrid inference when shared experts (#1191 )	2026-01-26 07:22:05 +02:00
Kawrakow	478b56871f	Faster long context TG on CUDA for GLM-4.5/4.6/4.7/AIR (part 2) (#1190 ) * This works * Make quantized KV cache work * Remove the glm45 graph building changes * Add condition	2026-01-26 07:21:47 +02:00
Kawrakow	04beeffa4e	Faster long context TG on CUDA for GLM-4.5/4.6/4.7/AIR (#1183 ) * Similar hack to #1182 for GLM-4.5/6/7 * Refinements * Disable when the KV cache is not f16	2026-01-24 09:39:29 +02:00
Kawrakow	996e77047a	Avoid ggml_get_rows if not necessary (#1160 ) * Copy reduce result to other GPUs if necessary * Avoid ggml_get_rows for TG * For the output ops use the result of the split that ran on the main GPU * More models	2026-01-20 15:38:21 +02:00
Kawrakow	0c0b6e4b8b	Copy reduce result to other GPUs if necessary (#1156 )	2026-01-19 08:40:26 +02:00
Kawrakow	7024fdbc72	Additional graph reduce types for split mode graph (#1154 ) * WIP: add Q8_0 and BF16 as possible reduce types Does not work - there is a big somewhere * This finally works	2026-01-18 08:02:49 +02:00
Kawrakow	978202a754	Merge ffn_up and ffn_gate experts tensors (part 2) (#1139 ) * Add ability to merge up+gate exps to more models * We need to of course pass the merged tensor to build_ffn * All the others * Also Qwen3VL-MoE --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2026-01-13 08:07:52 +02:00
firecoperana	1a461525d5	server: stop processing the prompt when client disconnects (#1134 ) implement generator-based API for task results Update httplib.h to 0.27.0 Fix embedding error Stop prompt processing when disconnected Co-authored-by: firecoperana <firecoperana>	2026-01-13 07:56:59 +02:00
Kawrakow	c03c2d7cc6	Merge ffn_up and ffn_gate experts tensors (#1137 ) * WIP - not working * WIP - not working * WIP - GPT-OSS working However, extremely stupid. The only way I could correctly repack the up/gate experts is to copy up and gate into host buffers, repack into another host buffer, copy back into the ffn_up_gate_exps tensor. This is going to be very slow for giant 500 GB models. My attempts to do this via a compute graph on the backend holding the tensors was unsuccessful. For GPT-OSS-20B I see ~6-7% better PP when using the original ik_llama.cpp fused_up_gate CUDA implementation, and ~10% when using the small batch size implementation. Other models are not working yet on CUDA as I need to fix the fused mul-unary implementation. * WIP * WIP - Qwen3-MoE (and hopefully all others) working But when I say here and in the previous commit "working", I mean PP is working. TG is still broken. * WIP: TG seems to be working * Minor * Add command line option to merge experts up/gate * Add merge up/gate command line parameter to llama-bench * Turn off merge_up_gate_exps if split mode graph It is not yet implemented * When no bias, allow merging up/gate with tensor overrides * Arghh, we need to increase the context size again * Cleanup	2026-01-12 18:30:53 +02:00
Kawrakow	eaf2e1c15a	Split mode "graph" for Ernie-4.5-MoE (#1121 ) * Ernie-4.5-MoE split mode graph * Cleanup --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2026-01-08 16:46:41 +02:00
Kawrakow	5ef98f8b0f	Split mode "graph" for GPT-OSS (#1118 ) * Split mode "graph" for GPT-OSS * Force split_mode_f16 to false --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2026-01-08 09:14:15 +02:00
Kawrakow	99fbd84971	Split mode "graph" for Hunyuan-MoE (#1116 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2026-01-07 13:38:08 +02:00
Kawrakow	3c99284b67	Split mode 'graph' fpr Qwen3-VL (#1107 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2026-01-05 17:32:00 +02:00
Kawrakow	218dcc5727	Split mode graph for Qwen3 (#1106 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2026-01-05 14:31:36 +02:00
Kawrakow	419a397ce0	Graph parallel for Mimo-V2-Flash (#1105 ) * WIP * Cleanup * Set max_gpu to 2 for Mimo2 --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2026-01-05 09:58:54 +02:00
Kawrakow	ab50c6cdcb	Mimo-V2-Flash support (#1096 ) * Mimo-2 support * Fix bug for head sizes not being the same It still does not solve the Mimo-2 quantized cache issue. * Fix quantized cache * Minor --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2026-01-05 08:00:01 +02:00
Kawrakow	519405dc97	Async compute graph evaluation (2 or more GPUs) (#1089 ) * WIP: absorb adding input into std_attn and std_ffn * WIP: NCCL infra * WIP: add reduce and fake_cpy ops * WIP * WIP: graph appears to work, layer is broken * WIP: Qwen3-MoE works with graph, layer still broken * WIP: GLM-4.5 graph works * WIP: fix sm layer (dense) * WIP: fix sm layer (MoE) * WIP: fast PP with bespoke 4-GPU NCCL I guess, I'm not using NCCL the right way as PP is very low with a single communicator group for 3 or more GPUs. But if I create 4 communicator groups for pairs of GPUs (0,1, 2,3, 0,2, 1,3) and use that, PP is fast: I'm hitting 1500 t/s for L3-70B on the 4x3090 system, which is ~20% better than the previous sm graph without NCCL. But that cannot be the solution (I cannot be creating pairwise communicators and associated logic for every possible number of GPUs). * WIP: Cohere2 * Explicitely set device * Bespoke 3-GPU case * WIP * Do not repeat get_rows multiple times * Fix 3 GPUs * OK, let's leave it in * Simple async * This sync seems enough * Only do async for 4 or more backends With 2 GPUs (so, 3 backends) not using async is slightly faster * Scheduler changes * Use OpenMP if available Surprisingly (at least to me), this is quite a bit faster than std::thread and std::barrier. GLM-4.5-AIR with 4 GPUs is now at 105 t/s at zero context! * Do not use OpenMP if there are tensor overrides * Set omp max active levels * Be more careful with having set the device before using a stream * Command line option to turn on async. Set to false by defualt for now --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-27 08:18:06 +01:00
Kawrakow	ada5cc1523	Fused norm (#1086 ) * Adding fused_norm - same idea as fused_rms_norm * Avoid computing the attention reduce op for cohere2 --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-24 15:22:43 +01:00
Kawrakow	5e64235d4c	Be able to set reduce op data type for split mode "graph" (#1087 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-24 14:01:29 +01:00
Kawrakow	0d7eb34185	Graph parallel: the next generation (#1080 ) * WIP: absorb adding input into std_attn and std_ffn * WIP: NCCL infra * WIP: add reduce and fake_cpy ops * WIP * WIP: graph appears to work, layer is broken * WIP: Qwen3-MoE works with graph, layer still broken * WIP: GLM-4.5 graph works * WIP: fix sm layer (dense) * WIP: fix sm layer (MoE) * WIP: fast PP with bespoke 4-GPU NCCL I guess, I'm not using NCCL the right way as PP is very low with a single communicator group for 3 or more GPUs. But if I create 4 communicator groups for pairs of GPUs (0,1, 2,3, 0,2, 1,3) and use that, PP is fast: I'm hitting 1500 t/s for L3-70B on the 4x3090 system, which is ~20% better than the previous sm graph without NCCL. But that cannot be the solution (I cannot be creating pairwise communicators and associated logic for every possible number of GPUs). * WIP: Cohere2 * Explicitely set device * Bespoke 3-GPU case * WIP * Do not repeat get_rows multiple times * Fix 3 GPUs * OK, let's leave it in * Implement the reduce op without NCCL available * Be able to build without NCCL cmake -DGGML_NCCL=OFF disables it * Make --max-gpu work again * Slightly better for 4 GPUs without NCCL * Cleanup --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-24 08:31:48 +01:00
Nexes the Elder	d1dd45b4b9	add split-mode-graph-scheduling parameter (#1068 ) Use -smgs or --split-mode-graph-scheduling in CLI to bypass the disabling of split mode graph scheduling when tensor overrides is used. Co-authored-by: Kawrakow <iwankawrakow@gmail.com>	2025-12-17 07:58:19 +01:00
Kawrakow	d97a6de34d	Split mode "graph" for Cohere2 (#1061 ) * This works and TG is descent, but PP is low * Better * Apply f_logit_scale before mul mat with output tensor * This is better for PP: 600 t/s -> 700 t/s * To not lose this again * WIP * Equal split * WIP --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-13 20:30:08 +01:00
Kawrakow	f65fefa36c	Slightly faster TG for split mode "graph" (#1057 ) * Rearrange graph nodes So that we can do graph portions that are the same on 2 or more GPUs at the same time. * Separate graph compute implementation for split mode graph * This is better --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-12 07:54:37 +01:00
Kawrakow	18fdd80eaf	Hadamard transforms for K-cache - CPU only (#1033 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-04 06:51:11 +01:00

1 2

86 Commits