ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-02-21 13:44:10 +00:00

Author	SHA1	Message	Date
yurko	eef360a85f	cuda: add qwen3next delta-net kernel dispatch override	2026-02-08 14:38:30 -08:00
yurko	6dd990d15a	qwen3next: add fused delta-net op and wire model path	2026-02-07 14:32:16 -08:00
Yurko	e64b43392f	cuda: reduce qwen3next moe/ssm sync overhead and refresh eval	2026-02-06 14:46:59 +00:00
yurko	236633af99	cuda: add guarded multi-seq fast path for ssm_conv	2026-02-06 13:52:54 +00:00
yurko	89e9ecfa84	cuda: build MoE row mapping on device in mul_mat_id	2026-02-06 13:52:33 +00:00
yurko	9fbb50481e	qwen3next: optimize broadcast sub and single-seq ssm conv	2026-02-06 12:50:43 +00:00
yurko	a7df116441	qwen3next: add architecture support and recurrent-state fixes	2026-02-06 12:13:09 +00:00
Kawrakow	685df0e69d	Work buffer size	2026-01-31 16:10:23 +00:00
Kawrakow	2bf2fa8ba4	Better CPU FA thread strategy	2026-01-31 15:46:16 +00:00
Kawrakow	811f8c3393	Fix bug in the CPU flash attention implementation (#1206 )	2026-01-30 11:37:34 +02:00
Kawrakow	f0c61adacc	Be able to set FA offset via command line argument (#1198 )	2026-01-29 08:56:47 +02:00
Kawrakow	02ae22388f	Apply offfset to KQ_max in CUDA flash attention (#1196 ) * Apply offfset to KQ_max in CUDA flash attention * Forgot to add to fattn-common.h	2026-01-29 07:27:53 +02:00
Kawrakow	68ed62447c	Split mode graph for Minimax-M2 (#1195 ) * Split mode graph for Minimax-M2 * Cleanup * Forgotten ffn_exp_probs_b	2026-01-29 07:27:06 +02:00
Kawrakow	68cd52e583	Much faster long context TG for Minimax-M2 (#1194 )	2026-01-28 10:43:11 +02:00
Kawrakow	f9b5420e6a	Much faster long-context TG for GLM-4.5/4.6/4.7/AIR (#1193 ) * This seems much better for GQA = 12 TG * Remove unused arguments	2026-01-28 10:27:14 +02:00
Kawrakow	69fdd041c1	Remove forgotten unused code	2026-01-26 12:54:21 +00:00
Kawrakow	65441c2385	Even better GLM-4.7-Flash long context TG performance (#1192 ) * Better FA for GLM-4.7-Flash * Adjust ncols for ADA_LOVELACE or better	2026-01-26 13:45:06 +02:00
Kawrakow	478b56871f	Faster long context TG on CUDA for GLM-4.5/4.6/4.7/AIR (part 2) (#1190 ) * This works * Make quantized KV cache work * Remove the glm45 graph building changes * Add condition	2026-01-26 07:21:47 +02:00
Kawrakow	f0fb76da64	Better GLM-4.7-Flash long context TG performance (#1182 ) * Better GLM-4.7-Flash long context TG performance * Handle quantized cache	2026-01-24 07:05:48 +02:00
Kawrakow	2a7cc09149	Remove llamafile remnants (#1179 )	2026-01-22 13:20:23 +02:00
Kawrakow	66caa42b53	Fix build with GGML_CUDA_GRAPHS=OFF	2026-01-22 10:46:57 +00:00
Kawrakow	851fda3509	Split mode graph: use CUDA graphs (#1177 ) * Use GUDA graphs also when theretensor overrides * Change graph key * This seems to work	2026-01-22 12:38:36 +02:00
Kawrakow	101fe54797	CUDA graphs with tensor overrides (#1172 ) * Use GUDA graphs also when theretensor overrides * Change graph key	2026-01-22 12:28:11 +02:00
Kawrakow	1cb8cd534f	Fix build failure when OpenMP is not available (#1171 )	2026-01-22 12:26:23 +02:00
Kawrakow	77c18acc90	Fix non-contiguous batched cuBLAS (#1178 )	2026-01-22 12:25:05 +02:00
Kawrakow	6f1a69352f	Fuse experts bias in top_k_moe kernel (#1170 ) * GLM-4.7-Flash support * Model type * Make FA work for mla != 0 * Fuse bias in top_k_moe kernel if present	2026-01-20 15:38:51 +02:00
Kawrakow	996e77047a	Avoid ggml_get_rows if not necessary (#1160 ) * Copy reduce result to other GPUs if necessary * Avoid ggml_get_rows for TG * For the output ops use the result of the split that ran on the main GPU * More models	2026-01-20 15:38:21 +02:00
Kawrakow	132a01d25d	GLM-4.7-Flash support (#1168 ) * GLM-4.7-Flash support * Model type * Make FA work for mla != 0	2026-01-20 12:46:52 +02:00
Kawrakow	98b30e5e81	Faster adaptive_p sampling (#1165 ) * A hopefully more efficient adaptive_p sampling * Once at it, lets fix the formatting too * More formatting * Hopefully better * This should be better * Correctly accumulate adaptive_p sampling time * AVX2	2026-01-19 16:03:09 +02:00
Kawrakow	6a5c180be9	Fix bf16 additions on CUDA arch < Ampere (#1164 ) * Fix bf16 additions on CUDA arch < Ampere * Prevent using NCCL if graph reduce type is bf16 and arch < AMPERE	2026-01-19 12:27:52 +02:00
Kawrakow	0c0b6e4b8b	Copy reduce result to other GPUs if necessary (#1156 )	2026-01-19 08:40:26 +02:00
firecoperana	d71a3ec315	Server: refactor and rename functions (#1151 ) * Server: rename functions and refactor code rename functions refactor update slots rename params_base rename timings * change * Revert kv cache name changes * Revert 2 * fix test build error --------- Co-authored-by: firecoperana <firecoperana>	2026-01-18 08:16:57 +02:00
Kawrakow	7024fdbc72	Additional graph reduce types for split mode graph (#1154 ) * WIP: add Q8_0 and BF16 as possible reduce types Does not work - there is a big somewhere * This finally works	2026-01-18 08:02:49 +02:00
Kawrakow	709e1a5375	Fixing split mode graph with many GPUs (#1152 ) * Attempt to fix the many GPU issue in split mode graph * WIP: this seems more stable Still hanging after a while if I try to use all 7 GPUs * Reenable OpenMP in scheduler async Seems solid up to 4 GPUs. It did hang with --max-gpu 6. * printf cleanup	2026-01-17 08:05:24 +02:00
Kawrakow	c03c2d7cc6	Merge ffn_up and ffn_gate experts tensors (#1137 ) * WIP - not working * WIP - not working * WIP - GPT-OSS working However, extremely stupid. The only way I could correctly repack the up/gate experts is to copy up and gate into host buffers, repack into another host buffer, copy back into the ffn_up_gate_exps tensor. This is going to be very slow for giant 500 GB models. My attempts to do this via a compute graph on the backend holding the tensors was unsuccessful. For GPT-OSS-20B I see ~6-7% better PP when using the original ik_llama.cpp fused_up_gate CUDA implementation, and ~10% when using the small batch size implementation. Other models are not working yet on CUDA as I need to fix the fused mul-unary implementation. * WIP * WIP - Qwen3-MoE (and hopefully all others) working But when I say here and in the previous commit "working", I mean PP is working. TG is still broken. * WIP: TG seems to be working * Minor * Add command line option to merge experts up/gate * Add merge up/gate command line parameter to llama-bench * Turn off merge_up_gate_exps if split mode graph It is not yet implemented * When no bias, allow merging up/gate with tensor overrides * Arghh, we need to increase the context size again * Cleanup	2026-01-12 18:30:53 +02:00
Kawrakow	c7348f6f55	Fix mla = 0 (#1130 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2026-01-10 10:34:30 +02:00
firecoperana	c03ee1a4d2	server: improve speed of speculative decoding (#1119 ) * server: improve speed of speculative decoding change logs rpc: add recompute spec dec fix * Fix n_batch_size not set to context size for draft model --------- Co-authored-by: firecoperana <firecoperana>	2026-01-10 08:01:22 +02:00
Kawrakow	8725d110d2	Fix data races in the reduce op (#1124 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2026-01-09 10:34:58 +02:00
Kawrakow	0456aa47d3	Do not abort on NCCL initizalization failure (#1120 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2026-01-08 09:19:50 +02:00
firecoperana	9c1bef35e8	CUDA: compress-mode size (#1110 ) Co-authored-by: firecoperana <firecoperana>	2026-01-07 18:33:17 +02:00
Kawrakow	a82dcbf3ee	Fix ring reduction (#1114 ) * Fix ring reduction * Actually enable it --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2026-01-07 08:01:31 +02:00
Kawrakow	54a513768c	Disable ring reduction for now (#1112 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2026-01-06 15:40:50 +02:00
Kawrakow	419a397ce0	Graph parallel for Mimo-V2-Flash (#1105 ) * WIP * Cleanup * Set max_gpu to 2 for Mimo2 --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2026-01-05 09:58:54 +02:00
Kawrakow	385fc14110	Fix race in CUDA FA for head sizes 192/128 (#1104 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2026-01-05 08:21:07 +02:00
Kawrakow	ab50c6cdcb	Mimo-V2-Flash support (#1096 ) * Mimo-2 support * Fix bug for head sizes not being the same It still does not solve the Mimo-2 quantized cache issue. * Fix quantized cache * Minor --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2026-01-05 08:00:01 +02:00
firecoperana	56dceefd6b	Fix windows build with CUDA (#1101 ) Co-authored-by: firecoperana <firecoperana>	2026-01-05 07:59:23 +02:00
Kawrakow	17a5a80946	Fix Windows build (#1097 )	2025-12-29 14:18:27 +01:00
Kawrakow	519405dc97	Async compute graph evaluation (2 or more GPUs) (#1089 ) * WIP: absorb adding input into std_attn and std_ffn * WIP: NCCL infra * WIP: add reduce and fake_cpy ops * WIP * WIP: graph appears to work, layer is broken * WIP: Qwen3-MoE works with graph, layer still broken * WIP: GLM-4.5 graph works * WIP: fix sm layer (dense) * WIP: fix sm layer (MoE) * WIP: fast PP with bespoke 4-GPU NCCL I guess, I'm not using NCCL the right way as PP is very low with a single communicator group for 3 or more GPUs. But if I create 4 communicator groups for pairs of GPUs (0,1, 2,3, 0,2, 1,3) and use that, PP is fast: I'm hitting 1500 t/s for L3-70B on the 4x3090 system, which is ~20% better than the previous sm graph without NCCL. But that cannot be the solution (I cannot be creating pairwise communicators and associated logic for every possible number of GPUs). * WIP: Cohere2 * Explicitely set device * Bespoke 3-GPU case * WIP * Do not repeat get_rows multiple times * Fix 3 GPUs * OK, let's leave it in * Simple async * This sync seems enough * Only do async for 4 or more backends With 2 GPUs (so, 3 backends) not using async is slightly faster * Scheduler changes * Use OpenMP if available Surprisingly (at least to me), this is quite a bit faster than std::thread and std::barrier. GLM-4.5-AIR with 4 GPUs is now at 105 t/s at zero context! * Do not use OpenMP if there are tensor overrides * Set omp max active levels * Be more careful with having set the device before using a stream * Command line option to turn on async. Set to false by defualt for now --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-27 08:18:06 +01:00
Kawrakow	7146de451d	Be more careful with having set the device before using a stream (#1093 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-26 19:19:41 +01:00
Kawrakow	8687fca3ff	Graph parallel: better PP performance for 3 and more GPUs (#1092 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-26 17:35:27 +01:00

1 2 3 4 5 ...

533 Commits