ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-05-11 00:20:19 +00:00

Author	SHA1	Message	Date
yurko	343e335ff0	qwen3next: warn when forcing fused decode mode	2026-02-08 00:08:33 -08:00
yurko	64099e71c0	qwen3next: make fused delta safe by default and fix fused tensor layout	2026-02-08 00:06:29 -08:00
yurko	143e88ae77	qwen3next: add decode-only fused delta mode	2026-02-07 23:05:19 -08:00
yurko	9930f4d961	qwen3next: default fused delta-net off and document quality checks	2026-02-07 22:56:51 -08:00
yurko	81e788e2f6	docs: refresh qwen3next perf review and benchmark matrix	2026-02-07 17:31:17 -08:00
yurko	b33cef68ad	qwen3next: add runtime switch for fused delta-net path	2026-02-07 17:31:17 -08:00
yurko	ed0565f801	tests: add backend-op coverage for ggml_delta_net	2026-02-07 14:34:56 -08:00
yurko	6dd990d15a	qwen3next: add fused delta-net op and wire model path	2026-02-07 14:32:16 -08:00
yurko	5a6c4e8da5	qwen3next: keep recurrent state in 4d layout through delta path	2026-02-07 14:00:09 -08:00
yurko	de5bf44e8c	qwen3next: drop redundant cont before recurrent state flatten	2026-02-07 13:45:37 -08:00
yurko	43edfa237b	qwen3next: avoid extra cont on linear attention output	2026-02-07 13:30:29 -08:00
yurko	0e3891b348	qwen3next: remove redundant v_conv cont in delta path	2026-02-07 13:25:34 -08:00
yurko	a1163d0b68	qwen3next: trim delta-net graph overhead in chunking path	2026-02-07 13:21:02 -08:00
yurko	fffd27e3c8	qwen3next: harden seq-state flow and support optional dense FFN layers	2026-02-07 13:12:26 -08:00
yurko	6db8dc86ca	qwen3next: split cpu/cuda eval builds and tune PP scheduling	2026-02-06 19:28:17 -08:00
Yurko	e64b43392f	cuda: reduce qwen3next moe/ssm sync overhead and refresh eval	2026-02-06 14:46:59 +00:00
yurko	c767cfa1d3	docs: update qwen3next perf report for cuda MoE/SSM tuning	2026-02-06 13:52:54 +00:00
yurko	236633af99	cuda: add guarded multi-seq fast path for ssm_conv	2026-02-06 13:52:54 +00:00
yurko	89e9ecfa84	cuda: build MoE row mapping on device in mul_mat_id	2026-02-06 13:52:33 +00:00
yurko	9fbb50481e	qwen3next: optimize broadcast sub and single-seq ssm conv	2026-02-06 12:50:43 +00:00
yurko	a7df116441	qwen3next: add architecture support and recurrent-state fixes	2026-02-06 12:13:09 +00:00
Kawrakow	a527b5af25	Merge pull request #1212 from ikawrakow/ik/better_cpu_fa_thread_strategy Better long-context CPU performance	2026-02-02 10:58:01 +02:00
Kawrakow	685df0e69d	Work buffer size	2026-01-31 16:10:23 +00:00
Kawrakow	2bf2fa8ba4	Better CPU FA thread strategy	2026-01-31 15:46:16 +00:00
Kawrakow	33308908db	Merge pull request #1211 from ikawrakow/ik/reduce_mla3_compute_buffer_size Reduce CUDA compute buffer size for mla=3	2026-01-31 14:24:14 +02:00
Kawrakow	b85a2a50d5	Reduce compute buffer size for mla=3	2026-01-31 10:43:05 +00:00
Kawrakow	373f043d41	Merge pull request #1208 from ikawrakow/ik/try_fix_1201	2026-01-30 23:12:07 +02:00
Kawrakow	4d13ae03b5	Also these other two places	2026-01-30 15:36:29 +00:00
Kawrakow	098b1a2e04	Fix MiniMax-M2 KV-cache loading/saving	2026-01-30 13:38:07 +00:00
Kawrakow	811f8c3393	Fix bug in the CPU flash attention implementation (#1206 )	2026-01-30 11:37:34 +02:00
Kawrakow	686fd1ebec	Use standard output calculation for MiniMax-M2 graph parallel (#1199 )	2026-01-29 09:06:40 +02:00
Kawrakow	f0c61adacc	Be able to set FA offset via command line argument (#1198 )	2026-01-29 08:56:47 +02:00
Kawrakow	02ae22388f	Apply offfset to KQ_max in CUDA flash attention (#1196 ) * Apply offfset to KQ_max in CUDA flash attention * Forgot to add to fattn-common.h	2026-01-29 07:27:53 +02:00
Kawrakow	68ed62447c	Split mode graph for Minimax-M2 (#1195 ) * Split mode graph for Minimax-M2 * Cleanup * Forgotten ffn_exp_probs_b	2026-01-29 07:27:06 +02:00
Kawrakow	68cd52e583	Much faster long context TG for Minimax-M2 (#1194 )	2026-01-28 10:43:11 +02:00
Kawrakow	f9b5420e6a	Much faster long-context TG for GLM-4.5/4.6/4.7/AIR (#1193 ) * This seems much better for GQA = 12 TG * Remove unused arguments	2026-01-28 10:27:14 +02:00
Kawrakow	69fdd041c1	Remove forgotten unused code	2026-01-26 12:54:21 +00:00
Kawrakow	65441c2385	Even better GLM-4.7-Flash long context TG performance (#1192 ) * Better FA for GLM-4.7-Flash * Adjust ncols for ADA_LOVELACE or better	2026-01-26 13:45:06 +02:00
Kawrakow	30381fc1fc	Faster hybrid inference when shared experts (#1191 )	2026-01-26 07:22:05 +02:00
Kawrakow	478b56871f	Faster long context TG on CUDA for GLM-4.5/4.6/4.7/AIR (part 2) (#1190 ) * This works * Make quantized KV cache work * Remove the glm45 graph building changes * Add condition	2026-01-26 07:21:47 +02:00
Kawrakow	28f8320f3a	Much faster rng sampling (#1187 )	2026-01-25 09:11:27 +02:00
Kawrakow	04beeffa4e	Faster long context TG on CUDA for GLM-4.5/4.6/4.7/AIR (#1183 ) * Similar hack to #1182 for GLM-4.5/6/7 * Refinements * Disable when the KV cache is not f16	2026-01-24 09:39:29 +02:00
Kawrakow	f0fb76da64	Better GLM-4.7-Flash long context TG performance (#1182 ) * Better GLM-4.7-Flash long context TG performance * Handle quantized cache	2026-01-24 07:05:48 +02:00
Kawrakow	2a7cc09149	Remove llamafile remnants (#1179 )	2026-01-22 13:20:23 +02:00
Kawrakow	66caa42b53	Fix build with GGML_CUDA_GRAPHS=OFF	2026-01-22 10:46:57 +00:00
Kawrakow	851fda3509	Split mode graph: use CUDA graphs (#1177 ) * Use GUDA graphs also when theretensor overrides * Change graph key * This seems to work	2026-01-22 12:38:36 +02:00
Kawrakow	573e23679d	sweep_bench: set number of repetions (#1176 )	2026-01-22 12:28:30 +02:00
Kawrakow	101fe54797	CUDA graphs with tensor overrides (#1172 ) * Use GUDA graphs also when theretensor overrides * Change graph key	2026-01-22 12:28:11 +02:00
Kawrakow	1cb8cd534f	Fix build failure when OpenMP is not available (#1171 )	2026-01-22 12:26:23 +02:00
Kawrakow	77c18acc90	Fix non-contiguous batched cuBLAS (#1178 )	2026-01-22 12:25:05 +02:00

1 2 3 4 5 ...

4185 Commits