ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-02-21 13:44:10 +00:00

Author	SHA1	Message	Date
yurko	eef360a85f	cuda: add qwen3next delta-net kernel dispatch override	2026-02-08 14:38:30 -08:00
yurko	b5c9554a88	common: add qwen3next fused-delta runtime flag	2026-02-08 01:15:38 -08:00
yurko	bd0dd7804b	docs: reconcile qwen3next status and remaining upstream gaps	2026-02-08 01:12:40 -08:00
yurko	627d46912c	qwen3next: disable flash-attn for cpu-only contexts	2026-02-08 01:04:38 -08:00
yurko	a822db6f18	qwen3next: add unified regression runner script	2026-02-08 01:02:40 -08:00
yurko	691df60037	qwen3next: add absolute sanity guards to fused regression	2026-02-08 00:54:14 -08:00
yurko	670434ea8e	qwen3next: clean up chunked delta-net shape handling	2026-02-08 00:49:37 -08:00
yurko	55270b0f98	qwen3next: integrate fused regression into eval harness	2026-02-08 00:40:55 -08:00
yurko	44db3947a1	qwen3next: add fused-delta regression runner script	2026-02-08 00:13:18 -08:00
yurko	343e335ff0	qwen3next: warn when forcing fused decode mode	2026-02-08 00:08:33 -08:00
yurko	64099e71c0	qwen3next: make fused delta safe by default and fix fused tensor layout	2026-02-08 00:06:29 -08:00
yurko	143e88ae77	qwen3next: add decode-only fused delta mode	2026-02-07 23:05:19 -08:00
yurko	9930f4d961	qwen3next: default fused delta-net off and document quality checks	2026-02-07 22:56:51 -08:00
yurko	81e788e2f6	docs: refresh qwen3next perf review and benchmark matrix	2026-02-07 17:31:17 -08:00
yurko	b33cef68ad	qwen3next: add runtime switch for fused delta-net path	2026-02-07 17:31:17 -08:00
yurko	ed0565f801	tests: add backend-op coverage for ggml_delta_net	2026-02-07 14:34:56 -08:00
yurko	6dd990d15a	qwen3next: add fused delta-net op and wire model path	2026-02-07 14:32:16 -08:00
yurko	5a6c4e8da5	qwen3next: keep recurrent state in 4d layout through delta path	2026-02-07 14:00:09 -08:00
yurko	de5bf44e8c	qwen3next: drop redundant cont before recurrent state flatten	2026-02-07 13:45:37 -08:00
yurko	43edfa237b	qwen3next: avoid extra cont on linear attention output	2026-02-07 13:30:29 -08:00
yurko	0e3891b348	qwen3next: remove redundant v_conv cont in delta path	2026-02-07 13:25:34 -08:00
yurko	a1163d0b68	qwen3next: trim delta-net graph overhead in chunking path	2026-02-07 13:21:02 -08:00
yurko	fffd27e3c8	qwen3next: harden seq-state flow and support optional dense FFN layers	2026-02-07 13:12:26 -08:00
yurko	6db8dc86ca	qwen3next: split cpu/cuda eval builds and tune PP scheduling	2026-02-06 19:28:17 -08:00
Yurko	e64b43392f	cuda: reduce qwen3next moe/ssm sync overhead and refresh eval	2026-02-06 14:46:59 +00:00
yurko	c767cfa1d3	docs: update qwen3next perf report for cuda MoE/SSM tuning	2026-02-06 13:52:54 +00:00
yurko	236633af99	cuda: add guarded multi-seq fast path for ssm_conv	2026-02-06 13:52:54 +00:00
yurko	89e9ecfa84	cuda: build MoE row mapping on device in mul_mat_id	2026-02-06 13:52:33 +00:00
yurko	9fbb50481e	qwen3next: optimize broadcast sub and single-seq ssm conv	2026-02-06 12:50:43 +00:00
yurko	a7df116441	qwen3next: add architecture support and recurrent-state fixes	2026-02-06 12:13:09 +00:00
Kawrakow	a527b5af25	Merge pull request #1212 from ikawrakow/ik/better_cpu_fa_thread_strategy Better long-context CPU performance	2026-02-02 10:58:01 +02:00
Kawrakow	685df0e69d	Work buffer size	2026-01-31 16:10:23 +00:00
Kawrakow	2bf2fa8ba4	Better CPU FA thread strategy	2026-01-31 15:46:16 +00:00
Kawrakow	33308908db	Merge pull request #1211 from ikawrakow/ik/reduce_mla3_compute_buffer_size Reduce CUDA compute buffer size for mla=3	2026-01-31 14:24:14 +02:00
Kawrakow	b85a2a50d5	Reduce compute buffer size for mla=3	2026-01-31 10:43:05 +00:00
Kawrakow	373f043d41	Merge pull request #1208 from ikawrakow/ik/try_fix_1201	2026-01-30 23:12:07 +02:00
Kawrakow	4d13ae03b5	Also these other two places	2026-01-30 15:36:29 +00:00
Kawrakow	098b1a2e04	Fix MiniMax-M2 KV-cache loading/saving	2026-01-30 13:38:07 +00:00
Kawrakow	811f8c3393	Fix bug in the CPU flash attention implementation (#1206 )	2026-01-30 11:37:34 +02:00
Kawrakow	686fd1ebec	Use standard output calculation for MiniMax-M2 graph parallel (#1199 )	2026-01-29 09:06:40 +02:00
Kawrakow	f0c61adacc	Be able to set FA offset via command line argument (#1198 )	2026-01-29 08:56:47 +02:00
Kawrakow	02ae22388f	Apply offfset to KQ_max in CUDA flash attention (#1196 ) * Apply offfset to KQ_max in CUDA flash attention * Forgot to add to fattn-common.h	2026-01-29 07:27:53 +02:00
Kawrakow	68ed62447c	Split mode graph for Minimax-M2 (#1195 ) * Split mode graph for Minimax-M2 * Cleanup * Forgotten ffn_exp_probs_b	2026-01-29 07:27:06 +02:00
Kawrakow	68cd52e583	Much faster long context TG for Minimax-M2 (#1194 )	2026-01-28 10:43:11 +02:00
Kawrakow	f9b5420e6a	Much faster long-context TG for GLM-4.5/4.6/4.7/AIR (#1193 ) * This seems much better for GQA = 12 TG * Remove unused arguments	2026-01-28 10:27:14 +02:00
Kawrakow	69fdd041c1	Remove forgotten unused code	2026-01-26 12:54:21 +00:00
Kawrakow	65441c2385	Even better GLM-4.7-Flash long context TG performance (#1192 ) * Better FA for GLM-4.7-Flash * Adjust ncols for ADA_LOVELACE or better	2026-01-26 13:45:06 +02:00
Kawrakow	30381fc1fc	Faster hybrid inference when shared experts (#1191 )	2026-01-26 07:22:05 +02:00
Kawrakow	478b56871f	Faster long context TG on CUDA for GLM-4.5/4.6/4.7/AIR (part 2) (#1190 ) * This works * Make quantized KV cache work * Remove the glm45 graph building changes * Add condition	2026-01-26 07:21:47 +02:00
Kawrakow	28f8320f3a	Much faster rng sampling (#1187 )	2026-01-25 09:11:27 +02:00

1 2 3 4 5 ...

4194 Commits