Commit Graph

4194 Commits

Author SHA1 Message Date
yurko
eef360a85f cuda: add qwen3next delta-net kernel dispatch override 2026-02-08 14:38:30 -08:00
yurko
b5c9554a88 common: add qwen3next fused-delta runtime flag 2026-02-08 01:15:38 -08:00
yurko
bd0dd7804b docs: reconcile qwen3next status and remaining upstream gaps 2026-02-08 01:12:40 -08:00
yurko
627d46912c qwen3next: disable flash-attn for cpu-only contexts 2026-02-08 01:04:38 -08:00
yurko
a822db6f18 qwen3next: add unified regression runner script 2026-02-08 01:02:40 -08:00
yurko
691df60037 qwen3next: add absolute sanity guards to fused regression 2026-02-08 00:54:14 -08:00
yurko
670434ea8e qwen3next: clean up chunked delta-net shape handling 2026-02-08 00:49:37 -08:00
yurko
55270b0f98 qwen3next: integrate fused regression into eval harness 2026-02-08 00:40:55 -08:00
yurko
44db3947a1 qwen3next: add fused-delta regression runner script 2026-02-08 00:13:18 -08:00
yurko
343e335ff0 qwen3next: warn when forcing fused decode mode 2026-02-08 00:08:33 -08:00
yurko
64099e71c0 qwen3next: make fused delta safe by default and fix fused tensor layout 2026-02-08 00:06:29 -08:00
yurko
143e88ae77 qwen3next: add decode-only fused delta mode 2026-02-07 23:05:19 -08:00
yurko
9930f4d961 qwen3next: default fused delta-net off and document quality checks 2026-02-07 22:56:51 -08:00
yurko
81e788e2f6 docs: refresh qwen3next perf review and benchmark matrix 2026-02-07 17:31:17 -08:00
yurko
b33cef68ad qwen3next: add runtime switch for fused delta-net path 2026-02-07 17:31:17 -08:00
yurko
ed0565f801 tests: add backend-op coverage for ggml_delta_net 2026-02-07 14:34:56 -08:00
yurko
6dd990d15a qwen3next: add fused delta-net op and wire model path 2026-02-07 14:32:16 -08:00
yurko
5a6c4e8da5 qwen3next: keep recurrent state in 4d layout through delta path 2026-02-07 14:00:09 -08:00
yurko
de5bf44e8c qwen3next: drop redundant cont before recurrent state flatten 2026-02-07 13:45:37 -08:00
yurko
43edfa237b qwen3next: avoid extra cont on linear attention output 2026-02-07 13:30:29 -08:00
yurko
0e3891b348 qwen3next: remove redundant v_conv cont in delta path 2026-02-07 13:25:34 -08:00
yurko
a1163d0b68 qwen3next: trim delta-net graph overhead in chunking path 2026-02-07 13:21:02 -08:00
yurko
fffd27e3c8 qwen3next: harden seq-state flow and support optional dense FFN layers 2026-02-07 13:12:26 -08:00
yurko
6db8dc86ca qwen3next: split cpu/cuda eval builds and tune PP scheduling 2026-02-06 19:28:17 -08:00
Yurko
e64b43392f cuda: reduce qwen3next moe/ssm sync overhead and refresh eval 2026-02-06 14:46:59 +00:00
yurko
c767cfa1d3 docs: update qwen3next perf report for cuda MoE/SSM tuning 2026-02-06 13:52:54 +00:00
yurko
236633af99 cuda: add guarded multi-seq fast path for ssm_conv 2026-02-06 13:52:54 +00:00
yurko
89e9ecfa84 cuda: build MoE row mapping on device in mul_mat_id 2026-02-06 13:52:33 +00:00
yurko
9fbb50481e qwen3next: optimize broadcast sub and single-seq ssm conv 2026-02-06 12:50:43 +00:00
yurko
a7df116441 qwen3next: add architecture support and recurrent-state fixes 2026-02-06 12:13:09 +00:00
Kawrakow
a527b5af25 Merge pull request #1212 from ikawrakow/ik/better_cpu_fa_thread_strategy
Better long-context CPU performance
2026-02-02 10:58:01 +02:00
Kawrakow
685df0e69d Work buffer size 2026-01-31 16:10:23 +00:00
Kawrakow
2bf2fa8ba4 Better CPU FA thread strategy 2026-01-31 15:46:16 +00:00
Kawrakow
33308908db Merge pull request #1211 from ikawrakow/ik/reduce_mla3_compute_buffer_size
Reduce CUDA compute buffer size for mla=3
2026-01-31 14:24:14 +02:00
Kawrakow
b85a2a50d5 Reduce compute buffer size for mla=3 2026-01-31 10:43:05 +00:00
Kawrakow
373f043d41 Merge pull request #1208 from ikawrakow/ik/try_fix_1201 2026-01-30 23:12:07 +02:00
Kawrakow
4d13ae03b5 Also these other two places 2026-01-30 15:36:29 +00:00
Kawrakow
098b1a2e04 Fix MiniMax-M2 KV-cache loading/saving 2026-01-30 13:38:07 +00:00
Kawrakow
811f8c3393 Fix bug in the CPU flash attention implementation (#1206) 2026-01-30 11:37:34 +02:00
Kawrakow
686fd1ebec Use standard output calculation for MiniMax-M2 graph parallel (#1199) 2026-01-29 09:06:40 +02:00
Kawrakow
f0c61adacc Be able to set FA offset via command line argument (#1198) 2026-01-29 08:56:47 +02:00
Kawrakow
02ae22388f Apply offfset to KQ_max in CUDA flash attention (#1196)
* Apply offfset to KQ_max in CUDA flash attention

* Forgot to add to fattn-common.h
2026-01-29 07:27:53 +02:00
Kawrakow
68ed62447c Split mode graph for Minimax-M2 (#1195)
* Split mode graph for Minimax-M2

* Cleanup

* Forgotten ffn_exp_probs_b
2026-01-29 07:27:06 +02:00
Kawrakow
68cd52e583 Much faster long context TG for Minimax-M2 (#1194) 2026-01-28 10:43:11 +02:00
Kawrakow
f9b5420e6a Much faster long-context TG for GLM-4.5/4.6/4.7/AIR (#1193)
* This seems much better for GQA = 12 TG

* Remove unused arguments
2026-01-28 10:27:14 +02:00
Kawrakow
69fdd041c1 Remove forgotten unused code 2026-01-26 12:54:21 +00:00
Kawrakow
65441c2385 Even better GLM-4.7-Flash long context TG performance (#1192)
* Better FA for GLM-4.7-Flash

* Adjust ncols for ADA_LOVELACE or better
2026-01-26 13:45:06 +02:00
Kawrakow
30381fc1fc Faster hybrid inference when shared experts (#1191) 2026-01-26 07:22:05 +02:00
Kawrakow
478b56871f Faster long context TG on CUDA for GLM-4.5/4.6/4.7/AIR (part 2) (#1190)
* This works

* Make quantized KV cache work

* Remove the glm45 graph building changes

* Add condition
2026-01-26 07:21:47 +02:00
Kawrakow
28f8320f3a Much faster rng sampling (#1187) 2026-01-25 09:11:27 +02:00