yurko
|
eef360a85f
|
cuda: add qwen3next delta-net kernel dispatch override
|
2026-02-08 14:38:30 -08:00 |
|
yurko
|
b5c9554a88
|
common: add qwen3next fused-delta runtime flag
|
2026-02-08 01:15:38 -08:00 |
|
yurko
|
bd0dd7804b
|
docs: reconcile qwen3next status and remaining upstream gaps
|
2026-02-08 01:12:40 -08:00 |
|
yurko
|
627d46912c
|
qwen3next: disable flash-attn for cpu-only contexts
|
2026-02-08 01:04:38 -08:00 |
|
yurko
|
a822db6f18
|
qwen3next: add unified regression runner script
|
2026-02-08 01:02:40 -08:00 |
|
yurko
|
691df60037
|
qwen3next: add absolute sanity guards to fused regression
|
2026-02-08 00:54:14 -08:00 |
|
yurko
|
670434ea8e
|
qwen3next: clean up chunked delta-net shape handling
|
2026-02-08 00:49:37 -08:00 |
|
yurko
|
55270b0f98
|
qwen3next: integrate fused regression into eval harness
|
2026-02-08 00:40:55 -08:00 |
|
yurko
|
44db3947a1
|
qwen3next: add fused-delta regression runner script
|
2026-02-08 00:13:18 -08:00 |
|
yurko
|
343e335ff0
|
qwen3next: warn when forcing fused decode mode
|
2026-02-08 00:08:33 -08:00 |
|
yurko
|
64099e71c0
|
qwen3next: make fused delta safe by default and fix fused tensor layout
|
2026-02-08 00:06:29 -08:00 |
|
yurko
|
143e88ae77
|
qwen3next: add decode-only fused delta mode
|
2026-02-07 23:05:19 -08:00 |
|
yurko
|
9930f4d961
|
qwen3next: default fused delta-net off and document quality checks
|
2026-02-07 22:56:51 -08:00 |
|
yurko
|
81e788e2f6
|
docs: refresh qwen3next perf review and benchmark matrix
|
2026-02-07 17:31:17 -08:00 |
|
yurko
|
b33cef68ad
|
qwen3next: add runtime switch for fused delta-net path
|
2026-02-07 17:31:17 -08:00 |
|
yurko
|
ed0565f801
|
tests: add backend-op coverage for ggml_delta_net
|
2026-02-07 14:34:56 -08:00 |
|
yurko
|
6dd990d15a
|
qwen3next: add fused delta-net op and wire model path
|
2026-02-07 14:32:16 -08:00 |
|
yurko
|
5a6c4e8da5
|
qwen3next: keep recurrent state in 4d layout through delta path
|
2026-02-07 14:00:09 -08:00 |
|
yurko
|
de5bf44e8c
|
qwen3next: drop redundant cont before recurrent state flatten
|
2026-02-07 13:45:37 -08:00 |
|
yurko
|
43edfa237b
|
qwen3next: avoid extra cont on linear attention output
|
2026-02-07 13:30:29 -08:00 |
|
yurko
|
0e3891b348
|
qwen3next: remove redundant v_conv cont in delta path
|
2026-02-07 13:25:34 -08:00 |
|
yurko
|
a1163d0b68
|
qwen3next: trim delta-net graph overhead in chunking path
|
2026-02-07 13:21:02 -08:00 |
|
yurko
|
fffd27e3c8
|
qwen3next: harden seq-state flow and support optional dense FFN layers
|
2026-02-07 13:12:26 -08:00 |
|
yurko
|
6db8dc86ca
|
qwen3next: split cpu/cuda eval builds and tune PP scheduling
|
2026-02-06 19:28:17 -08:00 |
|
Yurko
|
e64b43392f
|
cuda: reduce qwen3next moe/ssm sync overhead and refresh eval
|
2026-02-06 14:46:59 +00:00 |
|
yurko
|
c767cfa1d3
|
docs: update qwen3next perf report for cuda MoE/SSM tuning
|
2026-02-06 13:52:54 +00:00 |
|
yurko
|
236633af99
|
cuda: add guarded multi-seq fast path for ssm_conv
|
2026-02-06 13:52:54 +00:00 |
|
yurko
|
89e9ecfa84
|
cuda: build MoE row mapping on device in mul_mat_id
|
2026-02-06 13:52:33 +00:00 |
|
yurko
|
9fbb50481e
|
qwen3next: optimize broadcast sub and single-seq ssm conv
|
2026-02-06 12:50:43 +00:00 |
|
yurko
|
a7df116441
|
qwen3next: add architecture support and recurrent-state fixes
|
2026-02-06 12:13:09 +00:00 |
|
Kawrakow
|
a527b5af25
|
Merge pull request #1212 from ikawrakow/ik/better_cpu_fa_thread_strategy
Better long-context CPU performance
|
2026-02-02 10:58:01 +02:00 |
|
Kawrakow
|
685df0e69d
|
Work buffer size
|
2026-01-31 16:10:23 +00:00 |
|
Kawrakow
|
2bf2fa8ba4
|
Better CPU FA thread strategy
|
2026-01-31 15:46:16 +00:00 |
|
Kawrakow
|
33308908db
|
Merge pull request #1211 from ikawrakow/ik/reduce_mla3_compute_buffer_size
Reduce CUDA compute buffer size for mla=3
|
2026-01-31 14:24:14 +02:00 |
|
Kawrakow
|
b85a2a50d5
|
Reduce compute buffer size for mla=3
|
2026-01-31 10:43:05 +00:00 |
|
Kawrakow
|
373f043d41
|
Merge pull request #1208 from ikawrakow/ik/try_fix_1201
|
2026-01-30 23:12:07 +02:00 |
|
Kawrakow
|
4d13ae03b5
|
Also these other two places
|
2026-01-30 15:36:29 +00:00 |
|
Kawrakow
|
098b1a2e04
|
Fix MiniMax-M2 KV-cache loading/saving
|
2026-01-30 13:38:07 +00:00 |
|
Kawrakow
|
811f8c3393
|
Fix bug in the CPU flash attention implementation (#1206)
|
2026-01-30 11:37:34 +02:00 |
|
Kawrakow
|
686fd1ebec
|
Use standard output calculation for MiniMax-M2 graph parallel (#1199)
|
2026-01-29 09:06:40 +02:00 |
|
Kawrakow
|
f0c61adacc
|
Be able to set FA offset via command line argument (#1198)
|
2026-01-29 08:56:47 +02:00 |
|
Kawrakow
|
02ae22388f
|
Apply offfset to KQ_max in CUDA flash attention (#1196)
* Apply offfset to KQ_max in CUDA flash attention
* Forgot to add to fattn-common.h
|
2026-01-29 07:27:53 +02:00 |
|
Kawrakow
|
68ed62447c
|
Split mode graph for Minimax-M2 (#1195)
* Split mode graph for Minimax-M2
* Cleanup
* Forgotten ffn_exp_probs_b
|
2026-01-29 07:27:06 +02:00 |
|
Kawrakow
|
68cd52e583
|
Much faster long context TG for Minimax-M2 (#1194)
|
2026-01-28 10:43:11 +02:00 |
|
Kawrakow
|
f9b5420e6a
|
Much faster long-context TG for GLM-4.5/4.6/4.7/AIR (#1193)
* This seems much better for GQA = 12 TG
* Remove unused arguments
|
2026-01-28 10:27:14 +02:00 |
|
Kawrakow
|
69fdd041c1
|
Remove forgotten unused code
|
2026-01-26 12:54:21 +00:00 |
|
Kawrakow
|
65441c2385
|
Even better GLM-4.7-Flash long context TG performance (#1192)
* Better FA for GLM-4.7-Flash
* Adjust ncols for ADA_LOVELACE or better
|
2026-01-26 13:45:06 +02:00 |
|
Kawrakow
|
30381fc1fc
|
Faster hybrid inference when shared experts (#1191)
|
2026-01-26 07:22:05 +02:00 |
|
Kawrakow
|
478b56871f
|
Faster long context TG on CUDA for GLM-4.5/4.6/4.7/AIR (part 2) (#1190)
* This works
* Make quantized KV cache work
* Remove the glm45 graph building changes
* Add condition
|
2026-01-26 07:21:47 +02:00 |
|
Kawrakow
|
28f8320f3a
|
Much faster rng sampling (#1187)
|
2026-01-25 09:11:27 +02:00 |
|