Commit Graph

550 Commits

Author SHA1 Message Date
Kawrakow
7b68353e09 Fused delta net 3 (#1333)
* This is better than chunked

* Keep the state in registers

* Cleanup

* Remove unused stuff

* Minor

* Make fused delta-net the default

* Fix race
2026-02-27 15:02:56 +01:00
Kawrakow
facc8fdc44 Very slightly better fused delta-net (#1330) 2026-02-27 07:03:09 +01:00
Kawrakow
757bee6238 Add special FA handling for dense Qwen3.5 (#1328) 2026-02-26 11:27:41 +01:00
Kawrakow
2616efa296 Fused delta net 2 (#1320)
* Revive fused delta-net

* Add command line argument for fused delta net

* Simplify/improve CUDA delta-net

* Add -fdn to llama-bench

* More CUDA fused delta net optimizations

* CPU optimizations

* Much faster fused delta-net on the CPU

It seems it is faster than the chunked implementation!

* Change meaning of fdn from bool flag to threshold value

* Use eps = 1e-6

* Give some nodes a name

* Don't re-apply L2 norm - it has already been done

* This seems quite a bit better

* More tweaks

* Restore per context buffer size log

Not everybody uses models split in 2000 parts, and those who do,
actually want to see the biffer sizes.
2026-02-26 06:53:43 +01:00
Kawrakow
216f44363f Fix KT quantization yet again (#1321)
* Fix KT quantization yet again

* Add same 1e-16f check for all quants in iqk_uantize.cpp

* Fixes for k-quants

* Also this one
2026-02-25 18:07:12 +01:00
Kawrakow
c77ec4b8b8 Fused delta-net (#1315)
* Revive fused delta-net

* Add command line argument for fused delta net

* Simplify/improve CUDA delta-net

* Add -fdn to llama-bench

* More CUDA fused delta net optimizations

* CPU optimizations

* Much faster fused delta-net on the CPU

It seems it is faster than the chunked implementation!

* Change meaning of fdn from bool flag to threshold value

* Use eps = 1e-6

* Give some nodes a name
2026-02-25 14:12:48 +01:00
Kawrakow
38ca19d828 Minor delta-net tweak (#1308)
* Make sure we pick the reduced tensor from the right GPU

* Minor

* Minor delta-net tweak
2026-02-24 15:22:57 +01:00
Kawrakow
7065488135 Slightly better graph parallel for Qwen3-Next (#1307)
* Make sure we pick the reduced tensor from the right GPU

* Minor
2026-02-24 15:22:30 +01:00
Samuel Oliveira Alves
51df09be8a Feat - add kimi 2.5 Vision (#1280)
* port kimi 25-vision  from upstream

* feat(clip): add support for Kimi K2.5 vision model
2026-02-19 08:15:03 +01:00
Kawrakow
d2d65c0d64 Better CPU performance for Qwen3-Next (#1283)
* Better CPU silu - +4% PP

* Improve ggml_compute_forward_dup_bytes
2026-02-18 15:55:11 +01:00
Kawrakow
84831fc3ee Don't disable CUDA graphs for Qwen3-Next (#1278) 2026-02-18 08:47:45 +01:00
Kawrakow
cafeef484c More Qwen3-Next optimizations (#1277)
* Optimizing q3next TG

* Fused add -> softplus -> mul on CUDA

* Remove forgotten debug log

* Increase ggml context size

Required for Qwen3-Next with batch/u-batch size of 4096

* WIP

* Avoid some contiguous ops

* Avoid some repeats

* Avoid some more repeats
2026-02-17 16:03:51 +01:00
Kawrakow
16fe459a49 Faster CPU PP performance for Qwen3-Next - optimize concat (#1276) 2026-02-17 11:46:27 +01:00
Kawrakow
35c99f9f41 Faster Qwen3-Next PP on CUDA - optimize concat (#1275) 2026-02-16 11:46:39 +01:00
Kawrakow
e30198a553 WIP: Qwen3Next (#1266)
* qwen3next: add architecture support and recurrent-state fixes

* qwen3next: optimize broadcast sub and single-seq ssm conv

* cuda: build MoE row mapping on device in mul_mat_id

* cuda: add guarded multi-seq fast path for ssm_conv

* docs: update qwen3next perf report for cuda MoE/SSM tuning

* cuda: reduce qwen3next moe/ssm sync overhead and refresh eval

* qwen3next: split cpu/cuda eval builds and tune PP scheduling

* qwen3next: harden seq-state flow and support optional dense FFN layers

* qwen3next: trim delta-net graph overhead in chunking path

* qwen3next: remove redundant v_conv cont in delta path

* qwen3next: avoid extra cont on linear attention output

* qwen3next: drop redundant cont before recurrent state flatten

* qwen3next: keep recurrent state in 4d layout through delta path

* qwen3next: add fused delta-net op and wire model path

* tests: add backend-op coverage for ggml_delta_net

* qwen3next: add runtime switch for fused delta-net path

* docs: refresh qwen3next perf review and benchmark matrix

* qwen3next: default fused delta-net off and document quality checks

* qwen3next: add decode-only fused delta mode

* qwen3next: make fused delta safe by default and fix fused tensor layout

* qwen3next: warn when forcing fused decode mode

* qwen3next: add fused-delta regression runner script

* qwen3next: integrate fused regression into eval harness

* qwen3next: clean up chunked delta-net shape handling

* qwen3next: add absolute sanity guards to fused regression

* qwen3next: add unified regression runner script

* qwen3next: disable flash-attn for cpu-only contexts

* docs: reconcile qwen3next status and remaining upstream gaps

* common: add qwen3next fused-delta runtime flag

* cuda: add qwen3next delta-net kernel dispatch override

* docs: update qwen3next quality and serving baseline findings

* qwen3next: keep fused delta on safe path and remove PR artifacts

* qwen3next: align autoregressive delta-net decode layout

* Revert "qwen3next: align autoregressive delta-net decode layout"

This reverts commit 9241164a5e.

* cuda: port solve-tri fast-paths for qwen3next delta-net

* qwen3next: add fused-delta runtime flag and drop env toggle

* qwen3next: make fused delta single-flag and default on

* Account for GPU arch differences

* Revert "cuda: build MoE row mapping on device in mul_mat_id"

This reverts commit 89e9ecfa84.

* qwen3next: drop non-essential MoE scheduling and split heuristics

* qwen3next: avoid generic ggml_sub broadcast changes

* llama: restore only_active_experts log message

* Remove unnecessary hacks, disable fusion for now.

* qwen3next: port hybrid recurrent state memory semantics

* qwen3next: clean up recurrent state slot plumbing

* qwen3next: fix hybrid V-cache layout plumbing

* qwen3next: guard recurrent state slots against kv capacity

* qwen3next: persist recurrent state in session data

- serialize/restore qwen3next cache.s_l in state/session paths\n- bump session and sequence-state file versions for format change\n- fallback to single-token chunking for mixed repeated seq_id batches

* qwen3next: drop unused fused-delta builder path

- remove dead build_delta_net_fused lambda\n- remove unused llm_build_context::fused_delta member

* qwen3next: remove unused fused-delta CLI/context plumbing

- drop -fd/-no-fd options and related YAML dump field\n- remove fused_delta fields from public/internal context params\n- remove fused_delta assignment and logging in context init

* ggml: remove unused DELTA_NET operator stack

* Missing include

* Reorder ops/unary ops

So we don't change again the enum values of the mul mat ops

* Minor

* Discard unnecessary changes in llama-build-context.cpp

* Minor

* Revert "Discard unnecessary changes in llama-build-context.cpp"

This reverts commit edadb80ed6.

* Increase GGML_SCHED_MAX_SPLITS - required for larger u-batches

* Fix CPU concat in the TG case: 7.25 -> 10.5 t/s for Qwen3Next

* Fix CPU sum_rows: 10.5 -> 13.6 t/s for Qwen3Next

It was single-threaded and was taking ~25% of the computation time
during TG. It is now down to 2%.

Strangely enough, I measure 13.6 t/s with llama-bench, but if I
let the model give me an actual response with llama-cli, I get close
to 17 t/s.

* Fix CPU scale: 13.6 -> 16.7 t/s for Qwen3Next

For Qwen3Next there is a scale op on a largish tensor (548k elements)
that has a single row for TG, so was done in a single thread.
We now simply use blocks of 1024 elements.

* Optimize CPU mul: 16.7 -> 17.6 t/s for Qwen3Next

* CPU: fuse transpose -> cont -> sum_rows -> transpos: 17.6 -> 23.1 t/s for Qwen3Next

* Optimize CPU repeat: 176 -> 200 t/s for Qwen3Next PP-512

* Multithreading for OP_SUB

* Don't commit with timing trace on

* Multithread neg and sigmoid

* Be able to turn on/off fusion more easily (CPU)

* Name the mul_mat ops so we know where the time goes

* WIP

* Much better PP on CUDA

* CUDA: fuse transpose -> cont -> sum_rows -> transpose

Needs non-coontiguous variant of sum_rows.
On the CPU this gave 30+% improvement in TG performance,
on CUDA ist is disapointing 6-7%. I guess, this is because
Georgi's cont CPU implementation was so bad that skipping
it made such a big difference.

* CUDA: faster mul for special case relevant for Qwen3Next

Worth 1% in TG

* Fix CPU OP_CONT

---------

Co-authored-by: yurko <yurko@local>
Co-authored-by: Yurko <yurko@example.com>
Co-authored-by: yurko <yurko@pop-os.tail5a1a6b.ts.net>
Co-authored-by: Yurko Hoshko <YurkoHoshko@users.noreply.github.com>
2026-02-16 06:50:28 +01:00
Kawrakow
82c4f27332 Fuse the attention gate in Step-3.5-Flash (#1244)
* WIP

* This works but is slow

* Turn off the up / gate clamps for now

* OK we need the clamping

* Fuse the clamp (CUDA)

* Fuse the clamp (CPU)

* WIP

* Be able to use merged q, k, v

* Be able to use merged up/gate experts

* Fuse the clamp (CUDA mmvq)

* WIP: graph parallel for Step-3.5

* WIP

* This should be it

* Cleanup

* Fix merge

* Not working attempt to extend fused_mul_unary to the Step-3.5 case

* It works now, but performance gain is very minor
2026-02-07 07:56:58 +02:00
Kawrakow
81ea911f0d Graph parallel for Step-3.5-Flash (#1236)
* WIP

* This works but is slow

* Turn off the up / gate clamps for now

* OK we need the clamping

* Fuse the clamp (CUDA)

* Fuse the clamp (CPU)

* WIP

* Be able to use merged q, k, v

* Be able to use merged up/gate experts

* Fuse the clamp (CUDA mmvq)

* WIP: graph parallel for Step-3.5

* WIP

* This should be it

* Cleanup

* Fix merge
2026-02-06 06:56:51 +02:00
Kawrakow
5a44324e4a Bespoke ggml_repeat for Step3.5-Flash (#1239) 2026-02-06 06:56:09 +02:00
Kawrakow
51fc78750f Remove forgotten printf 2026-02-05 16:52:24 +02:00
Kawrakow
a7befb3bed Change default FA offset to ln(2) (#1235)
* Change default FA offset to ln(2)

* Also here
2026-02-05 13:42:53 +02:00
Kawrakow
9c1c74acda Step-3.5-Flash support (#1231)
* WIP

* This works but is slow

* Turn off the up / gate clamps for now

* OK we need the clamping

* Fuse the clamp (CUDA)

* Fuse the clamp (CPU)

* WIP

* Be able to use merged q, k, v

* Be able to use merged up/gate experts

* Fuse the clamp (CUDA mmvq)
2026-02-05 08:13:22 +02:00
Kawrakow
f8acfc2bf0 Better CUDA TG for GQA = 10 (#1221)
* Better CUDA TG for GQA = 10

* Cleanup
2026-02-03 09:18:46 +02:00
Kawrakow
589d80f677 Fix CPU FA work buffer size (#1216) 2026-02-02 12:39:41 +02:00
Kawrakow
d5498c4467 Do not repack q8_0 for batch sizes less than 8 2026-02-02 09:07:45 +00:00
Kawrakow
685df0e69d Work buffer size 2026-01-31 16:10:23 +00:00
Kawrakow
2bf2fa8ba4 Better CPU FA thread strategy 2026-01-31 15:46:16 +00:00
Kawrakow
811f8c3393 Fix bug in the CPU flash attention implementation (#1206) 2026-01-30 11:37:34 +02:00
Kawrakow
f0c61adacc Be able to set FA offset via command line argument (#1198) 2026-01-29 08:56:47 +02:00
Kawrakow
02ae22388f Apply offfset to KQ_max in CUDA flash attention (#1196)
* Apply offfset to KQ_max in CUDA flash attention

* Forgot to add to fattn-common.h
2026-01-29 07:27:53 +02:00
Kawrakow
68ed62447c Split mode graph for Minimax-M2 (#1195)
* Split mode graph for Minimax-M2

* Cleanup

* Forgotten ffn_exp_probs_b
2026-01-29 07:27:06 +02:00
Kawrakow
68cd52e583 Much faster long context TG for Minimax-M2 (#1194) 2026-01-28 10:43:11 +02:00
Kawrakow
f9b5420e6a Much faster long-context TG for GLM-4.5/4.6/4.7/AIR (#1193)
* This seems much better for GQA = 12 TG

* Remove unused arguments
2026-01-28 10:27:14 +02:00
Kawrakow
69fdd041c1 Remove forgotten unused code 2026-01-26 12:54:21 +00:00
Kawrakow
65441c2385 Even better GLM-4.7-Flash long context TG performance (#1192)
* Better FA for GLM-4.7-Flash

* Adjust ncols for ADA_LOVELACE or better
2026-01-26 13:45:06 +02:00
Kawrakow
478b56871f Faster long context TG on CUDA for GLM-4.5/4.6/4.7/AIR (part 2) (#1190)
* This works

* Make quantized KV cache work

* Remove the glm45 graph building changes

* Add condition
2026-01-26 07:21:47 +02:00
Kawrakow
f0fb76da64 Better GLM-4.7-Flash long context TG performance (#1182)
* Better GLM-4.7-Flash long context TG performance

* Handle quantized cache
2026-01-24 07:05:48 +02:00
Kawrakow
2a7cc09149 Remove llamafile remnants (#1179) 2026-01-22 13:20:23 +02:00
Kawrakow
66caa42b53 Fix build with GGML_CUDA_GRAPHS=OFF 2026-01-22 10:46:57 +00:00
Kawrakow
851fda3509 Split mode graph: use CUDA graphs (#1177)
* Use GUDA graphs also when theretensor overrides

* Change graph key

* This seems to work
2026-01-22 12:38:36 +02:00
Kawrakow
101fe54797 CUDA graphs with tensor overrides (#1172)
* Use GUDA graphs also when theretensor overrides

* Change graph key
2026-01-22 12:28:11 +02:00
Kawrakow
1cb8cd534f Fix build failure when OpenMP is not available (#1171) 2026-01-22 12:26:23 +02:00
Kawrakow
77c18acc90 Fix non-contiguous batched cuBLAS (#1178) 2026-01-22 12:25:05 +02:00
Kawrakow
6f1a69352f Fuse experts bias in top_k_moe kernel (#1170)
* GLM-4.7-Flash support

* Model type

* Make FA work for mla != 0

* Fuse bias in top_k_moe kernel if present
2026-01-20 15:38:51 +02:00
Kawrakow
996e77047a Avoid ggml_get_rows if not necessary (#1160)
* Copy reduce result to other GPUs if necessary

* Avoid ggml_get_rows for TG

* For the output ops use the result of the split that ran on the main GPU

* More models
2026-01-20 15:38:21 +02:00
Kawrakow
132a01d25d GLM-4.7-Flash support (#1168)
* GLM-4.7-Flash support

* Model type

* Make FA work for mla != 0
2026-01-20 12:46:52 +02:00
Kawrakow
98b30e5e81 Faster adaptive_p sampling (#1165)
* A hopefully more efficient adaptive_p sampling

* Once at it, lets fix the formatting too

* More formatting

* Hopefully better

* This should be better

* Correctly accumulate adaptive_p sampling time

* AVX2
2026-01-19 16:03:09 +02:00
Kawrakow
6a5c180be9 Fix bf16 additions on CUDA arch < Ampere (#1164)
* Fix bf16 additions on CUDA arch < Ampere

* Prevent using NCCL if graph reduce type is bf16 and arch < AMPERE
2026-01-19 12:27:52 +02:00
Kawrakow
0c0b6e4b8b Copy reduce result to other GPUs if necessary (#1156) 2026-01-19 08:40:26 +02:00
firecoperana
d71a3ec315 Server: refactor and rename functions (#1151)
* Server: rename functions and refactor code

rename functions

refactor update slots

rename params_base

rename timings

* change

* Revert kv cache name changes

* Revert 2

* fix test build error

---------

Co-authored-by: firecoperana <firecoperana>
2026-01-18 08:16:57 +02:00
Kawrakow
7024fdbc72 Additional graph reduce types for split mode graph (#1154)
* WIP: add Q8_0 and BF16 as possible reduce types

Does not work - there is a big somewhere

* This finally works
2026-01-18 08:02:49 +02:00