Commit Graph

556 Commits

Author SHA1 Message Date
Kawrakow
f27678d39b ARM_NEON fused delta-net implementation (#1361)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2026-03-04 13:24:59 +01:00
mullecofo
2f93bf7563 Fix compilation on clang-cl.exe (#1355)
Fixes https://github.com/ikawrakow/ik_llama.cpp/issues/1169

See bitwise ariphmetics here: https://clang.llvm.org/doxygen/avx512fintrin_8h_source.html

Clang (and GCC) supports a language feature called Vector Extensions.

To Clang, `__m512i` is not just a "struct" or a "bag of bits"; it is recognized by the compiler as a native vector type.
Because it is a native vector type, Clang automatically maps standard C operators to the corresponding hardware instructions.
When you write `a | b`, Clang sees that a and b are 512-bit integer vectors.
It implicitly understands that the bitwise OR operator (|) applies to these vectors.
It automatically generates the VPORQ (or VPORD) instruction without needing any helper function.

MSVC follows a stricter, more traditional C++ model regarding intrinsics.

In MSVC, __m512i is defined in the header files (<immintrin.h>) as a struct or union (e.g., typedef struct __m512i { ... } __m512i). To the MSVC compiler, it is essentially a user-defined data type, not a fundamental language primitive like int or float.
Standard C++ does not define what `|` means for a user-defined struct.
MSVC does not have the same "Vector Extensions" that automatically apply operators to these structs.
When you write `a | b` in MSVC, the compiler looks for a definition of `operator|` for the __m512i struct. Since the standard headers don't provide one, the compiler throws an error.
You must use the explicit intrinsic function provided by Intel/MSVC: _mm512_or_si512(a, b).

To get the nice syntax `(a | b)` in MSVC, you have to manually "teach" the compiler what `|` means by defining the `operator|` overload yourself.
2026-03-04 08:00:28 +01:00
Kawrakow
fd16a418de Fix clang warnings on macOS (#1354)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2026-03-03 16:27:16 +01:00
Nexes the Elder
d4ac5f1566 gguf-split: fix the split output files naming (#1336)
* Fix gguf-split.cpp splits output naming

With this fix, the initial extension of the source .gguf file is not included in the naming of the output file before the numeration of the splits.

ex:

No more model.gguf-00001-of-00200.gguf
Instead, model-00001-of-00200.gguf

* increase ggml_max_context to 2048

* Revert GGML_MAX_CONTEXTS to 64
2026-03-02 08:43:47 +01:00
Kawrakow
0ff3a43289 Bring back #1333 and #1335 (#1340)
* Bring back fused delta net 3

* Remove autoregressive and chunking
2026-02-28 14:31:42 +01:00
Kawrakow
1922449b2c Revert delta net 3 (#1339)
* Revert "Simplify delta-net (#1335)"

This reverts commit e5fc30244c.

* Revert "Fused delta net 3 (#1333)"

This reverts commit 7b68353e09.
2026-02-28 13:12:08 +01:00
Kawrakow
7b68353e09 Fused delta net 3 (#1333)
* This is better than chunked

* Keep the state in registers

* Cleanup

* Remove unused stuff

* Minor

* Make fused delta-net the default

* Fix race
2026-02-27 15:02:56 +01:00
Kawrakow
facc8fdc44 Very slightly better fused delta-net (#1330) 2026-02-27 07:03:09 +01:00
Kawrakow
757bee6238 Add special FA handling for dense Qwen3.5 (#1328) 2026-02-26 11:27:41 +01:00
Kawrakow
2616efa296 Fused delta net 2 (#1320)
* Revive fused delta-net

* Add command line argument for fused delta net

* Simplify/improve CUDA delta-net

* Add -fdn to llama-bench

* More CUDA fused delta net optimizations

* CPU optimizations

* Much faster fused delta-net on the CPU

It seems it is faster than the chunked implementation!

* Change meaning of fdn from bool flag to threshold value

* Use eps = 1e-6

* Give some nodes a name

* Don't re-apply L2 norm - it has already been done

* This seems quite a bit better

* More tweaks

* Restore per context buffer size log

Not everybody uses models split in 2000 parts, and those who do,
actually want to see the biffer sizes.
2026-02-26 06:53:43 +01:00
Kawrakow
216f44363f Fix KT quantization yet again (#1321)
* Fix KT quantization yet again

* Add same 1e-16f check for all quants in iqk_uantize.cpp

* Fixes for k-quants

* Also this one
2026-02-25 18:07:12 +01:00
Kawrakow
c77ec4b8b8 Fused delta-net (#1315)
* Revive fused delta-net

* Add command line argument for fused delta net

* Simplify/improve CUDA delta-net

* Add -fdn to llama-bench

* More CUDA fused delta net optimizations

* CPU optimizations

* Much faster fused delta-net on the CPU

It seems it is faster than the chunked implementation!

* Change meaning of fdn from bool flag to threshold value

* Use eps = 1e-6

* Give some nodes a name
2026-02-25 14:12:48 +01:00
Kawrakow
38ca19d828 Minor delta-net tweak (#1308)
* Make sure we pick the reduced tensor from the right GPU

* Minor

* Minor delta-net tweak
2026-02-24 15:22:57 +01:00
Kawrakow
7065488135 Slightly better graph parallel for Qwen3-Next (#1307)
* Make sure we pick the reduced tensor from the right GPU

* Minor
2026-02-24 15:22:30 +01:00
Samuel Oliveira Alves
51df09be8a Feat - add kimi 2.5 Vision (#1280)
* port kimi 25-vision  from upstream

* feat(clip): add support for Kimi K2.5 vision model
2026-02-19 08:15:03 +01:00
Kawrakow
d2d65c0d64 Better CPU performance for Qwen3-Next (#1283)
* Better CPU silu - +4% PP

* Improve ggml_compute_forward_dup_bytes
2026-02-18 15:55:11 +01:00
Kawrakow
84831fc3ee Don't disable CUDA graphs for Qwen3-Next (#1278) 2026-02-18 08:47:45 +01:00
Kawrakow
cafeef484c More Qwen3-Next optimizations (#1277)
* Optimizing q3next TG

* Fused add -> softplus -> mul on CUDA

* Remove forgotten debug log

* Increase ggml context size

Required for Qwen3-Next with batch/u-batch size of 4096

* WIP

* Avoid some contiguous ops

* Avoid some repeats

* Avoid some more repeats
2026-02-17 16:03:51 +01:00
Kawrakow
16fe459a49 Faster CPU PP performance for Qwen3-Next - optimize concat (#1276) 2026-02-17 11:46:27 +01:00
Kawrakow
35c99f9f41 Faster Qwen3-Next PP on CUDA - optimize concat (#1275) 2026-02-16 11:46:39 +01:00
Kawrakow
e30198a553 WIP: Qwen3Next (#1266)
* qwen3next: add architecture support and recurrent-state fixes

* qwen3next: optimize broadcast sub and single-seq ssm conv

* cuda: build MoE row mapping on device in mul_mat_id

* cuda: add guarded multi-seq fast path for ssm_conv

* docs: update qwen3next perf report for cuda MoE/SSM tuning

* cuda: reduce qwen3next moe/ssm sync overhead and refresh eval

* qwen3next: split cpu/cuda eval builds and tune PP scheduling

* qwen3next: harden seq-state flow and support optional dense FFN layers

* qwen3next: trim delta-net graph overhead in chunking path

* qwen3next: remove redundant v_conv cont in delta path

* qwen3next: avoid extra cont on linear attention output

* qwen3next: drop redundant cont before recurrent state flatten

* qwen3next: keep recurrent state in 4d layout through delta path

* qwen3next: add fused delta-net op and wire model path

* tests: add backend-op coverage for ggml_delta_net

* qwen3next: add runtime switch for fused delta-net path

* docs: refresh qwen3next perf review and benchmark matrix

* qwen3next: default fused delta-net off and document quality checks

* qwen3next: add decode-only fused delta mode

* qwen3next: make fused delta safe by default and fix fused tensor layout

* qwen3next: warn when forcing fused decode mode

* qwen3next: add fused-delta regression runner script

* qwen3next: integrate fused regression into eval harness

* qwen3next: clean up chunked delta-net shape handling

* qwen3next: add absolute sanity guards to fused regression

* qwen3next: add unified regression runner script

* qwen3next: disable flash-attn for cpu-only contexts

* docs: reconcile qwen3next status and remaining upstream gaps

* common: add qwen3next fused-delta runtime flag

* cuda: add qwen3next delta-net kernel dispatch override

* docs: update qwen3next quality and serving baseline findings

* qwen3next: keep fused delta on safe path and remove PR artifacts

* qwen3next: align autoregressive delta-net decode layout

* Revert "qwen3next: align autoregressive delta-net decode layout"

This reverts commit 9241164a5e.

* cuda: port solve-tri fast-paths for qwen3next delta-net

* qwen3next: add fused-delta runtime flag and drop env toggle

* qwen3next: make fused delta single-flag and default on

* Account for GPU arch differences

* Revert "cuda: build MoE row mapping on device in mul_mat_id"

This reverts commit 89e9ecfa84.

* qwen3next: drop non-essential MoE scheduling and split heuristics

* qwen3next: avoid generic ggml_sub broadcast changes

* llama: restore only_active_experts log message

* Remove unnecessary hacks, disable fusion for now.

* qwen3next: port hybrid recurrent state memory semantics

* qwen3next: clean up recurrent state slot plumbing

* qwen3next: fix hybrid V-cache layout plumbing

* qwen3next: guard recurrent state slots against kv capacity

* qwen3next: persist recurrent state in session data

- serialize/restore qwen3next cache.s_l in state/session paths\n- bump session and sequence-state file versions for format change\n- fallback to single-token chunking for mixed repeated seq_id batches

* qwen3next: drop unused fused-delta builder path

- remove dead build_delta_net_fused lambda\n- remove unused llm_build_context::fused_delta member

* qwen3next: remove unused fused-delta CLI/context plumbing

- drop -fd/-no-fd options and related YAML dump field\n- remove fused_delta fields from public/internal context params\n- remove fused_delta assignment and logging in context init

* ggml: remove unused DELTA_NET operator stack

* Missing include

* Reorder ops/unary ops

So we don't change again the enum values of the mul mat ops

* Minor

* Discard unnecessary changes in llama-build-context.cpp

* Minor

* Revert "Discard unnecessary changes in llama-build-context.cpp"

This reverts commit edadb80ed6.

* Increase GGML_SCHED_MAX_SPLITS - required for larger u-batches

* Fix CPU concat in the TG case: 7.25 -> 10.5 t/s for Qwen3Next

* Fix CPU sum_rows: 10.5 -> 13.6 t/s for Qwen3Next

It was single-threaded and was taking ~25% of the computation time
during TG. It is now down to 2%.

Strangely enough, I measure 13.6 t/s with llama-bench, but if I
let the model give me an actual response with llama-cli, I get close
to 17 t/s.

* Fix CPU scale: 13.6 -> 16.7 t/s for Qwen3Next

For Qwen3Next there is a scale op on a largish tensor (548k elements)
that has a single row for TG, so was done in a single thread.
We now simply use blocks of 1024 elements.

* Optimize CPU mul: 16.7 -> 17.6 t/s for Qwen3Next

* CPU: fuse transpose -> cont -> sum_rows -> transpos: 17.6 -> 23.1 t/s for Qwen3Next

* Optimize CPU repeat: 176 -> 200 t/s for Qwen3Next PP-512

* Multithreading for OP_SUB

* Don't commit with timing trace on

* Multithread neg and sigmoid

* Be able to turn on/off fusion more easily (CPU)

* Name the mul_mat ops so we know where the time goes

* WIP

* Much better PP on CUDA

* CUDA: fuse transpose -> cont -> sum_rows -> transpose

Needs non-coontiguous variant of sum_rows.
On the CPU this gave 30+% improvement in TG performance,
on CUDA ist is disapointing 6-7%. I guess, this is because
Georgi's cont CPU implementation was so bad that skipping
it made such a big difference.

* CUDA: faster mul for special case relevant for Qwen3Next

Worth 1% in TG

* Fix CPU OP_CONT

---------

Co-authored-by: yurko <yurko@local>
Co-authored-by: Yurko <yurko@example.com>
Co-authored-by: yurko <yurko@pop-os.tail5a1a6b.ts.net>
Co-authored-by: Yurko Hoshko <YurkoHoshko@users.noreply.github.com>
2026-02-16 06:50:28 +01:00
Kawrakow
82c4f27332 Fuse the attention gate in Step-3.5-Flash (#1244)
* WIP

* This works but is slow

* Turn off the up / gate clamps for now

* OK we need the clamping

* Fuse the clamp (CUDA)

* Fuse the clamp (CPU)

* WIP

* Be able to use merged q, k, v

* Be able to use merged up/gate experts

* Fuse the clamp (CUDA mmvq)

* WIP: graph parallel for Step-3.5

* WIP

* This should be it

* Cleanup

* Fix merge

* Not working attempt to extend fused_mul_unary to the Step-3.5 case

* It works now, but performance gain is very minor
2026-02-07 07:56:58 +02:00
Kawrakow
81ea911f0d Graph parallel for Step-3.5-Flash (#1236)
* WIP

* This works but is slow

* Turn off the up / gate clamps for now

* OK we need the clamping

* Fuse the clamp (CUDA)

* Fuse the clamp (CPU)

* WIP

* Be able to use merged q, k, v

* Be able to use merged up/gate experts

* Fuse the clamp (CUDA mmvq)

* WIP: graph parallel for Step-3.5

* WIP

* This should be it

* Cleanup

* Fix merge
2026-02-06 06:56:51 +02:00
Kawrakow
5a44324e4a Bespoke ggml_repeat for Step3.5-Flash (#1239) 2026-02-06 06:56:09 +02:00
Kawrakow
51fc78750f Remove forgotten printf 2026-02-05 16:52:24 +02:00
Kawrakow
a7befb3bed Change default FA offset to ln(2) (#1235)
* Change default FA offset to ln(2)

* Also here
2026-02-05 13:42:53 +02:00
Kawrakow
9c1c74acda Step-3.5-Flash support (#1231)
* WIP

* This works but is slow

* Turn off the up / gate clamps for now

* OK we need the clamping

* Fuse the clamp (CUDA)

* Fuse the clamp (CPU)

* WIP

* Be able to use merged q, k, v

* Be able to use merged up/gate experts

* Fuse the clamp (CUDA mmvq)
2026-02-05 08:13:22 +02:00
Kawrakow
f8acfc2bf0 Better CUDA TG for GQA = 10 (#1221)
* Better CUDA TG for GQA = 10

* Cleanup
2026-02-03 09:18:46 +02:00
Kawrakow
589d80f677 Fix CPU FA work buffer size (#1216) 2026-02-02 12:39:41 +02:00
Kawrakow
d5498c4467 Do not repack q8_0 for batch sizes less than 8 2026-02-02 09:07:45 +00:00
Kawrakow
685df0e69d Work buffer size 2026-01-31 16:10:23 +00:00
Kawrakow
2bf2fa8ba4 Better CPU FA thread strategy 2026-01-31 15:46:16 +00:00
Kawrakow
811f8c3393 Fix bug in the CPU flash attention implementation (#1206) 2026-01-30 11:37:34 +02:00
Kawrakow
f0c61adacc Be able to set FA offset via command line argument (#1198) 2026-01-29 08:56:47 +02:00
Kawrakow
02ae22388f Apply offfset to KQ_max in CUDA flash attention (#1196)
* Apply offfset to KQ_max in CUDA flash attention

* Forgot to add to fattn-common.h
2026-01-29 07:27:53 +02:00
Kawrakow
68ed62447c Split mode graph for Minimax-M2 (#1195)
* Split mode graph for Minimax-M2

* Cleanup

* Forgotten ffn_exp_probs_b
2026-01-29 07:27:06 +02:00
Kawrakow
68cd52e583 Much faster long context TG for Minimax-M2 (#1194) 2026-01-28 10:43:11 +02:00
Kawrakow
f9b5420e6a Much faster long-context TG for GLM-4.5/4.6/4.7/AIR (#1193)
* This seems much better for GQA = 12 TG

* Remove unused arguments
2026-01-28 10:27:14 +02:00
Kawrakow
69fdd041c1 Remove forgotten unused code 2026-01-26 12:54:21 +00:00
Kawrakow
65441c2385 Even better GLM-4.7-Flash long context TG performance (#1192)
* Better FA for GLM-4.7-Flash

* Adjust ncols for ADA_LOVELACE or better
2026-01-26 13:45:06 +02:00
Kawrakow
478b56871f Faster long context TG on CUDA for GLM-4.5/4.6/4.7/AIR (part 2) (#1190)
* This works

* Make quantized KV cache work

* Remove the glm45 graph building changes

* Add condition
2026-01-26 07:21:47 +02:00
Kawrakow
f0fb76da64 Better GLM-4.7-Flash long context TG performance (#1182)
* Better GLM-4.7-Flash long context TG performance

* Handle quantized cache
2026-01-24 07:05:48 +02:00
Kawrakow
2a7cc09149 Remove llamafile remnants (#1179) 2026-01-22 13:20:23 +02:00
Kawrakow
66caa42b53 Fix build with GGML_CUDA_GRAPHS=OFF 2026-01-22 10:46:57 +00:00
Kawrakow
851fda3509 Split mode graph: use CUDA graphs (#1177)
* Use GUDA graphs also when theretensor overrides

* Change graph key

* This seems to work
2026-01-22 12:38:36 +02:00
Kawrakow
101fe54797 CUDA graphs with tensor overrides (#1172)
* Use GUDA graphs also when theretensor overrides

* Change graph key
2026-01-22 12:28:11 +02:00
Kawrakow
1cb8cd534f Fix build failure when OpenMP is not available (#1171) 2026-01-22 12:26:23 +02:00
Kawrakow
77c18acc90 Fix non-contiguous batched cuBLAS (#1178) 2026-01-22 12:25:05 +02:00
Kawrakow
6f1a69352f Fuse experts bias in top_k_moe kernel (#1170)
* GLM-4.7-Flash support

* Model type

* Make FA work for mla != 0

* Fuse bias in top_k_moe kernel if present
2026-01-20 15:38:51 +02:00
Kawrakow
996e77047a Avoid ggml_get_rows if not necessary (#1160)
* Copy reduce result to other GPUs if necessary

* Avoid ggml_get_rows for TG

* For the output ops use the result of the split that ran on the main GPU

* More models
2026-01-20 15:38:21 +02:00