518 Commits

Author SHA1 Message Date
Kawrakow
55a704b67a Fused Q and K fused_rms_norm for TG on CUDA (#882)
* Biased mmvq: minor optimization

* Fusing Q and K rms_norm for TG on CUDA

* Remove commented out code

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-10-31 14:41:28 +02:00
Kawrakow
cfb840379f Biased mmvq: minor optimization (#880)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-10-31 14:21:18 +02:00
Kawrakow
0459f595d7 CUDA: corectly detect if flash attention is supported (#875)
* Don't use vector kernels if K or V are quantized

* Correctly determine if FA is supported

* Also wmma

* Minor

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-10-29 13:56:16 +02:00
Nexes the Elder
d50c2490fc correct typo (#876) 2025-10-28 19:01:45 +02:00
firecoperana
904e994bfb Support --device and --device-draft parameter (#866)
* add --device and --device-draft parameter

* don't print debug message in release mode

* fix

* bug fix to throw exception when no device specified

* add const

---------

Co-authored-by: firecoperana <firecoperana>
2025-10-27 18:13:28 +02:00
Kawrakow
eb8116b097 Even more fused ops (#868)
* Fuse Q, K, V gemv+add

* More gemv+add fusing

* Faster copy when tensors are contiguous

Relevant for storing data into the KV cache. I see ~1% speedup
for fast models (Ling-mini-2.0, gpt-oss-20b, etc.)

* Cleanup

* Make sure the bias really is 1 row to use fusion

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-10-27 16:09:01 +02:00
Kawrakow
e34399c116 CUDA: fuse ffn_up*unary_op(ffn_gate) for MMVQ (V2) (#864)
* Args for MMVQ functions

* WIP

* Fused ffn_up*unary_op(ffn_gate) for MMVQ (no bias)

We see nearly 2% TG speedup for Ling-mini-2.0 and
about 1% for DeepSeek-Lite.

* Fused ffn_up*unary_op(ffn_gate) for MMVQ (with bias)

* Fusing also for iqk/trellis/repacked quants

* Fusing mmvq also in non-MoE up+gate

* Fuse mul_mat_id and add_id into a single kernel for mmvq

* Also iqk quants

* Split mmvq.cu and iqk_mmvq.cu into separate template instances

* Put iqk mmvq implementations into template instances

* Somehow I forgot to change the ggml_type in the legacy template calls

* Add disagnostics

* Disable assert

* Fix TG fused up*nary(gate) when down cannot be fused

The wrong memory buffer got used in that case

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-10-26 17:08:50 +02:00
Kawrakow
70c0095e11 Faster tensor name formatting (#860)
* Adding fused mul+multi_add + CPU implementation

* fused mul+multi_add: command line argument to disable it

* Faster tensor name formatting

We gain ~1% for Ling-mini-2.0 when running on CUDA.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-10-24 07:46:18 +03:00
Kawrakow
0549be76e5 Fused mul + multi_add op (#858)
* Adding fused mul+multi_add + CPU implementation

* fused mul+multi_add: CUDA

* fused mul+multi_add: command line argument to disable it

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-10-24 07:40:35 +03:00
Kawrakow
ed4e1a6588 Fuse add+add+fused_rms (#853)
* Fuse add+add+fused_rms

* Try this

* Macro to easily enable/disable fusion

* Various:

* Check that all tensors involved are on the same device before applying fusion
* Fuse sigmoid+scale+sum_rows+div
* Fix the fused bailingmoe2 experts selection

The issue there was that the bias was not per row, but per
expert group, so only the first n_per_group biases were used
for al experts.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-10-22 16:18:11 +03:00
Kawrakow
af5bf60cc8 Hopefully this fixes #854 (#855)
* Hopefully this fixes #854

* Also this one

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-10-21 19:07:23 +03:00
Kawrakow
366d66bc1a Fuse add + fused_rms_norm (CUDA) (#852)
* Combine all calls to llm_build_norm to a single line

so more easily check what kind of arguments are being passed
by simply using grep.

* Combine add + fused_rms_norm

For many models this happens at each layer: the result of the
layer is added to the ayer input, which then becomes the input
to the next layer, which then is typically normalized via
fused_rms_norm.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-10-21 14:29:50 +03:00
Kawrakow
a27d661aeb Fix fused grouped topk (#851)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-10-21 10:10:38 +03:00
Kawrakow
c23a17b6fe cuda: use better block sizes for rms_norm (#845)
* cuda: use better block sizes for rms_norm

* Minor

* Remove forgotten printf

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-10-21 08:12:48 +03:00
Kawrakow
7a41b3b1f5 Various fused ops around expert selection (#840)
* Fuse sigmoid+add+grouped_topk+get_rows (CPU)

* Fix CPU + CUDA

but CUDA is somehow not 100% correct as I get a slightly different
PPL (lower!)

* Minor

* Fuse sigmoid+add+topk+get_rows (CUDA)

* Fuse sigmoid+add+topk+get_rows (CPU)

* Fuse topk+view+get_rows+reshape+softmax (CPU)

* Fuse topk+view+get_rows+reshape+softmax (CUDA)

* cpu: turn off the openai topk fusing for now

Something is not right and I don't see the bug.
On the CPU one doesn't gain much if anything, so not a big loss.

* Also fuse sum_rows and div

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-10-19 19:02:46 +03:00
Kawrakow
1dcc044134 Grouped expert routing (CUDA) (#838)
* WIP

* cuda: grouped top_k

* This is very slightly better

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-10-18 07:22:35 +03:00
Kawrakow
cde642e591 Grouped expert routing (CPU only) (#836)
* Better argsort (CPU)

* Attemt at grouped topk

* This seems to do the trick for grouped experts routing

* Cleanup

* Trying to merge, something is not right

* Working merged grouped top_k (CPU)

* Add command line option to enable grouped expert routing

* Add grouped expert routing option to llama-bench

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-10-16 14:57:02 +03:00
Kawrakow
e66d307e13 Better argsort (CPU) (#835)
* Better argsort (CPU)

* Minor

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-10-16 11:31:03 +03:00
Kawrakow
f7adde1043 Adding Ling/Ring (a.k.a., Bailing-MoE2) support (#833)
* Adding Ling/Ring (a.k.a., Bailing-MoE2)

* Add expert group selection (not working, so turned off)

* BailingMoE2 conversion

* WIP

* Bits and pieces

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-10-15 14:20:40 +03:00
Kawrakow
ba9fefb73d gpt-oss: duplicate experts biases when necessary (#829)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-10-14 14:38:40 +03:00
Kawrakow
4e24d48e63 Attention mask tweaks for better long context performance (#825)
* Parallelize mask

We see non-negligible PP gains for long contexts.
More importantly, the strange drop in performance
observed for GPT-OSS for context >= 32k tokens is gone.

* Whith FA on, create mask as f16 directly

* WIP

* Reduce KQ mask padding to 16

Why was it 64 in the first place?

I don't observe any issues, while TG performance
for long contexts improves by 2-4%.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-10-13 14:01:11 +03:00
Kawrakow
475223079c Attempt to fix AVX2 FA (#807)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-09-30 08:06:53 +02:00
Kawrakow
9932e6b102 Fix gemma3 vision (#803)
* Remove unnecessary assert in im2col

* Remove unnecessary assert in im2col (CPU)

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-09-27 11:15:32 +02:00
Kawrakow
c1a0e15377 Port mdmd from mainline + Qwen2/2.5-VL support (#798)
* Add mtmd: the beginning

* Add mtmd: mtmd.cpp compiles

* Add mtmd: clip initialization compiles

* Add mtmd: clip.cpp compiles

* Add mtmd: builds successfully

* Add CPU implementation for GGML_OP_GLU

* Add CUDA implementation for GGML_OP_GLU

* Add CPU implementation for GGML_OP_CONV_2D and GGML_OP_CONV_2D_DW

* Add CUDA implementation for GGML_OP_CONV_2D and GGML_OP_CONV_2D_DW

* Add mtmd: refresh CPU rope

* Add mtmd: refresh CUDA rope

* Add mtmd: add Qwen2-VL

* Add mtmd: Qwen2.5-VL text seems to work with this change

* Add mtmd: fix swiglu

* Add mtmd: use LOG_TEE so generated tokens show up in terminal

* Add mtmd: do not attempt to load a GPU backend if none are available

* GLU, not GPU

* Fix typo

* Fix new/free mismatch

* LOG stuff

* Add mtmd: this fixes gibberish on second image

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-09-27 08:45:29 +02:00
Kawrakow
bc34573356 CPU: faster FA (#797)
* Avoid computing FA chunks where the mask is -infinity

* Avoid computing FA chunks where the mask is -infinity also for f16/bf16

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-09-26 09:00:25 +02:00
Kawrakow
f8b66238fa Fused matrix multiplications (CUDA and CPU) (#796)
* Quick attempt to fuse the Q, K, V GEMMs

Doesn't do much on the CPU

* Doesn't do much on the GPU either

* Use llm_build_mul_mat_qkv

* This is not needed

* Revert timing on committed by mistake

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-09-24 16:52:54 +02:00
Kawrakow
f59b2909d4 cpu: fused softmax+topk (#794)
* cpu: fused softmax+topk

* Cleanup

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-09-24 09:02:21 +02:00
Kawrakow
8b4208e789 Fix #772 (#790)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-09-23 16:43:02 +02:00
Kawrakow
4591e83825 cuda: fused top_k+softmax as used in most MoE models (#789)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-09-23 13:45:57 +02:00
Kawrakow
540a26514f This is very slightly better (#762)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-09-05 21:31:02 +02:00
Kawrakow
f74dd77143 Fix ggml_is_contiguously_allocated (#764)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-09-05 19:05:02 +02:00
firecoperana
49979ba9e9 llama: enable K-shift for quantized KV cache for cuda (#760)
cuda: add q8_0->f32 cpy operation (#9571)
It will fail on unsupported backends or quant types.

Co-authored-by: Ivan <nekotekina@gmail.com>
2025-09-05 11:54:18 +02:00
Kawrakow
13c3b6412e Offload only activated experts to the GPU (#698)
* Offload only activated experts

* This seems to do the trick for -fmoe

* Do not recalculate activated expers for fused up/gate

* Log out of bounds access details

* Add a command line argument

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-09-04 12:22:30 +02:00
Kawrakow
144d456717 Better CPU SWA (#757)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-09-04 11:58:16 +02:00
Kawrakow
4a6a6f17ee Alternative CUDA FA for SWA models (#754)
* Bounds for flash attention

* Add n_swa to FA parameters

* Fix it

* This seems very slightly better

* Using vec kernel when we have SWA

* Need also this

* f32 vec kernel

* This is slightly better

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-09-04 08:42:18 +02:00
Kawrakow
727f7b7d9f Refactor CUDA flash attention (#745)
* Factor out mma

* Factor out wmma

* Factor out vec

* Remove unnecessary includes from fattn.cu

* Move mma launch to fattn-mma-f16.cuh

* Slightly better PP

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-09-02 10:12:56 +02:00
Kawrakow
d29c21ecbc Set default value of GGML_SCHED_MAX_COPIES to 1 (#751)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-09-02 07:04:39 +02:00
Kawrakow
56e0f897ae Revert "CUDA: prompt processing optimizations for MoE models (#739)" (#748)
This reverts commit f22a9ef95a.

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-09-02 06:55:48 +02:00
Kawrakow
8de297b795 Fused FFN_UP+FFN_GATE op (#741)
* Fused up+gate+unary for regular (not MoE) FFN - CPU

* WIP CUDA

* Seems to be working on CUDA

For a dense model we get 2-3% speedup for PP and ~0.6% for TG.

* Add command line option

This time the option is ON by default, and one needs to turn it
off via -no-fug or --no-fused-up-gate

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-08-31 18:16:36 +03:00
Kawrakow
d55e98519f CUDA: prompt processing optimizations for MoE models (#739)
* Skip the row id computation for the ffn_down op

Sadly, almost negligible performance gain.

* Also this doesn't do much

* Also this barely moves the needle

* This is slightly better

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-08-30 12:09:41 +03:00
Kawrakow
f529c3a808 Sanitize imatrix (#735)
* sanitize importance matrix: WIP

* sanitize importance matrix: iq4_k

* sanitize importance matrix: iq5_k, iq6_k

* sanitize imatrix: iq4_ks

* sanitize imatrix: iq4_kss

* sanitize imatrix: iq2_ks and iq2_kl

* sanitize imatrix: iq5_ks

* sanitize imatrix: iq4_nl_r4

* sanitize imatrix: q4_0_r8

* sanitize imatrix: q6_0_r4

* sanitize imatrix: iq4_xs_r8

* sanitize imatrix: iq4_xs_r8 and q3_k_r4 with a template

* sanitize imatrix: q2_k_r4, q4_k_r4, q5_k_r4, q6_k_r4

* sanitize imatrix: repacked i-quants

* Minor

* Add more checks for iq3_k, iq3_ks

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-08-29 09:08:15 +03:00
Kawrakow
e760b4dc41 Check for NaNs while loading the model. (#727)
* Check for NaNs while loading the model.

* Also tell which experts have NaNs.

* Add command line option to validate quants

* Add checks for more quantization types

* Add checks for more quantizagtion types

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-08-27 19:00:17 +03:00
Kawrakow
ca5b6ab9b1 Fix typo 2025-08-27 14:43:44 +03:00
Kawrakow
1dcc34f70a Heuristics for mmq_id -> original threshold (#734)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-08-27 08:17:41 +03:00
Kawrakow
6afe9b48ab Sanitize importances for KT quantization (#720)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-08-27 08:04:15 +03:00
Kawrakow
3dc4dffed5 Fix avx2 GEMM mess (v2) (#724)
* This fixes confusion around Q8_0 on AVX2

* This does it for iq4_nl, including FA

* This does it for iq4_nl on Zen4, but FA does not work

* Slightly more clear

* Adding forgotten q8_0_r8 to num_rows()

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-08-27 08:03:47 +03:00
Kawrakow
ac4ec50f03 CUDA: muh faster prompt processing for MoE models and small u-batch sizes (#728)
* WIP: adding mainline mmq_id implementation

* This seems to work

* Now also -fmoe works

* WIP

* WIP

* WIP

* This works for mainline supported quants

* mmq_id: add iq2_k, iq2_k_r4

* mmiq_id: don't assume row size is multiple of type size (per row scales)

* mmiq_id: don't assume row size is multiple of type size

* mmq_id: add iq2_ks

So we are sure it works with per row scales

* mmq_id: add iq2_kl

* mmq_id: add iq3_ks

* mmq_id: adding iq3_k, iq3_k_r4

* mmq_id: add iq4_kss, iq4_ks, iq4_ks_r4

* mmq_id: adding iq4_k, iq4_k_r4

* mmq_id: adding iq5_ks, iq5_ks_r4

* mmq_id: adding iq5_k, iq5_k_r4, q6_0

* mmq_id: adding iq6_k

* mmq_id: add iq1_s_r4

* mmq_id: adding iq1_kt, iq2_kt

* mmq_id: add iq3_kt, iq4_kt

* Add CUDA fp8 header

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-08-26 13:30:35 +03:00
Kawrakow
d3aecc7f37 Log for debugging #721 (#722)
* Log for debugging #721

* Remove the 16

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-08-23 15:24:34 +03:00
Kawrakow
8e23ecdd96 Fix more Q8_0 repacking mess on AVX2 (#719)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-08-23 09:04:51 +03:00
Kawrakow
e962ce8c70 CUDA: faster IQ2_K, IQ2_KS, IQ2_K_R4 (#716)
* Use bperm trick for iq2_ks gemm -> 7% gain

* Use bperm trick for iq2_k gemm -> ~5% gain

* Use bperm trick for iq2_k_r4 gemm -> ~3% gain

* Use bperm trick for iq2_ks gemv -> ~7% gain

* Use bperm trick for iq2_k gemv -> ~3% gain

* Use bperm trick for iq2_k_r4 gemv -> ~7% gain

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-08-22 07:25:35 +03:00