Commit Graph

418 Commits

Author SHA1 Message Date
Iwan Kawrakow
1f14f50dfd Try removing copy indirection 2025-10-27 11:39:18 +02:00
Iwan Kawrakow
444782523d Make sure the bias really is 1 row to use fusion 2025-10-27 07:10:03 +02:00
Iwan Kawrakow
1cf4f21463 Cleanup 2025-10-26 18:59:38 +02:00
Iwan Kawrakow
21a37bfc6c Faster copy when tensors are contiguous
Relevant for storing data into the KV cache. I see ~1% speedup
for fast models (Ling-mini-2.0, gpt-oss-20b, etc.)
2025-10-26 17:15:30 +02:00
Iwan Kawrakow
837c219b51 More gemv+add fusing 2025-10-26 17:14:10 +02:00
Iwan Kawrakow
5ddde01542 Fuse Q, K, V gemv+add 2025-10-26 17:13:11 +02:00
Kawrakow
f76e98536f CUDA: fuse ffn_up*unary_op(ffn_gate) for MMVQ (V2) (#864)
* Args for MMVQ functions

* WIP

* Fused ffn_up*unary_op(ffn_gate) for MMVQ (no bias)

We see nearly 2% TG speedup for Ling-mini-2.0 and
about 1% for DeepSeek-Lite.

* Fused ffn_up*unary_op(ffn_gate) for MMVQ (with bias)

* Fusing also for iqk/trellis/repacked quants

* Fusing mmvq also in non-MoE up+gate

* Fuse mul_mat_id and add_id into a single kernel for mmvq

* Also iqk quants

* Split mmvq.cu and iqk_mmvq.cu into separate template instances

* Put iqk mmvq implementations into template instances

* Somehow I forgot to change the ggml_type in the legacy template calls

* Add disagnostics

* Disable assert

* Fix TG fused up*nary(gate) when down cannot be fused

The wrong memory buffer got used in that case

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-10-26 17:08:50 +02:00
Kawrakow
2522c97dc9 Faster tensor name formatting (#860)
* Adding fused mul+multi_add + CPU implementation

* fused mul+multi_add: command line argument to disable it

* Faster tensor name formatting

We gain ~1% for Ling-mini-2.0 when running on CUDA.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-10-24 07:46:18 +03:00
Kawrakow
db3ba4999f Fused mul + multi_add op (#858)
* Adding fused mul+multi_add + CPU implementation

* fused mul+multi_add: CUDA

* fused mul+multi_add: command line argument to disable it

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-10-24 07:40:35 +03:00
Kawrakow
0e1d33ca4a Fuse add+add+fused_rms (#853)
* Fuse add+add+fused_rms

* Try this

* Macro to easily enable/disable fusion

* Various:

* Check that all tensors involved are on the same device before applying fusion
* Fuse sigmoid+scale+sum_rows+div
* Fix the fused bailingmoe2 experts selection

The issue there was that the bias was not per row, but per
expert group, so only the first n_per_group biases were used
for al experts.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-10-22 16:18:11 +03:00
Kawrakow
8aa3c2ec5e Hopefully this fixes #854 (#855)
* Hopefully this fixes #854

* Also this one

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-10-21 19:07:23 +03:00
Kawrakow
caf9759c97 Fuse add + fused_rms_norm (CUDA) (#852)
* Combine all calls to llm_build_norm to a single line

so more easily check what kind of arguments are being passed
by simply using grep.

* Combine add + fused_rms_norm

For many models this happens at each layer: the result of the
layer is added to the ayer input, which then becomes the input
to the next layer, which then is typically normalized via
fused_rms_norm.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-10-21 14:29:50 +03:00
Kawrakow
92231460cf Fix fused grouped topk (#851)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-10-21 10:10:38 +03:00
Kawrakow
f5571e241e cuda: use better block sizes for rms_norm (#845)
* cuda: use better block sizes for rms_norm

* Minor

* Remove forgotten printf

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-10-21 08:12:48 +03:00
Kawrakow
28d3e63805 Various fused ops around expert selection (#840)
* Fuse sigmoid+add+grouped_topk+get_rows (CPU)

* Fix CPU + CUDA

but CUDA is somehow not 100% correct as I get a slightly different
PPL (lower!)

* Minor

* Fuse sigmoid+add+topk+get_rows (CUDA)

* Fuse sigmoid+add+topk+get_rows (CPU)

* Fuse topk+view+get_rows+reshape+softmax (CPU)

* Fuse topk+view+get_rows+reshape+softmax (CUDA)

* cpu: turn off the openai topk fusing for now

Something is not right and I don't see the bug.
On the CPU one doesn't gain much if anything, so not a big loss.

* Also fuse sum_rows and div

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-10-19 19:02:46 +03:00
Kawrakow
747f411da5 Grouped expert routing (CUDA) (#838)
* WIP

* cuda: grouped top_k

* This is very slightly better

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-10-18 07:22:35 +03:00
Kawrakow
dbfd151594 Grouped expert routing (CPU only) (#836)
* Better argsort (CPU)

* Attemt at grouped topk

* This seems to do the trick for grouped experts routing

* Cleanup

* Trying to merge, something is not right

* Working merged grouped top_k (CPU)

* Add command line option to enable grouped expert routing

* Add grouped expert routing option to llama-bench

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-10-16 14:57:02 +03:00
Kawrakow
ecf8f931ea Better argsort (CPU) (#835)
* Better argsort (CPU)

* Minor

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-10-16 11:31:03 +03:00
Kawrakow
9d364b88ba Adding Ling/Ring (a.k.a., Bailing-MoE2) support (#833)
* Adding Ling/Ring (a.k.a., Bailing-MoE2)

* Add expert group selection (not working, so turned off)

* BailingMoE2 conversion

* WIP

* Bits and pieces

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-10-15 14:20:40 +03:00
Kawrakow
8d0d01a593 gpt-oss: duplicate experts biases when necessary (#829)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-10-14 14:38:40 +03:00
Kawrakow
9724ea9213 Attention mask tweaks for better long context performance (#825)
* Parallelize mask

We see non-negligible PP gains for long contexts.
More importantly, the strange drop in performance
observed for GPT-OSS for context >= 32k tokens is gone.

* Whith FA on, create mask as f16 directly

* WIP

* Reduce KQ mask padding to 16

Why was it 64 in the first place?

I don't observe any issues, while TG performance
for long contexts improves by 2-4%.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-10-13 14:01:11 +03:00
Kawrakow
e94d1a92a5 Attempt to fix AVX2 FA (#807)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-09-30 08:06:53 +02:00
Kawrakow
3d4977cb6e Fix gemma3 vision (#803)
* Remove unnecessary assert in im2col

* Remove unnecessary assert in im2col (CPU)

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-09-27 11:15:32 +02:00
Kawrakow
87e4762720 Port mdmd from mainline + Qwen2/2.5-VL support (#798)
* Add mtmd: the beginning

* Add mtmd: mtmd.cpp compiles

* Add mtmd: clip initialization compiles

* Add mtmd: clip.cpp compiles

* Add mtmd: builds successfully

* Add CPU implementation for GGML_OP_GLU

* Add CUDA implementation for GGML_OP_GLU

* Add CPU implementation for GGML_OP_CONV_2D and GGML_OP_CONV_2D_DW

* Add CUDA implementation for GGML_OP_CONV_2D and GGML_OP_CONV_2D_DW

* Add mtmd: refresh CPU rope

* Add mtmd: refresh CUDA rope

* Add mtmd: add Qwen2-VL

* Add mtmd: Qwen2.5-VL text seems to work with this change

* Add mtmd: fix swiglu

* Add mtmd: use LOG_TEE so generated tokens show up in terminal

* Add mtmd: do not attempt to load a GPU backend if none are available

* GLU, not GPU

* Fix typo

* Fix new/free mismatch

* LOG stuff

* Add mtmd: this fixes gibberish on second image

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-09-27 08:45:29 +02:00
Kawrakow
c108e4b7c9 CPU: faster FA (#797)
* Avoid computing FA chunks where the mask is -infinity

* Avoid computing FA chunks where the mask is -infinity also for f16/bf16

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-09-26 09:00:25 +02:00
Kawrakow
8e497e704e Fused matrix multiplications (CUDA and CPU) (#796)
* Quick attempt to fuse the Q, K, V GEMMs

Doesn't do much on the CPU

* Doesn't do much on the GPU either

* Use llm_build_mul_mat_qkv

* This is not needed

* Revert timing on committed by mistake

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-09-24 16:52:54 +02:00
Kawrakow
cde2eb5e95 cpu: fused softmax+topk (#794)
* cpu: fused softmax+topk

* Cleanup

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-09-24 09:02:21 +02:00
Kawrakow
45afaf3391 Fix #772 (#790)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-09-23 16:43:02 +02:00
Kawrakow
18f04350e9 cuda: fused top_k+softmax as used in most MoE models (#789)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-09-23 13:45:57 +02:00
Kawrakow
c519d4177b This is very slightly better (#762)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-09-05 21:31:02 +02:00
Kawrakow
c15f8ac508 Fix ggml_is_contiguously_allocated (#764)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-09-05 19:05:02 +02:00
firecoperana
cec8b70a7e llama: enable K-shift for quantized KV cache for cuda (#760)
cuda: add q8_0->f32 cpy operation (#9571)
It will fail on unsupported backends or quant types.

Co-authored-by: Ivan <nekotekina@gmail.com>
2025-09-05 11:54:18 +02:00
Kawrakow
0c15494c30 Offload only activated experts to the GPU (#698)
* Offload only activated experts

* This seems to do the trick for -fmoe

* Do not recalculate activated expers for fused up/gate

* Log out of bounds access details

* Add a command line argument

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-09-04 12:22:30 +02:00
Kawrakow
06cc7c6894 Better CPU SWA (#757)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-09-04 11:58:16 +02:00
Kawrakow
f5e68bf8b6 Alternative CUDA FA for SWA models (#754)
* Bounds for flash attention

* Add n_swa to FA parameters

* Fix it

* This seems very slightly better

* Using vec kernel when we have SWA

* Need also this

* f32 vec kernel

* This is slightly better

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-09-04 08:42:18 +02:00
Kawrakow
3433c7b56d Refactor CUDA flash attention (#745)
* Factor out mma

* Factor out wmma

* Factor out vec

* Remove unnecessary includes from fattn.cu

* Move mma launch to fattn-mma-f16.cuh

* Slightly better PP

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-09-02 10:12:56 +02:00
Kawrakow
1f4346381f Set default value of GGML_SCHED_MAX_COPIES to 1 (#751)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-09-02 07:04:39 +02:00
Kawrakow
62f5382c2b Revert "CUDA: prompt processing optimizations for MoE models (#739)" (#748)
This reverts commit f22a9ef95a.

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-09-02 06:55:48 +02:00
Kawrakow
b66cecca45 Fused FFN_UP+FFN_GATE op (#741)
* Fused up+gate+unary for regular (not MoE) FFN - CPU

* WIP CUDA

* Seems to be working on CUDA

For a dense model we get 2-3% speedup for PP and ~0.6% for TG.

* Add command line option

This time the option is ON by default, and one needs to turn it
off via -no-fug or --no-fused-up-gate

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-08-31 18:16:36 +03:00
Kawrakow
f22a9ef95a CUDA: prompt processing optimizations for MoE models (#739)
* Skip the row id computation for the ffn_down op

Sadly, almost negligible performance gain.

* Also this doesn't do much

* Also this barely moves the needle

* This is slightly better

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-08-30 12:09:41 +03:00
Kawrakow
46968d4ab1 Sanitize imatrix (#735)
* sanitize importance matrix: WIP

* sanitize importance matrix: iq4_k

* sanitize importance matrix: iq5_k, iq6_k

* sanitize imatrix: iq4_ks

* sanitize imatrix: iq4_kss

* sanitize imatrix: iq2_ks and iq2_kl

* sanitize imatrix: iq5_ks

* sanitize imatrix: iq4_nl_r4

* sanitize imatrix: q4_0_r8

* sanitize imatrix: q6_0_r4

* sanitize imatrix: iq4_xs_r8

* sanitize imatrix: iq4_xs_r8 and q3_k_r4 with a template

* sanitize imatrix: q2_k_r4, q4_k_r4, q5_k_r4, q6_k_r4

* sanitize imatrix: repacked i-quants

* Minor

* Add more checks for iq3_k, iq3_ks

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-08-29 09:08:15 +03:00
Kawrakow
dac5b48398 Check for NaNs while loading the model. (#727)
* Check for NaNs while loading the model.

* Also tell which experts have NaNs.

* Add command line option to validate quants

* Add checks for more quantization types

* Add checks for more quantizagtion types

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-08-27 19:00:17 +03:00
Iwan Kawrakow
03ce6fdb0d Fix typo 2025-08-27 14:43:44 +03:00
Kawrakow
966a6ce93c Heuristics for mmq_id -> original threshold (#734)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-08-27 08:17:41 +03:00
Kawrakow
931f04af53 Sanitize importances for KT quantization (#720)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-08-27 08:04:15 +03:00
Kawrakow
683d6f1fc8 Fix avx2 GEMM mess (v2) (#724)
* This fixes confusion around Q8_0 on AVX2

* This does it for iq4_nl, including FA

* This does it for iq4_nl on Zen4, but FA does not work

* Slightly more clear

* Adding forgotten q8_0_r8 to num_rows()

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-08-27 08:03:47 +03:00
Kawrakow
0cc32ff0b1 CUDA: muh faster prompt processing for MoE models and small u-batch sizes (#728)
* WIP: adding mainline mmq_id implementation

* This seems to work

* Now also -fmoe works

* WIP

* WIP

* WIP

* This works for mainline supported quants

* mmq_id: add iq2_k, iq2_k_r4

* mmiq_id: don't assume row size is multiple of type size (per row scales)

* mmiq_id: don't assume row size is multiple of type size

* mmq_id: add iq2_ks

So we are sure it works with per row scales

* mmq_id: add iq2_kl

* mmq_id: add iq3_ks

* mmq_id: adding iq3_k, iq3_k_r4

* mmq_id: add iq4_kss, iq4_ks, iq4_ks_r4

* mmq_id: adding iq4_k, iq4_k_r4

* mmq_id: adding iq5_ks, iq5_ks_r4

* mmq_id: adding iq5_k, iq5_k_r4, q6_0

* mmq_id: adding iq6_k

* mmq_id: add iq1_s_r4

* mmq_id: adding iq1_kt, iq2_kt

* mmq_id: add iq3_kt, iq4_kt

* Add CUDA fp8 header

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-08-26 13:30:35 +03:00
Kawrakow
e008c0e192 Log for debugging #721 (#722)
* Log for debugging #721

* Remove the 16

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-08-23 15:24:34 +03:00
Kawrakow
e919e89d5a Fix more Q8_0 repacking mess on AVX2 (#719)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-08-23 09:04:51 +03:00
Kawrakow
dfa6e2b5fa CUDA: faster IQ2_K, IQ2_KS, IQ2_K_R4 (#716)
* Use bperm trick for iq2_ks gemm -> 7% gain

* Use bperm trick for iq2_k gemm -> ~5% gain

* Use bperm trick for iq2_k_r4 gemm -> ~3% gain

* Use bperm trick for iq2_ks gemv -> ~7% gain

* Use bperm trick for iq2_k gemv -> ~3% gain

* Use bperm trick for iq2_k_r4 gemv -> ~7% gain

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-08-22 07:25:35 +03:00