Commit Graph

456 Commits

Author SHA1 Message Date
firecoperana
869557c8fd Update mtmd to improve accuracy of M-RoPE (#993)
* model : Granite docling + Idefics3 preprocessing (SmolVLM) (#16206)

* feat: Add granite-docling conversion using trillion pretokenizer

Branch: gabe-l-hart/GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Add granite-docling vocab pre enum

Branch: gabe-l-hart/GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Use granite-docling pre

Branch: gabe-l-hart/GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Add clip_is_idefics3

Branch: gabe-l-hart/GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Allow multi-token boundary sequences for image templating

Branch: gabe-l-hart/GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Add tiling support for idefices3 in clip.cpp

This should likely be moved into llava_uhd::get_slice_instructions, but for
now this avoids disrupting the logic there.

Branch: gabe-l-hart/GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Partial support for full templating for idefics3 in mtmd

There are still errors encoding some of the image chunks, but the token
sequence now matches transformers _almost_ perfectly, except for the double
newline before the global image which shows up as two consecutive newline
tokens instead of a single double-newline token. I think this is happening
because the blocks are tokenized separately then concatenated.

Branch: gabe-l-hart/GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Fully working image preprocessing for idefics3 w/ resize and slicing

Branch: gabe-l-hart/GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Parse the preprocessor config's longest side and add it to the mmproj hparams

Branch: GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Use the longest side instead of size * scale_factor

For Granite Docling, these come out to the same value, but that was just a
conicidence.

Branch: GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Allow batch encoding and remove clip_is_idefics3

Branch: GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refactor: Remove unnecessary conditionals for empty token vectors

Branch: GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refactor: Use image_manipulation util

Branch: GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* add test model

---------

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
# Conflicts:
#	convert_hf_to_gguf.py
#	convert_hf_to_gguf_update.py
#	gguf-py/gguf/constants.py
#	gguf-py/gguf/gguf_writer.py
#	src/llama-vocab.cpp
#	src/llama-vocab.h

* mtmd : support home-cooked Mistral Small Omni (#14928)

* model : add LightOnOCR-1B model (#16764)

* model : add LightOnOCR-1B model

* add test
# Conflicts:
#	convert_hf_to_gguf.py
#	gguf-py/gguf/constants.py

* mtmd : fix idefics3 preprocessing (#16806)

* mtmd : fix idefics3 preprocessing

* disable granite test

* fix test for granite

* model: Add support for CogVLM model (#15002)

* Added GGUF mappings for CogVLM model

* Add tensor mapping for CogVLM visual encoder

* Add CogVLM to conversion script, no vision part yet

* Added CogVLM vision model to conversion script

* Add graph for CogVLM CLIP model

* Add graph for CogVLM

* Fixes for CogVLM. Now compiles.

* Model now runs

* Fixes for cogvlm graph

* Account for graph context change after rebase

* Changes for whitespace

* Changes in convert script according to comments

* Switch CogVLM LLM graph to merged QKV tensor

* Use rope_type variable instead of direct definition

* Change CogVLM CLIP encoder to use SWIGLU

* Switch CogVLM CLIP to use merged QKV

* Apply rebase edits and remove ggml_cont call that is now unnecessary

* clean up

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
# Conflicts:
#	convert_hf_to_gguf.py
#	examples/mtmd/clip.cpp
#	gguf-py/gguf/constants.py
#	gguf-py/gguf/tensor_mapping.py
#	src/llama-arch.cpp
#	src/llama-arch.h
#	src/llama-model.cpp
#	src/llama-model.h

* mtmd: refactor preprocessing + support max/min pixels (#16878)

* mtmd: refactor preprocessing + support max/min pixels

* fix mlp type

* implement mix/max pixels

* improve hparams

* better image preproc for qwen

* fix

* fix out of bound composite

* fix (2)

* fix token calculation

* get_merge_kernel_size()

* fix llama4 and lfm2

* gonna fix them all

* use simple resize for qwen

* qwen: increase min tokens

* no resize if dst size == src size

* restore to initial min/max tokens value for qwen
# Conflicts:
#	examples/mtmd/clip.cpp

* clip : use FA (#16837)

* clip : use FA

* cont : add warning about unsupported ops

* implement "auto" mode for clip flash attn

* clip : print more detailed op support info during warmup

* cont : remove obsolete comment [no ci]

* improve debugging message

* trailing space

* metal : remove stray return

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>

* model: add Janus Pro for image understanding (#16906)

* Add support for Janus Pro

* Update gguf-py/gguf/tensor_mapping.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update gguf-py/gguf/tensor_mapping.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Address reviewer suggestions

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Add JANUS_PRO constant

* Update clip model handling

Co-authored-by: Xuan-Son Nguyen <son@huggingface.co>

* Update tools/mtmd/clip.cpp

Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>

* Refactor JANUS_PRO handling in clip.cpp

Co-authored-by: Xuan-Son Nguyen <son@huggingface.co>

* Update tools/mtmd/clip.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* em whitespace

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Xuan-Son Nguyen <son@huggingface.co>
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
# Conflicts:
#	convert_hf_to_gguf.py
#	gguf-py/gguf/constants.py
#	gguf-py/gguf/tensor_mapping.py

* mtmd: pad mask for qwen2.5vl (#16954)

* mtmd: pad mask for qwen2.5vl

* improve

* mtmd: add --image-min/max-tokens (#16921)

* mtmd: improve struct initialization (#16981)

* mtmd: allow QwenVL to process larger image by default (#17020)

* Disable flash attention

* mtmd : fix embedding size for image input (#17123)

* mtmd: fix patch_size initialized to random value in audio models (#17128)

* mtmd: fix patch_size initialized to random value in audio models

* add default hparams

* add llama_model_n_embd_inp

* Fix load qwen3 vl

Change batch size

* Add description

* Fix cli build error

---------

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
Co-authored-by: Tianyue-Zhao <zhaotianyue@outlook.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Zhiyong Wang <85110830+ravenouse@users.noreply.github.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
Co-authored-by: firecoperana <firecoperana>
2025-11-29 07:27:15 +01:00
Kawrakow
d6daee337c Attempt to fix #1014 (#1017)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-27 15:58:18 +01:00
Kawrakow
a3b8efd687 Enable iq4_nl KV cache on CUDA (#1006)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-24 09:41:19 +01:00
Kawrakow
0243356650 Fix q6_0 dequantize (#1005)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-24 09:13:46 +01:00
Nexes the Elder
9a63e768ea Legacy quants cpy_blck_q_f16 function for K cache (#1001)
Shortfixes the bug : ggml\src\ggml-cuda\cpy.cu:614: ggml_cuda_cpy_fn: unsupported type combination (q6_0 to f16) encountered when trying to use deepseek lite v2 with quantized K cache. Note: I compile my IK_Llama with GGML_CUDA_F16.

To fix this, I added a cpy_blck_q_f16 function devised by comparing the cpy_blck_q8_0_f32 and cpy_blck_q8_0_f16, and transposing the difference for the other legacy quants on the basis of the cpy_blck_q_f32 function. A "rule of three" of sorts.

Perplexity test and inference now works consistantly on -ctk q4_0 ; q4_1 ; q5_0 ; q5_1 in that scenario, with expected values and behavior.

Except on Q6_0, which sees its perplexity multiplied by 100. (I suspect the Cuda dequantize_q6_0 to be incompatible with this PR for some reason, but that's beyond what I can fix)

-ctk iq4_nl, which doesn't have yet a dequantize_iq4_nl function, is not usable that way for now.
2025-11-24 08:56:38 +01:00
Kawrakow
920f424929 Support GigaChat3 (#995)
* Fixing Gigachat support

* Gigachat: CUDA FA (needs 192 x 192 for MLA = 3)

* Gigachat: CPU FA (needs 192 x 192 for MLA = 3)

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-24 06:55:14 +01:00
Kawrakow
232050b473 Attempt to fix #974 (#983)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-19 15:48:39 +01:00
Kawrakow
d764edd652 Fuse sum_rows and div with topk-moe (#984)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-19 13:44:09 +01:00
Kawrakow
054c31cf8f Fuse Q and K RoPE (#980)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-19 09:08:42 +01:00
Kawrakow
0157f78061 Minor 2025-11-18 08:55:36 +00:00
Kawrakow
03da76eb05 Fix RoPE cache on multi-GPU setup (#966)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-16 11:50:48 +02:00
Kawrakow
37d72f9878 Fix ggml_cuda_fattn_is_supported (#968)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-16 11:50:29 +02:00
Kawrakow
c64e3e3482 Fix fused up+gate when mmq is not supported (#952)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-14 06:59:27 +02:00
Kawrakow
a1f60b3535 Add missing AVX512 operators for MSVC (#948)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-14 06:58:51 +02:00
Kawrakow
ce3ce97a29 Fix repacked legacy quants (#951)
* Fix q5_0_r4

The issue waqs in the tail part. As almost all models have tensor
rows that are multiple of 128, that part was never triggered in testing.
But ithe gpt-oss models have an embedding size of 2880, so we end
up there and trigger the bug.

* Fix q6_0_r4

Same fix as q5_0_r4

* Fix q4_0_r8

* Fix q5_0_r4 and q6_0_r4 also on Zen4

* Fix q4_0_r8 also on Zen4

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-13 15:35:37 +02:00
Kawrakow
32edcb4b74 Fix rope_norm_fast_cuda (#945)
* Fix rope_norm_fast_cuda

* One more

* Also fix mrope and vision

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-13 08:54:37 +02:00
Kawrakow
0d97b9c0bf Enable fusion by default (#939)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-11 10:35:48 +02:00
Kawrakow
219fe93973 Opt from #880 also for iqk cuda gemv (#938)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-11 10:01:34 +02:00
Kawrakow
86e2bec04e DeepSeek FA optimizations (#929)
* Use new-new-mma also for MLA=3, and use mask bounds

This gives us ~25% better PP at 32k tokens compared to main

* This seems better

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-10 09:55:30 +02:00
Kawrakow
adba641347 DeepSeek TG optimizations for TG (#928)
* Fuse concat and copy into K cache
* Avoid ggml_cont() when n_token = 1

Combined effect: about +2% in TG performance with full GPU offload

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-10 09:52:07 +02:00
Kawrakow
bf474e9bff Use fused gemv+add only for TG (#933)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-10 08:34:24 +02:00
Kawrakow
56ee303254 Make biased gemv fusion optional (#931)
* Make biased gemv fusion optional

* Fix one path through gemv fusion

* Remove forgotten printf

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-09 19:09:47 +02:00
Lennart Lopin
fd37776584 Add ARM Grace Blackwell (NVIDIA DGX Spark) support (#922)
This commit enables IQK quantization operations on ARM-based systems,
specifically tested on NVIDIA DGX Spark with GB10 Grace Blackwell.

Changes:
- Enable IQK_IMPLEMENT macro for ARM NEON operations
- Add arm_neon.h header include for ARM SIMD intrinsics
- Fix compilation errors related to missing NEON types and functions

Build requirements for ARM:
  cmake .. -DGGML_CUDA=ON \
           -DCMAKE_CXX_FLAGS="-march=armv8.2-a+dotprod+fp16" \
           -DCMAKE_C_FLAGS="-march=armv8.2-a+dotprod+fp16"

Tested on:
- Platform: NVIDIA DGX Spark (aarch64)
- CPU: GB10 Grace Blackwell Superchip
- Memory: 128GB unified memory

Fixes build errors:
- 'float32x4_t' does not name a type
- 'vld1q_f32' was not declared in this scope
- 'v_expf' was not declared in this scope
- Missing FP16 NEON intrinsics
2025-11-09 14:22:40 +02:00
Kawrakow
5cc15d0ecf CUDA MoE improvements (#923)
* Use mmq_id in mul_mat_id

* Better

* Also use it in the fused up+gate op

* Better -no-fmoe TG on CUDA

Still much slower than -fmoe, but abot 20-25% faster than what
we had before.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-09 11:34:33 +02:00
Kawrakow
defa6945b3 CUDA: fuse copies to K and V cache (#921)
* Fuse copies to K- and V-cache on CUDA

* Adapt to latest main

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-08 18:13:58 +02:00
Kawrakow
3614c4f098 Adopt fix from mainline PR 17089 (#920)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-08 07:44:20 +02:00
Kawrakow
d0850dccc8 Disable add + fused_rms_norm fusion (#916)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-07 19:38:18 +02:00
Kawrakow
1c31b25380 Fix PPL increase caused by mmq_id (#913)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-07 18:58:09 +02:00
Kawrakow
532a05e466 CUDA: set compute parameters via command line arguments (#910)
* cuda: set compute parameters via command line arguments

* Also llama-bench

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-07 07:11:23 +02:00
Kawrakow
49befdd4fb Fix iqk_mul_mat when number of rows is not multiple of repack rows (#911)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-06 19:07:46 +02:00
Kawrakow
50f95d7bf3 Disable CUDA fusion by default for now (#903)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-05 10:58:12 +02:00
Kawrakow
92607d44c4 Much better CPU TG performance at long context for GLM-4.5 (#899)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-05 10:20:26 +02:00
Kawrakow
98357d9aa5 Adding cmake option to disable CUDA fusion (#902)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-05 07:09:27 +02:00
Kawrakow
11feb49562 Fix compilation failure after merging #883 (#900)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-04 19:28:52 +02:00
Thireus ☠
86597623a5 Port of Qwen3-VL support from mainline (#883)
* Port of Qwen3-VL for latest ik_llama.cpp

- convert_hf_to_gguf.py - Not touched, use llama.cpp to convert model instead
- sysl and metal support for imrope not added
- Vulkan support for imrope not tested
- Code not tested

* Bugfix n_embd was declared multiple times

https://github.com/ikawrakow/ik_llama.cpp/pull/883#issuecomment-3471179655

* Fix n_embd issue with qwen3vl

* model.output tensor not required

https://github.com/ikawrakow/ik_llama.cpp/pull/883#discussion_r2480388389

* Improved logic for qkv combined tensors

59ceaf8fcb (r2480395800)
59ceaf8fcb (r2480398187)

* Fix n_embd for merge_qkv() + cleaner code

https://github.com/ikawrakow/ik_llama.cpp/pull/883#discussion_r2481227395

* Revert TENSOR_NOT_REQUIRED
2025-11-04 19:20:54 +02:00
Kawrakow
c23fda2103 Disable some fusion, RoPE cache off by default (#894)
* Disable some fusion and make rope cahe off by default

* Minor

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-04 07:50:14 +02:00
Kawrakow
fb0d5a995c RoPE cache (#887)
* Introducing rope cache

When computing RoPE, the rotation angles in each layer
are exactly the same, and only depend on the token positions
(and other constant, model dependent parameters).
So, I wonder, why don't we compute the angles just once
and then reuse for the Q and K RoPE in each layer?

This commit does it as a POC on the CPU, and uses it in
the Qwen3-MoE compute graph.

* cuda: neox works

* WIP

* rope_cache: norm works

* Fused rope+rope

* Fused rope+rope (norm)

* Fused rms+rms+rope+rope (neox) - not working

* WIP

* Also qwen3

* Add command line arg to disable rope cache

* Disable RoPE cache if rope type is not neox or norm

* Add missing break after merge with main

* Fused fused_rms+fused_rms+rope+rope (with -mqkv)

* Fused fused_rms+fused_rms+rope+rope (without -mqkv)

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-03 18:42:20 +02:00
Kawrakow
846e736e85 cuda: add missing backwards RoPE op (#889)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-03 07:45:18 +02:00
Kawrakow
55a704b67a Fused Q and K fused_rms_norm for TG on CUDA (#882)
* Biased mmvq: minor optimization

* Fusing Q and K rms_norm for TG on CUDA

* Remove commented out code

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-10-31 14:41:28 +02:00
Kawrakow
cfb840379f Biased mmvq: minor optimization (#880)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-10-31 14:21:18 +02:00
Kawrakow
0459f595d7 CUDA: corectly detect if flash attention is supported (#875)
* Don't use vector kernels if K or V are quantized

* Correctly determine if FA is supported

* Also wmma

* Minor

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-10-29 13:56:16 +02:00
Nexes the Elder
d50c2490fc correct typo (#876) 2025-10-28 19:01:45 +02:00
firecoperana
904e994bfb Support --device and --device-draft parameter (#866)
* add --device and --device-draft parameter

* don't print debug message in release mode

* fix

* bug fix to throw exception when no device specified

* add const

---------

Co-authored-by: firecoperana <firecoperana>
2025-10-27 18:13:28 +02:00
Kawrakow
eb8116b097 Even more fused ops (#868)
* Fuse Q, K, V gemv+add

* More gemv+add fusing

* Faster copy when tensors are contiguous

Relevant for storing data into the KV cache. I see ~1% speedup
for fast models (Ling-mini-2.0, gpt-oss-20b, etc.)

* Cleanup

* Make sure the bias really is 1 row to use fusion

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-10-27 16:09:01 +02:00
Kawrakow
e34399c116 CUDA: fuse ffn_up*unary_op(ffn_gate) for MMVQ (V2) (#864)
* Args for MMVQ functions

* WIP

* Fused ffn_up*unary_op(ffn_gate) for MMVQ (no bias)

We see nearly 2% TG speedup for Ling-mini-2.0 and
about 1% for DeepSeek-Lite.

* Fused ffn_up*unary_op(ffn_gate) for MMVQ (with bias)

* Fusing also for iqk/trellis/repacked quants

* Fusing mmvq also in non-MoE up+gate

* Fuse mul_mat_id and add_id into a single kernel for mmvq

* Also iqk quants

* Split mmvq.cu and iqk_mmvq.cu into separate template instances

* Put iqk mmvq implementations into template instances

* Somehow I forgot to change the ggml_type in the legacy template calls

* Add disagnostics

* Disable assert

* Fix TG fused up*nary(gate) when down cannot be fused

The wrong memory buffer got used in that case

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-10-26 17:08:50 +02:00
Kawrakow
70c0095e11 Faster tensor name formatting (#860)
* Adding fused mul+multi_add + CPU implementation

* fused mul+multi_add: command line argument to disable it

* Faster tensor name formatting

We gain ~1% for Ling-mini-2.0 when running on CUDA.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-10-24 07:46:18 +03:00
Kawrakow
0549be76e5 Fused mul + multi_add op (#858)
* Adding fused mul+multi_add + CPU implementation

* fused mul+multi_add: CUDA

* fused mul+multi_add: command line argument to disable it

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-10-24 07:40:35 +03:00
Kawrakow
ed4e1a6588 Fuse add+add+fused_rms (#853)
* Fuse add+add+fused_rms

* Try this

* Macro to easily enable/disable fusion

* Various:

* Check that all tensors involved are on the same device before applying fusion
* Fuse sigmoid+scale+sum_rows+div
* Fix the fused bailingmoe2 experts selection

The issue there was that the bias was not per row, but per
expert group, so only the first n_per_group biases were used
for al experts.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-10-22 16:18:11 +03:00
Kawrakow
af5bf60cc8 Hopefully this fixes #854 (#855)
* Hopefully this fixes #854

* Also this one

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-10-21 19:07:23 +03:00
Kawrakow
366d66bc1a Fuse add + fused_rms_norm (CUDA) (#852)
* Combine all calls to llm_build_norm to a single line

so more easily check what kind of arguments are being passed
by simply using grep.

* Combine add + fused_rms_norm

For many models this happens at each layer: the result of the
layer is added to the ayer input, which then becomes the input
to the next layer, which then is typically normalized via
fused_rms_norm.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-10-21 14:29:50 +03:00