Commit Graph

30 Commits

Author SHA1 Message Date
Kawrakow
63d0389e18 WIP split mode attn
Works for LlaMA models, but not for GLM-4.5.
Doesn't seem to improve performance, so I guess no point in trying to
fix it.
2025-12-01 09:34:14 +00:00
Kawrakow
a8cb1860b3 Guards against using merge_qkv with split mode "graph" 2025-12-01 07:11:54 +00:00
Kawrakow
a27904877a Minor 2025-11-30 18:05:13 +00:00
Kawrakow
eb9882407f Better 2025-11-30 18:05:13 +00:00
Kawrakow
072020a678 WIP tensor overrides
Runs with wrong results, don't see where the issue could be.
2025-11-30 18:05:13 +00:00
Kawrakow
9e1d14f9c3 WIP GLM4.5 - this works
PP is already better than split mode layer, but TG for zero context
is kind of low - 60 vs 92 t/s. TG becomes better than split mode layer
at around 20k tokens. PP at 26k tokens is 1.55X of sm layer.
2025-11-30 18:05:13 +00:00
Kawrakow
43f644e482 WIP GLM4.5 - runs with wrong results 2025-11-30 18:05:13 +00:00
Kawrakow
ceaa71bcc8 Rename split mode "row" to split mode "graph" 2025-11-30 18:05:13 +00:00
Kawrakow
094b51d5ae Make it work with partial offload
but no tensor overrides yet, just ngl < num_layers.
2025-11-30 18:05:13 +00:00
Kawrakow
5305590f04 Show memory used per device 2025-11-30 18:05:13 +00:00
Kawrakow
7d84dca29e Fix attn split
Granularity for Wq, Wo is not just head size, but
head size * gqa_ratio.
Else the Wk, Wv tensors end up not being a multiple of the
head size when we divide the split determined by Wo with
the gqa_ratio.
2025-11-30 18:05:13 +00:00
Kawrakow
4bcfb40711 This is slightly better
the graph is still not being computed in parallel.
Why? Because the scheduler creates graph splits where the
result of the computation on one GPU becomes an input for the
other split. Hence, to trigger the computation on the second GPU
one needs to wait for the computation on the first GPU to finish,
even thiough the two can be done in parallel up to the sunchronization
point. So, all that is left to do is to trick the scheduler to create
to splits that can be done in parallel, and then have a graph split
where the results get combined.
2025-11-30 18:05:13 +00:00
Kawrakow
97143330a1 This works, but it is slow 2025-11-30 18:05:13 +00:00
Kawrakow
5d68e4eb35 WIP: it runs with wrong result
But it also looks like the backend scheduler is not going to help:
* It copies mask and input positions to GPU 0
* => RoPE ops must run on GPU 0
* => To proceed attn evaluation, GPU 1 must wait for GPU 0 to finish its
     entire attn calculation
* Same with FFN. The rms_norm gets scheduled on GPU 0. Hence, GPU 1 must
  wait for GPU 0 to finish its entore FFN calculation before it can
  start (as it needs to copy the result of rms_norm from GPU 0)
* => Seems useless without writing a bespoke TP scheduling
2025-11-30 18:05:12 +00:00
Kawrakow
32c6df015b WIP 2025-11-30 18:05:12 +00:00
Kawrakow
920f424929 Support GigaChat3 (#995)
* Fixing Gigachat support

* Gigachat: CUDA FA (needs 192 x 192 for MLA = 3)

* Gigachat: CPU FA (needs 192 x 192 for MLA = 3)

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-24 06:55:14 +01:00
Kawrakow
d72206dd79 Add mqkv and rcache for Gemma3 (#972)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-16 19:10:41 +02:00
Kawrakow
4d003e29ee Allow distinct output tensor for Gemma models (#969)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-16 12:12:41 +02:00
Kawrakow
668c37d4cf DeepSeek: enable option to merge Q and K tensors (#941)
* Merge Q and K for DeepSeek

* Formatting

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-14 08:23:04 +02:00
Kawrakow
263be6670b Add support for SmolLM3 (#934)
* Convert from HF

* Model loading and compute graph

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-10 15:40:12 +02:00
firecoperana
e15a215e6b model : Port Minimax M2 from mainline (#907)
Co-authored-by: firecoperana <firecoperana>
2025-11-06 18:09:24 +02:00
Kawrakow
cb30f8e057 Merge Q and K into a single tensor (#892)
* Merge Q and K into a single tensor

* Make V mul mat follow QK mul mat

so they can be fused, which gives a slightly bbetter TG performance.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-05 10:54:36 +02:00
Thireus ☠
86597623a5 Port of Qwen3-VL support from mainline (#883)
* Port of Qwen3-VL for latest ik_llama.cpp

- convert_hf_to_gguf.py - Not touched, use llama.cpp to convert model instead
- sysl and metal support for imrope not added
- Vulkan support for imrope not tested
- Code not tested

* Bugfix n_embd was declared multiple times

https://github.com/ikawrakow/ik_llama.cpp/pull/883#issuecomment-3471179655

* Fix n_embd issue with qwen3vl

* model.output tensor not required

https://github.com/ikawrakow/ik_llama.cpp/pull/883#discussion_r2480388389

* Improved logic for qkv combined tensors

59ceaf8fcb (r2480395800)
59ceaf8fcb (r2480398187)

* Fix n_embd for merge_qkv() + cleaner code

https://github.com/ikawrakow/ik_llama.cpp/pull/883#discussion_r2481227395

* Revert TENSOR_NOT_REQUIRED
2025-11-04 19:20:54 +02:00
Kawrakow
55a704b67a Fused Q and K fused_rms_norm for TG on CUDA (#882)
* Biased mmvq: minor optimization

* Fusing Q and K rms_norm for TG on CUDA

* Remove commented out code

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-10-31 14:41:28 +02:00
Kawrakow
56fc5454ff Merge Q, K, V (#878)
* POC: merge Q, K, V into a single, contiguous tensor

Done just for Qwen3-MoE, where I see a 4% uplift in TG.
PP performance gain is sub-percent, if any.
Still, it seems it makes sense to do it in general given
the TG performance gain.

* WIP

* merge_qkv: it works for gpt-oss

...but we see a smaller TG gain (~1.5%)

* WIP

* Don't ignore the return value of create_tensors()

else, when q, k, v get merged and we are running on the CPU,
we get a crash because the backend is trying to use mmap,
but that no longer works.

* merge_qkv: bias can be required, optional, or mandatory

* merge_qkv: glm4.5moe

* merge_qkv: add command loine argument to enable

* merge_qkv: fix tensor dimensions

* merge_qkv: llama-4

* merge_qkv: qwen3 (dense)

* merge_qkv: simplify build_qwen3moe

* cohere2 - simplify graph building

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-10-30 10:49:48 +02:00
Kawrakow
f7adde1043 Adding Ling/Ring (a.k.a., Bailing-MoE2) support (#833)
* Adding Ling/Ring (a.k.a., Bailing-MoE2)

* Add expert group selection (not working, so turned off)

* BailingMoE2 conversion

* WIP

* Bits and pieces

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-10-15 14:20:40 +03:00
Kawrakow
ba9fefb73d gpt-oss: duplicate experts biases when necessary (#829)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-10-14 14:38:40 +03:00
Kawrakow
78409c95ff Fix performance regression introduced in #823 (#826)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-10-13 08:09:55 +03:00
Kawrakow
764eefd1bc Enable and clean up compiler warnings in src (#824)
* WIP: enable and clean up warnings in src

* All warnings handled

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-10-11 16:01:13 +03:00
Kawrakow
4daff01b39 Refactor file llama.cpp (#823)
* llama_model and llama_hparams

* llama_build_context

Surprisingly small reduction in llama.cpp compile time given
the reduction in LOCs (22k -> 14k)

* LLM_TN

llama.cpp compilation: 50 s -> 33 s

* llama_quantize

* arch names

* All graph building is now in llm-build-context.cpp

* hparams loading

llama.cpp is now just 9300 LOC, but still takes 32 seconds to compile.

* We are now at 6 seconds to build the src folder

* load -> create

We are not actually loading the tensors, but just creating them.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-10-11 11:35:20 +03:00