Commit Graph

15 Commits

Author SHA1 Message Date
Kawrakow
f1191036b2 Support GigaChat3 (#995)
* Fixing Gigachat support

* Gigachat: CUDA FA (needs 192 x 192 for MLA = 3)

* Gigachat: CPU FA (needs 192 x 192 for MLA = 3)

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-24 06:55:14 +01:00
Kawrakow
294aec2bc2 Add mqkv and rcache for Gemma3 (#972)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-16 19:10:41 +02:00
Kawrakow
3008fdf0b6 Allow distinct output tensor for Gemma models (#969)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-16 12:12:41 +02:00
Kawrakow
9e2b21fbc9 DeepSeek: enable option to merge Q and K tensors (#941)
* Merge Q and K for DeepSeek

* Formatting

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-14 08:23:04 +02:00
Kawrakow
e4145c013f Add support for SmolLM3 (#934)
* Convert from HF

* Model loading and compute graph

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-10 15:40:12 +02:00
firecoperana
0378f38c27 model : Port Minimax M2 from mainline (#907)
Co-authored-by: firecoperana <firecoperana>
2025-11-06 18:09:24 +02:00
Kawrakow
1a3aaa33c1 Merge Q and K into a single tensor (#892)
* Merge Q and K into a single tensor

* Make V mul mat follow QK mul mat

so they can be fused, which gives a slightly bbetter TG performance.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-05 10:54:36 +02:00
Thireus ☠
5536e99d42 Port of Qwen3-VL support from mainline (#883)
* Port of Qwen3-VL for latest ik_llama.cpp

- convert_hf_to_gguf.py - Not touched, use llama.cpp to convert model instead
- sysl and metal support for imrope not added
- Vulkan support for imrope not tested
- Code not tested

* Bugfix n_embd was declared multiple times

https://github.com/ikawrakow/ik_llama.cpp/pull/883#issuecomment-3471179655

* Fix n_embd issue with qwen3vl

* model.output tensor not required

https://github.com/ikawrakow/ik_llama.cpp/pull/883#discussion_r2480388389

* Improved logic for qkv combined tensors

59ceaf8fcb (r2480395800)
59ceaf8fcb (r2480398187)

* Fix n_embd for merge_qkv() + cleaner code

https://github.com/ikawrakow/ik_llama.cpp/pull/883#discussion_r2481227395

* Revert TENSOR_NOT_REQUIRED
2025-11-04 19:20:54 +02:00
Kawrakow
8c8a7fb7c8 Fused Q and K fused_rms_norm for TG on CUDA (#882)
* Biased mmvq: minor optimization

* Fusing Q and K rms_norm for TG on CUDA

* Remove commented out code

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-10-31 14:41:28 +02:00
Kawrakow
14760aaf46 Merge Q, K, V (#878)
* POC: merge Q, K, V into a single, contiguous tensor

Done just for Qwen3-MoE, where I see a 4% uplift in TG.
PP performance gain is sub-percent, if any.
Still, it seems it makes sense to do it in general given
the TG performance gain.

* WIP

* merge_qkv: it works for gpt-oss

...but we see a smaller TG gain (~1.5%)

* WIP

* Don't ignore the return value of create_tensors()

else, when q, k, v get merged and we are running on the CPU,
we get a crash because the backend is trying to use mmap,
but that no longer works.

* merge_qkv: bias can be required, optional, or mandatory

* merge_qkv: glm4.5moe

* merge_qkv: add command loine argument to enable

* merge_qkv: fix tensor dimensions

* merge_qkv: llama-4

* merge_qkv: qwen3 (dense)

* merge_qkv: simplify build_qwen3moe

* cohere2 - simplify graph building

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-10-30 10:49:48 +02:00
Kawrakow
9d364b88ba Adding Ling/Ring (a.k.a., Bailing-MoE2) support (#833)
* Adding Ling/Ring (a.k.a., Bailing-MoE2)

* Add expert group selection (not working, so turned off)

* BailingMoE2 conversion

* WIP

* Bits and pieces

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-10-15 14:20:40 +03:00
Kawrakow
8d0d01a593 gpt-oss: duplicate experts biases when necessary (#829)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-10-14 14:38:40 +03:00
Kawrakow
0030bc89c9 Fix performance regression introduced in #823 (#826)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-10-13 08:09:55 +03:00
Kawrakow
0ad1d34090 Enable and clean up compiler warnings in src (#824)
* WIP: enable and clean up warnings in src

* All warnings handled

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-10-11 16:01:13 +03:00
Kawrakow
335a1f9b71 Refactor file llama.cpp (#823)
* llama_model and llama_hparams

* llama_build_context

Surprisingly small reduction in llama.cpp compile time given
the reduction in LOCs (22k -> 14k)

* LLM_TN

llama.cpp compilation: 50 s -> 33 s

* llama_quantize

* arch names

* All graph building is now in llm-build-context.cpp

* hparams loading

llama.cpp is now just 9300 LOC, but still takes 32 seconds to compile.

* We are now at 6 seconds to build the src folder

* load -> create

We are not actually loading the tensors, but just creating them.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-10-11 11:35:20 +03:00