Commit Graph

3994 Commits

Author SHA1 Message Date
Iwan Kawrakow
ca03d07bb6 Add --chat-template-file to usage 2025-11-14 11:07:32 +02:00
Kawrakow
9e2b21fbc9 DeepSeek: enable option to merge Q and K tensors (#941)
* Merge Q and K for DeepSeek

* Formatting

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-14 08:23:04 +02:00
firecoperana
ba1a753106 Fix cuda init error in rpc (#957)
Co-authored-by: firecoperana <firecoperana>
2025-11-14 06:59:54 +02:00
Kawrakow
716d318fd8 Fix fused up+gate when mmq is not supported (#952)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-14 06:59:27 +02:00
Kawrakow
459bf5812d Add missing AVX512 operators for MSVC (#948)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-14 06:58:51 +02:00
Kawrakow
41bde27541 Graph reuse (#947)
* Add mainline compatible FA command line option

* Graph reuse: add command line argument to turn it on

* WIP

* This seems to work

* This is perhaps cleaner

* Change the command line option to -gr

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-14 06:58:19 +02:00
Iwan Kawrakow
be1a8cb9d8 Fix flash attention long argument for mainloine compatibility 2025-11-13 19:22:16 +02:00
Kawrakow
f4202c812e Fix repacked legacy quants (#951)
* Fix q5_0_r4

The issue waqs in the tail part. As almost all models have tensor
rows that are multiple of 128, that part was never triggered in testing.
But ithe gpt-oss models have an embedding size of 2880, so we end
up there and trigger the bug.

* Fix q6_0_r4

Same fix as q5_0_r4

* Fix q4_0_r8

* Fix q5_0_r4 and q6_0_r4 also on Zen4

* Fix q4_0_r8 also on Zen4

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-13 15:35:37 +02:00
Kawrakow
38abd0e289 Set default MLA to 3 also in llama-bench (#949)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-13 09:52:06 +02:00
Kawrakow
bbc127d10e Add mainline compatible FA command line option (#944)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-13 08:55:33 +02:00
Kawrakow
5f73351638 Fix rope_norm_fast_cuda (#945)
* Fix rope_norm_fast_cuda

* One more

* Also fix mrope and vision

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-13 08:54:37 +02:00
Kawrakow
8a8de91a42 Set mla=3 by default (#943)
so more recent users that haven't followed the history of FlashMLA
evolution and hence don't know about the MLA options get the best setting
without having to add -mla 3 on the command line.

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-12 11:00:58 +02:00
Kawrakow
9ecfee6c03 Enable fusion by default (#939)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-11 10:35:48 +02:00
Kawrakow
463c694676 Opt from #880 also for iqk cuda gemv (#938)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-11 10:01:34 +02:00
Kawrakow
5e7f6711e4 Add --n-cpu-moe to llama_bench (#937)
* Add --n-cpu-moe to llama_banch

* Add usage

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-11 08:44:59 +02:00
Kawrakow
1e6f8ff89a Add rcache to llama-bench (#936)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-11 08:06:18 +02:00
Kawrakow
489554bc11 Minor: remove unnecesssary calls to build_inp_out_ids (#935)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-10 17:38:46 +02:00
Kawrakow
e4145c013f Add support for SmolLM3 (#934)
* Convert from HF

* Model loading and compute graph

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-10 15:40:12 +02:00
Kawrakow
a313b71bf8 DeepSeek FA optimizations (#929)
* Use new-new-mma also for MLA=3, and use mask bounds

This gives us ~25% better PP at 32k tokens compared to main

* This seems better

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-10 09:55:30 +02:00
Kawrakow
7747000f3b DeepSeek TG optimizations for TG (#928)
* Fuse concat and copy into K cache
* Avoid ggml_cont() when n_token = 1

Combined effect: about +2% in TG performance with full GPU offload

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-10 09:52:07 +02:00
firecoperana
9dfbc69aee Server: Add --draft-params to set draft model parameter via command line args (#932)
* Add command line argument for draft model

* Remove second context of draft model

* Format print

* print usage if parsing -draft fails

---------

Co-authored-by: firecoperana <firecoperana>
2025-11-10 09:51:07 +02:00
Kawrakow
ad688e10f4 Use fused gemv+add only for TG (#933)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-10 08:34:24 +02:00
Kawrakow
db3bed2461 Make biased gemv fusion optional (#931)
* Make biased gemv fusion optional

* Fix one path through gemv fusion

* Remove forgotten printf

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-09 19:09:47 +02:00
Iwan Kawrakow
0db683e478 Fix compiler warning 2025-11-09 14:35:59 +02:00
Lennart Lopin
1da9c218b0 Add ARM Grace Blackwell (NVIDIA DGX Spark) support (#922)
This commit enables IQK quantization operations on ARM-based systems,
specifically tested on NVIDIA DGX Spark with GB10 Grace Blackwell.

Changes:
- Enable IQK_IMPLEMENT macro for ARM NEON operations
- Add arm_neon.h header include for ARM SIMD intrinsics
- Fix compilation errors related to missing NEON types and functions

Build requirements for ARM:
  cmake .. -DGGML_CUDA=ON \
           -DCMAKE_CXX_FLAGS="-march=armv8.2-a+dotprod+fp16" \
           -DCMAKE_C_FLAGS="-march=armv8.2-a+dotprod+fp16"

Tested on:
- Platform: NVIDIA DGX Spark (aarch64)
- CPU: GB10 Grace Blackwell Superchip
- Memory: 128GB unified memory

Fixes build errors:
- 'float32x4_t' does not name a type
- 'vld1q_f32' was not declared in this scope
- 'v_expf' was not declared in this scope
- Missing FP16 NEON intrinsics
2025-11-09 14:22:40 +02:00
firecoperana
ff4c1c6eb3 server: bug fix for preserved_tokens not preserved in process_token (#926)
Co-authored-by: firecoperana <firecoperana>
2025-11-09 14:16:29 +02:00
firecoperana
03235a4b11 Fix embedding missing, CORS and crash using verbose in server (#924)
* server: fix crash when prompt has image and is too long

* server: fix CORS

* server: fix empty result for embedding

* change error message to truncate prompt

* server: fix slot id for save and load state

* bug fix

* server: update slot similarity to handle mtmd

* server: quick hack to calculate number of token processed with image

* server: fix out of range error when detokenizing prompt under verbose

* Add back Access-Control-Allow-Origin

* Server: Add prompt tokens in embedding results

---------

Co-authored-by: firecoperana <firecoperana>
2025-11-09 14:16:03 +02:00
Kawrakow
9207a48ab2 CUDA MoE improvements (#923)
* Use mmq_id in mul_mat_id

* Better

* Also use it in the fused up+gate op

* Better -no-fmoe TG on CUDA

Still much slower than -fmoe, but abot 20-25% faster than what
we had before.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-09 11:34:33 +02:00
Kawrakow
e5fc02c71a CUDA: fuse copies to K and V cache (#921)
* Fuse copies to K- and V-cache on CUDA

* Adapt to latest main

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-08 18:13:58 +02:00
Kawrakow
55576c93b2 Adopt fix from mainline PR 17089 (#920)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-08 07:44:20 +02:00
Kawrakow
d62e8c51ed Disable add + fused_rms_norm fusion (#916)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-07 19:38:18 +02:00
Kawrakow
12b9902395 Fix PPL increase caused by mmq_id (#913)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-07 18:58:09 +02:00
Nexes the Elder
6a805c73b4 More informative PPL readout line (#914)
* More informative PPL readout line

* trailing whitespace..
2025-11-07 16:41:24 +02:00
Kawrakow
9d0b834405 CUDA: set compute parameters via command line arguments (#910)
* cuda: set compute parameters via command line arguments

* Also llama-bench

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-07 07:11:23 +02:00
Kawrakow
665434e5ec Fix iqk_mul_mat when number of rows is not multiple of repack rows (#911)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-06 19:07:46 +02:00
firecoperana
0378f38c27 model : Port Minimax M2 from mainline (#907)
Co-authored-by: firecoperana <firecoperana>
2025-11-06 18:09:24 +02:00
Iwan Kawrakow
575e2c2821 Fix compiler warning 2025-11-06 07:12:07 +02:00
firecoperana
d0dfbf9771 Bug fixes for completions and prompt caching in server (#906)
* Bug fixes for completions and prompt caching in server

* Fix compiler warning about redefinition

---------

Co-authored-by: firecoperana <firecoperana>
2025-11-06 07:10:51 +02:00
Kawrakow
320fc606cd Disable CUDA fusion by default for now (#903)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-05 10:58:12 +02:00
Kawrakow
1a3aaa33c1 Merge Q and K into a single tensor (#892)
* Merge Q and K into a single tensor

* Make V mul mat follow QK mul mat

so they can be fused, which gives a slightly bbetter TG performance.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-05 10:54:36 +02:00
Kawrakow
abb966eba1 Allow quantization of ffn_gate_inp (#896)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-05 10:44:32 +02:00
firecoperana
15159a87d4 Add vision support in llama-server (#901)
* server: add support for vision model
webui: add support for vision model

* server : remove hack for extra parallel slot#10187

* llama : fix KV shift for qwen2vl #13870

* add no-context-shift parameter

---------

Co-authored-by: firecoperana <firecoperana>
2025-11-05 10:43:46 +02:00
Kawrakow
5b38d431ac Much better CPU TG performance at long context for GLM-4.5 (#899)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-05 10:20:26 +02:00
Kawrakow
85035d606c Adding cmake option to disable CUDA fusion (#902)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-05 07:09:27 +02:00
Kawrakow
dfa9689e72 Fix compilation failure after merging #883 (#900)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-04 19:28:52 +02:00
Thireus ☠
5536e99d42 Port of Qwen3-VL support from mainline (#883)
* Port of Qwen3-VL for latest ik_llama.cpp

- convert_hf_to_gguf.py - Not touched, use llama.cpp to convert model instead
- sysl and metal support for imrope not added
- Vulkan support for imrope not tested
- Code not tested

* Bugfix n_embd was declared multiple times

https://github.com/ikawrakow/ik_llama.cpp/pull/883#issuecomment-3471179655

* Fix n_embd issue with qwen3vl

* model.output tensor not required

https://github.com/ikawrakow/ik_llama.cpp/pull/883#discussion_r2480388389

* Improved logic for qkv combined tensors

59ceaf8fcb (r2480395800)
59ceaf8fcb (r2480398187)

* Fix n_embd for merge_qkv() + cleaner code

https://github.com/ikawrakow/ik_llama.cpp/pull/883#discussion_r2481227395

* Revert TENSOR_NOT_REQUIRED
2025-11-04 19:20:54 +02:00
Kawrakow
7e956a32ce sweep-bench: be able to set TG tokens via -n (#897)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-04 14:39:30 +02:00
Kawrakow
cd8d0b0832 Disable some fusion, RoPE cache off by default (#894)
* Disable some fusion and make rope cahe off by default

* Minor

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-04 07:50:14 +02:00
Kawrakow
1cfd19862f RoPE cache (#887)
* Introducing rope cache

When computing RoPE, the rotation angles in each layer
are exactly the same, and only depend on the token positions
(and other constant, model dependent parameters).
So, I wonder, why don't we compute the angles just once
and then reuse for the Q and K RoPE in each layer?

This commit does it as a POC on the CPU, and uses it in
the Qwen3-MoE compute graph.

* cuda: neox works

* WIP

* rope_cache: norm works

* Fused rope+rope

* Fused rope+rope (norm)

* Fused rms+rms+rope+rope (neox) - not working

* WIP

* Also qwen3

* Add command line arg to disable rope cache

* Disable RoPE cache if rope type is not neox or norm

* Add missing break after merge with main

* Fused fused_rms+fused_rms+rope+rope (with -mqkv)

* Fused fused_rms+fused_rms+rope+rope (without -mqkv)

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-03 18:42:20 +02:00
Kawrakow
d890b9fee0 cuda: add missing backwards RoPE op (#889)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-03 07:45:18 +02:00