Commit Graph

4135 Commits

Author SHA1 Message Date
Kawrakow
a6651d017a Change graph key 2026-01-20 15:35:53 +00:00
Kawrakow
c307525cb2 Use GUDA graphs also when theretensor overrides 2026-01-20 15:26:47 +00:00
Kawrakow
6f1a69352f Fuse experts bias in top_k_moe kernel (#1170)
* GLM-4.7-Flash support

* Model type

* Make FA work for mla != 0

* Fuse bias in top_k_moe kernel if present
2026-01-20 15:38:51 +02:00
Kawrakow
996e77047a Avoid ggml_get_rows if not necessary (#1160)
* Copy reduce result to other GPUs if necessary

* Avoid ggml_get_rows for TG

* For the output ops use the result of the split that ran on the main GPU

* More models
2026-01-20 15:38:21 +02:00
Kawrakow
132a01d25d GLM-4.7-Flash support (#1168)
* GLM-4.7-Flash support

* Model type

* Make FA work for mla != 0
2026-01-20 12:46:52 +02:00
Kawrakow
ef5f17940c sampling: refactor sorting (#1166)
* sampling: refactor sorting

* Couldn't look at it without fixing it.
2026-01-19 16:48:54 +02:00
Kawrakow
98b30e5e81 Faster adaptive_p sampling (#1165)
* A hopefully more efficient adaptive_p sampling

* Once at it, lets fix the formatting too

* More formatting

* Hopefully better

* This should be better

* Correctly accumulate adaptive_p sampling time

* AVX2
2026-01-19 16:03:09 +02:00
Kawrakow
fa58c20c42 A hopefully more efficient adaptive_p sampling (#1161)
* A hopefully more efficient adaptive_p sampling

* Once at it, lets fix the formatting too

* More formatting

* Correctly accumulate sampling time for adaptive_p
2026-01-19 15:01:55 +02:00
Kawrakow
6a5c180be9 Fix bf16 additions on CUDA arch < Ampere (#1164)
* Fix bf16 additions on CUDA arch < Ampere

* Prevent using NCCL if graph reduce type is bf16 and arch < AMPERE
2026-01-19 12:27:52 +02:00
Kawrakow
0c0b6e4b8b Copy reduce result to other GPUs if necessary (#1156) 2026-01-19 08:40:26 +02:00
dungquixote42
6dfbef27ec Adaptive p: bugfix + optimization + refactor (#1155)
* adaptive-p sampler: fix zeroed orig_probs bug and refactor

- Fix bug where original probabilities were captured as zero by calculating
  them from logits in llama_prep_adaptive_p (new).
- Replace vector with unordered_map to track candidate probabilities,
  filtering for relevance via logit delta (16.6f).
- Standardize API naming: llama_<action/verb>_<focus/name/topic>_<extra/info>
- Update function signatures to follow most other samplers.

* resolve merge bug

* adaptive-p: revert reordering function definitions
2026-01-18 08:26:06 +02:00
firecoperana
d71a3ec315 Server: refactor and rename functions (#1151)
* Server: rename functions and refactor code

rename functions

refactor update slots

rename params_base

rename timings

* change

* Revert kv cache name changes

* Revert 2

* fix test build error

---------

Co-authored-by: firecoperana <firecoperana>
2026-01-18 08:16:57 +02:00
Kawrakow
7024fdbc72 Additional graph reduce types for split mode graph (#1154)
* WIP: add Q8_0 and BF16 as possible reduce types

Does not work - there is a big somewhere

* This finally works
2026-01-18 08:02:49 +02:00
firecoperana
ee463b079e Webui: add text completions and adaptive_p sampling (#1153)
* Webui: add text completions and adaptive_p sampling

* update description

---------

Co-authored-by: firecoperana <firecoperana>
2026-01-17 08:37:07 +02:00
Kawrakow
709e1a5375 Fixing split mode graph with many GPUs (#1152)
* Attempt to fix the many GPU issue in split mode graph

* WIP: this seems more stable

Still hanging after a while if I try to use all 7 GPUs

* Reenable OpenMP in scheduler async

Seems solid up to 4 GPUs. It did hang with --max-gpu 6.

* printf cleanup
2026-01-17 08:05:24 +02:00
Kawrakow
cb1063f6cd Fix experts/shared experts split (#1147) 2026-01-14 15:35:16 +02:00
hksdpc255
3a0b234669 Add context management to the MiroThinker template (simulate official agent behavior) (#1143) 2026-01-13 18:08:59 +02:00
firecoperana
672df48ed1 server: keep logit bias unchanged when client does not set it (#1144)
Co-authored-by: firecoperana <firecoperana>
2026-01-13 18:08:09 +02:00
Kawrakow
0adff91363 Make adding tensor overrides to llama-bench table optional (#1141) 2026-01-13 11:08:13 +02:00
Kawrakow
9d9ed6a032 Add -sas, --scheduler-async to llama-bench (#1140) 2026-01-13 10:23:50 +02:00
hksdpc255
e1c4c4a495 Fix Anthropic Messages API (#1136)
* server: stop processing the prompt when client disconnects

implement generator-based API for task results

Update httplib.h to 0.27.0

Fix embedding error

Stop prompt processing when disconnected

* Port upstream https://github.com/ggml-org/llama.cpp/pull/18551

* add back anthropic

* Fix merge issue caused by github webui

---------

Co-authored-by: firecoperana <firecoperana>
2026-01-13 08:37:29 +02:00
Kawrakow
013831bba5 Fix compilation errors 2026-01-13 08:12:49 +02:00
Kawrakow
978202a754 Merge ffn_up and ffn_gate experts tensors (part 2) (#1139)
* Add ability to merge up+gate exps to more models

* We need to of course pass the merged tensor to build_ffn

* All the others

* Also Qwen3VL-MoE

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2026-01-13 08:07:52 +02:00
hksdpc255
54a1f68d32 Add chat parser for MiroThinker (#1138)
* Add chat parser for MiroThinker

* Add MiroThinker template

* delete space
2026-01-13 08:07:12 +02:00
firecoperana
1a461525d5 server: stop processing the prompt when client disconnects (#1134)
implement generator-based API for task results

Update httplib.h to 0.27.0

Fix embedding error

Stop prompt processing when disconnected

Co-authored-by: firecoperana <firecoperana>
2026-01-13 07:56:59 +02:00
Kawrakow
d3e3ad40f9 Compiler warning and white space 2026-01-12 19:06:17 +02:00
Kawrakow
c03c2d7cc6 Merge ffn_up and ffn_gate experts tensors (#1137)
* WIP - not working

* WIP - not working

* WIP - GPT-OSS working

However, extremely stupid. The only way I could correctly repack the
up/gate experts is to copy up and gate into host buffers, repack
into another host buffer, copy back into the ffn_up_gate_exps tensor.
This is going to be very slow for giant 500 GB models.

My attempts to do this via a compute graph on the backend holding
the tensors was unsuccessful.

For GPT-OSS-20B I see ~6-7% better PP when using the original
ik_llama.cpp fused_up_gate CUDA implementation, and ~10% when
using the small batch size implementation.

Other models are not working yet on CUDA as I need to fix the
fused mul-unary implementation.

* WIP

* WIP - Qwen3-MoE (and hopefully all others) working

But when I say here and in the previous commit "working",
I mean PP is working. TG is still broken.

* WIP: TG seems to be working

* Minor

* Add command line option to merge experts up/gate

* Add merge up/gate command line parameter to llama-bench

* Turn off merge_up_gate_exps if split mode graph

It is not yet implemented

* When no bias, allow merging up/gate with tensor overrides

* Arghh, we need to increase the context size again

* Cleanup
2026-01-12 18:30:53 +02:00
bndlfm
bf0c6c57bb addOpenGLRunpath -> autoAddDriverRunpath in .devops/nix/package.nix (#1135) 2026-01-12 15:16:37 +02:00
Kawrakow
738dc60b78 We don't need these 2026-01-10 15:32:21 +00:00
Kawrakow
c7348f6f55 Fix mla = 0 (#1130)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2026-01-10 10:34:30 +02:00
Kawrakow
c7dba35702 Update AUTHORS (#1129)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2026-01-10 08:10:21 +02:00
firecoperana
c03ee1a4d2 server: improve speed of speculative decoding (#1119)
* server: improve speed of speculative decoding

change logs

rpc: add recompute

spec dec fix

* Fix n_batch_size not set to context size for draft model

---------

Co-authored-by: firecoperana <firecoperana>
2026-01-10 08:01:22 +02:00
dungquixote42
52ad1c6421 Implement Adaptive-P Sampler (#1100)
* initial implementation of adaptive-p sampler

* explicitly mark candidates unsorted + cleanup qualifiers

* cosmetic update

* reorg prototypes

* lockstep with mainline

* add _impl for _init + reorg

* add LLAMA_API to prototypes

* update sharpness to 10

* lockstep: rng seed

* delete llama_sampling member in llama_sampler_adaptive_p

* fix LLAMA_API return type

* lockstep: rng seed cont

* actually correct implementation

* lockstep: sorting behavior

* const -> constexpr for known constants

* add missing space

* fix softmax usage in adaptive p sampler

* cosmetic changes

* implement do-not-sort version of softmax

* simpify rng seed, add static to constexpr

* refactor: remove iface + use shared rng + use actually original probabilities

* adaptive-p: add dedicated rng back in

* fix initial max_logit + add float vector to adaptive p sampler context + stochastic sampling

* adaptive-p: fuse first softmax with transformation

* adaptive-p: implement binary search selection

* adaptive-p: update comment
2026-01-10 07:58:53 +02:00
Kawrakow
dd3c3f72f2 Fix split mode graph for GPT-OSS with partial offload (#1128)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2026-01-10 07:57:43 +02:00
Kawrakow
08a0da389c Better VRAM utilization strategy for split mode graph (#1126)
* Better VRAM utilization strategy for split mode graph

* Fix assert when --max-gpu is less than available GPUs

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2026-01-09 13:36:02 +02:00
Kawrakow
8725d110d2 Fix data races in the reduce op (#1124)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2026-01-09 10:34:58 +02:00
Kawrakow
eaf2e1c15a Split mode "graph" for Ernie-4.5-MoE (#1121)
* Ernie-4.5-MoE split mode graph

* Cleanup

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2026-01-08 16:46:41 +02:00
Kawrakow
0456aa47d3 Do not abort on NCCL initizalization failure (#1120)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2026-01-08 09:19:50 +02:00
Kawrakow
5ef98f8b0f Split mode "graph" for GPT-OSS (#1118)
* Split mode "graph" for GPT-OSS

* Force split_mode_f16 to false

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2026-01-08 09:14:15 +02:00
firecoperana
9c1bef35e8 CUDA: compress-mode size (#1110)
Co-authored-by: firecoperana <firecoperana>
2026-01-07 18:33:17 +02:00
Kawrakow
99fbd84971 Split mode "graph" for Hunyuan-MoE (#1116)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2026-01-07 13:38:08 +02:00
Kawrakow
ab1616767b Enable up to 4 GPUs for Mimo2-Flash (#1115)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2026-01-07 09:40:29 +02:00
Kawrakow
a82dcbf3ee Fix ring reduction (#1114)
* Fix ring reduction

* Actually enable it

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2026-01-07 08:01:31 +02:00
Kawrakow
54a513768c Disable ring reduction for now (#1112)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2026-01-06 15:40:50 +02:00
Kawrakow
3c99284b67 Split mode 'graph' fpr Qwen3-VL (#1107)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2026-01-05 17:32:00 +02:00
Kawrakow
218dcc5727 Split mode graph for Qwen3 (#1106)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2026-01-05 14:31:36 +02:00
Kawrakow
419a397ce0 Graph parallel for Mimo-V2-Flash (#1105)
* WIP

* Cleanup

* Set max_gpu to 2 for Mimo2

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2026-01-05 09:58:54 +02:00
Kawrakow
385fc14110 Fix race in CUDA FA for head sizes 192/128 (#1104)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2026-01-05 08:21:07 +02:00
Kawrakow
ab50c6cdcb Mimo-V2-Flash support (#1096)
* Mimo-2 support

* Fix bug for head sizes not being the same

It still does not solve the Mimo-2 quantized cache issue.

* Fix quantized cache

* Minor

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2026-01-05 08:00:01 +02:00
firecoperana
56dceefd6b Fix windows build with CUDA (#1101)
Co-authored-by: firecoperana <firecoperana>
2026-01-05 07:59:23 +02:00