Commit Graph

1115 Commits

Author SHA1 Message Date
firecoperana
d71a3ec315 Server: refactor and rename functions (#1151)
* Server: rename functions and refactor code

rename functions

refactor update slots

rename params_base

rename timings

* change

* Revert kv cache name changes

* Revert 2

* fix test build error

---------

Co-authored-by: firecoperana <firecoperana>
2026-01-18 08:16:57 +02:00
firecoperana
ee463b079e Webui: add text completions and adaptive_p sampling (#1153)
* Webui: add text completions and adaptive_p sampling

* update description

---------

Co-authored-by: firecoperana <firecoperana>
2026-01-17 08:37:07 +02:00
firecoperana
672df48ed1 server: keep logit bias unchanged when client does not set it (#1144)
Co-authored-by: firecoperana <firecoperana>
2026-01-13 18:08:09 +02:00
Kawrakow
0adff91363 Make adding tensor overrides to llama-bench table optional (#1141) 2026-01-13 11:08:13 +02:00
Kawrakow
9d9ed6a032 Add -sas, --scheduler-async to llama-bench (#1140) 2026-01-13 10:23:50 +02:00
hksdpc255
e1c4c4a495 Fix Anthropic Messages API (#1136)
* server: stop processing the prompt when client disconnects

implement generator-based API for task results

Update httplib.h to 0.27.0

Fix embedding error

Stop prompt processing when disconnected

* Port upstream https://github.com/ggml-org/llama.cpp/pull/18551

* add back anthropic

* Fix merge issue caused by github webui

---------

Co-authored-by: firecoperana <firecoperana>
2026-01-13 08:37:29 +02:00
firecoperana
1a461525d5 server: stop processing the prompt when client disconnects (#1134)
implement generator-based API for task results

Update httplib.h to 0.27.0

Fix embedding error

Stop prompt processing when disconnected

Co-authored-by: firecoperana <firecoperana>
2026-01-13 07:56:59 +02:00
Kawrakow
d3e3ad40f9 Compiler warning and white space 2026-01-12 19:06:17 +02:00
Kawrakow
c03c2d7cc6 Merge ffn_up and ffn_gate experts tensors (#1137)
* WIP - not working

* WIP - not working

* WIP - GPT-OSS working

However, extremely stupid. The only way I could correctly repack the
up/gate experts is to copy up and gate into host buffers, repack
into another host buffer, copy back into the ffn_up_gate_exps tensor.
This is going to be very slow for giant 500 GB models.

My attempts to do this via a compute graph on the backend holding
the tensors was unsuccessful.

For GPT-OSS-20B I see ~6-7% better PP when using the original
ik_llama.cpp fused_up_gate CUDA implementation, and ~10% when
using the small batch size implementation.

Other models are not working yet on CUDA as I need to fix the
fused mul-unary implementation.

* WIP

* WIP - Qwen3-MoE (and hopefully all others) working

But when I say here and in the previous commit "working",
I mean PP is working. TG is still broken.

* WIP: TG seems to be working

* Minor

* Add command line option to merge experts up/gate

* Add merge up/gate command line parameter to llama-bench

* Turn off merge_up_gate_exps if split mode graph

It is not yet implemented

* When no bias, allow merging up/gate with tensor overrides

* Arghh, we need to increase the context size again

* Cleanup
2026-01-12 18:30:53 +02:00
firecoperana
c03ee1a4d2 server: improve speed of speculative decoding (#1119)
* server: improve speed of speculative decoding

change logs

rpc: add recompute

spec dec fix

* Fix n_batch_size not set to context size for draft model

---------

Co-authored-by: firecoperana <firecoperana>
2026-01-10 08:01:22 +02:00
dungquixote42
52ad1c6421 Implement Adaptive-P Sampler (#1100)
* initial implementation of adaptive-p sampler

* explicitly mark candidates unsorted + cleanup qualifiers

* cosmetic update

* reorg prototypes

* lockstep with mainline

* add _impl for _init + reorg

* add LLAMA_API to prototypes

* update sharpness to 10

* lockstep: rng seed

* delete llama_sampling member in llama_sampler_adaptive_p

* fix LLAMA_API return type

* lockstep: rng seed cont

* actually correct implementation

* lockstep: sorting behavior

* const -> constexpr for known constants

* add missing space

* fix softmax usage in adaptive p sampler

* cosmetic changes

* implement do-not-sort version of softmax

* simpify rng seed, add static to constexpr

* refactor: remove iface + use shared rng + use actually original probabilities

* adaptive-p: add dedicated rng back in

* fix initial max_logit + add float vector to adaptive p sampler context + stochastic sampling

* adaptive-p: fuse first softmax with transformation

* adaptive-p: implement binary search selection

* adaptive-p: update comment
2026-01-10 07:58:53 +02:00
firecoperana
56dceefd6b Fix windows build with CUDA (#1101)
Co-authored-by: firecoperana <firecoperana>
2026-01-05 07:59:23 +02:00
Kawrakow
17a5a80946 Fix Windows build (#1097) 2025-12-29 14:18:27 +01:00
firecoperana
903377bc34 Webui: improve scroll and bug fixes (#1082)
* Webui: fix message scroll back due to setPending

smooth scroll

remove throttle

increase scroll margin

# Conflicts:
#	examples/server/public/index.html.gz
#	examples/server/webui/dist/index.html
#	examples/server/webui/src/utils/app.context.tsx

* webui: don't scroll to bottom when conversation changes or edit message

# Conflicts:
#	examples/server/public/index.html.gz
#	examples/server/webui/dist/index.html

* Webui: fix save config error

# Conflicts:
#	examples/server/public/index.html.gz
#	examples/server/webui/dist/index.html

* Webui: add api key to request model name

# Conflicts:
#	examples/server/public/index.html.gz
#	examples/server/webui/dist/index.html

* Update

* webui: fix loading dots display issue

# Conflicts:
#	examples/server/public/index.html.gz
#	examples/server/webui/dist/index.html
#	examples/server/webui/src/components/ChatMessage.tsx

* Webui: cancel scroll when user moves up

---------

Co-authored-by: firecoperana <firecoperana>
2025-12-24 12:30:26 +01:00
firecoperana
2a633c4357 server: exclude thinking tokens when finding the slot (#1079)
refactor find slot

enable by default

Fix load prompt

rename variables

Co-authored-by: firecoperana <firecoperana>
2025-12-22 09:46:45 +01:00
firecoperana
0e91b89cd3 Refactor chat and server file (#1062)
* Add alternative log functions

* chat: fix int overflow, prevent size calculation in float/double (#17357)

* chat: fix int overflow, prevent size calculation in float/double

* Update common/chat.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* common : move all common_chat_parse_* to chat-parser.cpp. (#17481)

# Conflicts:
#	common/chat.cpp

* server: split server.cpp code into server/common/task/queue/context

* Fix compiler warning

* Clean up code

* common: use native MultiByteToWideChar

* move server prompt to server task

* Clean code

* delete utils.hpp

---------

Co-authored-by: firecoperana <firecoperana>
Co-authored-by: Xuan-Son Nguyen <son@huggingface.co>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: DAN™ <dranger003@gmail.com>
2025-12-15 08:27:20 +01:00
Kawrakow
279b4c44fc Fix the fix (#1054)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-12-11 08:05:33 +01:00
Kawrakow
a2efa22f10 Fix llama-bench - missing buffer override comparison operator (#1053)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-12-11 07:21:06 +01:00
i4TsU
e61cd0303a QoL/bugfixes for llama-bench (#1052)
* include cuda-params and -ot in llama-bench output

* cleanup redundant type mapping

* fix wrong field name

* fix preexisting mistake in cuda_params help text (default value)

* fix preexisting mistake in kompute column header

* adjust code style to match current norms

* simplify/fix inverted columns

* fix field->value pairings/order

* remove dead field `f16_kv`

* sql printer deserves a way out too

* actually enable the new improvements....
2025-12-11 07:04:15 +01:00
Kawrakow
a719349982 POC: CUDA tensor parallel (MoE models) (#1022)
* Remove most of split mode row

* WIP

* WIP: also allocate the KV cache using tensor split

* WIP: it runs with wrong result

But it also looks like the backend scheduler is not going to help:
* It copies mask and input positions to GPU 0
* => RoPE ops must run on GPU 0
* => To proceed attn evaluation, GPU 1 must wait for GPU 0 to finish its
     entire attn calculation
* Same with FFN. The rms_norm gets scheduled on GPU 0. Hence, GPU 1 must
  wait for GPU 0 to finish its entore FFN calculation before it can
  start (as it needs to copy the result of rms_norm from GPU 0)
* => Seems useless without writing a bespoke TP scheduling

* WIP

* This works, but it is slow

* This is slightly better

the graph is still not being computed in parallel.
Why? Because the scheduler creates graph splits where the
result of the computation on one GPU becomes an input for the
other split. Hence, to trigger the computation on the second GPU
one needs to wait for the computation on the first GPU to finish,
even thiough the two can be done in parallel up to the sunchronization
point. So, all that is left to do is to trick the scheduler to create
to splits that can be done in parallel, and then have a graph split
where the results get combined.

* Playing games with the scheduler

This change tricks it into doing the right thing^TM.
Still quite a bit slower than split mode layer for the 8B LlaMA model.
But for the 70B LlaMA it now beats split mode layer for TG:
28 t/s vs 24.4 t/s. PP is 627 t/s vs 744 t/s.
In comparison, split mode "row" in mainline gets
484 t/s PP and 19.3 t/s TG.

* Fix attn split

Granularity for Wq, Wo is not just head size, but
head size * gqa_ratio.
Else the Wk, Wv tensors end up not being a multiple of the
head size when we divide the split determined by Wo with
the gqa_ratio.

* Show memory used per device

* Make it work with partial offload

but no tensor overrides yet, just ngl < num_layers.

* Allow for f16 source in fused_rms_norm

* This results in faster PP.

Now PP is faster than split mode layer for L3-70B.

* Rename split mode "row" to split mode "graph"

* Leave FFN partial results as f16

* WIP GLM4.5 - runs with wrong results

* WIP GLM4.5 - this works

PP is already better than split mode layer, but TG for zero context
is kind of low - 60 vs 92 t/s. TG becomes better than split mode layer
at around 20k tokens. PP at 26k tokens is 1.55X of sm layer.

* Work around compiler bug

It issues a warning that there is an extra semicolon outside of a function,
but there isn't. If I remove the anonymous namespace and turn the
functions inside into static, the warning disapears, so clearly
a compiler bug.

* Make graph reuse work with split mode graph

* Remove more split mode row remnants

* WIP tensor overrides

Runs with wrong results, don't see where the issue could be.

* This works but is slow

Still does not work for row-interleaved quants

* Slightly better

* Slightly better

* Row-interleaved quants work

* Better

* Minor

* Guarad against using split mode "graph" for unsupported models

* Guards against using merge_qkv with split mode "graph"

* WIP split mode attn

Works for LlaMA models, but not for GLM-4.5.
Doesn't seem to improve performance, so I guess no point in trying to
fix it.

* Split mode graph for qwen3moe

* Try to better distribute the splits

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-12-01 19:25:40 +01:00
firecoperana
15771072c7 RPC: support multiple devices including cpu (#1024)
* RPC support multiple devices

* rpc : update documentation (#16441)

Update the README file to match the newly added functionality of
exposing multiple devices from a single server.

Co-authored-by: Diego Devesa <slarengh@gmail.com>

# Conflicts:
#	examples/rpc/README.md

* Remove memory settings

* rpc : cache and reuse compute graphs (#15405)

Store the last computed graph and reuse it when possible.
Also do not return response from GRAPH_COMPUTE and assume it always
completes successfully. If this this is not the case, the server closes
the connection. This saves us a network round trip to the server.

* Add -cpu to include cpu backend

---------

Co-authored-by: firecoperana <firecoperana>
Co-authored-by: Radoslav Gerganov <rgerganov@gmail.com>
2025-11-30 18:48:02 +01:00
firecoperana
1cad1ec1cc Update grammar (#1023)
* grammar : fix JSON Schema for string regex with top-level alt. (#9903)

Prior to this commit, using a JSON Schema containing a string
with `pattern` regular expression that uses top-level alternation
(e.g. `"pattern": "^A|B|C|D$"`) would result in invalid JSON
output from the constrained sampling grammar, because it
ended up creating a grammar rule like this for the string:

```
thing ::= "\"" "A" | "B" | "C" | "D" "\"" space
```

Note that this rule will only match a starting quote for the "A" case,
and will only match an ending quote for the "D" case,
so this rule will always produce invalid JSON when used for sampling
(that is, the JSON will always be lacking the starting quote,
the ending quote, or both).

This was fixed in a simple way by adding parentheses to the
generated rule (for all string pattern rules, to keep it simple),
such that the new generated rule looks like this (correct):

```
thing ::= "\"" ("A" | "B" | "C" | "D") "\"" space
```

* grammars : add English-only grammar (#10612)

* grammar : handle maxItems == 0 in JSON schema (#13117)

Co-authored-by: Richard Lyons <frob@cloudstaff.com>

* grammar-parser : fix possible null-deref (#9004)

Fixes: https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=70680

Signed-off-by: David Korczynski <david@adalogics.com>

* llama : fix typo in llama-grammar.h [no ci] (#11816)

* * server: fix "--grammar-file" parameter (#12285)

* common : use std::string_view now that we target c++17 (#14319)

* json : support `enum` values within `allOf` (#15830)

* grammar : use int64_t to avoid int overflows in int schema to grammar conversion logic (#16626)

* grammar : support array references in json schema (#16792)

* grammar : support array references in json schema

* Update json-schema-to-grammar.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* grammar : improve regex when naming ref derived rules

* grammar : replace non-conformant definitions array with anyOf test case

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
# Conflicts:
#	tests/test-json-schema-to-grammar.cpp

* merge fix

* llama : minor grammar refactor (#10897)

* llama: fix error on bad grammar (#12628)

* grammar : fix integer overflow (#17381)

* Fix DoS / integer overflow

* Remove optional, use INT64_MAX instead as placeholder value (it's technically -1, so it fits :)

* White space

* Actually, since it's unsigned, use UINT64_MAX
# Conflicts:
#	src/llama-grammar.cpp

* grammar: fix regression caused by #17381 (#17412)

* grammar: fix regression caused by #17381

* more readable
# Conflicts:
#	src/llama-grammar.cpp

* Merge Fix

* Fix warnings

---------

Signed-off-by: David Korczynski <david@adalogics.com>
Co-authored-by: Joe Eli McIlvain <joe.eli.mac@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: frob <rick+github@frob.com.au>
Co-authored-by: Richard Lyons <frob@cloudstaff.com>
Co-authored-by: DavidKorczynski <david@adalogics.com>
Co-authored-by: Daniel Bevenius <daniel.bevenius@gmail.com>
Co-authored-by: firecoperana <firecoperana>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Aldehir Rojas <hello@alde.dev>
Co-authored-by: Olivier Chafik <olivier.chafik@gmail.com>
Co-authored-by: Piotr Wilkin (ilintar) <piotr.wilkin@syndatis.com>
Co-authored-by: Xuan-Son Nguyen <son@huggingface.co>
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-30 18:45:38 +01:00
firecoperana
869557c8fd Update mtmd to improve accuracy of M-RoPE (#993)
* model : Granite docling + Idefics3 preprocessing (SmolVLM) (#16206)

* feat: Add granite-docling conversion using trillion pretokenizer

Branch: gabe-l-hart/GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Add granite-docling vocab pre enum

Branch: gabe-l-hart/GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Use granite-docling pre

Branch: gabe-l-hart/GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Add clip_is_idefics3

Branch: gabe-l-hart/GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Allow multi-token boundary sequences for image templating

Branch: gabe-l-hart/GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Add tiling support for idefices3 in clip.cpp

This should likely be moved into llava_uhd::get_slice_instructions, but for
now this avoids disrupting the logic there.

Branch: gabe-l-hart/GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Partial support for full templating for idefics3 in mtmd

There are still errors encoding some of the image chunks, but the token
sequence now matches transformers _almost_ perfectly, except for the double
newline before the global image which shows up as two consecutive newline
tokens instead of a single double-newline token. I think this is happening
because the blocks are tokenized separately then concatenated.

Branch: gabe-l-hart/GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Fully working image preprocessing for idefics3 w/ resize and slicing

Branch: gabe-l-hart/GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Parse the preprocessor config's longest side and add it to the mmproj hparams

Branch: GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Use the longest side instead of size * scale_factor

For Granite Docling, these come out to the same value, but that was just a
conicidence.

Branch: GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Allow batch encoding and remove clip_is_idefics3

Branch: GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refactor: Remove unnecessary conditionals for empty token vectors

Branch: GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refactor: Use image_manipulation util

Branch: GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* add test model

---------

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
# Conflicts:
#	convert_hf_to_gguf.py
#	convert_hf_to_gguf_update.py
#	gguf-py/gguf/constants.py
#	gguf-py/gguf/gguf_writer.py
#	src/llama-vocab.cpp
#	src/llama-vocab.h

* mtmd : support home-cooked Mistral Small Omni (#14928)

* model : add LightOnOCR-1B model (#16764)

* model : add LightOnOCR-1B model

* add test
# Conflicts:
#	convert_hf_to_gguf.py
#	gguf-py/gguf/constants.py

* mtmd : fix idefics3 preprocessing (#16806)

* mtmd : fix idefics3 preprocessing

* disable granite test

* fix test for granite

* model: Add support for CogVLM model (#15002)

* Added GGUF mappings for CogVLM model

* Add tensor mapping for CogVLM visual encoder

* Add CogVLM to conversion script, no vision part yet

* Added CogVLM vision model to conversion script

* Add graph for CogVLM CLIP model

* Add graph for CogVLM

* Fixes for CogVLM. Now compiles.

* Model now runs

* Fixes for cogvlm graph

* Account for graph context change after rebase

* Changes for whitespace

* Changes in convert script according to comments

* Switch CogVLM LLM graph to merged QKV tensor

* Use rope_type variable instead of direct definition

* Change CogVLM CLIP encoder to use SWIGLU

* Switch CogVLM CLIP to use merged QKV

* Apply rebase edits and remove ggml_cont call that is now unnecessary

* clean up

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
# Conflicts:
#	convert_hf_to_gguf.py
#	examples/mtmd/clip.cpp
#	gguf-py/gguf/constants.py
#	gguf-py/gguf/tensor_mapping.py
#	src/llama-arch.cpp
#	src/llama-arch.h
#	src/llama-model.cpp
#	src/llama-model.h

* mtmd: refactor preprocessing + support max/min pixels (#16878)

* mtmd: refactor preprocessing + support max/min pixels

* fix mlp type

* implement mix/max pixels

* improve hparams

* better image preproc for qwen

* fix

* fix out of bound composite

* fix (2)

* fix token calculation

* get_merge_kernel_size()

* fix llama4 and lfm2

* gonna fix them all

* use simple resize for qwen

* qwen: increase min tokens

* no resize if dst size == src size

* restore to initial min/max tokens value for qwen
# Conflicts:
#	examples/mtmd/clip.cpp

* clip : use FA (#16837)

* clip : use FA

* cont : add warning about unsupported ops

* implement "auto" mode for clip flash attn

* clip : print more detailed op support info during warmup

* cont : remove obsolete comment [no ci]

* improve debugging message

* trailing space

* metal : remove stray return

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>

* model: add Janus Pro for image understanding (#16906)

* Add support for Janus Pro

* Update gguf-py/gguf/tensor_mapping.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update gguf-py/gguf/tensor_mapping.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Address reviewer suggestions

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Add JANUS_PRO constant

* Update clip model handling

Co-authored-by: Xuan-Son Nguyen <son@huggingface.co>

* Update tools/mtmd/clip.cpp

Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>

* Refactor JANUS_PRO handling in clip.cpp

Co-authored-by: Xuan-Son Nguyen <son@huggingface.co>

* Update tools/mtmd/clip.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* em whitespace

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Xuan-Son Nguyen <son@huggingface.co>
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
# Conflicts:
#	convert_hf_to_gguf.py
#	gguf-py/gguf/constants.py
#	gguf-py/gguf/tensor_mapping.py

* mtmd: pad mask for qwen2.5vl (#16954)

* mtmd: pad mask for qwen2.5vl

* improve

* mtmd: add --image-min/max-tokens (#16921)

* mtmd: improve struct initialization (#16981)

* mtmd: allow QwenVL to process larger image by default (#17020)

* Disable flash attention

* mtmd : fix embedding size for image input (#17123)

* mtmd: fix patch_size initialized to random value in audio models (#17128)

* mtmd: fix patch_size initialized to random value in audio models

* add default hparams

* add llama_model_n_embd_inp

* Fix load qwen3 vl

Change batch size

* Add description

* Fix cli build error

---------

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
Co-authored-by: Tianyue-Zhao <zhaotianyue@outlook.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Zhiyong Wang <85110830+ravenouse@users.noreply.github.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
Co-authored-by: firecoperana <firecoperana>
2025-11-29 07:27:15 +01:00
hksdpc255
d24ea9e48e server : add Anthropic Messages API support (#1012) 2025-11-29 07:24:59 +01:00
Kawrakow
45cd1a70f5 Fix llama-bench mla parameter (#1016)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-27 09:33:30 +01:00
firecoperana
0b126b2ca6 Fix prompt tokenization issue during prompt processing (#1008)
* Find common tokens between prompt and cache
Fix wrong context size usage for mtmd
Use start position of common part
server: handle context shift

* Add size check for inexact match

* Change

---------

Co-authored-by: firecoperana <firecoperana>
2025-11-26 10:34:26 +01:00
firecoperana
07d08e15ad webui update (#1003)
webui: add system message in export conversation, support upload conversation with system message
Webui: show upload only when in new conversation
Webui: Add model name
webui: increase height of chat message window when clicking editing
Webui: autoclose settings dialog dropdown and maximze screen width when zoom in
webui: fix date issues and add more dates
webui: change error to toast.error.
server: add n_past and slot_id in props_simple
webui: add cache tokens, context and prompt speed in chat
webui: modernize ui
webui: change welcome message
webui: change speed display
webui: change run python icon
webui: add config to use server defaults for sampler
webui: put speed on left and context on right

webui: recognize AsciiDoc files as valid text files (#16850)

* webui: recognize AsciiDoc files as valid text files

* webui: add an updated static webui build

* webui: add the updated dependency list

* webui: re-add an updated static webui build

Add a setting to display message generation statistics (#16901)

* feat: Add setting to display message generation statistics

* chore: build static webui output

webui: add HTML/JS preview support to MarkdownContent with sandboxed iframe (#16757)

* webui: add HTML/JS preview support to MarkdownContent with sandboxed iframe dialog

Extended MarkdownContent to flag previewable code languages,
add a preview button alongside copy controls, manage preview
dialog state, and share styling for the new button group

Introduced CodePreviewDialog.svelte, a sandboxed iframe modal
for rendering HTML/JS previews with consistent dialog controls

* webui: fullscreen HTML preview dialog using bits-ui

* Update tools/server/webui/src/lib/components/app/misc/CodePreviewDialog.svelte

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>

* Update tools/server/webui/src/lib/components/app/misc/MarkdownContent.svelte

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>

* webui: pedantic style tweak for CodePreviewDialog close button

* webui: remove overengineered preview language logic

* chore: update webui static build

---------

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>

webui: auto-refresh /props on inference start to resync model metadata (#16784)

* webui: auto-refresh /props on inference start to resync model metadata

- Add no-cache headers to /props and /slots
- Throttle slot checks to 30s
- Prevent concurrent fetches with promise guard
- Trigger refresh from chat streaming for legacy and ModelSelector
- Show dynamic serverWarning when using cached data

* fix: restore proper legacy behavior in webui by using unified /props refresh

Updated assistant message bubbles to show each message's stored model when available,
falling back to the current server model only when the per-message value is missing

When the model selector is disabled, now fetches /props and prioritizes that model name
over chunk metadata, then persists it with the streamed message so legacy mode properly
reflects the backend configuration

* fix: detect first valid SSE chunk and refresh server props once

* fix: removed the slots availability throttle constant and state

* webui: purge ai-generated cruft

* chore: update webui static build

feat(webui): improve LaTeX rendering with currency detection (#16508)

* webui : Revised LaTeX formula recognition

* webui : Further examples containg amounts

* webui : vitest for maskInlineLaTeX

* webui: Moved preprocessLaTeX to lib/utils

* webui: LaTeX in table-cells

* chore: update webui build output (use theirs)

* webui: backslash in LaTeX-preprocessing

* chore: update webui build output

* webui: look-behind backslash-check

* chore: update webui build output

* Apply suggestions from code review

Code maintenance (variable names, code formatting, string handling)

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>

* webui: Moved constants to lib/constants.

* webui: package woff2 inside base64 data

* webui: LaTeX-line-break in display formula

* chore: update webui build output

* webui: Bugfix (font embedding)

* webui: Bugfix (font embedding)

* webui: vite embeds assets

* webui: don't suppress 404 (fonts)

* refactor: KaTeX integration with SCSS

Moves KaTeX styling to SCSS for better customization and font embedding.

This change includes:
- Adding `sass` as a dev dependency.
- Introducing a custom SCSS file to override KaTeX variables and disable TTF/WOFF fonts, relying solely on WOFF2 for embedding.
- Adjusting the Vite configuration to resolve `katex-fonts` alias and inject SCSS variables.

* fix: LaTeX processing within blockquotes

* webui: update webui build output

---------

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>

server : add props.model_alias (#16943)

* server : add props.model_alias

webui: fix keyboard shortcuts for new chat & edit chat title (#17007)

Better UX for handling multiple attachments in WebUI (#17246)

webui: add OAI-Compat Harmony tool-call streaming visualization and persistence in chat UI (#16618)

* webui: add OAI-Compat Harmony tool-call live streaming visualization and persistence in chat UI

- Purely visual and diagnostic change, no effect on model context, prompt
  construction, or inference behavior

- Captured assistant tool call payloads during streaming and non-streaming
  completions, and persisted them in chat state and storage for downstream use

- Exposed parsed tool call labels beneath the assistant's model info line
  with graceful fallback when parsing fails

- Added tool call badges beneath assistant responses that expose JSON tooltips
  and copy their payloads when clicked, matching the existing model badge styling

- Added a user-facing setting to toggle tool call visibility to the Developer
  settings section directly under the model selector option

* webui: remove scroll listener causing unnecessary layout updates (model selector)

* Update tools/server/webui/src/lib/components/app/chat/ChatMessages/ChatMessageAssistant.svelte

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>

* Update tools/server/webui/src/lib/components/app/chat/ChatMessages/ChatMessageAssistant.svelte

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>

* chore: npm run format & update webui build output

* chore: update webui build output

---------

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>

webui: Fix clickability around chat processing statistics UI (#17278)

* fix: Better pointer events handling in chat processing info elements

* chore: update webui build output

Fix merge error

webui: Add a "Continue" Action for Assistant Message (#16971)

* feat: Add "Continue" action for assistant messages

* feat: Continuation logic & prompt improvements

* chore: update webui build output

* feat: Improve logic for continuing the assistant message

* chore: update webui build output

* chore: Linting

* chore: update webui build output

* fix: Remove synthetic prompt logic, use the prefill feature by sending the conversation payload ending with assistant message

* chore: update webui build output

* feat: Enable "Continue" button based on config & non-reasoning model type

* chore: update webui build output

* chore: Update packages with `npm audit fix`

* fix: Remove redundant error

* chore: update webui build output

* chore: Update `.gitignore`

* fix: Add missing change

* feat: Add auto-resizing for Edit Assistant/User Message textareas

* chore: update webui build output

Improved file naming & structure for UI components (#17405)

* refactor: Component iles naming & structure

* chore: update webui build output

* refactor: Dialog titles + components namig

* chore: update webui build output

* refactor: Imports

* chore: update webui build output

webui: hide border of button

webui: update

webui: update

webui: update

add vision

webui: minor settings reorganization and add disable autoscroll option (#17452)

* webui: added a dedicated 'Display' settings section that groups visualization options

* webui: added a Display setting to toggle automatic chat scrolling

* chore: update webui build output

Co-authored-by: firecoperana <firecoperana>
2025-11-24 07:03:45 +01:00
gapeleon
f6163dd58f Fix: Register missing /apply-template endpoint (#999)
The handle_apply_template handler was defined but never registered as
  an HTTP endpoint, causing 404 errors when calling POST /apply-template.

  This commit adds the missing endpoint registration to match the
  functionality available in llama.cpp mainline.

Co-authored-by: Gapeleon <gapeleon@users.noreply.github.com>
2025-11-24 06:53:15 +01:00
Yap Sok Ann
7505165dee Fix truncated logprobs when streaming is off (#998)
The logic to skip the logprobs of the stop token was originally from
ggml-org/llama.cpp#2849, and was later modified as part of
ggml-org/llama.cpp#10643 to be applied only to STOP_TYPE_WORD.

The latter change wasn't included in #723. Then, after #958 got merged,
the logic got inadvertently applied to GLM-4.5/4.6 and Kimi K2,
resulting in truncated logprobs when streaming is off.

This commit reverts the logic from ggml-org/llama.cpp#2849, such that
the logprobs of the stop token will always be included in the response,
when logprobs is enabled. From testing, this matches with the behavior
of Fireworks inference server, for both chat completions and text
completions endpoints.

Also fix logprobs param handling for the text completion endpoint.
2025-11-24 06:52:15 +01:00
Kawrakow
0f6986a33c Disable split mode "row" (#987)
* Disable split mode "row"

* Also llama-bench

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-19 16:15:50 +01:00
firecoperana
bacb8fb79f Server: Handle context shift better to reduce prompt processing time (#973)
* Handle context shift better to reduce pp

Add context-shift args

Add back ga_n in context shift

* optimize discard function and bring back n_keep = -1

---------

Co-authored-by: firecoperana <firecoperana>
2025-11-19 16:04:48 +01:00
firecoperana
b40d11b22d Fix kv cache save and load for GLM model (#965)
Co-authored-by: firecoperana <firecoperana>
2025-11-15 17:04:16 +02:00
firecoperana
5ec0def0ef Fix compiler warnings (#963)
* Fix changes meaning warnings

* A couple of more warnings and formatting

---------

Co-authored-by: firecoperana <firecoperana>
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-15 07:07:15 +02:00
firecoperana
bb358223cd server: cache prompt to host memory (#954)
* server : host-memory prompt caching

change similarity calculation and prompt save conditions

Remove unneeded token limit

rename variable

Separate prompt save and load logic

change default values

change log

remove truncate prompt logic

* add description

* bug fixes

* remove token limit in init

---------

Co-authored-by: firecoperana <firecoperana>
2025-11-14 18:40:13 +02:00
firecoperana
177b5d2a47 Fix cuda init error in rpc (#957)
Co-authored-by: firecoperana <firecoperana>
2025-11-14 06:59:54 +02:00
Kawrakow
6b9d1bf4b4 Graph reuse (#947)
* Add mainline compatible FA command line option

* Graph reuse: add command line argument to turn it on

* WIP

* This seems to work

* This is perhaps cleaner

* Change the command line option to -gr

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-14 06:58:19 +02:00
Kawrakow
88c02fa108 Set default MLA to 3 also in llama-bench (#949)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-13 09:52:06 +02:00
Kawrakow
25cd985c9b Add --n-cpu-moe to llama_bench (#937)
* Add --n-cpu-moe to llama_banch

* Add usage

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-11 08:44:59 +02:00
Kawrakow
121ed91165 Add rcache to llama-bench (#936)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-11 08:06:18 +02:00
firecoperana
eea6cc4433 Server: Add --draft-params to set draft model parameter via command line args (#932)
* Add command line argument for draft model

* Remove second context of draft model

* Format print

* print usage if parsing -draft fails

---------

Co-authored-by: firecoperana <firecoperana>
2025-11-10 09:51:07 +02:00
firecoperana
73c28dbef4 server: bug fix for preserved_tokens not preserved in process_token (#926)
Co-authored-by: firecoperana <firecoperana>
2025-11-09 14:16:29 +02:00
firecoperana
b63309a918 Fix embedding missing, CORS and crash using verbose in server (#924)
* server: fix crash when prompt has image and is too long

* server: fix CORS

* server: fix empty result for embedding

* change error message to truncate prompt

* server: fix slot id for save and load state

* bug fix

* server: update slot similarity to handle mtmd

* server: quick hack to calculate number of token processed with image

* server: fix out of range error when detokenizing prompt under verbose

* Add back Access-Control-Allow-Origin

* Server: Add prompt tokens in embedding results

---------

Co-authored-by: firecoperana <firecoperana>
2025-11-09 14:16:03 +02:00
Nexes the Elder
f9a411e5db More informative PPL readout line (#914)
* More informative PPL readout line

* trailing whitespace..
2025-11-07 16:41:24 +02:00
Kawrakow
532a05e466 CUDA: set compute parameters via command line arguments (#910)
* cuda: set compute parameters via command line arguments

* Also llama-bench

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-07 07:11:23 +02:00
Kawrakow
66ef68bc14 Fix compiler warning 2025-11-06 07:12:07 +02:00
firecoperana
18f5a6caef Bug fixes for completions and prompt caching in server (#906)
* Bug fixes for completions and prompt caching in server

* Fix compiler warning about redefinition

---------

Co-authored-by: firecoperana <firecoperana>
2025-11-06 07:10:51 +02:00
Kawrakow
e68f50be9a Allow quantization of ffn_gate_inp (#896)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-05 10:44:32 +02:00
firecoperana
7978f04996 Add vision support in llama-server (#901)
* server: add support for vision model
webui: add support for vision model

* server : remove hack for extra parallel slot#10187

* llama : fix KV shift for qwen2vl #13870

* add no-context-shift parameter

---------

Co-authored-by: firecoperana <firecoperana>
2025-11-05 10:43:46 +02:00
Thireus ☠
86597623a5 Port of Qwen3-VL support from mainline (#883)
* Port of Qwen3-VL for latest ik_llama.cpp

- convert_hf_to_gguf.py - Not touched, use llama.cpp to convert model instead
- sysl and metal support for imrope not added
- Vulkan support for imrope not tested
- Code not tested

* Bugfix n_embd was declared multiple times

https://github.com/ikawrakow/ik_llama.cpp/pull/883#issuecomment-3471179655

* Fix n_embd issue with qwen3vl

* model.output tensor not required

https://github.com/ikawrakow/ik_llama.cpp/pull/883#discussion_r2480388389

* Improved logic for qkv combined tensors

59ceaf8fcb (r2480395800)
59ceaf8fcb (r2480398187)

* Fix n_embd for merge_qkv() + cleaner code

https://github.com/ikawrakow/ik_llama.cpp/pull/883#discussion_r2481227395

* Revert TENSOR_NOT_REQUIRED
2025-11-04 19:20:54 +02:00
Kawrakow
efcb5f9d9e sweep-bench: be able to set TG tokens via -n (#897)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-04 14:39:30 +02:00