Commit Graph

4076 Commits

Author SHA1 Message Date
Iwan Kawrakow
2de3a96510 Avoid computing the attention reduce op for cohere2 2025-12-24 10:14:58 +00:00
Iwan Kawrakow
dc1770b8df Adding fused_norm - same idea as fused_rms_norm 2025-12-24 08:18:56 +00:00
Kawrakow
1d7d0225a0 Graph parallel: the next generation (#1080)
* WIP: absorb adding input into std_attn and std_ffn

* WIP: NCCL infra

* WIP: add reduce and fake_cpy ops

* WIP

* WIP: graph appears to work, layer is broken

* WIP: Qwen3-MoE works with graph, layer still broken

* WIP: GLM-4.5 graph works

* WIP: fix sm layer (dense)

* WIP: fix sm layer (MoE)

* WIP: fast PP with bespoke 4-GPU NCCL

I guess, I'm not using NCCL the right way as PP is very
low with a single communicator group for 3 or more GPUs.
But if I create 4 communicator groups for pairs of GPUs
(0,1, 2,3, 0,2, 1,3) and use that, PP is fast: I'm hitting
1500 t/s for L3-70B on the 4x3090 system, which is
~20% better than the previous sm graph without NCCL.
But that cannot be the solution (I cannot be creating pairwise
communicators and associated logic for every possible number of GPUs).

* WIP: Cohere2

* Explicitely set device

* Bespoke 3-GPU case

* WIP

* Do not repeat get_rows multiple times

* Fix 3 GPUs

* OK, let's leave it in

* Implement the reduce op without NCCL available

* Be able to build without NCCL

cmake -DGGML_NCCL=OFF disables it

* Make --max-gpu work again

* Slightly better for 4 GPUs without NCCL

* Cleanup

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-12-24 08:31:48 +01:00
firecoperana
5562605076 server: exclude thinking tokens when finding the slot (#1079)
refactor find slot

enable by default

Fix load prompt

rename variables

Co-authored-by: firecoperana <firecoperana>
2025-12-22 09:46:45 +01:00
Kawrakow
21fc9322f9 cuda: set device to src device before p2p copy (#1073)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-12-17 12:50:34 +01:00
Nexes the Elder
7bb79eff48 add split-mode-graph-scheduling parameter (#1068)
Use -smgs or --split-mode-graph-scheduling in CLI to bypass the disabling of split mode graph scheduling when tensor overrides is used.

Co-authored-by: Kawrakow <iwankawrakow@gmail.com>
2025-12-17 07:58:19 +01:00
Kawrakow
51eea5715f Better PP performance with split mode "graph" and 3+ GPUs (#1069)
* This should do the trick for PP

* Command line option to set max. extra VRAM that the scheduler can use

* Fix bug and cleanup

* Looks like with this change it is working with tensor overrides

* Nah, it is not working

* OK, this seems to be working

* Disable split scheduling with tensor overrides

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-12-17 07:40:25 +01:00
Kawrakow
8ccceff4e9 Much better TG speed with split mode "graph" (#1067)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-12-16 19:48:20 +01:00
firecoperana
756c3f8f43 Fix log issue for llama-cli (#1071)
Co-authored-by: firecoperana <firecoperana>
2025-12-16 18:12:16 +01:00
firecoperana
269cc761db Add back the fix for Kimi-K2 tool-call parsing issues (#1070)
Co-authored-by: firecoperana <firecoperana>
2025-12-16 14:44:47 +01:00
firecoperana
090f354d33 Refactor chat and server file (#1062)
* Add alternative log functions

* chat: fix int overflow, prevent size calculation in float/double (#17357)

* chat: fix int overflow, prevent size calculation in float/double

* Update common/chat.cpp

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* common : move all common_chat_parse_* to chat-parser.cpp. (#17481)

# Conflicts:
#	common/chat.cpp

* server: split server.cpp code into server/common/task/queue/context

* Fix compiler warning

* Clean up code

* common: use native MultiByteToWideChar

* move server prompt to server task

* Clean code

* delete utils.hpp

---------

Co-authored-by: firecoperana <firecoperana>
Co-authored-by: Xuan-Son Nguyen <son@huggingface.co>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: DAN™ <dranger003@gmail.com>
2025-12-15 08:27:20 +01:00
Kawrakow
0a36cea555 Use actual active number of layers when preparing splits (#1065)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-12-14 07:44:13 +01:00
Kawrakow
f90d1fdd06 Split mode "graph" for Cohere2 (#1061)
* This works and TG is descent, but PP is low

* Better

* Apply f_logit_scale before mul mat with output tensor

* This is better for PP: 600 t/s -> 700 t/s

* To not lose this again

* WIP

* Equal split

* WIP

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-12-13 20:30:08 +01:00
Kawrakow
5645be6cfc Fix sync logic (#1064)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-12-13 18:40:49 +01:00
Kawrakow
f667bd58b0 Undo sync reduction (#1063)
I'm finding issues for Qwen3-MoE

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-12-13 16:58:32 +01:00
Kawrakow
df02c39650 Do not use split mode graph scheduling if there are tensor overrides (#1060)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-12-12 14:48:38 +01:00
Kawrakow
cc14d4a3cc Fix overflow in offset calculation in mmq (#1059)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-12-12 14:31:06 +01:00
Kawrakow
b74fb479af Be able to enable or disable P2P via command line argument (#1058)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-12-12 13:36:42 +01:00
Kawrakow
0698501ae2 Slightly faster TG for split mode "graph" (#1057)
* Rearrange graph nodes

So that we can do graph portions that are the same on 2 or more
GPUs at the same time.

* Separate graph compute implementation for split mode graph

* This is better

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-12-12 07:54:37 +01:00
Kawrakow
6a0e72aeae Fix #1055 (#1056)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-12-11 14:44:32 +01:00
abc-nix
0feb046e6b enable peer access (NVlink) (#1050)
* enable peer access for cuda

* Remove redundant loop
2025-12-11 08:31:56 +01:00
Kawrakow
59dba9f778 Fix the fix (#1054)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-12-11 08:05:33 +01:00
Kawrakow
9484d150d8 Be able to set a max. number of GPUs to be used in split mode graph (#1051)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-12-11 07:22:53 +01:00
Kawrakow
6a5a707ac0 Fix llama-bench - missing buffer override comparison operator (#1053)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-12-11 07:21:06 +01:00
Kawrakow
00d939c811 Reduce back-end syncs (#1049)
* Reduce backend synchronization calls

* Also this

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-12-11 07:04:44 +01:00
i4TsU
62f907c663 QoL/bugfixes for llama-bench (#1052)
* include cuda-params and -ot in llama-bench output

* cleanup redundant type mapping

* fix wrong field name

* fix preexisting mistake in cuda_params help text (default value)

* fix preexisting mistake in kompute column header

* adjust code style to match current norms

* simplify/fix inverted columns

* fix field->value pairings/order

* remove dead field `f16_kv`

* sql printer deserves a way out too

* actually enable the new improvements....
2025-12-11 07:04:15 +01:00
Kawrakow
53f693a708 KV cache read/write for split mode "graph" (#1048)
* Handle split cache (write)

* Handle split cache (read)

* Fix writing the data twice

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-12-09 06:50:53 +01:00
Djip007
808ce4907c Unroll for loop for repacked BF16 MATMUL (#1047)
see https://github.com/ikawrakow/ik_llama.cpp/discussions/1028 for
detail
2025-12-08 06:09:45 +01:00
Kawrakow
c9fcfb9a7a Fix annoying compiler warnings (#1042)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-12-06 09:59:07 +01:00
Kawrakow
87f6943e4b Automatically disable CUDA graphs for split mode "graph" (#1040)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-12-06 07:38:02 +01:00
Kawrakow
a3737f4296 CUDA: set current device in compute_forward (#1039)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-12-05 16:47:50 +01:00
firecoperana
e741ec8a5d CUDA: Fix FA for Pascal GPU (#1036)
Co-authored-by: firecoperana <firecoperana>
2025-12-05 16:42:14 +01:00
Kawrakow
f4def9b300 Don't split the output tensor (#1038)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-12-05 15:56:53 +01:00
Kawrakow
b43801a2d2 Fix debug build (#1037)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-12-05 14:06:22 +01:00
Kawrakow
b715342e82 K-cache Hadamard transforms (CUDA) (#1034)
* Hadamard transforms for K-cache on CUDA

* Minor

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-12-04 18:46:22 +01:00
Kawrakow
658ced0abd Hadamard transforms for K-cache - CPU only (#1033)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-12-04 06:51:11 +01:00
Kawrakow
08961718f3 Allow empty splits (#1029)
* Allow empty splits

* Fix type, add additional asserts

* Fix also output

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-12-03 13:52:41 +01:00
Kawrakow
bcb218102d Use standard attention for Ministral3 (#1032)
Required adding the "temperature scaling" to the standard attention
implementation.

But in this way split mode "graph" is automatically supported.

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-12-03 13:43:31 +01:00
Kawrakow
74c56067b4 Fix bug in ggml_cuda_op_scale_tensor (#1031)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-12-03 11:32:19 +01:00
Kawrakow
fcc2df11df Adding ministral3: this seems to work (#1030)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-12-03 11:01:21 +01:00
Kawrakow
40097e7e41 Slightly better graph split strategy (#1026)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-12-02 18:50:52 +01:00
Kawrakow
8e3041b263 POC: CUDA tensor parallel (MoE models) (#1022)
* Remove most of split mode row

* WIP

* WIP: also allocate the KV cache using tensor split

* WIP: it runs with wrong result

But it also looks like the backend scheduler is not going to help:
* It copies mask and input positions to GPU 0
* => RoPE ops must run on GPU 0
* => To proceed attn evaluation, GPU 1 must wait for GPU 0 to finish its
     entire attn calculation
* Same with FFN. The rms_norm gets scheduled on GPU 0. Hence, GPU 1 must
  wait for GPU 0 to finish its entore FFN calculation before it can
  start (as it needs to copy the result of rms_norm from GPU 0)
* => Seems useless without writing a bespoke TP scheduling

* WIP

* This works, but it is slow

* This is slightly better

the graph is still not being computed in parallel.
Why? Because the scheduler creates graph splits where the
result of the computation on one GPU becomes an input for the
other split. Hence, to trigger the computation on the second GPU
one needs to wait for the computation on the first GPU to finish,
even thiough the two can be done in parallel up to the sunchronization
point. So, all that is left to do is to trick the scheduler to create
to splits that can be done in parallel, and then have a graph split
where the results get combined.

* Playing games with the scheduler

This change tricks it into doing the right thing^TM.
Still quite a bit slower than split mode layer for the 8B LlaMA model.
But for the 70B LlaMA it now beats split mode layer for TG:
28 t/s vs 24.4 t/s. PP is 627 t/s vs 744 t/s.
In comparison, split mode "row" in mainline gets
484 t/s PP and 19.3 t/s TG.

* Fix attn split

Granularity for Wq, Wo is not just head size, but
head size * gqa_ratio.
Else the Wk, Wv tensors end up not being a multiple of the
head size when we divide the split determined by Wo with
the gqa_ratio.

* Show memory used per device

* Make it work with partial offload

but no tensor overrides yet, just ngl < num_layers.

* Allow for f16 source in fused_rms_norm

* This results in faster PP.

Now PP is faster than split mode layer for L3-70B.

* Rename split mode "row" to split mode "graph"

* Leave FFN partial results as f16

* WIP GLM4.5 - runs with wrong results

* WIP GLM4.5 - this works

PP is already better than split mode layer, but TG for zero context
is kind of low - 60 vs 92 t/s. TG becomes better than split mode layer
at around 20k tokens. PP at 26k tokens is 1.55X of sm layer.

* Work around compiler bug

It issues a warning that there is an extra semicolon outside of a function,
but there isn't. If I remove the anonymous namespace and turn the
functions inside into static, the warning disapears, so clearly
a compiler bug.

* Make graph reuse work with split mode graph

* Remove more split mode row remnants

* WIP tensor overrides

Runs with wrong results, don't see where the issue could be.

* This works but is slow

Still does not work for row-interleaved quants

* Slightly better

* Slightly better

* Row-interleaved quants work

* Better

* Minor

* Guarad against using split mode "graph" for unsupported models

* Guards against using merge_qkv with split mode "graph"

* WIP split mode attn

Works for LlaMA models, but not for GLM-4.5.
Doesn't seem to improve performance, so I guess no point in trying to
fix it.

* Split mode graph for qwen3moe

* Try to better distribute the splits

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-12-01 19:25:40 +01:00
Kawrakow
507f3a4d14 Fix build with RPC not enabled (#1025)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-30 19:04:54 +01:00
firecoperana
e89064e657 RPC: support multiple devices including cpu (#1024)
* RPC support multiple devices

* rpc : update documentation (#16441)

Update the README file to match the newly added functionality of
exposing multiple devices from a single server.

Co-authored-by: Diego Devesa <slarengh@gmail.com>

# Conflicts:
#	examples/rpc/README.md

* Remove memory settings

* rpc : cache and reuse compute graphs (#15405)

Store the last computed graph and reuse it when possible.
Also do not return response from GRAPH_COMPUTE and assume it always
completes successfully. If this this is not the case, the server closes
the connection. This saves us a network round trip to the server.

* Add -cpu to include cpu backend

---------

Co-authored-by: firecoperana <firecoperana>
Co-authored-by: Radoslav Gerganov <rgerganov@gmail.com>
2025-11-30 18:48:02 +01:00
firecoperana
52adcf1e90 Update grammar (#1023)
* grammar : fix JSON Schema for string regex with top-level alt. (#9903)

Prior to this commit, using a JSON Schema containing a string
with `pattern` regular expression that uses top-level alternation
(e.g. `"pattern": "^A|B|C|D$"`) would result in invalid JSON
output from the constrained sampling grammar, because it
ended up creating a grammar rule like this for the string:

```
thing ::= "\"" "A" | "B" | "C" | "D" "\"" space
```

Note that this rule will only match a starting quote for the "A" case,
and will only match an ending quote for the "D" case,
so this rule will always produce invalid JSON when used for sampling
(that is, the JSON will always be lacking the starting quote,
the ending quote, or both).

This was fixed in a simple way by adding parentheses to the
generated rule (for all string pattern rules, to keep it simple),
such that the new generated rule looks like this (correct):

```
thing ::= "\"" ("A" | "B" | "C" | "D") "\"" space
```

* grammars : add English-only grammar (#10612)

* grammar : handle maxItems == 0 in JSON schema (#13117)

Co-authored-by: Richard Lyons <frob@cloudstaff.com>

* grammar-parser : fix possible null-deref (#9004)

Fixes: https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=70680

Signed-off-by: David Korczynski <david@adalogics.com>

* llama : fix typo in llama-grammar.h [no ci] (#11816)

* * server: fix "--grammar-file" parameter (#12285)

* common : use std::string_view now that we target c++17 (#14319)

* json : support `enum` values within `allOf` (#15830)

* grammar : use int64_t to avoid int overflows in int schema to grammar conversion logic (#16626)

* grammar : support array references in json schema (#16792)

* grammar : support array references in json schema

* Update json-schema-to-grammar.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* grammar : improve regex when naming ref derived rules

* grammar : replace non-conformant definitions array with anyOf test case

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
# Conflicts:
#	tests/test-json-schema-to-grammar.cpp

* merge fix

* llama : minor grammar refactor (#10897)

* llama: fix error on bad grammar (#12628)

* grammar : fix integer overflow (#17381)

* Fix DoS / integer overflow

* Remove optional, use INT64_MAX instead as placeholder value (it's technically -1, so it fits :)

* White space

* Actually, since it's unsigned, use UINT64_MAX
# Conflicts:
#	src/llama-grammar.cpp

* grammar: fix regression caused by #17381 (#17412)

* grammar: fix regression caused by #17381

* more readable
# Conflicts:
#	src/llama-grammar.cpp

* Merge Fix

* Fix warnings

---------

Signed-off-by: David Korczynski <david@adalogics.com>
Co-authored-by: Joe Eli McIlvain <joe.eli.mac@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: frob <rick+github@frob.com.au>
Co-authored-by: Richard Lyons <frob@cloudstaff.com>
Co-authored-by: DavidKorczynski <david@adalogics.com>
Co-authored-by: Daniel Bevenius <daniel.bevenius@gmail.com>
Co-authored-by: firecoperana <firecoperana>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Aldehir Rojas <hello@alde.dev>
Co-authored-by: Olivier Chafik <olivier.chafik@gmail.com>
Co-authored-by: Piotr Wilkin (ilintar) <piotr.wilkin@syndatis.com>
Co-authored-by: Xuan-Son Nguyen <son@huggingface.co>
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-30 18:45:38 +01:00
firecoperana
0a3e1d1449 Update mtmd to improve accuracy of M-RoPE (#993)
* model : Granite docling + Idefics3 preprocessing (SmolVLM) (#16206)

* feat: Add granite-docling conversion using trillion pretokenizer

Branch: gabe-l-hart/GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Add granite-docling vocab pre enum

Branch: gabe-l-hart/GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Use granite-docling pre

Branch: gabe-l-hart/GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Add clip_is_idefics3

Branch: gabe-l-hart/GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Allow multi-token boundary sequences for image templating

Branch: gabe-l-hart/GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Add tiling support for idefices3 in clip.cpp

This should likely be moved into llava_uhd::get_slice_instructions, but for
now this avoids disrupting the logic there.

Branch: gabe-l-hart/GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Partial support for full templating for idefics3 in mtmd

There are still errors encoding some of the image chunks, but the token
sequence now matches transformers _almost_ perfectly, except for the double
newline before the global image which shows up as two consecutive newline
tokens instead of a single double-newline token. I think this is happening
because the blocks are tokenized separately then concatenated.

Branch: gabe-l-hart/GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Fully working image preprocessing for idefics3 w/ resize and slicing

Branch: gabe-l-hart/GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Parse the preprocessor config's longest side and add it to the mmproj hparams

Branch: GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Use the longest side instead of size * scale_factor

For Granite Docling, these come out to the same value, but that was just a
conicidence.

Branch: GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Allow batch encoding and remove clip_is_idefics3

Branch: GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refactor: Remove unnecessary conditionals for empty token vectors

Branch: GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refactor: Use image_manipulation util

Branch: GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* add test model

---------

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
# Conflicts:
#	convert_hf_to_gguf.py
#	convert_hf_to_gguf_update.py
#	gguf-py/gguf/constants.py
#	gguf-py/gguf/gguf_writer.py
#	src/llama-vocab.cpp
#	src/llama-vocab.h

* mtmd : support home-cooked Mistral Small Omni (#14928)

* model : add LightOnOCR-1B model (#16764)

* model : add LightOnOCR-1B model

* add test
# Conflicts:
#	convert_hf_to_gguf.py
#	gguf-py/gguf/constants.py

* mtmd : fix idefics3 preprocessing (#16806)

* mtmd : fix idefics3 preprocessing

* disable granite test

* fix test for granite

* model: Add support for CogVLM model (#15002)

* Added GGUF mappings for CogVLM model

* Add tensor mapping for CogVLM visual encoder

* Add CogVLM to conversion script, no vision part yet

* Added CogVLM vision model to conversion script

* Add graph for CogVLM CLIP model

* Add graph for CogVLM

* Fixes for CogVLM. Now compiles.

* Model now runs

* Fixes for cogvlm graph

* Account for graph context change after rebase

* Changes for whitespace

* Changes in convert script according to comments

* Switch CogVLM LLM graph to merged QKV tensor

* Use rope_type variable instead of direct definition

* Change CogVLM CLIP encoder to use SWIGLU

* Switch CogVLM CLIP to use merged QKV

* Apply rebase edits and remove ggml_cont call that is now unnecessary

* clean up

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
# Conflicts:
#	convert_hf_to_gguf.py
#	examples/mtmd/clip.cpp
#	gguf-py/gguf/constants.py
#	gguf-py/gguf/tensor_mapping.py
#	src/llama-arch.cpp
#	src/llama-arch.h
#	src/llama-model.cpp
#	src/llama-model.h

* mtmd: refactor preprocessing + support max/min pixels (#16878)

* mtmd: refactor preprocessing + support max/min pixels

* fix mlp type

* implement mix/max pixels

* improve hparams

* better image preproc for qwen

* fix

* fix out of bound composite

* fix (2)

* fix token calculation

* get_merge_kernel_size()

* fix llama4 and lfm2

* gonna fix them all

* use simple resize for qwen

* qwen: increase min tokens

* no resize if dst size == src size

* restore to initial min/max tokens value for qwen
# Conflicts:
#	examples/mtmd/clip.cpp

* clip : use FA (#16837)

* clip : use FA

* cont : add warning about unsupported ops

* implement "auto" mode for clip flash attn

* clip : print more detailed op support info during warmup

* cont : remove obsolete comment [no ci]

* improve debugging message

* trailing space

* metal : remove stray return

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>

* model: add Janus Pro for image understanding (#16906)

* Add support for Janus Pro

* Update gguf-py/gguf/tensor_mapping.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update gguf-py/gguf/tensor_mapping.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Address reviewer suggestions

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Add JANUS_PRO constant

* Update clip model handling

Co-authored-by: Xuan-Son Nguyen <son@huggingface.co>

* Update tools/mtmd/clip.cpp

Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>

* Refactor JANUS_PRO handling in clip.cpp

Co-authored-by: Xuan-Son Nguyen <son@huggingface.co>

* Update tools/mtmd/clip.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* em whitespace

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Xuan-Son Nguyen <son@huggingface.co>
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
# Conflicts:
#	convert_hf_to_gguf.py
#	gguf-py/gguf/constants.py
#	gguf-py/gguf/tensor_mapping.py

* mtmd: pad mask for qwen2.5vl (#16954)

* mtmd: pad mask for qwen2.5vl

* improve

* mtmd: add --image-min/max-tokens (#16921)

* mtmd: improve struct initialization (#16981)

* mtmd: allow QwenVL to process larger image by default (#17020)

* Disable flash attention

* mtmd : fix embedding size for image input (#17123)

* mtmd: fix patch_size initialized to random value in audio models (#17128)

* mtmd: fix patch_size initialized to random value in audio models

* add default hparams

* add llama_model_n_embd_inp

* Fix load qwen3 vl

Change batch size

* Add description

* Fix cli build error

---------

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
Co-authored-by: Tianyue-Zhao <zhaotianyue@outlook.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Zhiyong Wang <85110830+ravenouse@users.noreply.github.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
Co-authored-by: firecoperana <firecoperana>
2025-11-29 07:27:15 +01:00
hksdpc255
e7ecdb8f0d server : add Anthropic Messages API support (#1012) 2025-11-29 07:24:59 +01:00
Kawrakow
bcdd3031d8 Attempt to fix #1014 (#1017)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-27 15:58:18 +01:00
Kawrakow
e60f71887b Fix llama-bench mla parameter (#1016)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-27 09:33:30 +01:00
firecoperana
5f3485c2c2 Change default RPC order and fix wrong RPC server order in --device arg (#1011)
* Change default RPC order and fix wrong RPC order in --device arg

* Update

---------

Co-authored-by: firecoperana <firecoperana>
2025-11-26 16:51:51 +01:00