But it also looks like the backend scheduler is not going to help:
* It copies mask and input positions to GPU 0
* => RoPE ops must run on GPU 0
* => To proceed attn evaluation, GPU 1 must wait for GPU 0 to finish its
entire attn calculation
* Same with FFN. The rms_norm gets scheduled on GPU 0. Hence, GPU 1 must
wait for GPU 0 to finish its entore FFN calculation before it can
start (as it needs to copy the result of rms_norm from GPU 0)
* => Seems useless without writing a bespoke TP scheduling
* RPC support multiple devices
* rpc : update documentation (#16441)
Update the README file to match the newly added functionality of
exposing multiple devices from a single server.
Co-authored-by: Diego Devesa <slarengh@gmail.com>
# Conflicts:
# examples/rpc/README.md
* Remove memory settings
* rpc : cache and reuse compute graphs (#15405)
Store the last computed graph and reuse it when possible.
Also do not return response from GRAPH_COMPUTE and assume it always
completes successfully. If this this is not the case, the server closes
the connection. This saves us a network round trip to the server.
* Add -cpu to include cpu backend
---------
Co-authored-by: firecoperana <firecoperana>
Co-authored-by: Radoslav Gerganov <rgerganov@gmail.com>
* grammar : fix JSON Schema for string regex with top-level alt. (#9903)
Prior to this commit, using a JSON Schema containing a string
with `pattern` regular expression that uses top-level alternation
(e.g. `"pattern": "^A|B|C|D$"`) would result in invalid JSON
output from the constrained sampling grammar, because it
ended up creating a grammar rule like this for the string:
```
thing ::= "\"" "A" | "B" | "C" | "D" "\"" space
```
Note that this rule will only match a starting quote for the "A" case,
and will only match an ending quote for the "D" case,
so this rule will always produce invalid JSON when used for sampling
(that is, the JSON will always be lacking the starting quote,
the ending quote, or both).
This was fixed in a simple way by adding parentheses to the
generated rule (for all string pattern rules, to keep it simple),
such that the new generated rule looks like this (correct):
```
thing ::= "\"" ("A" | "B" | "C" | "D") "\"" space
```
* grammars : add English-only grammar (#10612)
* grammar : handle maxItems == 0 in JSON schema (#13117)
Co-authored-by: Richard Lyons <frob@cloudstaff.com>
* grammar-parser : fix possible null-deref (#9004)
Fixes: https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=70680
Signed-off-by: David Korczynski <david@adalogics.com>
* llama : fix typo in llama-grammar.h [no ci] (#11816)
* * server: fix "--grammar-file" parameter (#12285)
* common : use std::string_view now that we target c++17 (#14319)
* json : support `enum` values within `allOf` (#15830)
* grammar : use int64_t to avoid int overflows in int schema to grammar conversion logic (#16626)
* grammar : support array references in json schema (#16792)
* grammar : support array references in json schema
* Update json-schema-to-grammar.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* grammar : improve regex when naming ref derived rules
* grammar : replace non-conformant definitions array with anyOf test case
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
# Conflicts:
# tests/test-json-schema-to-grammar.cpp
* merge fix
* llama : minor grammar refactor (#10897)
* llama: fix error on bad grammar (#12628)
* grammar : fix integer overflow (#17381)
* Fix DoS / integer overflow
* Remove optional, use INT64_MAX instead as placeholder value (it's technically -1, so it fits :)
* White space
* Actually, since it's unsigned, use UINT64_MAX
# Conflicts:
# src/llama-grammar.cpp
* grammar: fix regression caused by #17381 (#17412)
* grammar: fix regression caused by #17381
* more readable
# Conflicts:
# src/llama-grammar.cpp
* Merge Fix
* Fix warnings
---------
Signed-off-by: David Korczynski <david@adalogics.com>
Co-authored-by: Joe Eli McIlvain <joe.eli.mac@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: frob <rick+github@frob.com.au>
Co-authored-by: Richard Lyons <frob@cloudstaff.com>
Co-authored-by: DavidKorczynski <david@adalogics.com>
Co-authored-by: Daniel Bevenius <daniel.bevenius@gmail.com>
Co-authored-by: firecoperana <firecoperana>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Aldehir Rojas <hello@alde.dev>
Co-authored-by: Olivier Chafik <olivier.chafik@gmail.com>
Co-authored-by: Piotr Wilkin (ilintar) <piotr.wilkin@syndatis.com>
Co-authored-by: Xuan-Son Nguyen <son@huggingface.co>
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* model : Granite docling + Idefics3 preprocessing (SmolVLM) (#16206)
* feat: Add granite-docling conversion using trillion pretokenizer
Branch: gabe-l-hart/GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Add granite-docling vocab pre enum
Branch: gabe-l-hart/GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Use granite-docling pre
Branch: gabe-l-hart/GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Add clip_is_idefics3
Branch: gabe-l-hart/GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Allow multi-token boundary sequences for image templating
Branch: gabe-l-hart/GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Add tiling support for idefices3 in clip.cpp
This should likely be moved into llava_uhd::get_slice_instructions, but for
now this avoids disrupting the logic there.
Branch: gabe-l-hart/GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Partial support for full templating for idefics3 in mtmd
There are still errors encoding some of the image chunks, but the token
sequence now matches transformers _almost_ perfectly, except for the double
newline before the global image which shows up as two consecutive newline
tokens instead of a single double-newline token. I think this is happening
because the blocks are tokenized separately then concatenated.
Branch: gabe-l-hart/GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Fully working image preprocessing for idefics3 w/ resize and slicing
Branch: gabe-l-hart/GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Parse the preprocessor config's longest side and add it to the mmproj hparams
Branch: GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Use the longest side instead of size * scale_factor
For Granite Docling, these come out to the same value, but that was just a
conicidence.
Branch: GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Allow batch encoding and remove clip_is_idefics3
Branch: GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* refactor: Remove unnecessary conditionals for empty token vectors
Branch: GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* refactor: Use image_manipulation util
Branch: GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* add test model
---------
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
# Conflicts:
# convert_hf_to_gguf.py
# convert_hf_to_gguf_update.py
# gguf-py/gguf/constants.py
# gguf-py/gguf/gguf_writer.py
# src/llama-vocab.cpp
# src/llama-vocab.h
* mtmd : support home-cooked Mistral Small Omni (#14928)
* model : add LightOnOCR-1B model (#16764)
* model : add LightOnOCR-1B model
* add test
# Conflicts:
# convert_hf_to_gguf.py
# gguf-py/gguf/constants.py
* mtmd : fix idefics3 preprocessing (#16806)
* mtmd : fix idefics3 preprocessing
* disable granite test
* fix test for granite
* model: Add support for CogVLM model (#15002)
* Added GGUF mappings for CogVLM model
* Add tensor mapping for CogVLM visual encoder
* Add CogVLM to conversion script, no vision part yet
* Added CogVLM vision model to conversion script
* Add graph for CogVLM CLIP model
* Add graph for CogVLM
* Fixes for CogVLM. Now compiles.
* Model now runs
* Fixes for cogvlm graph
* Account for graph context change after rebase
* Changes for whitespace
* Changes in convert script according to comments
* Switch CogVLM LLM graph to merged QKV tensor
* Use rope_type variable instead of direct definition
* Change CogVLM CLIP encoder to use SWIGLU
* Switch CogVLM CLIP to use merged QKV
* Apply rebase edits and remove ggml_cont call that is now unnecessary
* clean up
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
# Conflicts:
# convert_hf_to_gguf.py
# examples/mtmd/clip.cpp
# gguf-py/gguf/constants.py
# gguf-py/gguf/tensor_mapping.py
# src/llama-arch.cpp
# src/llama-arch.h
# src/llama-model.cpp
# src/llama-model.h
* mtmd: refactor preprocessing + support max/min pixels (#16878)
* mtmd: refactor preprocessing + support max/min pixels
* fix mlp type
* implement mix/max pixels
* improve hparams
* better image preproc for qwen
* fix
* fix out of bound composite
* fix (2)
* fix token calculation
* get_merge_kernel_size()
* fix llama4 and lfm2
* gonna fix them all
* use simple resize for qwen
* qwen: increase min tokens
* no resize if dst size == src size
* restore to initial min/max tokens value for qwen
# Conflicts:
# examples/mtmd/clip.cpp
* clip : use FA (#16837)
* clip : use FA
* cont : add warning about unsupported ops
* implement "auto" mode for clip flash attn
* clip : print more detailed op support info during warmup
* cont : remove obsolete comment [no ci]
* improve debugging message
* trailing space
* metal : remove stray return
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
* model: add Janus Pro for image understanding (#16906)
* Add support for Janus Pro
* Update gguf-py/gguf/tensor_mapping.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update gguf-py/gguf/tensor_mapping.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Address reviewer suggestions
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Add JANUS_PRO constant
* Update clip model handling
Co-authored-by: Xuan-Son Nguyen <son@huggingface.co>
* Update tools/mtmd/clip.cpp
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
* Refactor JANUS_PRO handling in clip.cpp
Co-authored-by: Xuan-Son Nguyen <son@huggingface.co>
* Update tools/mtmd/clip.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* em whitespace
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Xuan-Son Nguyen <son@huggingface.co>
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
# Conflicts:
# convert_hf_to_gguf.py
# gguf-py/gguf/constants.py
# gguf-py/gguf/tensor_mapping.py
* mtmd: pad mask for qwen2.5vl (#16954)
* mtmd: pad mask for qwen2.5vl
* improve
* mtmd: add --image-min/max-tokens (#16921)
* mtmd: improve struct initialization (#16981)
* mtmd: allow QwenVL to process larger image by default (#17020)
* Disable flash attention
* mtmd : fix embedding size for image input (#17123)
* mtmd: fix patch_size initialized to random value in audio models (#17128)
* mtmd: fix patch_size initialized to random value in audio models
* add default hparams
* add llama_model_n_embd_inp
* Fix load qwen3 vl
Change batch size
* Add description
* Fix cli build error
---------
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
Co-authored-by: Tianyue-Zhao <zhaotianyue@outlook.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Zhiyong Wang <85110830+ravenouse@users.noreply.github.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
Co-authored-by: firecoperana <firecoperana>
* Fixing Gigachat support
* Gigachat: CUDA FA (needs 192 x 192 for MLA = 3)
* Gigachat: CPU FA (needs 192 x 192 for MLA = 3)
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Add mainline compatible FA command line option
* Graph reuse: add command line argument to turn it on
* WIP
* This seems to work
* This is perhaps cleaner
* Change the command line option to -gr
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
so more recent users that haven't followed the history of FlashMLA
evolution and hence don't know about the MLA options get the best setting
without having to add -mla 3 on the command line.
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Fuse concat and copy into K cache
* Avoid ggml_cont() when n_token = 1
Combined effect: about +2% in TG performance with full GPU offload
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Merge Q and K into a single tensor
* Make V mul mat follow QK mul mat
so they can be fused, which gives a slightly bbetter TG performance.
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* server: add support for vision model
webui: add support for vision model
* server : remove hack for extra parallel slot#10187
* llama : fix KV shift for qwen2vl #13870
* add no-context-shift parameter
---------
Co-authored-by: firecoperana <firecoperana>
* Introducing rope cache
When computing RoPE, the rotation angles in each layer
are exactly the same, and only depend on the token positions
(and other constant, model dependent parameters).
So, I wonder, why don't we compute the angles just once
and then reuse for the Q and K RoPE in each layer?
This commit does it as a POC on the CPU, and uses it in
the Qwen3-MoE compute graph.
* cuda: neox works
* WIP
* rope_cache: norm works
* Fused rope+rope
* Fused rope+rope (norm)
* Fused rms+rms+rope+rope (neox) - not working
* WIP
* Also qwen3
* Add command line arg to disable rope cache
* Disable RoPE cache if rope type is not neox or norm
* Add missing break after merge with main
* Fused fused_rms+fused_rms+rope+rope (with -mqkv)
* Fused fused_rms+fused_rms+rope+rope (without -mqkv)
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Biased mmvq: minor optimization
* Fusing Q and K rms_norm for TG on CUDA
* Remove commented out code
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* POC: merge Q, K, V into a single, contiguous tensor
Done just for Qwen3-MoE, where I see a 4% uplift in TG.
PP performance gain is sub-percent, if any.
Still, it seems it makes sense to do it in general given
the TG performance gain.
* WIP
* merge_qkv: it works for gpt-oss
...but we see a smaller TG gain (~1.5%)
* WIP
* Don't ignore the return value of create_tensors()
else, when q, k, v get merged and we are running on the CPU,
we get a crash because the backend is trying to use mmap,
but that no longer works.
* merge_qkv: bias can be required, optional, or mandatory
* merge_qkv: glm4.5moe
* merge_qkv: add command loine argument to enable
* merge_qkv: fix tensor dimensions
* merge_qkv: llama-4
* merge_qkv: qwen3 (dense)
* merge_qkv: simplify build_qwen3moe
* cohere2 - simplify graph building
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Fuse Q, K, V gemv+add
* More gemv+add fusing
* Faster copy when tensors are contiguous
Relevant for storing data into the KV cache. I see ~1% speedup
for fast models (Ling-mini-2.0, gpt-oss-20b, etc.)
* Cleanup
* Make sure the bias really is 1 row to use fusion
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Change fmoe to be on by default
* Change default fmoe also in llama-bench
* Change flash attention to be on by default
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Adding fused mul+multi_add + CPU implementation
* fused mul+multi_add: command line argument to disable it
* Faster tensor name formatting
We gain ~1% for Ling-mini-2.0 when running on CUDA.
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Adding fused mul+multi_add + CPU implementation
* fused mul+multi_add: CUDA
* fused mul+multi_add: command line argument to disable it
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Fuse add+add+fused_rms
* Try this
* Macro to easily enable/disable fusion
* Various:
* Check that all tensors involved are on the same device before applying fusion
* Fuse sigmoid+scale+sum_rows+div
* Fix the fused bailingmoe2 experts selection
The issue there was that the bias was not per row, but per
expert group, so only the first n_per_group biases were used
for al experts.
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Combine all calls to llm_build_norm to a single line
so more easily check what kind of arguments are being passed
by simply using grep.
* Combine add + fused_rms_norm
For many models this happens at each layer: the result of the
layer is added to the ayer input, which then becomes the input
to the next layer, which then is typically normalized via
fused_rms_norm.
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Do not allocate KV cache for unused layers
* Do not apply experts weight scale if it is 1
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Fuse sigmoid+add+grouped_topk+get_rows (CPU)
* Fix CPU + CUDA
but CUDA is somehow not 100% correct as I get a slightly different
PPL (lower!)
* Minor
* Fuse sigmoid+add+topk+get_rows (CUDA)
* Fuse sigmoid+add+topk+get_rows (CPU)
* Fuse topk+view+get_rows+reshape+softmax (CPU)
* Fuse topk+view+get_rows+reshape+softmax (CUDA)
* cpu: turn off the openai topk fusing for now
Something is not right and I don't see the bug.
On the CPU one doesn't gain much if anything, so not a big loss.
* Also fuse sum_rows and div
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Better argsort (CPU)
* Attemt at grouped topk
* This seems to do the trick for grouped experts routing
* Cleanup
* Trying to merge, something is not right
* Working merged grouped top_k (CPU)
* Add command line option to enable grouped expert routing
* Add grouped expert routing option to llama-bench
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>