But it also looks like the backend scheduler is not going to help:
* It copies mask and input positions to GPU 0
* => RoPE ops must run on GPU 0
* => To proceed attn evaluation, GPU 1 must wait for GPU 0 to finish its
entire attn calculation
* Same with FFN. The rms_norm gets scheduled on GPU 0. Hence, GPU 1 must
wait for GPU 0 to finish its entore FFN calculation before it can
start (as it needs to copy the result of rms_norm from GPU 0)
* => Seems useless without writing a bespoke TP scheduling
* RPC support multiple devices
* rpc : update documentation (#16441)
Update the README file to match the newly added functionality of
exposing multiple devices from a single server.
Co-authored-by: Diego Devesa <slarengh@gmail.com>
# Conflicts:
# examples/rpc/README.md
* Remove memory settings
* rpc : cache and reuse compute graphs (#15405)
Store the last computed graph and reuse it when possible.
Also do not return response from GRAPH_COMPUTE and assume it always
completes successfully. If this this is not the case, the server closes
the connection. This saves us a network round trip to the server.
* Add -cpu to include cpu backend
---------
Co-authored-by: firecoperana <firecoperana>
Co-authored-by: Radoslav Gerganov <rgerganov@gmail.com>
* model : Granite docling + Idefics3 preprocessing (SmolVLM) (#16206)
* feat: Add granite-docling conversion using trillion pretokenizer
Branch: gabe-l-hart/GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Add granite-docling vocab pre enum
Branch: gabe-l-hart/GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Use granite-docling pre
Branch: gabe-l-hart/GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Add clip_is_idefics3
Branch: gabe-l-hart/GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Allow multi-token boundary sequences for image templating
Branch: gabe-l-hart/GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Add tiling support for idefices3 in clip.cpp
This should likely be moved into llava_uhd::get_slice_instructions, but for
now this avoids disrupting the logic there.
Branch: gabe-l-hart/GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Partial support for full templating for idefics3 in mtmd
There are still errors encoding some of the image chunks, but the token
sequence now matches transformers _almost_ perfectly, except for the double
newline before the global image which shows up as two consecutive newline
tokens instead of a single double-newline token. I think this is happening
because the blocks are tokenized separately then concatenated.
Branch: gabe-l-hart/GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Fully working image preprocessing for idefics3 w/ resize and slicing
Branch: gabe-l-hart/GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Parse the preprocessor config's longest side and add it to the mmproj hparams
Branch: GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Use the longest side instead of size * scale_factor
For Granite Docling, these come out to the same value, but that was just a
conicidence.
Branch: GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Allow batch encoding and remove clip_is_idefics3
Branch: GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* refactor: Remove unnecessary conditionals for empty token vectors
Branch: GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* refactor: Use image_manipulation util
Branch: GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* add test model
---------
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
# Conflicts:
# convert_hf_to_gguf.py
# convert_hf_to_gguf_update.py
# gguf-py/gguf/constants.py
# gguf-py/gguf/gguf_writer.py
# src/llama-vocab.cpp
# src/llama-vocab.h
* mtmd : support home-cooked Mistral Small Omni (#14928)
* model : add LightOnOCR-1B model (#16764)
* model : add LightOnOCR-1B model
* add test
# Conflicts:
# convert_hf_to_gguf.py
# gguf-py/gguf/constants.py
* mtmd : fix idefics3 preprocessing (#16806)
* mtmd : fix idefics3 preprocessing
* disable granite test
* fix test for granite
* model: Add support for CogVLM model (#15002)
* Added GGUF mappings for CogVLM model
* Add tensor mapping for CogVLM visual encoder
* Add CogVLM to conversion script, no vision part yet
* Added CogVLM vision model to conversion script
* Add graph for CogVLM CLIP model
* Add graph for CogVLM
* Fixes for CogVLM. Now compiles.
* Model now runs
* Fixes for cogvlm graph
* Account for graph context change after rebase
* Changes for whitespace
* Changes in convert script according to comments
* Switch CogVLM LLM graph to merged QKV tensor
* Use rope_type variable instead of direct definition
* Change CogVLM CLIP encoder to use SWIGLU
* Switch CogVLM CLIP to use merged QKV
* Apply rebase edits and remove ggml_cont call that is now unnecessary
* clean up
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
# Conflicts:
# convert_hf_to_gguf.py
# examples/mtmd/clip.cpp
# gguf-py/gguf/constants.py
# gguf-py/gguf/tensor_mapping.py
# src/llama-arch.cpp
# src/llama-arch.h
# src/llama-model.cpp
# src/llama-model.h
* mtmd: refactor preprocessing + support max/min pixels (#16878)
* mtmd: refactor preprocessing + support max/min pixels
* fix mlp type
* implement mix/max pixels
* improve hparams
* better image preproc for qwen
* fix
* fix out of bound composite
* fix (2)
* fix token calculation
* get_merge_kernel_size()
* fix llama4 and lfm2
* gonna fix them all
* use simple resize for qwen
* qwen: increase min tokens
* no resize if dst size == src size
* restore to initial min/max tokens value for qwen
# Conflicts:
# examples/mtmd/clip.cpp
* clip : use FA (#16837)
* clip : use FA
* cont : add warning about unsupported ops
* implement "auto" mode for clip flash attn
* clip : print more detailed op support info during warmup
* cont : remove obsolete comment [no ci]
* improve debugging message
* trailing space
* metal : remove stray return
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
* model: add Janus Pro for image understanding (#16906)
* Add support for Janus Pro
* Update gguf-py/gguf/tensor_mapping.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update gguf-py/gguf/tensor_mapping.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Address reviewer suggestions
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Add JANUS_PRO constant
* Update clip model handling
Co-authored-by: Xuan-Son Nguyen <son@huggingface.co>
* Update tools/mtmd/clip.cpp
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
* Refactor JANUS_PRO handling in clip.cpp
Co-authored-by: Xuan-Son Nguyen <son@huggingface.co>
* Update tools/mtmd/clip.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* em whitespace
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Xuan-Son Nguyen <son@huggingface.co>
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
# Conflicts:
# convert_hf_to_gguf.py
# gguf-py/gguf/constants.py
# gguf-py/gguf/tensor_mapping.py
* mtmd: pad mask for qwen2.5vl (#16954)
* mtmd: pad mask for qwen2.5vl
* improve
* mtmd: add --image-min/max-tokens (#16921)
* mtmd: improve struct initialization (#16981)
* mtmd: allow QwenVL to process larger image by default (#17020)
* Disable flash attention
* mtmd : fix embedding size for image input (#17123)
* mtmd: fix patch_size initialized to random value in audio models (#17128)
* mtmd: fix patch_size initialized to random value in audio models
* add default hparams
* add llama_model_n_embd_inp
* Fix load qwen3 vl
Change batch size
* Add description
* Fix cli build error
---------
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
Co-authored-by: Tianyue-Zhao <zhaotianyue@outlook.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Zhiyong Wang <85110830+ravenouse@users.noreply.github.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
Co-authored-by: firecoperana <firecoperana>
* Add mainline compatible FA command line option
* Graph reuse: add command line argument to turn it on
* WIP
* This seems to work
* This is perhaps cleaner
* Change the command line option to -gr
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
so more recent users that haven't followed the history of FlashMLA
evolution and hence don't know about the MLA options get the best setting
without having to add -mla 3 on the command line.
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Introducing rope cache
When computing RoPE, the rotation angles in each layer
are exactly the same, and only depend on the token positions
(and other constant, model dependent parameters).
So, I wonder, why don't we compute the angles just once
and then reuse for the Q and K RoPE in each layer?
This commit does it as a POC on the CPU, and uses it in
the Qwen3-MoE compute graph.
* cuda: neox works
* WIP
* rope_cache: norm works
* Fused rope+rope
* Fused rope+rope (norm)
* Fused rms+rms+rope+rope (neox) - not working
* WIP
* Also qwen3
* Add command line arg to disable rope cache
* Disable RoPE cache if rope type is not neox or norm
* Add missing break after merge with main
* Fused fused_rms+fused_rms+rope+rope (with -mqkv)
* Fused fused_rms+fused_rms+rope+rope (without -mqkv)
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* POC: merge Q, K, V into a single, contiguous tensor
Done just for Qwen3-MoE, where I see a 4% uplift in TG.
PP performance gain is sub-percent, if any.
Still, it seems it makes sense to do it in general given
the TG performance gain.
* WIP
* merge_qkv: it works for gpt-oss
...but we see a smaller TG gain (~1.5%)
* WIP
* Don't ignore the return value of create_tensors()
else, when q, k, v get merged and we are running on the CPU,
we get a crash because the backend is trying to use mmap,
but that no longer works.
* merge_qkv: bias can be required, optional, or mandatory
* merge_qkv: glm4.5moe
* merge_qkv: add command loine argument to enable
* merge_qkv: fix tensor dimensions
* merge_qkv: llama-4
* merge_qkv: qwen3 (dense)
* merge_qkv: simplify build_qwen3moe
* cohere2 - simplify graph building
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Change fmoe to be on by default
* Change default fmoe also in llama-bench
* Change flash attention to be on by default
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Adding fused mul+multi_add + CPU implementation
* fused mul+multi_add: CUDA
* fused mul+multi_add: command line argument to disable it
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Do not allocate KV cache for unused layers
* Do not apply experts weight scale if it is 1
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Better argsort (CPU)
* Attemt at grouped topk
* This seems to do the trick for grouped experts routing
* Cleanup
* Trying to merge, something is not right
* Working merged grouped top_k (CPU)
* Add command line option to enable grouped expert routing
* Add grouped expert routing option to llama-bench
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Parallelize mask
We see non-negligible PP gains for long contexts.
More importantly, the strange drop in performance
observed for GPT-OSS for context >= 32k tokens is gone.
* Whith FA on, create mask as f16 directly
* WIP
* Reduce KQ mask padding to 16
Why was it 64 in the first place?
I don't observe any issues, while TG performance
for long contexts improves by 2-4%.
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* llama_model and llama_hparams
* llama_build_context
Surprisingly small reduction in llama.cpp compile time given
the reduction in LOCs (22k -> 14k)
* LLM_TN
llama.cpp compilation: 50 s -> 33 s
* llama_quantize
* arch names
* All graph building is now in llm-build-context.cpp
* hparams loading
llama.cpp is now just 9300 LOC, but still takes 32 seconds to compile.
* We are now at 6 seconds to build the src folder
* load -> create
We are not actually loading the tensors, but just creating them.
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Add mtmd: the beginning
* Add mtmd: mtmd.cpp compiles
* Add mtmd: clip initialization compiles
* Add mtmd: clip.cpp compiles
* Add mtmd: builds successfully
* Add CPU implementation for GGML_OP_GLU
* Add CUDA implementation for GGML_OP_GLU
* Add CPU implementation for GGML_OP_CONV_2D and GGML_OP_CONV_2D_DW
* Add CUDA implementation for GGML_OP_CONV_2D and GGML_OP_CONV_2D_DW
* Add mtmd: refresh CPU rope
* Add mtmd: refresh CUDA rope
* Add mtmd: add Qwen2-VL
* Add mtmd: Qwen2.5-VL text seems to work with this change
* Add mtmd: fix swiglu
* Add mtmd: use LOG_TEE so generated tokens show up in terminal
* Add mtmd: do not attempt to load a GPU backend if none are available
* GLU, not GPU
* Fix typo
* Fix new/free mismatch
* LOG stuff
* Add mtmd: this fixes gibberish on second image
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Quick attempt to fuse the Q, K, V GEMMs
Doesn't do much on the CPU
* Doesn't do much on the GPU either
* Use llm_build_mul_mat_qkv
* This is not needed
* Revert timing on committed by mistake
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Offload only activated experts
* This seems to do the trick for -fmoe
* Do not recalculate activated expers for fused up/gate
* Log out of bounds access details
* Add a command line argument
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Bounds for flash attention
* Add n_swa to FA parameters
* Fix it
* This seems very slightly better
* Using vec kernel when we have SWA
* Need also this
* f32 vec kernel
* This is slightly better
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>