* model : Granite docling + Idefics3 preprocessing (SmolVLM) (#16206)
* feat: Add granite-docling conversion using trillion pretokenizer
Branch: gabe-l-hart/GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Add granite-docling vocab pre enum
Branch: gabe-l-hart/GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Use granite-docling pre
Branch: gabe-l-hart/GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Add clip_is_idefics3
Branch: gabe-l-hart/GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Allow multi-token boundary sequences for image templating
Branch: gabe-l-hart/GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Add tiling support for idefices3 in clip.cpp
This should likely be moved into llava_uhd::get_slice_instructions, but for
now this avoids disrupting the logic there.
Branch: gabe-l-hart/GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Partial support for full templating for idefics3 in mtmd
There are still errors encoding some of the image chunks, but the token
sequence now matches transformers _almost_ perfectly, except for the double
newline before the global image which shows up as two consecutive newline
tokens instead of a single double-newline token. I think this is happening
because the blocks are tokenized separately then concatenated.
Branch: gabe-l-hart/GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Fully working image preprocessing for idefics3 w/ resize and slicing
Branch: gabe-l-hart/GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* feat: Parse the preprocessor config's longest side and add it to the mmproj hparams
Branch: GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Use the longest side instead of size * scale_factor
For Granite Docling, these come out to the same value, but that was just a
conicidence.
Branch: GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* fix: Allow batch encoding and remove clip_is_idefics3
Branch: GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* refactor: Remove unnecessary conditionals for empty token vectors
Branch: GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* refactor: Use image_manipulation util
Branch: GraniteDocling
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
* add test model
---------
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
# Conflicts:
# convert_hf_to_gguf.py
# convert_hf_to_gguf_update.py
# gguf-py/gguf/constants.py
# gguf-py/gguf/gguf_writer.py
# src/llama-vocab.cpp
# src/llama-vocab.h
* mtmd : support home-cooked Mistral Small Omni (#14928)
* model : add LightOnOCR-1B model (#16764)
* model : add LightOnOCR-1B model
* add test
# Conflicts:
# convert_hf_to_gguf.py
# gguf-py/gguf/constants.py
* mtmd : fix idefics3 preprocessing (#16806)
* mtmd : fix idefics3 preprocessing
* disable granite test
* fix test for granite
* model: Add support for CogVLM model (#15002)
* Added GGUF mappings for CogVLM model
* Add tensor mapping for CogVLM visual encoder
* Add CogVLM to conversion script, no vision part yet
* Added CogVLM vision model to conversion script
* Add graph for CogVLM CLIP model
* Add graph for CogVLM
* Fixes for CogVLM. Now compiles.
* Model now runs
* Fixes for cogvlm graph
* Account for graph context change after rebase
* Changes for whitespace
* Changes in convert script according to comments
* Switch CogVLM LLM graph to merged QKV tensor
* Use rope_type variable instead of direct definition
* Change CogVLM CLIP encoder to use SWIGLU
* Switch CogVLM CLIP to use merged QKV
* Apply rebase edits and remove ggml_cont call that is now unnecessary
* clean up
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
# Conflicts:
# convert_hf_to_gguf.py
# examples/mtmd/clip.cpp
# gguf-py/gguf/constants.py
# gguf-py/gguf/tensor_mapping.py
# src/llama-arch.cpp
# src/llama-arch.h
# src/llama-model.cpp
# src/llama-model.h
* mtmd: refactor preprocessing + support max/min pixels (#16878)
* mtmd: refactor preprocessing + support max/min pixels
* fix mlp type
* implement mix/max pixels
* improve hparams
* better image preproc for qwen
* fix
* fix out of bound composite
* fix (2)
* fix token calculation
* get_merge_kernel_size()
* fix llama4 and lfm2
* gonna fix them all
* use simple resize for qwen
* qwen: increase min tokens
* no resize if dst size == src size
* restore to initial min/max tokens value for qwen
# Conflicts:
# examples/mtmd/clip.cpp
* clip : use FA (#16837)
* clip : use FA
* cont : add warning about unsupported ops
* implement "auto" mode for clip flash attn
* clip : print more detailed op support info during warmup
* cont : remove obsolete comment [no ci]
* improve debugging message
* trailing space
* metal : remove stray return
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
* model: add Janus Pro for image understanding (#16906)
* Add support for Janus Pro
* Update gguf-py/gguf/tensor_mapping.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Update gguf-py/gguf/tensor_mapping.py
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Address reviewer suggestions
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* Add JANUS_PRO constant
* Update clip model handling
Co-authored-by: Xuan-Son Nguyen <son@huggingface.co>
* Update tools/mtmd/clip.cpp
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
* Refactor JANUS_PRO handling in clip.cpp
Co-authored-by: Xuan-Son Nguyen <son@huggingface.co>
* Update tools/mtmd/clip.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* em whitespace
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Xuan-Son Nguyen <son@huggingface.co>
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
# Conflicts:
# convert_hf_to_gguf.py
# gguf-py/gguf/constants.py
# gguf-py/gguf/tensor_mapping.py
* mtmd: pad mask for qwen2.5vl (#16954)
* mtmd: pad mask for qwen2.5vl
* improve
* mtmd: add --image-min/max-tokens (#16921)
* mtmd: improve struct initialization (#16981)
* mtmd: allow QwenVL to process larger image by default (#17020)
* Disable flash attention
* mtmd : fix embedding size for image input (#17123)
* mtmd: fix patch_size initialized to random value in audio models (#17128)
* mtmd: fix patch_size initialized to random value in audio models
* add default hparams
* add llama_model_n_embd_inp
* Fix load qwen3 vl
Change batch size
* Add description
* Fix cli build error
---------
Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
Co-authored-by: Tianyue-Zhao <zhaotianyue@outlook.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Zhiyong Wang <85110830+ravenouse@users.noreply.github.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
Co-authored-by: firecoperana <firecoperana>
* Add mtmd: the beginning
* Add mtmd: mtmd.cpp compiles
* Add mtmd: clip initialization compiles
* Add mtmd: clip.cpp compiles
* Add mtmd: builds successfully
* Add CPU implementation for GGML_OP_GLU
* Add CUDA implementation for GGML_OP_GLU
* Add CPU implementation for GGML_OP_CONV_2D and GGML_OP_CONV_2D_DW
* Add CUDA implementation for GGML_OP_CONV_2D and GGML_OP_CONV_2D_DW
* Add mtmd: refresh CPU rope
* Add mtmd: refresh CUDA rope
* Add mtmd: add Qwen2-VL
* Add mtmd: Qwen2.5-VL text seems to work with this change
* Add mtmd: fix swiglu
* Add mtmd: use LOG_TEE so generated tokens show up in terminal
* Add mtmd: do not attempt to load a GPU backend if none are available
* GLU, not GPU
* Fix typo
* Fix new/free mismatch
* LOG stuff
* Add mtmd: this fixes gibberish on second image
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* gmp-oss: common
* gpt-oss: attnetion sinks, swiglu_oai
* gpt-oss: WIP llama
Model loads and runs (CPU only), but PPL is much to high
(~1500 for 1st batch vs ~200 in mainline).
Is it because of SWA, because of vocab, or did I introduce a bug somewhere?
* gpt-oss: CPU seems to be working
It was the SWA thta was missing in the previous commit.
There are issues with EOG tokens, so this still needs to be added.
* CUDA: ADD_ID
Just a copy from mainline
* gpt-oss: Seems to be working on CUDA
* gpt-oss: add sinks to the attn-vec kernels
* CUDA: add head size of 64 to new mma
Haven't turned it on yet, but observe slightly better PP and slightly
worse TG performance with that.
* gpt-oss: add ability to use -fmoe (only CUDA for now)
* Move row sums to the write place
* Add sinks to iqk flash attention
* gpt_oss: Implement -fmoe on the CPU
* Simdify swiglu_oai
Turning it off for now as performance becomes more variable,
so perhaps I'm running into thermal trottling imore often
because of making the CPU work too hard.
* llama: factor out model loader
* Builds successfully
* It runs, but mmap does not work
* Fix llama_mmap so mmap works
* Minor
* Fix CUDA after latest changes
* Attempt to use CUDA graphs with MoE models - not working
* CUDA graphs WIP - still not working
* CUDA graphs - seems to be working
Likely not all MLA variants are working.
I no longer remember why I added the q8_0 cpy that
transposes the tensor, but if really needed, this is now
missing. Also missing is q6_0.
* Make q8_0 cache work for DeepSeek models with CUDA graphs
* cuda: cpy for q6_0
* Fix llama_mmap on non-Linux platforms
* Adding forgotten file
* Iterating on Windows build failures
* cuda: re-add q8_0 -> q8_0 transpose
so mla = 2 can be used with CUDA graphs and q8_0 cache.
* Disable graphs without -fmoe
* Minor
* Turn graphs on by default
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* add dry sampler
* use vocab instead of model in dry_init function
* fix compile error for build test
---------
Co-authored-by: firecoperana <firecoperana>
* Merging mainline - WIP
* Merging mainline - WIP
AVX2 and CUDA appear to work.
CUDA performance seems slightly (~1-2%) lower as it is so often
the case with llama.cpp/ggml after some "improvements" have been made.
* Merging mainline - fix Metal
* Remove check
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>