mirror of
https://github.com/ikawrakow/ik_llama.cpp.git
synced 2026-01-27 01:29:51 +00:00
* model : Granite docling + Idefics3 preprocessing (SmolVLM) (#16206) * feat: Add granite-docling conversion using trillion pretokenizer Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add granite-docling vocab pre enum Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Use granite-docling pre Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add clip_is_idefics3 Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Allow multi-token boundary sequences for image templating Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add tiling support for idefices3 in clip.cpp This should likely be moved into llava_uhd::get_slice_instructions, but for now this avoids disrupting the logic there. Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Partial support for full templating for idefics3 in mtmd There are still errors encoding some of the image chunks, but the token sequence now matches transformers _almost_ perfectly, except for the double newline before the global image which shows up as two consecutive newline tokens instead of a single double-newline token. I think this is happening because the blocks are tokenized separately then concatenated. Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Fully working image preprocessing for idefics3 w/ resize and slicing Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Parse the preprocessor config's longest side and add it to the mmproj hparams Branch: GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Use the longest side instead of size * scale_factor For Granite Docling, these come out to the same value, but that was just a conicidence. Branch: GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Allow batch encoding and remove clip_is_idefics3 Branch: GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Remove unnecessary conditionals for empty token vectors Branch: GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Use image_manipulation util Branch: GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * add test model --------- Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Xuan Son Nguyen <son@huggingface.co> # Conflicts: # convert_hf_to_gguf.py # convert_hf_to_gguf_update.py # gguf-py/gguf/constants.py # gguf-py/gguf/gguf_writer.py # src/llama-vocab.cpp # src/llama-vocab.h * mtmd : support home-cooked Mistral Small Omni (#14928) * model : add LightOnOCR-1B model (#16764) * model : add LightOnOCR-1B model * add test # Conflicts: # convert_hf_to_gguf.py # gguf-py/gguf/constants.py * mtmd : fix idefics3 preprocessing (#16806) * mtmd : fix idefics3 preprocessing * disable granite test * fix test for granite * model: Add support for CogVLM model (#15002) * Added GGUF mappings for CogVLM model * Add tensor mapping for CogVLM visual encoder * Add CogVLM to conversion script, no vision part yet * Added CogVLM vision model to conversion script * Add graph for CogVLM CLIP model * Add graph for CogVLM * Fixes for CogVLM. Now compiles. * Model now runs * Fixes for cogvlm graph * Account for graph context change after rebase * Changes for whitespace * Changes in convert script according to comments * Switch CogVLM LLM graph to merged QKV tensor * Use rope_type variable instead of direct definition * Change CogVLM CLIP encoder to use SWIGLU * Switch CogVLM CLIP to use merged QKV * Apply rebase edits and remove ggml_cont call that is now unnecessary * clean up --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co> # Conflicts: # convert_hf_to_gguf.py # examples/mtmd/clip.cpp # gguf-py/gguf/constants.py # gguf-py/gguf/tensor_mapping.py # src/llama-arch.cpp # src/llama-arch.h # src/llama-model.cpp # src/llama-model.h * mtmd: refactor preprocessing + support max/min pixels (#16878) * mtmd: refactor preprocessing + support max/min pixels * fix mlp type * implement mix/max pixels * improve hparams * better image preproc for qwen * fix * fix out of bound composite * fix (2) * fix token calculation * get_merge_kernel_size() * fix llama4 and lfm2 * gonna fix them all * use simple resize for qwen * qwen: increase min tokens * no resize if dst size == src size * restore to initial min/max tokens value for qwen # Conflicts: # examples/mtmd/clip.cpp * clip : use FA (#16837) * clip : use FA * cont : add warning about unsupported ops * implement "auto" mode for clip flash attn * clip : print more detailed op support info during warmup * cont : remove obsolete comment [no ci] * improve debugging message * trailing space * metal : remove stray return --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co> * model: add Janus Pro for image understanding (#16906) * Add support for Janus Pro * Update gguf-py/gguf/tensor_mapping.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update gguf-py/gguf/tensor_mapping.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Address reviewer suggestions Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Add JANUS_PRO constant * Update clip model handling Co-authored-by: Xuan-Son Nguyen <son@huggingface.co> * Update tools/mtmd/clip.cpp Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com> * Refactor JANUS_PRO handling in clip.cpp Co-authored-by: Xuan-Son Nguyen <son@huggingface.co> * Update tools/mtmd/clip.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * em whitespace --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> Co-authored-by: Xuan-Son Nguyen <son@huggingface.co> Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com> # Conflicts: # convert_hf_to_gguf.py # gguf-py/gguf/constants.py # gguf-py/gguf/tensor_mapping.py * mtmd: pad mask for qwen2.5vl (#16954) * mtmd: pad mask for qwen2.5vl * improve * mtmd: add --image-min/max-tokens (#16921) * mtmd: improve struct initialization (#16981) * mtmd: allow QwenVL to process larger image by default (#17020) * Disable flash attention * mtmd : fix embedding size for image input (#17123) * mtmd: fix patch_size initialized to random value in audio models (#17128) * mtmd: fix patch_size initialized to random value in audio models * add default hparams * add llama_model_n_embd_inp * Fix load qwen3 vl Change batch size * Add description * Fix cli build error --------- Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Xuan Son Nguyen <son@huggingface.co> Co-authored-by: Tianyue-Zhao <zhaotianyue@outlook.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Zhiyong Wang <85110830+ravenouse@users.noreply.github.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com> Co-authored-by: firecoperana <firecoperana>
169 lines
5.4 KiB
Bash
Executable File
169 lines
5.4 KiB
Bash
Executable File
#!/usr/bin/env bash
|
|
|
|
# make sure we are in the right directory
|
|
SCRIPT_DIR=$( cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd )
|
|
cd $SCRIPT_DIR
|
|
|
|
#export LLAMA_CACHE="$SCRIPT_DIR/tmp"
|
|
|
|
set -eux
|
|
|
|
mkdir -p $SCRIPT_DIR/output
|
|
|
|
PROJ_ROOT="$SCRIPT_DIR/../.."
|
|
cd $PROJ_ROOT
|
|
|
|
# Check if the first argument is "big", then run test with big models
|
|
# This is useful if we're running the script on a larger machine, so we can test the big models
|
|
RUN_BIG_TESTS=false
|
|
if [ "${1:-}" = "big" ]; then
|
|
RUN_BIG_TESTS=true
|
|
echo "Include BIG models..."
|
|
fi
|
|
|
|
RUN_HUGE_TESTS=false
|
|
if [ "${1:-}" = "huge" ]; then
|
|
RUN_HUGE_TESTS=true
|
|
RUN_BIG_TESTS=true
|
|
echo "Include BIG and HUGE models..."
|
|
fi
|
|
|
|
###############
|
|
|
|
arr_prefix=()
|
|
arr_hf=()
|
|
arr_tmpl=() # chat template
|
|
arr_file=()
|
|
|
|
add_test_vision() {
|
|
local hf=$1
|
|
local tmpl=${2:-""} # default to empty string if not provided
|
|
arr_prefix+=("[vision]")
|
|
arr_hf+=("$hf")
|
|
arr_tmpl+=("$tmpl")
|
|
arr_file+=("test-1.jpeg")
|
|
}
|
|
|
|
add_test_audio() {
|
|
local hf=$1
|
|
arr_prefix+=("[audio] ")
|
|
arr_hf+=("$hf")
|
|
arr_tmpl+=("") # no need for chat tmpl
|
|
arr_file+=("test-2.mp3")
|
|
}
|
|
|
|
add_test_vision "ggml-org/SmolVLM-500M-Instruct-GGUF:Q8_0"
|
|
add_test_vision "ggml-org/SmolVLM2-2.2B-Instruct-GGUF:Q4_K_M"
|
|
add_test_vision "ggml-org/SmolVLM2-500M-Video-Instruct-GGUF:Q8_0"
|
|
add_test_vision "ggml-org/gemma-3-4b-it-GGUF:Q4_K_M"
|
|
add_test_vision "THUDM/glm-edge-v-5b-gguf:Q4_K_M"
|
|
add_test_vision "second-state/Llava-v1.5-7B-GGUF:Q2_K" "vicuna"
|
|
add_test_vision "cjpais/llava-1.6-mistral-7b-gguf:Q3_K_M" "vicuna"
|
|
add_test_vision "ibm-research/granite-vision-3.2-2b-GGUF:Q4_K_M"
|
|
add_test_vision "second-state/MiniCPM-Llama3-V-2_5-GGUF:Q2_K" # model from openbmb is corrupted
|
|
add_test_vision "openbmb/MiniCPM-V-2_6-gguf:Q2_K"
|
|
add_test_vision "openbmb/MiniCPM-o-2_6-gguf:Q4_0"
|
|
add_test_vision "bartowski/Qwen2-VL-2B-Instruct-GGUF:Q4_K_M"
|
|
add_test_vision "ggml-org/Qwen2.5-VL-3B-Instruct-GGUF:Q4_K_M"
|
|
add_test_vision "ggml-org/InternVL2_5-1B-GGUF:Q8_0"
|
|
add_test_vision "ggml-org/InternVL3-1B-Instruct-GGUF:Q8_0"
|
|
add_test_vision "ggml-org/Qwen2.5-Omni-3B-GGUF:Q4_K_M"
|
|
add_test_vision "ggml-org/LFM2-VL-450M-GGUF:Q8_0"
|
|
add_test_vision "ggml-org/granite-docling-258M-GGUF:Q8_0"
|
|
add_test_vision "ggml-org/LightOnOCR-1B-1025-GGUF:Q8_0"
|
|
|
|
add_test_audio "ggml-org/ultravox-v0_5-llama-3_2-1b-GGUF:Q8_0"
|
|
add_test_audio "ggml-org/Qwen2.5-Omni-3B-GGUF:Q4_K_M"
|
|
add_test_audio "ggml-org/Voxtral-Mini-3B-2507-GGUF:Q4_K_M"
|
|
|
|
# to test the big models, run: ./tests.sh big
|
|
if [ "$RUN_BIG_TESTS" = true ]; then
|
|
add_test_vision "ggml-org/pixtral-12b-GGUF:Q4_K_M"
|
|
add_test_vision "ggml-org/Mistral-Small-3.1-24B-Instruct-2503-GGUF" "mistral-v7"
|
|
add_test_vision "ggml-org/Qwen2-VL-2B-Instruct-GGUF:Q4_K_M"
|
|
add_test_vision "ggml-org/Qwen2-VL-7B-Instruct-GGUF:Q4_K_M"
|
|
add_test_vision "ggml-org/Qwen2.5-VL-3B-Instruct-GGUF:Q4_K_M"
|
|
add_test_vision "ggml-org/Qwen2.5-VL-7B-Instruct-GGUF:Q4_K_M"
|
|
add_test_vision "ggml-org/InternVL3-8B-Instruct-GGUF:Q4_K_M"
|
|
add_test_vision "ggml-org/InternVL3-14B-Instruct-GGUF:Q4_K_M"
|
|
add_test_vision "ggml-org/Qwen2.5-Omni-7B-GGUF:Q4_K_M"
|
|
# add_test_vision "ggml-org/Qwen2.5-VL-32B-Instruct-GGUF:Q4_K_M" # does not work on my mac M3 Ultra
|
|
add_test_vision "ggml-org/Kimi-VL-A3B-Thinking-2506-GGUF:Q4_K_M"
|
|
|
|
add_test_audio "ggml-org/ultravox-v0_5-llama-3_1-8b-GGUF:Q4_K_M"
|
|
add_test_audio "ggml-org/Qwen2.5-Omni-7B-GGUF:Q4_K_M"
|
|
fi
|
|
|
|
# to test the huge models, run: ./tests.sh huge
|
|
# this will run both the big and huge models
|
|
# huge models are > 32B parameters
|
|
if [ "$RUN_HUGE_TESTS" = true ]; then
|
|
add_test_vision "ggml-org/Qwen2.5-VL-72B-Instruct-GGUF:Q4_K_M"
|
|
add_test_vision "ggml-org/Llama-4-Scout-17B-16E-Instruct-GGUF:IQ1_S"
|
|
fi
|
|
|
|
# these models always give the wrong answer, not sure why
|
|
# add_test_vision "ggml-org/SmolVLM-Instruct-GGUF:Q4_K_M"
|
|
# add_test_vision "ggml-org/SmolVLM-256M-Instruct-GGUF:Q8_0"
|
|
# add_test_vision "ggml-org/SmolVLM2-256M-Video-Instruct-GGUF:Q8_0"
|
|
|
|
# this model has broken chat template, not usable
|
|
# add_test_vision "cmp-nct/Yi-VL-6B-GGUF:Q5_K"
|
|
# add_test_vision "guinmoon/MobileVLM-3B-GGUF:Q4_K_M" "deepseek"
|
|
|
|
###############
|
|
|
|
cmake --build build -j --target llama-mtmd-cli
|
|
|
|
arr_res=()
|
|
|
|
for i in "${!arr_hf[@]}"; do
|
|
bin="llama-mtmd-cli"
|
|
prefix="${arr_prefix[$i]}"
|
|
hf="${arr_hf[$i]}"
|
|
tmpl="${arr_tmpl[$i]}"
|
|
inp_file="${arr_file[$i]}"
|
|
|
|
echo "Running test with binary: $bin and HF model: $hf"
|
|
echo ""
|
|
echo ""
|
|
|
|
output=$(\
|
|
"$PROJ_ROOT/build/bin/$bin" \
|
|
-hf "$hf" \
|
|
--image $SCRIPT_DIR/$inp_file \
|
|
-p "what is the publisher name of the newspaper?" \
|
|
--temp 0 -n 128 \
|
|
${tmpl:+--chat-template "$tmpl"} \
|
|
2>&1 | tee /dev/tty)
|
|
|
|
echo "$output" > $SCRIPT_DIR/output/$bin-$(echo "$hf" | tr '/' '-').log
|
|
|
|
# either contains "new york" or both "men" and "walk"
|
|
if echo "$output" | grep -iq "new york" \
|
|
|| (echo "$output" | grep -iq "men" && echo "$output" | grep -iq "walk")
|
|
then
|
|
result="$prefix \033[32mOK\033[0m: $bin $hf"
|
|
else
|
|
result="$prefix \033[31mFAIL\033[0m: $bin $hf"
|
|
fi
|
|
echo -e "$result"
|
|
arr_res+=("$result")
|
|
|
|
echo ""
|
|
echo ""
|
|
echo ""
|
|
echo "#################################################"
|
|
echo "#################################################"
|
|
echo ""
|
|
echo ""
|
|
done
|
|
|
|
set +x
|
|
|
|
for i in "${!arr_res[@]}"; do
|
|
echo -e "${arr_res[$i]}"
|
|
done
|
|
echo ""
|
|
echo "Output logs are saved in $SCRIPT_DIR/output"
|