mirror of
https://github.com/ikawrakow/ik_llama.cpp.git
synced 2026-04-24 00:19:19 +00:00
* model : Granite docling + Idefics3 preprocessing (SmolVLM) (#16206) * feat: Add granite-docling conversion using trillion pretokenizer Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add granite-docling vocab pre enum Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Use granite-docling pre Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add clip_is_idefics3 Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Allow multi-token boundary sequences for image templating Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Add tiling support for idefices3 in clip.cpp This should likely be moved into llava_uhd::get_slice_instructions, but for now this avoids disrupting the logic there. Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Partial support for full templating for idefics3 in mtmd There are still errors encoding some of the image chunks, but the token sequence now matches transformers _almost_ perfectly, except for the double newline before the global image which shows up as two consecutive newline tokens instead of a single double-newline token. I think this is happening because the blocks are tokenized separately then concatenated. Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Fully working image preprocessing for idefics3 w/ resize and slicing Branch: gabe-l-hart/GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * feat: Parse the preprocessor config's longest side and add it to the mmproj hparams Branch: GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Use the longest side instead of size * scale_factor For Granite Docling, these come out to the same value, but that was just a conicidence. Branch: GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * fix: Allow batch encoding and remove clip_is_idefics3 Branch: GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Remove unnecessary conditionals for empty token vectors Branch: GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * refactor: Use image_manipulation util Branch: GraniteDocling Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> * add test model --------- Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Xuan Son Nguyen <son@huggingface.co> # Conflicts: # convert_hf_to_gguf.py # convert_hf_to_gguf_update.py # gguf-py/gguf/constants.py # gguf-py/gguf/gguf_writer.py # src/llama-vocab.cpp # src/llama-vocab.h * mtmd : support home-cooked Mistral Small Omni (#14928) * model : add LightOnOCR-1B model (#16764) * model : add LightOnOCR-1B model * add test # Conflicts: # convert_hf_to_gguf.py # gguf-py/gguf/constants.py * mtmd : fix idefics3 preprocessing (#16806) * mtmd : fix idefics3 preprocessing * disable granite test * fix test for granite * model: Add support for CogVLM model (#15002) * Added GGUF mappings for CogVLM model * Add tensor mapping for CogVLM visual encoder * Add CogVLM to conversion script, no vision part yet * Added CogVLM vision model to conversion script * Add graph for CogVLM CLIP model * Add graph for CogVLM * Fixes for CogVLM. Now compiles. * Model now runs * Fixes for cogvlm graph * Account for graph context change after rebase * Changes for whitespace * Changes in convert script according to comments * Switch CogVLM LLM graph to merged QKV tensor * Use rope_type variable instead of direct definition * Change CogVLM CLIP encoder to use SWIGLU * Switch CogVLM CLIP to use merged QKV * Apply rebase edits and remove ggml_cont call that is now unnecessary * clean up --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co> # Conflicts: # convert_hf_to_gguf.py # examples/mtmd/clip.cpp # gguf-py/gguf/constants.py # gguf-py/gguf/tensor_mapping.py # src/llama-arch.cpp # src/llama-arch.h # src/llama-model.cpp # src/llama-model.h * mtmd: refactor preprocessing + support max/min pixels (#16878) * mtmd: refactor preprocessing + support max/min pixels * fix mlp type * implement mix/max pixels * improve hparams * better image preproc for qwen * fix * fix out of bound composite * fix (2) * fix token calculation * get_merge_kernel_size() * fix llama4 and lfm2 * gonna fix them all * use simple resize for qwen * qwen: increase min tokens * no resize if dst size == src size * restore to initial min/max tokens value for qwen # Conflicts: # examples/mtmd/clip.cpp * clip : use FA (#16837) * clip : use FA * cont : add warning about unsupported ops * implement "auto" mode for clip flash attn * clip : print more detailed op support info during warmup * cont : remove obsolete comment [no ci] * improve debugging message * trailing space * metal : remove stray return --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co> * model: add Janus Pro for image understanding (#16906) * Add support for Janus Pro * Update gguf-py/gguf/tensor_mapping.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Update gguf-py/gguf/tensor_mapping.py Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Address reviewer suggestions Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * Add JANUS_PRO constant * Update clip model handling Co-authored-by: Xuan-Son Nguyen <son@huggingface.co> * Update tools/mtmd/clip.cpp Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com> * Refactor JANUS_PRO handling in clip.cpp Co-authored-by: Xuan-Son Nguyen <son@huggingface.co> * Update tools/mtmd/clip.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * em whitespace --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> Co-authored-by: Xuan-Son Nguyen <son@huggingface.co> Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com> # Conflicts: # convert_hf_to_gguf.py # gguf-py/gguf/constants.py # gguf-py/gguf/tensor_mapping.py * mtmd: pad mask for qwen2.5vl (#16954) * mtmd: pad mask for qwen2.5vl * improve * mtmd: add --image-min/max-tokens (#16921) * mtmd: improve struct initialization (#16981) * mtmd: allow QwenVL to process larger image by default (#17020) * Disable flash attention * mtmd : fix embedding size for image input (#17123) * mtmd: fix patch_size initialized to random value in audio models (#17128) * mtmd: fix patch_size initialized to random value in audio models * add default hparams * add llama_model_n_embd_inp * Fix load qwen3 vl Change batch size * Add description * Fix cli build error --------- Signed-off-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Gabe Goodhart <ghart@us.ibm.com> Co-authored-by: Xuan Son Nguyen <son@huggingface.co> Co-authored-by: Tianyue-Zhao <zhaotianyue@outlook.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Zhiyong Wang <85110830+ravenouse@users.noreply.github.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com> Co-authored-by: firecoperana <firecoperana>
185 lines
6.3 KiB
C++
185 lines
6.3 KiB
C++
#pragma once
|
|
|
|
#include "llama.h"
|
|
|
|
#include <string>
|
|
#include <vector>
|
|
#include <memory>
|
|
|
|
// pre-tokenization types
|
|
enum llama_vocab_pre_type {
|
|
LLAMA_VOCAB_PRE_TYPE_DEFAULT = 0,
|
|
LLAMA_VOCAB_PRE_TYPE_LLAMA3 = 1,
|
|
LLAMA_VOCAB_PRE_TYPE_DEEPSEEK_LLM = 2,
|
|
LLAMA_VOCAB_PRE_TYPE_DEEPSEEK_CODER = 3,
|
|
LLAMA_VOCAB_PRE_TYPE_FALCON = 4,
|
|
LLAMA_VOCAB_PRE_TYPE_MPT = 5,
|
|
LLAMA_VOCAB_PRE_TYPE_STARCODER = 6,
|
|
LLAMA_VOCAB_PRE_TYPE_GPT2 = 7,
|
|
LLAMA_VOCAB_PRE_TYPE_REFACT = 8,
|
|
LLAMA_VOCAB_PRE_TYPE_COMMAND_R = 9,
|
|
LLAMA_VOCAB_PRE_TYPE_STABLELM2 = 10,
|
|
LLAMA_VOCAB_PRE_TYPE_QWEN2 = 11,
|
|
LLAMA_VOCAB_PRE_TYPE_OLMO = 12,
|
|
LLAMA_VOCAB_PRE_TYPE_DBRX = 13,
|
|
LLAMA_VOCAB_PRE_TYPE_SMAUG = 14,
|
|
LLAMA_VOCAB_PRE_TYPE_PORO = 15,
|
|
LLAMA_VOCAB_PRE_TYPE_CHATGLM3 = 16,
|
|
LLAMA_VOCAB_PRE_TYPE_CHATGLM4 = 17,
|
|
LLAMA_VOCAB_PRE_TYPE_VIKING = 18,
|
|
LLAMA_VOCAB_PRE_TYPE_JAIS = 19,
|
|
LLAMA_VOCAB_PRE_TYPE_TEKKEN = 20,
|
|
LLAMA_VOCAB_PRE_TYPE_SMOLLM = 21,
|
|
LLAMA_VOCAB_PRE_TYPE_CODESHELL = 22,
|
|
LLAMA_VOCAB_PRE_TYPE_BLOOM = 23,
|
|
LLAMA_VOCAB_PRE_TYPE_GPT3_FINNISH = 24,
|
|
LLAMA_VOCAB_PRE_TYPE_EXAONE = 25,
|
|
LLAMA_VOCAB_PRE_TYPE_CHAMELEON = 26,
|
|
LLAMA_VOCAB_PRE_TYPE_MINERVA = 27,
|
|
LLAMA_VOCAB_PRE_TYPE_DEEPSEEK3_LLM = 28,
|
|
LLAMA_VOCAB_PRE_TYPE_GPT4O = 29,
|
|
LLAMA_VOCAB_PRE_TYPE_SUPERBPE = 30,
|
|
LLAMA_VOCAB_PRE_TYPE_TRILLION = 31,
|
|
LLAMA_VOCAB_PRE_TYPE_BAILINGMOE = 32,
|
|
LLAMA_VOCAB_PRE_TYPE_LLAMA4 = 33,
|
|
LLAMA_VOCAB_PRE_TYPE_PIXTRAL = 34,
|
|
LLAMA_VOCAB_PRE_TYPE_SEED_CODER = 35,
|
|
LLAMA_VOCAB_PRE_TYPE_HUNYUAN = 36,
|
|
LLAMA_VOCAB_PRE_TYPE_KIMI_K2 = 37,
|
|
LLAMA_VOCAB_PRE_TYPE_HUNYUAN_DENSE = 38,
|
|
LLAMA_VOCAB_PRE_TYPE_GROK_2 = 39,
|
|
LLAMA_VOCAB_PRE_TYPE_MINIMAX_M2 = 40,
|
|
LLAMA_VOCAB_PRE_TYPE_GRANITE_DOCLING = 41,
|
|
};
|
|
|
|
struct LLM_KV;
|
|
struct llama_model_loader;
|
|
|
|
struct llama_vocab {
|
|
struct token_data {
|
|
std::string text;
|
|
float score;
|
|
llama_token_attr attr;
|
|
};
|
|
|
|
llama_vocab();
|
|
~llama_vocab();
|
|
|
|
void load(llama_model_loader & ml, const LLM_KV & kv);
|
|
|
|
std::string get_tokenizer_model() const;
|
|
std::string get_tokenizer_pre() const;
|
|
|
|
enum llama_vocab_type get_type() const;
|
|
enum llama_vocab_pre_type get_pre_type() const;
|
|
|
|
uint32_t n_tokens() const;
|
|
uint32_t n_token_types() const;
|
|
|
|
std::string type_name() const;
|
|
|
|
bool is_normal (llama_token id) const;
|
|
bool is_unknown (llama_token id) const;
|
|
bool is_control (llama_token id) const;
|
|
bool is_byte (llama_token id) const;
|
|
bool is_user_defined(llama_token id) const;
|
|
bool is_unused (llama_token id) const;
|
|
bool is_eog (llama_token id) const;
|
|
|
|
uint8_t token_to_byte(llama_token id) const;
|
|
llama_token byte_to_token(uint8_t ch) const;
|
|
|
|
llama_token text_to_token(const std::string & text) const;
|
|
|
|
const token_data & get_token_data(llama_token id) const;
|
|
|
|
const char * token_get_text (llama_token id) const;
|
|
float token_get_score(llama_token id) const;
|
|
llama_token_attr token_get_attr (llama_token id) const;
|
|
|
|
llama_token token_bos() const;
|
|
llama_token token_eos() const;
|
|
llama_token token_eot() const;
|
|
llama_token token_eom() const;
|
|
llama_token token_unk() const;
|
|
llama_token token_sep() const;
|
|
llama_token token_nl () const;
|
|
llama_token token_pad() const;
|
|
llama_token token_mask() const;
|
|
|
|
llama_token token_prefix() const;
|
|
llama_token token_middle() const;
|
|
llama_token token_suffix() const;
|
|
|
|
llama_token token_fim_pre() const;
|
|
llama_token token_fim_suf() const;
|
|
llama_token token_fim_mid() const;
|
|
llama_token token_fim_pad() const;
|
|
llama_token token_fim_rep() const;
|
|
llama_token token_fim_sep() const;
|
|
|
|
bool get_add_space_prefix () const;
|
|
bool get_add_bos () const;
|
|
bool get_add_eos () const;
|
|
bool get_add_sep () const;
|
|
bool get_ignore_merges () const;
|
|
bool get_clean_spaces () const;
|
|
bool get_remove_extra_whitespaces () const;
|
|
bool get_escape_whitespaces () const;
|
|
bool get_treat_whitespace_as_suffix() const;
|
|
|
|
int max_token_len() const;
|
|
|
|
int find_bpe_rank(const std::string & token_left, const std::string & token_right) const;
|
|
std::vector<std::string> get_bpe_merges() const;
|
|
|
|
std::vector<char> get_precompiled_charsmap() const;
|
|
|
|
int32_t tokenize(
|
|
const char * text,
|
|
int32_t text_len,
|
|
llama_token * tokens,
|
|
int32_t n_tokens_max,
|
|
bool add_special,
|
|
bool parse_special) const;
|
|
|
|
std::vector<llama_token> tokenize(
|
|
const std::string & raw_text,
|
|
bool add_special,
|
|
bool parse_special = false) const;
|
|
|
|
// does not write null-terminator to buf
|
|
int32_t token_to_piece(
|
|
llama_token token,
|
|
char * buf,
|
|
int32_t length,
|
|
int32_t lstrip,
|
|
bool special) const;
|
|
|
|
// use cached data
|
|
const std::string & token_to_piece(llama_token token) const;
|
|
|
|
int32_t detokenize(
|
|
const llama_token * tokens,
|
|
int32_t n_tokens,
|
|
char * text,
|
|
int32_t text_len_max,
|
|
bool remove_special,
|
|
bool unparse_special) const;
|
|
|
|
std::string detokenize(
|
|
const std::vector<llama_token> & tokens,
|
|
bool special) const;
|
|
|
|
void print_info() const;
|
|
|
|
private:
|
|
struct impl;
|
|
std::unique_ptr<impl> pimpl;
|
|
};
|
|
|
|
const struct llama_vocab * llama_get_vocab(const struct llama_context * ctx);
|
|
bool llama_token_is_eog(const struct llama_vocab* vocab, llama_token token);
|
|
llama_token llama_token_bos(const struct llama_vocab* vocab);
|
|
llama_token llama_token_eos(const struct llama_vocab* vocab);
|