Commit Graph

106 Commits

Author SHA1 Message Date
sasha0552
08bf59c672 convert-hf : set the model name based on cli arg, if present (#7693)
`--model-name` argument was added a while ago but did not do anything.
This commit fixes this issue and enables this feature.
2024-06-09 16:39:25 +10:00
compilade
46f8d599f6 convert-hf : match model part name prefix and suffix (#7687)
In #7075, to fix the conversion of (some) models using model-00001-of-00001.safetensors instead of model.safetensors for a single model part we simply used the same logic as the part count to get the part names. 

But this doesn't always work correctly, like when unusual additional model files like consolidated.safetensors in https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3 are present.

This commit matching both the prefix and the suffix of the model part names should fix this problem without breaking any previously-supported upstream models. But according to report by @teleprint-me there is still some
persistent problem, but shall do in the meantime.
2024-06-09 12:47:25 +10:00
compilade
bad6961237 gguf-py : decouple adding metadata from writing in GGUFWriter (#7827)
Main changes of this PR is to consolidate GGUFWriter.add_key and GGUFWriter.add_val into GGUFWriter.add_key_value. 

In addition use_temp_file is now opt-in instead of opt-out defaulting to False.

Also GGUFWriter now does not require output file name until when actually writing to it.

And GGUFWriter doesn't really need to eagerly prepare the data layout of the metadata
2024-06-09 12:34:29 +10:00
Joan Fontanals
add6ba8d05 llama : add jina v2 base code (#7596)
* feat: add changes to handle jina v2 base code

* fix: do not complicate things

* fix: fix the usage of the code model

* fix: fix comments

* fix: fix linting issues

* fix: remove ollama patches

* style : minor

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-06-06 10:22:41 +03:00
Galunid
e4c231c3ff Fix encoding in python scripts (#7733) 2024-06-06 03:07:24 +10:00
Galunid
ea8a7d7f25 convert-hf : Handle NotImplementedError in convert-hf-to-gguf (#7660) 2024-05-31 17:42:33 +02:00
Galunid
81e37e4303 Move convert.py to examples/convert-legacy-llama.py (#7430)
* Move convert.py to examples/convert-no-torch.py

* Fix CI, scripts, readme files

* convert-no-torch -> convert-legacy-llama

* Move vocab thing to vocab.py

* Fix convert-no-torch -> convert-legacy-llama

* Fix lost convert.py in ci/run.sh

* Fix imports

* Fix gguf not imported correctly

* Fix flake8 complaints

* Fix check-requirements.sh

* Get rid of ADDED_TOKENS_FILE, FAST_TOKENIZER_FILE

* Review fixes
2024-05-30 21:40:00 +10:00
Giuseppe Scrivano
c65d048afb llama : support small Granite models (#7481)
* Add optional MLP bias for Granite models

Add optional MLP bias for ARCH_LLAMA to support Granite models.
Partially addresses ggerganov/llama.cpp/issues/7116
Still needs some more changes to properly support Granite.

* llama: honor add_space_prefix from the model configuration

propagate the add_space_prefix configuration from the HF model
configuration to the gguf file and honor it with the gpt2 tokenizer.

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>

* llama: add support for small granite models

it works only for the small models 3b and 8b.

The convert-hf-to-gguf.py script uses the vocabulary size of the
granite models to detect granite and set the correct configuration.

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>

---------

Signed-off-by: Giuseppe Scrivano <gscrivan@redhat.com>
Co-authored-by: Steffen Roecker <sroecker@redhat.com>
2024-05-28 21:49:49 +03:00
fairydreaming
e354ad8256 Add support for DeepseekV2ForCausalLM (#7519)
* common : increase max number of experts to 160

* common : add tensors ATTN_Q_A, ATTN_Q_A_NORM, ATTN_Q_B, ATTN_KV_A_MQA, ATTN_KV_A_NORM, ATTN_KV_B needed by DeepSeek-V2 MLA (multi-head latent attention) architecture

* common : add model header parameters: leading_dense_block_count, expert_feed_forward_length, expert_shared_count, expert_weights_scale, attention.q_lora_rank, attention.kv_lora_rank, rope.scaling.yarn_log_multiplier

* convert-hf : add model conversion support for DeepseekV2ForCausalLM

* llama : add model types for DeepSeek-V2 and DeepSeek-V2-Lite models

* llama : add two new llm_build_moe_ffn() arguments: scale_w (whether to scale weights of selected MoE experts) and w_scale (numerical value of the scaling factor)

* llama : add inference support for LLM_ARCH_DEEPSEEK2

---------

Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
2024-05-28 17:07:05 +02:00
Galunid
19a748b5eb Fix aya-23 conversion scripts (#7539) 2024-05-26 16:02:34 +02:00
Bartowski
cbfef1b8c1 llama : add Smaug 70B support (#7402) 2024-05-26 15:28:35 +03:00
compilade
2e044457a1 gguf-py : fix and simplify quantized shape round-trip (#7483)
* gguf-py : fix and simplify quantized shape round-trip

* gguf-py : remove unused import
2024-05-25 11:11:48 +10:00
fairydreaming
0682aaed8d Add support for ArcticForCausalLM (#7020)
* common : increase max number of experts to 128

* common : add tensor LLM_TENSOR_FFN_NORM_EXPS for normalization before MoE that runs in parallel to attention + ffn

* gguf-py : add architecture-specific block mappings that override selected general block mappings

* convert-hf : add model conversion support for ArcticForCausalLM

* convert-hf : use added_tokens_decoder from tokenizer_config.json to redefine tokens from SentencePiece model (only for ArcticForCausalLM)

* llama : add inference support for LLM_ARCH_ARCTIC

---------

Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
2024-05-24 14:31:13 +02:00
fairydreaming
29d6974d16 Add missing inference support for GPTNeoXForCausalLM (Pythia and GPT-NeoX base models) (#7461)
* convert-hf : add conversion of bloom-style qkv tensor to gpt-style qkv (code borrowed from BloomModel)

* llama : add inference support for LLM_ARCH_GPTNEOX

* llama : add model types for every Pythia variant and GPT-NeoX

Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
2024-05-23 11:49:53 +02:00
liuwei-git
c1a6ad7577 llama : add phi3 128K model support (#7225)
* add phi3 128k support in convert-hf-to-gguf

* add phi3 128k support in cuda

* address build warnings on llama.cpp

* adjust index value in cuda long rope freq factors

* add long rope support in ggml cpu backend

* make freq factors only depend on ctx size

* remove unused rope scaling type 'su' frin gguf converter

* fix flint warnings on convert-hf-to-gguf.py

* set to the short freq factor when context size is small than trained context size

* add one line of comments

* metal : support rope freq_factors

* ggml : update ggml_rope_ext API to support freq. factors

* backends : add dev messages to support rope freq. factors

* minor : style

* tests : update to use new rope API

* backends : fix pragma semicolons

* minor : cleanup

* llama : move rope factors from KV header to tensors

* llama : remove tmp assert

* cuda : fix compile warning

* convert : read/write n_head_kv

* llama : fix uninitialized tensors

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-05-21 23:28:32 +03:00
Georgi Gerganov
61ab7a8eb1 tests : test-tokenizer-0.sh print more info (#7402) 2024-05-21 19:53:48 +03:00
jaime-m-p
0dbe001317 Tokenizer SPM fixes for phi-3 and llama-spm (bugfix) (#7425)
* Update brute force test: add_special
* Update brute force test: default values for add_bos_token and add_eos_token
* Enable rtrim when pre-inserting BOS

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Revert "server : fix test regexes"
2024-05-21 14:39:48 +02:00
jaime-m-p
49a32c0167 Tokenizer SPM fixes for phi-3 and llama-spm (#7375)
* Update brute force test: special tokens
* Fix added tokens
  - Try to read 'added_tokens.json'.
  - Try to read 'tokenizer_config.json'.
  - Try to read 'tokenizer.json'.
* Fix special tokens rtrim

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* server : fix test regexes
2024-05-20 20:15:57 +02:00
Georgi Gerganov
60faeefff0 llama : remove Persimmon (#7408)
* llama : remove Persimmon

* requirements : remove
2024-05-21 02:35:28 +10:00
Anas Ahouzi
753bb58afa Add StableLM2 pre-tokenizer (#7349)
* Add StableLM pre-tokenizer

* Fix space

* Fix trailing whitespace
2024-05-19 22:46:46 +10:00
Georgi Gerganov
59f9af2239 convert : fix set_vocab_sentencepiece (#6866)
* convert : fix set_vocab_sentencepiece

* Update convert-hf-to-gguf.py
2024-05-18 08:46:20 +03:00
Aarni Koskela
5ddcb26ba6 py : convert-hf-to-gguf-update improvements (#7340)
* convert-hf-to-gguf-update: automate updating

* convert-hf-to-gguf-update: improve download

* share requests session for performance
* create directories only when needed, don't skip downloads when empty directory encountered
* be more graceful about errors
2024-05-17 15:11:45 +03:00
amd-lalithnc
938cbd3d10 convert : fix Qwen/Qwen-7b conversion (#7308) 2024-05-17 10:01:58 +03:00
compilade
2ea6201d71 convert-hf : support direct Q8_0 conversion (#7234)
* convert-hf : support q8_0 conversion

* convert-hf : add missing ftype

This was messing with the checksums otherwise.

* convert-hf : add missing ftype to Baichuan and Xverse

I didn't notice these on my first pass.
2024-05-13 14:10:51 -04:00
Joan Fontanals
bf009f1d45 llama : rename jina tokenizers to v2 (#7249)
* refactor: rename jina tokenizers to v2

* refactor: keep refactoring non-breaking
2024-05-13 11:35:14 +03:00
compilade
770c662564 convert-hf : support bfloat16 conversion (#7158)
* convert-hf : support bfloat16 conversion

* gguf-py : flake8 fixes

* convert-hf : add missing space after comma

* convert-hf : get bit-exact same output as ./quantize

The quantization version was missing.

* convert-hf : don't round bf16 NANs

* convert-hf : save some memory with np.int16 intermediate bf16 weights

* convert-hf : more closely match llama.cpp with which weights to keep in f32

* convert-hf : add --outtype auto-f16

A reason for this to exist is for model quantizers who want an initial
GGUF with the most fidelity to the original model while still using
a 16-bit float type instead of 32-bit floats.

* convert-hf : remove a semicolon because flake8 doesn't like it

It's a reflex from when programming in C/C++, I guess.

* convert-hf : support outtype templating in outfile name

* convert-hf : rename --outtype auto-f16 to --outtype auto
2024-05-11 11:06:26 -04:00
Joan Fontanals
bbd8009868 llama : add Jina Embeddings architecture (#6826)
* feat: first things to do

* feat: create tensors for Jina architecture

* fix: use other tensors

* feat: embedding gets results

* fix: fix usage of ALIBI

* fix: clean prints

* fix: do some cleanup unused vars

* fix: revert changes to Makefile and CMakeLists

* fix: revert some changes

* fix: fix small detail

* fix: fix convert formatting

* fix: fix linting and editor

* feat: set proper vocab settings

* fix: JinaBertForMaskedLM registration

* feat: support q_normalization and k_normalization in Jina arch

* feat: handle gpt2 tokenizer with Jina architecture

* feat: example comments in embedding

* feat: rename Jina Bert to Jina Bert V2

* fix: add some changes as per review

* feat: proper KQ_pos for Jina embeddings

* feat: add capacity to load models ES and DE for Spanish

* llama : fix pre-tokenizers

* ggml : full ALiBi support

* ggml : update ggml_soft_max_ext() CUDA, SYCL

* ggml : ggml_flash_attn_ext() support ALiBi (CPU)

* ggml : ggml_flash_attn_ext() support ALiBi (Metal)

* ggml : fix warning

* ggml : ggml_flash_attn_ext() support ALiBi (CUDA)

ggml-ci

* minor : clean-up

* embedding : add warning about missing SEP

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-05-11 10:46:09 +03:00
Georgi Gerganov
11ca6a98cf ggml : full ALiBi support (#7192)
* ggml : full ALiBi support

* ggml : update ggml_soft_max_ext() CUDA, SYCL

* ggml : ggml_flash_attn_ext() support ALiBi (CPU)

* ggml : ggml_flash_attn_ext() support ALiBi (Metal)

* ggml : fix warning

* ggml : ggml_flash_attn_ext() support ALiBi (CUDA)

ggml-ci

* ggml : fix assert message

* vulkan : add dev notes

* ggml : require mask when using ALiBi

ggml-ci

* convert : fix convert for refact models
2024-05-11 10:32:41 +03:00
compilade
282537cab8 convert-hf : save memory with lazy evaluation (#7075)
* convert-hf : begin refactoring write_tensor

* convert : upgrade to sentencepiece v0.2.0

* convert-hf : remove unused n_dims in extra_*_tensors

* convert-hf : simplify MoE weights stacking

* convert-hf : flake8 linter doesn't like semicolons

* convert-hf : allow unusual model part names

For example, loading `model-00001-of-00001.safetensors` now works.

* convert-hf : fix stacking MoE expert tensors

`torch.stack` and `torch.cat` don't do the same thing.

* convert-hf : fix Mamba conversion

Tested to work even with a SentencePiece-based tokenizer.

* convert : use a string for the SentencePiece tokenizer path

* convert-hf : display tensor shape

* convert-hf : convert norms to f32 by default

* convert-hf : sort model part names

`os.listdir` is said to list files in arbitrary order.
Sorting the file names should let "model-00009-of-00042.safetensors"
be loaded before "model-00010-of-00042.safetensors".

* convert-hf : use an ABC for Model again

It seems Protocol can't be used as a statically type-checked ABC,
because its subclasses also can't be instantiated. (why did it seem to work?)

At least there's still a way to throw an error when forgetting to define
the `model_arch` property of any registered Model subclasses.

* convert-hf : use a plain class for Model, and forbid direct instantiation

There are no abstract methods used anyway,
so using ABC isn't really necessary.

* convert-hf : more consistent formatting of cmdline args

* convert-hf : align the message logged for converted tensors

* convert-hf : fix Refact conversion

* convert-hf : save memory with lazy evaluation

* convert-hf : flake8 doesn't like lowercase L as a variable name

* convert-hf : remove einops requirement for InternLM2

* convert-hf : faster model parts loading

Instead of pre-loading them all into a dict, iterate on the tensors
in the model parts progressively as needed in Model.write_tensors

Conversion for some architectures relies on checking for the presence
of specific tensor names, so for multi-part models, the weight map is read
from the relevant json file to quickly get these names up-front.

* convert-hf : minor changes for consistency

* gguf-py : add tqdm as a dependency

It's small, and used for a progress bar
in GGUFWriter.write_tensors_to_file
2024-05-08 18:16:38 -04:00
Ren Xuancheng
269bdbb907 llama : add BPE pre-tokenization for Qwen2 (#7114)
* Add BPE pre-tokenization for Qwen2.

* minor : fixes

---------

Co-authored-by: Ren Xuancheng <17811943+jklj077@users.noreply.github.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-05-08 15:06:43 +03:00
DAN™
6604fe6bb6 convert : add BPE pre-tokenization for DBRX (#7132)
* Add BPE pre-tokenization for DBRX.

* Add vocab GGUFs.

* Remove test.

* Remove GGUFs.
2024-05-08 13:43:23 +03:00
nopperl
9bf2c358fe Fix OLMo HF to GGUF conversion (#6910) 2024-05-07 21:39:43 +02:00
DAN™
591a32f4f1 command-r : add BPE pre-tokenization (#7063)
* Add BPE pre-tokenization for Command-R/R+.

* Bump transformers convert requirement.

* command-r : add individual digits regex

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-05-05 08:19:30 +03:00
Georgi Gerganov
b518940ede tests : add test-tokenizer-0.sh + fix some tokenizers (#7036)
* tests : add test-tokenizer-0.sh

* unicode : add all unicode number ranges

* starcoder : fix pre-tokenizer

* tests : add test that fails with DeepSeek tokenizers

* falcon : fix regex

* unicode : regenerate unicode tables

* refact : add tokenizer model

* lint : fix

* tests : disable failing tests

ggml-ci

* refact : add tests files

ggml-ci

* convert : print -> logging

ggml-ci

* lint : fix

* unicode : digit -> number

* phi-3 : update
2024-05-04 08:32:32 +03:00
Brian
7bd1c13a56 convert.py : add python logging instead of print() (#6511)
* convert.py: add python logging instead of print()

* convert.py: verbose flag takes priority over dump flag log suppression

* convert.py: named instance logging

* convert.py: use explicit logger id string

* convert.py: convert extra print() to named logger

* convert.py: sys.stderr.write --> logger.error

* *.py: Convert all python scripts to use logging module

* requirements.txt: remove extra line

* flake8: update flake8 ignore and exclude to match ci settings

* gh-actions: add flake8-no-print to flake8 lint step

* pre-commit: add flake8-no-print to flake8 and also update pre-commit version

* convert-hf-to-gguf.py: print() to logger conversion

* *.py: logging basiconfig refactor to use conditional expression

* *.py: removed commented out logging

* fixup! *.py: logging basiconfig refactor to use conditional expression

* constant.py: logger.error then exit should be a raise exception instead

* *.py: Convert logger error and sys.exit() into a raise exception (for atypical error)

* gguf-convert-endian.py: refactor convert_byteorder() to use tqdm progressbar

* verify-checksum-model.py: This is the result of the program, it should be printed to stdout.

* compare-llama-bench.py: add blank line for readability during missing repo response

* reader.py: read_gguf_file() use print() over logging

* convert.py: warning goes to stderr and won't hurt the dump output

* gguf-dump.py: dump_metadata() should print to stdout

* convert-hf-to-gguf.py: print --> logger.debug or ValueError()

* verify-checksum-models.py: use print() for printing table

* *.py: refactor logging.basicConfig()

* gguf-py/gguf/*.py: use __name__ as logger name

Since they will be imported and not run directly.

* python-lint.yml: use .flake8 file instead

* constants.py: logger no longer required

* convert-hf-to-gguf.py: add additional logging

* convert-hf-to-gguf.py: print() --> logger

* *.py: fix flake8 warnings

* revert changes to convert-hf-to-gguf.py for get_name()

* convert-hf-to-gguf-update.py: use triple quoted f-string instead

* *.py: accidentally corrected the wrong line

* *.py: add compilade warning suggestions and style fixes
2024-05-03 22:36:41 +03:00
Bartowski
abde0cec51 Remove .attention from skipped tensors to match more accurately (#7051) 2024-05-03 01:49:09 +02:00
Georgi Gerganov
aac7c0192a convert : use utf8 encoding (#7000)
* convert : use utf8 encoding

* convert : update instructions and warning message
2024-04-30 11:05:25 +03:00
Georgi Gerganov
820703bf9c llama : fix BPE pre-tokenization (#6920)
* merged the changes from deepseeker models to main branch

* Moved regex patterns to unicode.cpp and updated unicode.h

* Moved header files

* Resolved issues

* added and refactored unicode_regex_split and related functions

* Updated/merged the deepseek coder pr

* Refactored code

* Adding unicode regex mappings

* Adding unicode regex function

* Added needed functionality, testing remains

* Fixed issues

* Fixed issue with gpt2 regex custom preprocessor

* unicode : fix? unicode_wstring_to_utf8

* lint : fix whitespaces

* tests : add tokenizer tests for numbers

* unicode : remove redundant headers

* tests : remove and rename tokenizer test scripts

* tests : add sample usage

* gguf-py : reader prints warnings on duplicate keys

* llama : towards llama3 tokenization support (wip)

* unicode : shot in the dark to fix tests on Windows

* unicode : first try custom implementations

* convert : add "tokenizer.ggml.pre" GGUF KV (wip)

* llama : use new pre-tokenizer type

* convert : fix pre-tokenizer type writing

* lint : fix

* make : add test-tokenizer-0-llama-v3

* wip

* models : add llama v3 vocab file

* llama : adapt punctuation regex + add llama 3 regex

* minor

* unicode : set bomb

* unicode : set bomb

* unicode : always use std::wregex

* unicode : support \p{N}, \p{L} and \p{P} natively

* unicode : try fix windows

* unicode : category support via std::regex

* unicode : clean-up

* unicode : simplify

* convert : add convert-hf-to-gguf-update.py

ggml-ci

* lint : update

* convert : add falcon

ggml-ci

* unicode : normalize signatures

* lint : fix

* lint : fix

* convert : remove unused functions

* convert : add comments

* convert : exercise contractions

ggml-ci

* lint : fix

* cmake : refactor test targets

* tests : refactor vocab tests

ggml-ci

* tests : add more vocabs and tests

ggml-ci

* unicode : cleanup

* scripts : ignore new update script in check-requirements.sh

* models : add phi-3, mpt, gpt-2, starcoder

* tests : disable obsolete

ggml-ci

* tests : use faster bpe test

ggml-ci

* llama : more prominent warning for old BPE models

* tests : disable test-tokenizer-1-bpe due to slowness

ggml-ci

---------

Co-authored-by: Jaggzh <jaggz.h@gmail.com>
Co-authored-by: Kazim Abrar Mahi <kazimabrarmahi135@gmail.com>
2024-04-29 16:58:41 +03:00
Christian Zhou-Zheng
4b73489d69 convert : fix conversion of some BERT embedding models (#6937) 2024-04-29 16:34:41 +03:00
Junyang Lin
e528175412 convert : add support of codeqwen due to tokenizer (#6707)
* add support of codeqwen due to tokenizer

* override load_hparams

* fix typo

* fix load_params

* convert : fix whitespace

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-04-24 10:16:21 +03:00
liuwei-git
3e35ff21bc llama : add phi3 support (#6852)
* add explicit phi3 support

* add explicit phi3 support

* remove unused code

* convert : add BOS token

* llama : match EOT token <|end|>

* llama : minor / style

* llama : tabs -> spaces

* convert : fix lint checks

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-04-24 10:00:37 +03:00
Pedro Cuenca
a512902ca8 llama : support Llama 3 HF conversion (#6745)
* Support Llama 3 conversion

The tokenizer is BPE.

* style

* Accept suggestion

Co-authored-by: Sourab Mangrulkar <13534540+pacman100@users.noreply.github.com>

* llama : add llama_token_is_eog()

ggml-ci

* llama : auto-detect more EOT tokens when missing in KV data

* convert : replacing EOS token is a hack

* llama : fix codegemma EOT token + add TODOs

* llama : fix model type string for 8B model

---------

Co-authored-by: Sourab Mangrulkar <13534540+pacman100@users.noreply.github.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-04-21 14:50:41 +03:00
nopperl
460a782b1c Implement the OLMo architecture (#6741)
* implement olmo architecture

* remove unused variable

* remove unused moe branch

* remove check for weight

* remove superfluous moe, bias and rope tensors

* clarified comment

* fix clamp_kqv setting

* remove obsolete parameter name filter
2024-04-19 11:35:54 +02:00
Zheng.Deng
76a155daa7 convert : fix autoawq gemma (#6704)
* fix autoawq quantized gemma model convert error

using autoawq to quantize gemma model will include a lm_head.weight tensor in model-00001-of-00002.safetensors. it result in this situation that convert-hf-to-gguf.py can't map lm_head.weight. skip loading this tensor could prevent this error.

* change code to full string match and print necessary message

change code to full string match and print a short message to inform users that lm_head.weight has been skipped.

---------

Co-authored-by: Zheng.Deng <32841220+CUGfred@users.noreply.github.com>
2024-04-16 23:51:07 +03:00
Ashish
f758b2eea4 llama : add StableLM2 12B (#6635)
* StableLM2 12B support for huggingface -> GGUF

* StableLM12 tensormapping and constants

* StableLM-2-12b model support

* fix

* Added 12B support

* Removed autoformatting; resolved bug where model_arch was not selecting StableLM2

* Formatting

* Do QK norm stacking in model conversion step

* Converge StableLM and StableLM2 code to simplify graph construction

* Fix accidental removal

* Removed warnings

* Revert formatter

* Move QK norm stack to private function so it's easier to read

* refactor stablelm graph builder to support 1.6, 3b and 12b more efficiently

* Proper check for None type for new_name to avoid crash; formatting; revert change to base class `write_tensors()`

* Format

* Formatting

* format

Co-authored-by: compilade <git@compilade.net>

* Fix incorrect check for K norm

* space after commas; Keep indentation multiple of 4 spaces

* Flake8 format

* Removed unnecessary conditional branches

* Removed unused comment

* Fixed incorrect tensor passing

* Format

---------

Co-authored-by: compilade <git@compilade.net>
2024-04-16 18:48:35 +03:00
Shijie
c037100422 llama : add qwen2moe (#6074)
* support qwen2moe

* fix-review

* metal : support unary ops for nelements % 4 != 0

* metal : require contiguousness for float4 unary kernels

* metal : require contiguousness for float4 unary kernels (cont)

* fix-review

* names : for brevity "SHARED_EXP" -> "SHEXP"

* llama : reuse build_moe_ffn()

* llama : add model type name

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-04-16 18:40:48 +03:00
Daniel Bevenius
ecc2398add gguf : add special tokens metadata for FIM/Infill (#6689)
This commit adds special token metadata for Fill-In-the-Middle
(FIM)/Infill to the GGUF model.

The motivation for this is that currently there is support for CodeLlama
but other models exist now like CodeGemma, but the different models use
different token ids for the special tokens and this commit allows for
supporting multiple models.

Signed-off-by: Daniel Bevenius <daniel.bevenius@gmail.com>
2024-04-16 09:13:13 +03:00
James A Capozzoli
ae5d1b6888 convert : enable the --use-temp-file cli flag (#6645) 2024-04-14 11:40:18 +03:00
Pierrick Hymbert
5dc0da80c8 model: support arch DbrxForCausalLM (#6515)
* model: dbrx convert to gguf
#6344

* llama: support dbrx
#6344

* doc: dbrx: add the model as supported

* scripts: get-wikitext-2 add unzip

* llama: increase maximum experts allowed

* llama: factorize moe graph implementation between grok, mixtral and dbrx


---------

Co-authored-by: Megha Agarwal <16129366+megha95@users.noreply.github.com>
2024-04-13 11:33:52 +02:00
Jared Van Bortel
ac727da885 BERT tokenizer fixes (#6498)
Key changes:
* BERT conversion: fix abuse of LlamaHfVocab, do not set BOS or EOS
* Nomic Embed conversion: pad vocab instead of slicing embedding tensor
* llama_tokenize: handle added special tokens like HF does
2024-04-09 13:44:08 -04:00