Commit Graph

20 Commits

Author SHA1 Message Date
ubergarm
32540ac619 Ling-1T convert fixup (#837)
* Conditionally write moe_shared_expert_intermediate_size

Ling-1T config.json does *not* have `moe_shared_expert_intermediate_size`.
Ling-flash-2.0a *does* have it.

This small patch just makes the gguf_writer conditionally detect as
needed.

* Fix Ling-1T missing moe_shared_expert_intermediate_size

Thanks CISC for the proper patch to include the needed values!
2025-10-17 07:52:31 +03:00
Kawrakow
f7adde1043 Adding Ling/Ring (a.k.a., Bailing-MoE2) support (#833)
* Adding Ling/Ring (a.k.a., Bailing-MoE2)

* Add expert group selection (not working, so turned off)

* BailingMoE2 conversion

* WIP

* Bits and pieces

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-10-15 14:20:40 +03:00
firecoperana
079231c291 model : add grok-2 support (#782)
Co-authored-by: firecoperana <firecoperana>
2025-09-23 16:31:01 +02:00
firecoperana
de97c33b40 fix convert error for ernie 4.5 (#774) 2025-09-11 07:59:24 +02:00
firecoperana
426032c27a Add Ernie 4.5 MOE and 0.3B Support (#759)
* Add Ernie4_5MoeModel

* add ernie 4.5 0.3B model

---------

Co-authored-by: firecoperana <firecoperana>
2025-09-05 11:54:35 +02:00
Thireus ☠
d65d5fe29e Add support for GLM-4.5 models (#668)
* GLM-4.5

* GLM-4.5

* GLM-4.5

* convert_hf_to_gguf.py compatibility bugfix with GLM-4.5

From @ubergarm - https://github.com/ikawrakow/ik_llama.cpp/pull/668#issuecomment-3145913701

* Add ubergarm comments + my own

* Revert to llama.cpp script version that produced good BF16

See: https://github.com/ikawrakow/ik_llama.cpp/pull/668#issuecomment-3147374559

* Support for jinja chat templates

See https://github.com/ikawrakow/ik_llama.cpp/pull/668#issuecomment-3148109962

* GLM-4.5 llama.cpp final port

* Handle TENSOR_SKIP

Ported the hanges from:

f129567dc0
dcbbd2cb05

Except op info since ik_llama.cpp doesn't support this operation.

* Bugfix for TENSOR_SKIP

skip loading if a tensor has the TENSOR_SKIP flag - @ubergarm via https://github.com/ikawrakow/ik_llama.cpp/pull/668#issuecomment-3155297198

* Update llama.cpp

Restore original GGLM_ASSERT

* Fix chat template detection

Changes suggested by @ubergarm - https://github.com/ikawrakow/ik_llama.cpp/pull/668#issuecomment-3155927840

* Revert to original GGML_ASSERT
2025-08-07 07:55:00 +03:00
ubergarm
5e357db589 Fixup kimi-k2 convert indentation (#617) 2025-07-16 15:24:20 +02:00
ubergarm
d3ed217798 kimi-k2 convert script and chat template (#612)
* convert_hf_to_gguf for Kimi-K2-Instruct

Adapt mainline `PR14653` for tokenizer while maintaining proper MLA
tensors. Tested with this workflow using deepseek fp8_cast_bf16.py and
triton-cpu to upcast the fp8 safetensors to bf16 safetensors then used
this convert_hf_to_gguf.

* Add Kimi-K2 chat template

moonshotai/Kimi-K2-Instruct

https://github.com/ikawrakow/ik_llama.cpp/pull/609#issuecomment-3071259454

* kimi-k2 add ass to template to get response
2025-07-15 19:54:04 +02:00
saood06
02d675717e Support for dots.llm1 models (#573)
* Add llama.cpp changes for dots1 support

* Add python changes for dots1 support

* Fix to make it convert

* Remove V reshaping, remove BOS by default for dots1 and fix warmup to handle models without BOS

* Minor fix

* Remove commented lines
2025-07-10 02:37:36 -05:00
Fizz~
27ff5bf57e Special handling of Seed Coder FIM tokens (#585)
* Special handling of Seed Coder FIM tokens

* vocab: Add Seed Coder pretokenizer

* Formatting fix

* Update llama.h
2025-07-06 12:13:55 +02:00
Nexes the Elder
7c5d9aba86 convert_hf_to_gguf.py : conversion from hf weights to Q6_0 (#483)
* Direct conversion from fp16 to Q6_0

* forgotten comma

* More precise infos
2025-06-03 09:30:30 +03:00
Nexes the Elder
86170b2048 Legacy quants conversion schemes in convert_hf_to_gguf.py (#449)
* Legacy quants conversion schemes in convert_hf_to_gguf.py

This, notably in order to make smaller conversions to generate an iMatrix file.

`Q4_0`,`Q4_1` are here using embeddings, output, attn_k and attn_v in q5_0.
`Q5_0`,`Q5_1` are here using embeddings, output, attn_k and attn_v in q8_0.

Adapted from the following llama.cpp mainline PR : https://github.com/ggml-org/llama.cpp/pull/9022
Original author @chentyjpm

Also, 2 forgotten mentions of FTYPE IQ3_KL in llama.cpp file.

* forgotten IQ5_KS case mention
2025-05-24 11:49:10 +03:00
saood06
a7e5b01540 Fix missing rope_freqs with convert_hf_to_gguf (#402)
* lora : fix llama conversion script with ROPE_FREQS

* convert : refactor rope_freqs generation

This should also fix vocab-only conversion for Phi-3.

* convert : adapt MiniCPM3 to separate rope_freqs insertion

MiniCPM3's tokenizer is treated as a SentencePiece tokenizer to avoid
having to run its custom Python code which mixes tokenization
in the same file as tool calls.

gguf-py : add long and short RoPE factors to tensor mappings

Empty, but the key names are used to populate the mappings.

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
Co-authored-by: Francis Couture-Harpin <git@compilade.net>
2025-05-09 09:17:41 -05:00
saood06
87bfad8437 Support for Llama-3-Nemotron models (#377)
* conflict resolution

* Changes to make work and add longrope support

* Changes to n_attention_wv rule

* Untested support of 253B

* DeciLMCausalModel now reads rope_theta from config.json properly

* Remove errant Granite mentions

* Better n_attention_vw rule

* Update vocab.py

---------

Co-authored-by: Yee Man Chan <ymchan@gmail.com>
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-09 10:09:59 +03:00
Ben Harris
8b62ee32ca Apply Qwen3 PR from llama.cpp (#355) 2025-04-29 10:02:08 +02:00
saood06
e6c85a5b95 Add support for bitnet2b_2501 model (#337)
* add support for bitnet2b_2501 model

* Fixes

* Support both model names

---------

Co-authored-by: potassiummmm <zhou.hansong@outlook.com>
2025-04-22 08:34:13 +02:00
Kawrakow
3e536b95b0 Add optional MLA (#188)
* Deepseek MLA Optimizations

Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>

* Make MLA optional

* Remove some unnecessary copies in the MLA attention

* Deepseek MLA Optimizations V2 (#195)

* Avoid allocating MHA KV cache when MLA is turned on

* Added missing gguf-py file

* Added final optimizations

Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>

* Make sure we do have wk_b and wv_b before enabling MLA

---------

Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

* Use type_k and type_v to set the types of the MLA caches

They were hard-coded at f16.
On my Ryzen-7950X with native bf16 support I get a fairly
significant PP performance boost with bf16 KV-cache:
PP-4096 = 320 t/s up from 292 t/s with fp16 KV-cache.

* Better gemm strategy when nth > nhead

It gives a ~10% PP performance boost for DeepSeek-Lite with 32 threads
(with or without MLA).
Before this commit, when nth > nhead heads were processed
sequentially with all nth threads participating in each
matrix multiplication. Now we ind the gcd of nhead and
nth and split threads into nth/gcd groups, each group
processing nhead/gcd heads.

---------

Co-authored-by: Saood Karim <saood05@gmail.com>
Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-02-09 19:48:44 +02:00
saood06
5c0a01bdaf Deepseek V3 support added (#176)
Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
2025-01-23 18:24:10 +02:00
Kawrakow
1a4cfbcc53 Merge mainline - Aug 12 2024 (#17)
* Merge mainline

* Fix after merge

* Remove CI check

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-08-12 15:14:32 +02:00
Kawrakow
0ceeb11721 Merge mainline llama.cpp (#3)
* Merging mainline - WIP

* Merging mainline - WIP

AVX2 and CUDA appear to work.
CUDA performance seems slightly (~1-2%) lower as it is so often
the case with llama.cpp/ggml after some "improvements" have been made.

* Merging mainline - fix Metal

* Remove check

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-07-27 07:55:01 +02:00