Commit Graph

18 Commits

Author SHA1 Message Date
firecoperana
e15a215e6b model : Port Minimax M2 from mainline (#907)
Co-authored-by: firecoperana <firecoperana>
2025-11-06 18:09:24 +02:00
Kawrakow
f7adde1043 Adding Ling/Ring (a.k.a., Bailing-MoE2) support (#833)
* Adding Ling/Ring (a.k.a., Bailing-MoE2)

* Add expert group selection (not working, so turned off)

* BailingMoE2 conversion

* WIP

* Bits and pieces

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-10-15 14:20:40 +03:00
Kawrakow
764eefd1bc Enable and clean up compiler warnings in src (#824)
* WIP: enable and clean up warnings in src

* All warnings handled

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-10-11 16:01:13 +03:00
Kawrakow
c1a0e15377 Port mdmd from mainline + Qwen2/2.5-VL support (#798)
* Add mtmd: the beginning

* Add mtmd: mtmd.cpp compiles

* Add mtmd: clip initialization compiles

* Add mtmd: clip.cpp compiles

* Add mtmd: builds successfully

* Add CPU implementation for GGML_OP_GLU

* Add CUDA implementation for GGML_OP_GLU

* Add CPU implementation for GGML_OP_CONV_2D and GGML_OP_CONV_2D_DW

* Add CUDA implementation for GGML_OP_CONV_2D and GGML_OP_CONV_2D_DW

* Add mtmd: refresh CPU rope

* Add mtmd: refresh CUDA rope

* Add mtmd: add Qwen2-VL

* Add mtmd: Qwen2.5-VL text seems to work with this change

* Add mtmd: fix swiglu

* Add mtmd: use LOG_TEE so generated tokens show up in terminal

* Add mtmd: do not attempt to load a GPU backend if none are available

* GLU, not GPU

* Fix typo

* Fix new/free mismatch

* LOG stuff

* Add mtmd: this fixes gibberish on second image

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-09-27 08:45:29 +02:00
firecoperana
079231c291 model : add grok-2 support (#782)
Co-authored-by: firecoperana <firecoperana>
2025-09-23 16:31:01 +02:00
firecoperana
d7882c3cf8 Tool calls support from mainline (#723)
* Tool calls support from mainline

* update cmake

* revert api for /completions

* Fix broken thinking process for gpt-oss

* add missing args and fix webui bugs

* add missing args and fix webui bugs2

* Fix reasoning format error

* add usage

* change default post_sampling_probs to true

* add back generated_text

* Remove server endpoints tests

* add log

* Chat fixes

* Remove logs

* webui: revert extra handling of thinking process

---------

Co-authored-by: firecoperana <firecoperana>
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-09-01 08:38:49 +03:00
Kawrakow
6b82a64273 Disable "...is not marked as EOG" messages (#712)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-08-20 16:47:14 +03:00
Kawrakow
633e0617b0 Enable CUDA graphs for MoE models + GPT-OSS support (#689)
* gmp-oss: common

* gpt-oss: attnetion sinks, swiglu_oai

* gpt-oss: WIP llama

Model loads and runs (CPU only), but PPL is much to high
(~1500 for 1st batch vs ~200 in mainline).
Is it because of SWA, because of vocab, or did I introduce a bug somewhere?

* gpt-oss: CPU seems to be working

It was the SWA thta was missing in the previous commit.

There are issues with EOG tokens, so this still needs to be added.

* CUDA: ADD_ID

Just a copy from mainline

* gpt-oss: Seems to be working on CUDA

* gpt-oss: add sinks to the attn-vec kernels

* CUDA: add head size of 64 to new mma

Haven't turned it on yet, but observe slightly better PP and slightly
worse TG performance with that.

* gpt-oss: add ability to use -fmoe (only CUDA for now)

* Move row sums to the write place

* Add sinks to iqk flash attention

* gpt_oss: Implement -fmoe on the CPU

* Simdify swiglu_oai

Turning it off for now as performance becomes more variable,
so perhaps I'm running into thermal trottling imore often
because of making the CPU work too hard.

* llama: factor out model loader

* Builds successfully

* It runs, but mmap does not work

* Fix llama_mmap so mmap works

* Minor

* Fix CUDA after latest changes

* Attempt to use CUDA graphs with MoE models - not working

* CUDA graphs WIP - still not working

* CUDA graphs - seems to be working

Likely not all MLA variants are working.
I no longer remember why I added the q8_0 cpy that
transposes the tensor, but if really needed, this is now
missing. Also missing is q6_0.

* Make q8_0 cache work for DeepSeek models with CUDA graphs

* cuda: cpy for q6_0

* Fix llama_mmap on non-Linux platforms

* Adding forgotten file

* Iterating on Windows build failures

* cuda: re-add q8_0 -> q8_0 transpose

so mla = 2 can be used with CUDA graphs and q8_0 cache.

* Disable graphs without -fmoe

* Minor

* Turn graphs on by default

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-08-15 09:18:07 +03:00
Thireus ☠
d65d5fe29e Add support for GLM-4.5 models (#668)
* GLM-4.5

* GLM-4.5

* GLM-4.5

* convert_hf_to_gguf.py compatibility bugfix with GLM-4.5

From @ubergarm - https://github.com/ikawrakow/ik_llama.cpp/pull/668#issuecomment-3145913701

* Add ubergarm comments + my own

* Revert to llama.cpp script version that produced good BF16

See: https://github.com/ikawrakow/ik_llama.cpp/pull/668#issuecomment-3147374559

* Support for jinja chat templates

See https://github.com/ikawrakow/ik_llama.cpp/pull/668#issuecomment-3148109962

* GLM-4.5 llama.cpp final port

* Handle TENSOR_SKIP

Ported the hanges from:

f129567dc0
dcbbd2cb05

Except op info since ik_llama.cpp doesn't support this operation.

* Bugfix for TENSOR_SKIP

skip loading if a tensor has the TENSOR_SKIP flag - @ubergarm via https://github.com/ikawrakow/ik_llama.cpp/pull/668#issuecomment-3155297198

* Update llama.cpp

Restore original GGLM_ASSERT

* Fix chat template detection

Changes suggested by @ubergarm - https://github.com/ikawrakow/ik_llama.cpp/pull/668#issuecomment-3155927840

* Revert to original GGML_ASSERT
2025-08-07 07:55:00 +03:00
Aleksey Nikiforov
da8998c6c6 Ported kimi-k2 support from llama.cpp (#609)
Original patch by @gabriellarson:
https://github.com/ggml-org/llama.cpp/pull/14654

Co-authored-by: anikifoss <anikifoss>
2025-07-14 18:43:52 +02:00
ubergarm
db49223e8c add hunyuan moe support for 561 (#565)
* add hunyuan moe

* Don't reshape Vcur

* Apply chat template fix from mainline PR14584
2025-07-09 10:29:40 +02:00
Fizz~
27ff5bf57e Special handling of Seed Coder FIM tokens (#585)
* Special handling of Seed Coder FIM tokens

* vocab: Add Seed Coder pretokenizer

* Formatting fix

* Update llama.h
2025-07-06 12:13:55 +02:00
Kawrakow
c8e6d9cfe7 Add Falcon-Edge support (#555)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-06-26 08:48:52 +02:00
firecoperana
d1f92e24d3 add dry sampler (#513)
* add dry sampler

* use vocab instead of model in dry_init function

* fix compile error for build test

---------

Co-authored-by: firecoperana <firecoperana>
2025-06-19 10:24:53 +03:00
Kawrakow
5c127b279f LlaMA-4 support (text only) (#321)
* llama4: WIP

* llama4: this seems to be working

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-04-10 09:05:21 +02:00
saood06
5c0a01bdaf Deepseek V3 support added (#176)
Co-authored-by: Stanisław Szymczyk <sszymczy@gmail.com>
2025-01-23 18:24:10 +02:00
Kawrakow
1a4cfbcc53 Merge mainline - Aug 12 2024 (#17)
* Merge mainline

* Fix after merge

* Remove CI check

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-08-12 15:14:32 +02:00
Kawrakow
0ceeb11721 Merge mainline llama.cpp (#3)
* Merging mainline - WIP

* Merging mainline - WIP

AVX2 and CUDA appear to work.
CUDA performance seems slightly (~1-2%) lower as it is so often
the case with llama.cpp/ggml after some "improvements" have been made.

* Merging mainline - fix Metal

* Remove check

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-07-27 07:55:01 +02:00