4446 Commits

Author SHA1 Message Date
Kawrakow
aa053205e8 Faster fused_rms_norm on the CPU (#1427) 2026-03-14 16:33:34 +01:00
dungquixote42
be2940f57a Adaptive P sampler: update review logic, delete old code comments, put prep stage after logit bias (#1386)
* simpler n_rewind logic, delete old comments

* use more consistent names, add updt_w_cur to json schema

* align comments

* refactor review logic, update struct/variable names

* revert cosmetic changes

* check enable/disable in llama_prep_adaptive_p_impl()

* delete extra whitespaces after statement

* show target in debug prints

* more concise debug print

* delete old comments

* update with loop instead of move()

* comment out all adaptive p debug prints

* more debug prints

* move review() variables: common_sampler struct -> common_sampler_review() args

* match n_unsent type

* fix merge bugs, delete adaptive p references in buffer_and_check_string_ban()

* restore accidental erasure

* Revert "adaptive p: collect probability before logit bias"

This reverts commit 1434878461.
2026-03-14 12:34:12 +01:00
mcm007
a6a1da9a28 Fix Issue 1382 (#1424)
* Use cuda 86 instead of default

"default" fails to build

* Update docker README.md

- Use 86 architecture
- Examples for mix of architectures
- Where to identify Cuda version
- Hint to clean unused images
- How to build without llama-swap
2026-03-14 12:27:29 +01:00
Kawrakow
46018f89ed Fuse SILU and SSM_CONV (CPU) (#1421) 2026-03-14 08:27:32 +01:00
Kawrakow
c2b8e95700 Be able to use imatrix computed with merged ffn_gate_up_exps (#1419)
* Be able to use imatrix computed with merged ffn_gate_up_exps

* Also the other way around
2026-03-13 17:57:56 +01:00
Kawrakow
633c1baa94 Enable imatrix calculation for models with fused ffn_up/gate_exps tensors (#1418) 2026-03-13 17:57:38 +01:00
Kawrakow
07ab0d263b Add ffn_gate_up_exps to --cpu-moe and --n-cpu-moe overrides (#1422) 2026-03-13 17:56:08 +01:00
Kawrakow
9f4656fa7d Faster top_n_sigma sampler (#1417)
* Faster top_n_sigma sampler

* This is better: 4000 t/s -> 8000 t/s
2026-03-13 10:53:45 +01:00
Kawrakow
7fab617684 Enable split mode graph for on-the-fly merged up/gate experts (#1413)
* Split mode graph for on-the-fly merged ffn_up/gate_exps

* Cleanup

* Also handle merged bias
2026-03-13 08:11:46 +01:00
hksdpc255
9b90fd37cb Improve MiroThinker chat template compatibility with the new Jinja template engine (#1404)
* Improve compatibility with the new Jinja template engine

* Refactor MiroThinker chat template using macros

* Add MiroThinker-compat chat template

For compatibility reasons. It can be removed when upstream fix the jinja render engine bugs.

* Add workaround for llama.cpp crashing
2026-03-13 08:11:17 +01:00
firecoperana
433531ddae server : support multi-modal context checkpoints and prompt caching (#1398)
* server : support multi-modal context checkpoints and prompt caching

do not create checkpoint right after image processing

improve mtmd check for slot ops

fix context shift

do not abort if template parse failed

* change to debug message when detecting ban token

---------

Co-authored-by: firecoperana <firecoperana>
2026-03-13 08:07:57 +01:00
Kawrakow
d2141b802b Update AUTHORS 2026-03-13 07:09:56 +01:00
SneedwareInc
525d8b8a40 Update server string+regex ban documentation (#1407)
* Update server string/regex ban documentation

* Update README.md

* Update README.md
2026-03-13 07:08:38 +01:00
Kawrakow
714329f4ca Remove pre-merged up/gate notice from the README
No need for that after PRs #1408 and #1412
2026-03-12 17:29:36 +01:00
Kawrakow
c85361fe2f Split mode graph for models with pre-merged ffn_up/ffn_gate experts (#1412)
* WIP: support pre-merged up/gate experts

Haha, mainline has elected to arrange the merged tensors
the other way around compared to what I had done in the on-the-fly merge.

* Change the order of on-the-fly packed up/gate

* OpenAI

* CUDA TG

* CPU

* Split mode graph for models with pre-merged ffn_up/ffn_gate experts
2026-03-12 17:26:48 +01:00
Kawrakow
5713d3b38b Support models with merged up/gate experts (#1408)
* WIP: support pre-merged up/gate experts

Haha, mainline has elected to arrange the merged tensors
the other way around compared to what I had done in the on-the-fly merge.

* Change the order of on-the-fly packed up/gate

* OpenAI

* CUDA TG

* CPU
2026-03-12 09:25:57 +01:00
Kawrakow
afa6439ac3 Faster convolution on AVX2 (#1400)
* Faster ssm_conv on AVX2

* Move the optimized ssm_conv to iqk

* Minor
2026-03-11 19:28:38 +01:00
Kawrakow
1f4dcab5c6 Add abbility to merge up/gate expert tensors to Qwen3.5-MoE/Qwen3-Next (#1403) 2026-03-11 19:28:12 +01:00
saood06
2161ee01cb Vibe coded script + constants from mainline + pip requirements (#1405) 2026-03-11 15:41:17 +01:00
Marcel Coetzee
4d09e04501 common : add env vars for cache_type_k/v, mlock, k_cache_hadamard and enable env vars for all tools (#1402)
Two changes:

1. Add four missing environment variable bindings to
   gpt_params_parse_from_env():

   - LLAMA_ARG_CACHE_TYPE_K  (string, e.g. "q8_0")
   - LLAMA_ARG_CACHE_TYPE_V  (string, e.g. "q8_0")
   - LLAMA_ARG_MLOCK         (bool, "1"/"true")
   - LLAMA_ARG_K_CACHE_HADAMARD (bool, "1"/"true")

2. Call gpt_params_parse_from_env() from gpt_params_parse() so that
   ALL tools (llama-cli, llama-bench, etc.) respect env vars, not
   just llama-server. Env vars act as defaults; CLI flags override.

Follows the existing get_env() pattern and uses the same
LLAMA_ARG_ prefix convention as the other env vars.

Co-authored-by: Pipboyguy <>
2026-03-11 15:35:26 +01:00
SneedwareInc
4a247593dc Make string ban more robust and add regex ban (#1243)
* Test new ctx_sampling->n_rewind system

* CRLF quickfix

* Adaptive p check

* merge banned_n

* Fix attempt 1

* Fix attempt 2
2026-03-11 15:30:27 +01:00
Kawrakow
fd4638f0e8 Update README with model compatibility warnings
Add warnings about incompatible models with merged ffn_up_exps and ffn_gate_exps tensors.
2026-03-11 12:06:45 +01:00
Kawrakow
bb45cc3c74 Arghh (#1397) 2026-03-10 18:28:35 +01:00
Kawrakow
cda15bf175 Discard very first compute graph for recurrent models (#1393) 2026-03-10 09:41:47 +01:00
Kawrakow
f90b4c2f27 Full graph parallel for Qwen3.5 (dense and MoE) (#1388)
* WIP

* WIP

* WIP

* WIP

* WIP

* WIP

* WIP

Loads and starts running, crashes with illegal memory access in
quantize_mmq_q8_1. This almost always indicates NaNs in the input
to the MoE FFN part.

* WIP

* WIP

Loads and runs, wrong results (very high PPL)
Performance looks promising, around 25% better than previous sm graph.
Needs f32 or bf16 graph reduce type.

* WIP - still wrong

* Fix after rebase

* WIP

* WIP

* This seems to be working for dense Qwen3.5!!!

* WIP: Qwen3-Next is not quite working

* Some cleanup

* Disable Qwen3-Next for now

* Disable graph parallel when mmproj was specified

* Read/write split recurrent state

* That should not crash

* Re-enable vision - it works now

* Recurrent layers should now be counted for split cache
2026-03-10 09:08:24 +01:00
Kawrakow
14492bfdd2 Make split mode graph work with vision enabled (#1392) 2026-03-10 06:56:39 +01:00
Kawrakow
666ea0e983 Revise build instructions for ik_llama.cpp
Updated documentation to reflect changes from 'llama.cpp' to 'ik_llama.cpp' and clarified build instructions.
2026-03-09 11:23:39 +01:00
mullecofo
f67fd9a452 Update README.md with build instructions for Windows (#1372)
* Fix compilation on clang-cl.exe

Fixes https://github.com/ikawrakow/ik_llama.cpp/issues/1169

See bitwise ariphmetics here: https://clang.llvm.org/doxygen/avx512fintrin_8h_source.html

Clang (and GCC) supports a language feature called Vector Extensions.

To Clang, `__m512i` is not just a "struct" or a "bag of bits"; it is recognized by the compiler as a native vector type.
Because it is a native vector type, Clang automatically maps standard C operators to the corresponding hardware instructions.
When you write `a | b`, Clang sees that a and b are 512-bit integer vectors.
It implicitly understands that the bitwise OR operator (|) applies to these vectors.
It automatically generates the VPORQ (or VPORD) instruction without needing any helper function.

MSVC follows a stricter, more traditional C++ model regarding intrinsics.

In MSVC, __m512i is defined in the header files (<immintrin.h>) as a struct or union (e.g., typedef struct __m512i { ... } __m512i). To the MSVC compiler, it is essentially a user-defined data type, not a fundamental language primitive like int or float.
Standard C++ does not define what `|` means for a user-defined struct.
MSVC does not have the same "Vector Extensions" that automatically apply operators to these structs.
When you write `a | b` in MSVC, the compiler looks for a definition of `operator|` for the __m512i struct. Since the standard headers don't provide one, the compiler throws an error.
You must use the explicit intrinsic function provided by Intel/MSVC: _mm512_or_si512(a, b).

To get the nice syntax `(a | b)` in MSVC, you have to manually "teach" the compiler what `|` means by defining the `operator|` overload yourself.

* Update README.md with build instructions for Windows

Current README lacks any guide for Windows users, whereas build process on that platform is quite compicated

* Update build.md with instruction about clang-cl.exe

Brings step-by-step build instruction for Windows

* Apply suggestions from code review

Co-authored-by: Kawrakow <iwankawrakow@gmail.com>

* Polish build.md for Windows usage

Added example of use for Windows

* Apply suggestions from code review

---------

Co-authored-by: Kawrakow <iwankawrakow@gmail.com>
2026-03-09 11:17:26 +01:00
firecoperana
ab1d74074b common : introduce composable PEG parser combinators for chat parsing and new jinja template engine (#1369)
---------

Co-authored-by: Piotr Wilkin <piotr.wilkin@syndatis.com>

common : add nemotron 3 parsing (#18077)

common : add parser for ministral/mistral large 3/devstral 2 (#17713)

common : default content to an empty string (#18485)

chat: make tool description and parameters optional per OpenAI spec (#18478)

Per the OpenAI API specification, both 'description' and 'parameters'
fields in tool function definitions are optional. Previously, the parser
would throw an exception if these fields were missing.

Attempts to fix #17667

common : implement new jinja template engine (#18462)
---------

Co-authored-by: Alde Rojas <hello@alde.dev>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

jinja: correct member access rule (#18905)

jinja : fix lexing of float literals with sign (#18901)

jinja : add missing tojson filter for bool (#18900)

jinja : attribute support for join, map and sort (#18883)

jinja : fix object item order (and properly implement dictsort) (#18904)

tests : add test-jinja -py option for cross-checking (#18906)

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

ci : run test-jinja -py on high perf [no ci] (#18916)

jinja : fix undefined keys and attributes and int/float as bool (#18924)

jinja: support none|string (#18995)

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

jinja : implement mixed type object keys (#18955)

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>

jinja : undefined should be treated as sequence/iterable (return string/array) by filters/tests (#19147)

`tojson` is not a supported `undefined` filter

keep it DRY and fix some types

jinja : do not pass empty tools and add some none filters (#19176)

jinja : add unordered_map include to value.h [no ci] (#19205)

jinja : add missing 'in' test to template engine (#19004) (#19239)

The jinja template parser was missing the 'in' test from
global_builtins(), causing templates using reject("in", ...),
select("in", ...), or 'x is in(y)' to fail with
"selectattr: unknown test 'in'".

This broke tool-calling for Qwen3-Coder and any other model
whose chat template uses the 'in' test.

Added test_is_in supporting array, string, and object containment
checks, mirroring the existing 'in' operator logic in runtime.cpp.

Includes test cases for all three containment types plus
reject/select filter usage.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

---------

Co-authored-by: Sid Mohan <sidmohan0@users.noreply.github.com>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>

Add Jinja support for "indent" string filter (#19529)

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

add vendor

refactor chat

server : support preserving reasoning_content in assistant message (#18994)

chat : fix translategemma crash on common_chat_format_example (#19019)

chat: fix language input for translategemma (#19052)

Co-authored-by: Aldehir Rojas <hello@alde.dev>

---------

Co-authored-by: Aldehir Rojas <hello@alde.dev>

chat: fix case where template accepts type content only (#19419)

mtmd : chat : Fix extra \n between text and media marker (#19595)

Thanks to @tugot17 for detecting and reporting the issue.

For vision models (e.g. LFM2.5-VL-1.6B and Qwen/Qwen3-VL-4B-Instruct) `llama-mtmd-cli` produces identical output to HF implementation.

However `llama-server` doesn't. I traced it down to extra newline
inserted after `<__media__>`.

This happens in `to_json_oaicompat`, that treats media markers as text
and joins all parts with `\n` separator.

PR introduces new type `media_marker` and uses it for media markers.
Extra logic is added to prevent insertion of newlines before and after
media markers.

With this change number of input tokens is identical to HF
implementation and as a result the output is also identical.

I explored other ways to address the issue
* remove completely `\n` between text parts in `to_json_oaicompat`
* merge text messages in server-common.cpp before sending them to `to_json_oaicompat`

Please propose alternative ways of fixing this issue.

Co-authored-by: Piotr Wilkin (ilintar) <piotr.wilkin@syndatis.com>

---------

Co-authored-by: Piotr Wilkin (ilintar) <piotr.wilkin@syndatis.com>

common : merge qwen3-coder and nemotron nano 3 parsers (#19765)

common : fix improper trimming in XML parser on complete message (#19805)

Co-authored-by: Jules LEIDELINGER <11395311+julio75012@users.noreply.github.com>

jinja: correct stats for tojson and string filters (#19785)

jinja : correct default size for string slices (#19913)

common : handle unicode during partial json parsing (#16526)

common : fix json schema with '\' in literals (#17307)

add back qwen_coder_xml and mirothinker

Co-authored-by: Aldehir Rojas <hello@alde.dev>
2026-03-09 11:03:33 +01:00
Kawrakow
542988773c Update README with backend support notes
Clarify backend support and usage of quantized models in README.
2026-03-09 07:36:08 +01:00
Kawrakow
344688ce50 Add all Qwen3.5 model types (#1378)
* Add all Qwen3.5 model types

* Need also this
2026-03-07 09:01:33 +01:00
usrlocalben
c1c3421462 Fix incorrect --amb n_max_head fitting (#1375)
kv_f32_size should be fit to --amb by number of divisions, not heads per
division.

Regression in b85a2a5
2026-03-07 09:01:14 +01:00
Kawrakow
277fc1d26f Do not repeat yourself (#1373)
* DRY - part 1

* DRY - part 2

* DRY - part 3

* Fix NEON

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2026-03-06 16:06:51 +01:00
Kawrakow
082addead2 Minor: do not do SILU on the whole convolution output (#1374) 2026-03-06 16:06:34 +01:00
Kawrakow
fa0c29843d Fix split mode graph with Qwen3.5-MoE/Qwen3-Next hybryd inference (#1368) 2026-03-06 07:26:15 +01:00
Kawrakow
3208660d20 Be able to quantize mmproj files (#1367) 2026-03-06 07:25:40 +01:00
Kawrakow
1ef4b5eddc Disable split mode graph for recurrent/hybrid models when tensor overrides (#1366) 2026-03-05 10:25:50 +01:00
Kawrakow
8fb002207a Fused delta-net (AVX512) (#1362) 2026-03-05 07:55:05 +01:00
firecoperana
2add439e43 grammar: fix trigger pattern init error (#1365)
Co-authored-by: firecoperana <firecoperana>
2026-03-05 07:54:41 +01:00
dungquixote42
a903409a5e fix adaptive p sampler rewinding too far back (#1359)
* fix adaptive p sampler rewinding too far back

* update comments

* correct default value for total_weight, more comments

* new variables/names

* update comment for n_rewind

* move null pointer check back to common_sampler_review()

* refactor weighted_sum and total_weight to vector<pair>, better boundary check in llama_review_adaptive_p_impl()
2026-03-04 13:26:25 +01:00
Kawrakow
f27678d39b ARM_NEON fused delta-net implementation (#1361)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2026-03-04 13:24:59 +01:00
mullecofo
2f93bf7563 Fix compilation on clang-cl.exe (#1355)
Fixes https://github.com/ikawrakow/ik_llama.cpp/issues/1169

See bitwise ariphmetics here: https://clang.llvm.org/doxygen/avx512fintrin_8h_source.html

Clang (and GCC) supports a language feature called Vector Extensions.

To Clang, `__m512i` is not just a "struct" or a "bag of bits"; it is recognized by the compiler as a native vector type.
Because it is a native vector type, Clang automatically maps standard C operators to the corresponding hardware instructions.
When you write `a | b`, Clang sees that a and b are 512-bit integer vectors.
It implicitly understands that the bitwise OR operator (|) applies to these vectors.
It automatically generates the VPORQ (or VPORD) instruction without needing any helper function.

MSVC follows a stricter, more traditional C++ model regarding intrinsics.

In MSVC, __m512i is defined in the header files (<immintrin.h>) as a struct or union (e.g., typedef struct __m512i { ... } __m512i). To the MSVC compiler, it is essentially a user-defined data type, not a fundamental language primitive like int or float.
Standard C++ does not define what `|` means for a user-defined struct.
MSVC does not have the same "Vector Extensions" that automatically apply operators to these structs.
When you write `a | b` in MSVC, the compiler looks for a definition of `operator|` for the __m512i struct. Since the standard headers don't provide one, the compiler throws an error.
You must use the explicit intrinsic function provided by Intel/MSVC: _mm512_or_si512(a, b).

To get the nice syntax `(a | b)` in MSVC, you have to manually "teach" the compiler what `|` means by defining the `operator|` overload yourself.
2026-03-04 08:00:28 +01:00
Kawrakow
fd16a418de Fix clang warnings on macOS (#1354)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2026-03-03 16:27:16 +01:00
Yap Sok Ann
ea3e8e30e1 Allow arbitrary arguments order for Q3C, Q3CN, and Qwen3.5 (#1352)
This should fix the read file at offset/limit issue, where the tool
definition has offset before limit, while the model sets limit before
offset.
2026-03-03 15:39:16 +01:00
Kawrakow
505e2c57f9 Reduce memory use when processing large images (#1349) 2026-03-02 17:54:56 +01:00
Kawrakow
3735e88925 Remove unused tensors from delta-net (#1350) 2026-03-02 16:02:40 +01:00
Nexes the Elder
d4ac5f1566 gguf-split: fix the split output files naming (#1336)
* Fix gguf-split.cpp splits output naming

With this fix, the initial extension of the source .gguf file is not included in the naming of the output file before the numeration of the splits.

ex:

No more model.gguf-00001-of-00200.gguf
Instead, model-00001-of-00200.gguf

* increase ggml_max_context to 2048

* Revert GGML_MAX_CONTEXTS to 64
2026-03-02 08:43:47 +01:00
Kawrakow
d239dabcc6 Graph parallel for Qwen-3.5-MoE (#1347)
* Graph parallel for Qwen3.5-MoE

* Add --max-gpu to llama-bench

* Fix graph reuse when not all GPUs participate in self-attention
2026-03-02 07:48:43 +01:00
firecoperana
8f9e19d57c server: add checkpoint tolerance and fix grammar_trigger init (#1346)
Co-authored-by: firecoperana <firecoperana>
2026-03-02 07:45:32 +01:00
Kawrakow
a568e12c8f Minor delta-net tweak (#1337) 2026-03-01 17:45:02 +01:00