Commit Graph

4268 Commits

Author SHA1 Message Date
firecoperana
ab1d74074b common : introduce composable PEG parser combinators for chat parsing and new jinja template engine (#1369)
---------

Co-authored-by: Piotr Wilkin <piotr.wilkin@syndatis.com>

common : add nemotron 3 parsing (#18077)

common : add parser for ministral/mistral large 3/devstral 2 (#17713)

common : default content to an empty string (#18485)

chat: make tool description and parameters optional per OpenAI spec (#18478)

Per the OpenAI API specification, both 'description' and 'parameters'
fields in tool function definitions are optional. Previously, the parser
would throw an exception if these fields were missing.

Attempts to fix #17667

common : implement new jinja template engine (#18462)
---------

Co-authored-by: Alde Rojas <hello@alde.dev>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

jinja: correct member access rule (#18905)

jinja : fix lexing of float literals with sign (#18901)

jinja : add missing tojson filter for bool (#18900)

jinja : attribute support for join, map and sort (#18883)

jinja : fix object item order (and properly implement dictsort) (#18904)

tests : add test-jinja -py option for cross-checking (#18906)

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

ci : run test-jinja -py on high perf [no ci] (#18916)

jinja : fix undefined keys and attributes and int/float as bool (#18924)

jinja: support none|string (#18995)

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

jinja : implement mixed type object keys (#18955)

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>

jinja : undefined should be treated as sequence/iterable (return string/array) by filters/tests (#19147)

`tojson` is not a supported `undefined` filter

keep it DRY and fix some types

jinja : do not pass empty tools and add some none filters (#19176)

jinja : add unordered_map include to value.h [no ci] (#19205)

jinja : add missing 'in' test to template engine (#19004) (#19239)

The jinja template parser was missing the 'in' test from
global_builtins(), causing templates using reject("in", ...),
select("in", ...), or 'x is in(y)' to fail with
"selectattr: unknown test 'in'".

This broke tool-calling for Qwen3-Coder and any other model
whose chat template uses the 'in' test.

Added test_is_in supporting array, string, and object containment
checks, mirroring the existing 'in' operator logic in runtime.cpp.

Includes test cases for all three containment types plus
reject/select filter usage.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

---------

Co-authored-by: Sid Mohan <sidmohan0@users.noreply.github.com>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>

Add Jinja support for "indent" string filter (#19529)

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

add vendor

refactor chat

server : support preserving reasoning_content in assistant message (#18994)

chat : fix translategemma crash on common_chat_format_example (#19019)

chat: fix language input for translategemma (#19052)

Co-authored-by: Aldehir Rojas <hello@alde.dev>

---------

Co-authored-by: Aldehir Rojas <hello@alde.dev>

chat: fix case where template accepts type content only (#19419)

mtmd : chat : Fix extra \n between text and media marker (#19595)

Thanks to @tugot17 for detecting and reporting the issue.

For vision models (e.g. LFM2.5-VL-1.6B and Qwen/Qwen3-VL-4B-Instruct) `llama-mtmd-cli` produces identical output to HF implementation.

However `llama-server` doesn't. I traced it down to extra newline
inserted after `<__media__>`.

This happens in `to_json_oaicompat`, that treats media markers as text
and joins all parts with `\n` separator.

PR introduces new type `media_marker` and uses it for media markers.
Extra logic is added to prevent insertion of newlines before and after
media markers.

With this change number of input tokens is identical to HF
implementation and as a result the output is also identical.

I explored other ways to address the issue
* remove completely `\n` between text parts in `to_json_oaicompat`
* merge text messages in server-common.cpp before sending them to `to_json_oaicompat`

Please propose alternative ways of fixing this issue.

Co-authored-by: Piotr Wilkin (ilintar) <piotr.wilkin@syndatis.com>

---------

Co-authored-by: Piotr Wilkin (ilintar) <piotr.wilkin@syndatis.com>

common : merge qwen3-coder and nemotron nano 3 parsers (#19765)

common : fix improper trimming in XML parser on complete message (#19805)

Co-authored-by: Jules LEIDELINGER <11395311+julio75012@users.noreply.github.com>

jinja: correct stats for tojson and string filters (#19785)

jinja : correct default size for string slices (#19913)

common : handle unicode during partial json parsing (#16526)

common : fix json schema with '\' in literals (#17307)

add back qwen_coder_xml and mirothinker

Co-authored-by: Aldehir Rojas <hello@alde.dev>
2026-03-09 11:03:33 +01:00
Kawrakow
542988773c Update README with backend support notes
Clarify backend support and usage of quantized models in README.
2026-03-09 07:36:08 +01:00
Kawrakow
344688ce50 Add all Qwen3.5 model types (#1378)
* Add all Qwen3.5 model types

* Need also this
2026-03-07 09:01:33 +01:00
usrlocalben
c1c3421462 Fix incorrect --amb n_max_head fitting (#1375)
kv_f32_size should be fit to --amb by number of divisions, not heads per
division.

Regression in b85a2a5
2026-03-07 09:01:14 +01:00
Kawrakow
277fc1d26f Do not repeat yourself (#1373)
* DRY - part 1

* DRY - part 2

* DRY - part 3

* Fix NEON

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2026-03-06 16:06:51 +01:00
Kawrakow
082addead2 Minor: do not do SILU on the whole convolution output (#1374) 2026-03-06 16:06:34 +01:00
Kawrakow
fa0c29843d Fix split mode graph with Qwen3.5-MoE/Qwen3-Next hybryd inference (#1368) 2026-03-06 07:26:15 +01:00
Kawrakow
3208660d20 Be able to quantize mmproj files (#1367) 2026-03-06 07:25:40 +01:00
Kawrakow
1ef4b5eddc Disable split mode graph for recurrent/hybrid models when tensor overrides (#1366) 2026-03-05 10:25:50 +01:00
Kawrakow
8fb002207a Fused delta-net (AVX512) (#1362) 2026-03-05 07:55:05 +01:00
firecoperana
2add439e43 grammar: fix trigger pattern init error (#1365)
Co-authored-by: firecoperana <firecoperana>
2026-03-05 07:54:41 +01:00
dungquixote42
a903409a5e fix adaptive p sampler rewinding too far back (#1359)
* fix adaptive p sampler rewinding too far back

* update comments

* correct default value for total_weight, more comments

* new variables/names

* update comment for n_rewind

* move null pointer check back to common_sampler_review()

* refactor weighted_sum and total_weight to vector<pair>, better boundary check in llama_review_adaptive_p_impl()
2026-03-04 13:26:25 +01:00
Kawrakow
f27678d39b ARM_NEON fused delta-net implementation (#1361)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2026-03-04 13:24:59 +01:00
mullecofo
2f93bf7563 Fix compilation on clang-cl.exe (#1355)
Fixes https://github.com/ikawrakow/ik_llama.cpp/issues/1169

See bitwise ariphmetics here: https://clang.llvm.org/doxygen/avx512fintrin_8h_source.html

Clang (and GCC) supports a language feature called Vector Extensions.

To Clang, `__m512i` is not just a "struct" or a "bag of bits"; it is recognized by the compiler as a native vector type.
Because it is a native vector type, Clang automatically maps standard C operators to the corresponding hardware instructions.
When you write `a | b`, Clang sees that a and b are 512-bit integer vectors.
It implicitly understands that the bitwise OR operator (|) applies to these vectors.
It automatically generates the VPORQ (or VPORD) instruction without needing any helper function.

MSVC follows a stricter, more traditional C++ model regarding intrinsics.

In MSVC, __m512i is defined in the header files (<immintrin.h>) as a struct or union (e.g., typedef struct __m512i { ... } __m512i). To the MSVC compiler, it is essentially a user-defined data type, not a fundamental language primitive like int or float.
Standard C++ does not define what `|` means for a user-defined struct.
MSVC does not have the same "Vector Extensions" that automatically apply operators to these structs.
When you write `a | b` in MSVC, the compiler looks for a definition of `operator|` for the __m512i struct. Since the standard headers don't provide one, the compiler throws an error.
You must use the explicit intrinsic function provided by Intel/MSVC: _mm512_or_si512(a, b).

To get the nice syntax `(a | b)` in MSVC, you have to manually "teach" the compiler what `|` means by defining the `operator|` overload yourself.
2026-03-04 08:00:28 +01:00
Kawrakow
fd16a418de Fix clang warnings on macOS (#1354)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2026-03-03 16:27:16 +01:00
Yap Sok Ann
ea3e8e30e1 Allow arbitrary arguments order for Q3C, Q3CN, and Qwen3.5 (#1352)
This should fix the read file at offset/limit issue, where the tool
definition has offset before limit, while the model sets limit before
offset.
2026-03-03 15:39:16 +01:00
Kawrakow
505e2c57f9 Reduce memory use when processing large images (#1349) 2026-03-02 17:54:56 +01:00
Kawrakow
3735e88925 Remove unused tensors from delta-net (#1350) 2026-03-02 16:02:40 +01:00
Nexes the Elder
d4ac5f1566 gguf-split: fix the split output files naming (#1336)
* Fix gguf-split.cpp splits output naming

With this fix, the initial extension of the source .gguf file is not included in the naming of the output file before the numeration of the splits.

ex:

No more model.gguf-00001-of-00200.gguf
Instead, model-00001-of-00200.gguf

* increase ggml_max_context to 2048

* Revert GGML_MAX_CONTEXTS to 64
2026-03-02 08:43:47 +01:00
Kawrakow
d239dabcc6 Graph parallel for Qwen-3.5-MoE (#1347)
* Graph parallel for Qwen3.5-MoE

* Add --max-gpu to llama-bench

* Fix graph reuse when not all GPUs participate in self-attention
2026-03-02 07:48:43 +01:00
firecoperana
8f9e19d57c server: add checkpoint tolerance and fix grammar_trigger init (#1346)
Co-authored-by: firecoperana <firecoperana>
2026-03-02 07:45:32 +01:00
Kawrakow
a568e12c8f Minor delta-net tweak (#1337) 2026-03-01 17:45:02 +01:00
Kawrakow
04c140fe54 Make vision woork with Qwen-3.5 models (#1345) 2026-03-01 17:44:37 +01:00
Kawrakow
0ff3a43289 Bring back #1333 and #1335 (#1340)
* Bring back fused delta net 3

* Remove autoregressive and chunking
2026-02-28 14:31:42 +01:00
Kawrakow
1922449b2c Revert delta net 3 (#1339)
* Revert "Simplify delta-net (#1335)"

This reverts commit e5fc30244c.

* Revert "Fused delta net 3 (#1333)"

This reverts commit 7b68353e09.
2026-02-28 13:12:08 +01:00
Kawrakow
e5fc30244c Simplify delta-net (#1335)
* Simplify delta-net

* Minor

* Minor
2026-02-28 11:12:19 +01:00
Kawrakow
702e0765b8 Update README with clarification on '_XL' models
Clarified warning about Unsloth '_XL' models in README.
2026-02-27 16:22:10 +01:00
Kawrakow
7b68353e09 Fused delta net 3 (#1333)
* This is better than chunked

* Keep the state in registers

* Cleanup

* Remove unused stuff

* Minor

* Make fused delta-net the default

* Fix race
2026-02-27 15:02:56 +01:00
Kawrakow
1e6d36b1b4 Graph parallel for dense Qwen-3.5 models (#1331)
* Graph parallel for idense Qwen-3.5 models

* Cleanup
2026-02-27 07:03:25 +01:00
Kawrakow
facc8fdc44 Very slightly better fused delta-net (#1330) 2026-02-27 07:03:09 +01:00
Kawrakow
62a7dcac5a Move the Qwen-3.5 models to the standard attention mechanism (#1329) 2026-02-26 15:50:51 +01:00
Kawrakow
757bee6238 Add special FA handling for dense Qwen3.5 (#1328) 2026-02-26 11:27:41 +01:00
Kawrakow
0aa6f7e7cd iAdding support for dense Qwen-3.5 models (#1326) 2026-02-26 08:51:01 +01:00
Kawrakow
2616efa296 Fused delta net 2 (#1320)
* Revive fused delta-net

* Add command line argument for fused delta net

* Simplify/improve CUDA delta-net

* Add -fdn to llama-bench

* More CUDA fused delta net optimizations

* CPU optimizations

* Much faster fused delta-net on the CPU

It seems it is faster than the chunked implementation!

* Change meaning of fdn from bool flag to threshold value

* Use eps = 1e-6

* Give some nodes a name

* Don't re-apply L2 norm - it has already been done

* This seems quite a bit better

* More tweaks

* Restore per context buffer size log

Not everybody uses models split in 2000 parts, and those who do,
actually want to see the biffer sizes.
2026-02-26 06:53:43 +01:00
Kawrakow
87b35dac0c Faster quantization for MoE models with many experts (#1322) 2026-02-26 06:52:28 +01:00
firecoperana
3fac78c48b server: enable checkpoint for recurrent models (#1310)
* server: enable checkpoint for recurrent models

create checkpoint after cancel

fix ban string and rm context during rewind

add checkpoint interval

only save recurrent cache

* save checkpoint during pp

---------

Co-authored-by: firecoperana <firecoperana>
2026-02-26 06:51:18 +01:00
Kawrakow
216f44363f Fix KT quantization yet again (#1321)
* Fix KT quantization yet again

* Add same 1e-16f check for all quants in iqk_uantize.cpp

* Fixes for k-quants

* Also this one
2026-02-25 18:07:12 +01:00
Kawrakow
c77ec4b8b8 Fused delta-net (#1315)
* Revive fused delta-net

* Add command line argument for fused delta net

* Simplify/improve CUDA delta-net

* Add -fdn to llama-bench

* More CUDA fused delta net optimizations

* CPU optimizations

* Much faster fused delta-net on the CPU

It seems it is faster than the chunked implementation!

* Change meaning of fdn from bool flag to threshold value

* Use eps = 1e-6

* Give some nodes a name
2026-02-25 14:12:48 +01:00
Nexes the Elder
0bf7043a7b Display the size of the tensors overriden during the tensor loading (#1318)
* Display the size of the tensors overriden during the tensor loading

Ex:

`Tensor blk.60.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.60.ffn_up_exps.weight buffer type overriden to CPU`

become

`Tensor blk.60.ffn_up_exps.weight (size = 668467200 bytes) buffer type overriden to CPU
Tensor blk.60.ffn_gate_exps.weight (size = 668467200 bytes) buffer type overriden to CPU`

And pass in debug the later displayed size of the unnamed buffer overrides.

Ex : `llm_load_tensors:        CPU buffer size =   XXX.XX MiB`

That double display is cluttering the screen without being very informative.

* change bytes display to MiB.

Co-authored-by: Kawrakow <iwankawrakow@gmail.com>

---------

Co-authored-by: Kawrakow <iwankawrakow@gmail.com>
2026-02-25 07:36:27 +01:00
Nexes the Elder
170467e835 Llama-quantize: Partial requant feature (#1313)
* Partial Requant feature for llama-quantize

- Inspired by the recently portcopied --dry-run feature.
- Allows to partially requantize a split quantized .gguf by requantizing only the missing splits in the destination directory.
- Works both for GGUF which are split tensors by tensors, or by group of several tensors (though this one is not very much tested beyond 2 tensors by split).
- Vibe coded.

* Create output directory if it doesn't exist in llama-quantize

* Create output directory if it doesn't exist in gguf-split

* Add exit when directory fails to be created on Windows

* Use std::filesystem

* cleanup
2026-02-25 07:25:15 +01:00
Joshua Jolley
68431b049a server: propagate task index to response objects for batch requests (#1303)
When multiple prompts are sent in a single /v1/completions request,
each response needs to carry the correct index so the client can
match results to their corresponding prompts. The index field was
not being set on partial responses, final responses, or embedding
responses, causing batch results to all report index 0.

Set res->index = slot.task->index in send_partial_response,
send_final_response, and send_embedding.

Generated with [Devin](https://cli.devin.ai/docs)

Co-authored-by: Joshua Jolley <jjolley@clearwateranalytics.com>
Co-authored-by: Devin <noreply@cognition.ai>
2026-02-24 15:39:38 +01:00
dungquixote42
aaa545c3dc adaptive p: collect probability before logit bias (#1314) 2026-02-24 15:39:17 +01:00
Kawrakow
38ca19d828 Minor delta-net tweak (#1308)
* Make sure we pick the reduced tensor from the right GPU

* Minor

* Minor delta-net tweak
2026-02-24 15:22:57 +01:00
Kawrakow
7065488135 Slightly better graph parallel for Qwen3-Next (#1307)
* Make sure we pick the reduced tensor from the right GPU

* Minor
2026-02-24 15:22:30 +01:00
Kawrakow
cfb6747776 llama-quantize: --dry-run option (#1309) 2026-02-24 15:21:52 +01:00
TheAIGuyFromAR
96b8298472 Fix typo in merge-up-gate-experts argument (#1311) 2026-02-24 15:13:22 +01:00
Kawrakow
68bd30d99c Fix max nodes (again) (#1306) 2026-02-23 11:17:37 +01:00
Kawrakow
2bb40f8c35 Fix llm_arch_is_hybrid (#1305) 2026-02-23 08:55:53 +01:00
Kawrakow
5dacb5355a Graph parallel for Qwen3-Next (#1292)
* WIP

* This works, but is slower than split mode layer
2026-02-23 07:58:00 +01:00
Yap Sok Ann
dcf50d8279 Fix tool call for Qwen3.5 (#1300)
* Fix tool call for Qwen3.5

Loosely based on mainline changes from:
* https://github.com/ggml-org/llama.cpp/pull/19635
* https://github.com/ggml-org/llama.cpp/pull/19765

Also need to change the grammar to allow the model to make multiple
tool calls in a row. This was likely broken for Qwen3 Coder prior to
this commit.

* Fix the grammar for the subsequent parameters after the first one
2026-02-23 07:54:56 +01:00