* Fix compilation on clang-cl.exe
Fixes https://github.com/ikawrakow/ik_llama.cpp/issues/1169
See bitwise ariphmetics here: https://clang.llvm.org/doxygen/avx512fintrin_8h_source.html
Clang (and GCC) supports a language feature called Vector Extensions.
To Clang, `__m512i` is not just a "struct" or a "bag of bits"; it is recognized by the compiler as a native vector type.
Because it is a native vector type, Clang automatically maps standard C operators to the corresponding hardware instructions.
When you write `a | b`, Clang sees that a and b are 512-bit integer vectors.
It implicitly understands that the bitwise OR operator (|) applies to these vectors.
It automatically generates the VPORQ (or VPORD) instruction without needing any helper function.
MSVC follows a stricter, more traditional C++ model regarding intrinsics.
In MSVC, __m512i is defined in the header files (<immintrin.h>) as a struct or union (e.g., typedef struct __m512i { ... } __m512i). To the MSVC compiler, it is essentially a user-defined data type, not a fundamental language primitive like int or float.
Standard C++ does not define what `|` means for a user-defined struct.
MSVC does not have the same "Vector Extensions" that automatically apply operators to these structs.
When you write `a | b` in MSVC, the compiler looks for a definition of `operator|` for the __m512i struct. Since the standard headers don't provide one, the compiler throws an error.
You must use the explicit intrinsic function provided by Intel/MSVC: _mm512_or_si512(a, b).
To get the nice syntax `(a | b)` in MSVC, you have to manually "teach" the compiler what `|` means by defining the `operator|` overload yourself.
* Update README.md with build instructions for Windows
Current README lacks any guide for Windows users, whereas build process on that platform is quite compicated
* Update build.md with instruction about clang-cl.exe
Brings step-by-step build instruction for Windows
* Apply suggestions from code review
Co-authored-by: Kawrakow <iwankawrakow@gmail.com>
* Polish build.md for Windows usage
Added example of use for Windows
* Apply suggestions from code review
---------
Co-authored-by: Kawrakow <iwankawrakow@gmail.com>
---------
Co-authored-by: Piotr Wilkin <piotr.wilkin@syndatis.com>
common : add nemotron 3 parsing (#18077)
common : add parser for ministral/mistral large 3/devstral 2 (#17713)
common : default content to an empty string (#18485)
chat: make tool description and parameters optional per OpenAI spec (#18478)
Per the OpenAI API specification, both 'description' and 'parameters'
fields in tool function definitions are optional. Previously, the parser
would throw an exception if these fields were missing.
Attempts to fix#17667
common : implement new jinja template engine (#18462)
---------
Co-authored-by: Alde Rojas <hello@alde.dev>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
jinja: correct member access rule (#18905)
jinja : fix lexing of float literals with sign (#18901)
jinja : add missing tojson filter for bool (#18900)
jinja : attribute support for join, map and sort (#18883)
jinja : fix object item order (and properly implement dictsort) (#18904)
tests : add test-jinja -py option for cross-checking (#18906)
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
ci : run test-jinja -py on high perf [no ci] (#18916)
jinja : fix undefined keys and attributes and int/float as bool (#18924)
jinja: support none|string (#18995)
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
jinja : implement mixed type object keys (#18955)
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
jinja : undefined should be treated as sequence/iterable (return string/array) by filters/tests (#19147)
`tojson` is not a supported `undefined` filter
keep it DRY and fix some types
jinja : do not pass empty tools and add some none filters (#19176)
jinja : add unordered_map include to value.h [no ci] (#19205)
jinja : add missing 'in' test to template engine (#19004) (#19239)
The jinja template parser was missing the 'in' test from
global_builtins(), causing templates using reject("in", ...),
select("in", ...), or 'x is in(y)' to fail with
"selectattr: unknown test 'in'".
This broke tool-calling for Qwen3-Coder and any other model
whose chat template uses the 'in' test.
Added test_is_in supporting array, string, and object containment
checks, mirroring the existing 'in' operator logic in runtime.cpp.
Includes test cases for all three containment types plus
reject/select filter usage.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
---------
Co-authored-by: Sid Mohan <sidmohan0@users.noreply.github.com>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
Add Jinja support for "indent" string filter (#19529)
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
add vendor
refactor chat
server : support preserving reasoning_content in assistant message (#18994)
chat : fix translategemma crash on common_chat_format_example (#19019)
chat: fix language input for translategemma (#19052)
Co-authored-by: Aldehir Rojas <hello@alde.dev>
---------
Co-authored-by: Aldehir Rojas <hello@alde.dev>
chat: fix case where template accepts type content only (#19419)
mtmd : chat : Fix extra \n between text and media marker (#19595)
Thanks to @tugot17 for detecting and reporting the issue.
For vision models (e.g. LFM2.5-VL-1.6B and Qwen/Qwen3-VL-4B-Instruct) `llama-mtmd-cli` produces identical output to HF implementation.
However `llama-server` doesn't. I traced it down to extra newline
inserted after `<__media__>`.
This happens in `to_json_oaicompat`, that treats media markers as text
and joins all parts with `\n` separator.
PR introduces new type `media_marker` and uses it for media markers.
Extra logic is added to prevent insertion of newlines before and after
media markers.
With this change number of input tokens is identical to HF
implementation and as a result the output is also identical.
I explored other ways to address the issue
* remove completely `\n` between text parts in `to_json_oaicompat`
* merge text messages in server-common.cpp before sending them to `to_json_oaicompat`
Please propose alternative ways of fixing this issue.
Co-authored-by: Piotr Wilkin (ilintar) <piotr.wilkin@syndatis.com>
---------
Co-authored-by: Piotr Wilkin (ilintar) <piotr.wilkin@syndatis.com>
common : merge qwen3-coder and nemotron nano 3 parsers (#19765)
common : fix improper trimming in XML parser on complete message (#19805)
Co-authored-by: Jules LEIDELINGER <11395311+julio75012@users.noreply.github.com>
jinja: correct stats for tojson and string filters (#19785)
jinja : correct default size for string slices (#19913)
common : handle unicode during partial json parsing (#16526)
common : fix json schema with '\' in literals (#17307)
add back qwen_coder_xml and mirothinker
Co-authored-by: Aldehir Rojas <hello@alde.dev>
* fix adaptive p sampler rewinding too far back
* update comments
* correct default value for total_weight, more comments
* new variables/names
* update comment for n_rewind
* move null pointer check back to common_sampler_review()
* refactor weighted_sum and total_weight to vector<pair>, better boundary check in llama_review_adaptive_p_impl()
Fixes https://github.com/ikawrakow/ik_llama.cpp/issues/1169
See bitwise ariphmetics here: https://clang.llvm.org/doxygen/avx512fintrin_8h_source.html
Clang (and GCC) supports a language feature called Vector Extensions.
To Clang, `__m512i` is not just a "struct" or a "bag of bits"; it is recognized by the compiler as a native vector type.
Because it is a native vector type, Clang automatically maps standard C operators to the corresponding hardware instructions.
When you write `a | b`, Clang sees that a and b are 512-bit integer vectors.
It implicitly understands that the bitwise OR operator (|) applies to these vectors.
It automatically generates the VPORQ (or VPORD) instruction without needing any helper function.
MSVC follows a stricter, more traditional C++ model regarding intrinsics.
In MSVC, __m512i is defined in the header files (<immintrin.h>) as a struct or union (e.g., typedef struct __m512i { ... } __m512i). To the MSVC compiler, it is essentially a user-defined data type, not a fundamental language primitive like int or float.
Standard C++ does not define what `|` means for a user-defined struct.
MSVC does not have the same "Vector Extensions" that automatically apply operators to these structs.
When you write `a | b` in MSVC, the compiler looks for a definition of `operator|` for the __m512i struct. Since the standard headers don't provide one, the compiler throws an error.
You must use the explicit intrinsic function provided by Intel/MSVC: _mm512_or_si512(a, b).
To get the nice syntax `(a | b)` in MSVC, you have to manually "teach" the compiler what `|` means by defining the `operator|` overload yourself.
* Fix gguf-split.cpp splits output naming
With this fix, the initial extension of the source .gguf file is not included in the naming of the output file before the numeration of the splits.
ex:
No more model.gguf-00001-of-00200.gguf
Instead, model-00001-of-00200.gguf
* increase ggml_max_context to 2048
* Revert GGML_MAX_CONTEXTS to 64
* Revive fused delta-net
* Add command line argument for fused delta net
* Simplify/improve CUDA delta-net
* Add -fdn to llama-bench
* More CUDA fused delta net optimizations
* CPU optimizations
* Much faster fused delta-net on the CPU
It seems it is faster than the chunked implementation!
* Change meaning of fdn from bool flag to threshold value
* Use eps = 1e-6
* Give some nodes a name
* Don't re-apply L2 norm - it has already been done
* This seems quite a bit better
* More tweaks
* Restore per context buffer size log
Not everybody uses models split in 2000 parts, and those who do,
actually want to see the biffer sizes.
* server: enable checkpoint for recurrent models
create checkpoint after cancel
fix ban string and rm context during rewind
add checkpoint interval
only save recurrent cache
* save checkpoint during pp
---------
Co-authored-by: firecoperana <firecoperana>
* Revive fused delta-net
* Add command line argument for fused delta net
* Simplify/improve CUDA delta-net
* Add -fdn to llama-bench
* More CUDA fused delta net optimizations
* CPU optimizations
* Much faster fused delta-net on the CPU
It seems it is faster than the chunked implementation!
* Change meaning of fdn from bool flag to threshold value
* Use eps = 1e-6
* Give some nodes a name
* Display the size of the tensors overriden during the tensor loading
Ex:
`Tensor blk.60.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.60.ffn_up_exps.weight buffer type overriden to CPU`
become
`Tensor blk.60.ffn_up_exps.weight (size = 668467200 bytes) buffer type overriden to CPU
Tensor blk.60.ffn_gate_exps.weight (size = 668467200 bytes) buffer type overriden to CPU`
And pass in debug the later displayed size of the unnamed buffer overrides.
Ex : `llm_load_tensors: CPU buffer size = XXX.XX MiB`
That double display is cluttering the screen without being very informative.
* change bytes display to MiB.
Co-authored-by: Kawrakow <iwankawrakow@gmail.com>
---------
Co-authored-by: Kawrakow <iwankawrakow@gmail.com>
* Partial Requant feature for llama-quantize
- Inspired by the recently portcopied --dry-run feature.
- Allows to partially requantize a split quantized .gguf by requantizing only the missing splits in the destination directory.
- Works both for GGUF which are split tensors by tensors, or by group of several tensors (though this one is not very much tested beyond 2 tensors by split).
- Vibe coded.
* Create output directory if it doesn't exist in llama-quantize
* Create output directory if it doesn't exist in gguf-split
* Add exit when directory fails to be created on Windows
* Use std::filesystem
* cleanup
When multiple prompts are sent in a single /v1/completions request,
each response needs to carry the correct index so the client can
match results to their corresponding prompts. The index field was
not being set on partial responses, final responses, or embedding
responses, causing batch results to all report index 0.
Set res->index = slot.task->index in send_partial_response,
send_final_response, and send_embedding.
Generated with [Devin](https://cli.devin.ai/docs)
Co-authored-by: Joshua Jolley <jjolley@clearwateranalytics.com>
Co-authored-by: Devin <noreply@cognition.ai>