---------
Co-authored-by: Piotr Wilkin <piotr.wilkin@syndatis.com>
common : add nemotron 3 parsing (#18077)
common : add parser for ministral/mistral large 3/devstral 2 (#17713)
common : default content to an empty string (#18485)
chat: make tool description and parameters optional per OpenAI spec (#18478)
Per the OpenAI API specification, both 'description' and 'parameters'
fields in tool function definitions are optional. Previously, the parser
would throw an exception if these fields were missing.
Attempts to fix#17667
common : implement new jinja template engine (#18462)
---------
Co-authored-by: Alde Rojas <hello@alde.dev>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
jinja: correct member access rule (#18905)
jinja : fix lexing of float literals with sign (#18901)
jinja : add missing tojson filter for bool (#18900)
jinja : attribute support for join, map and sort (#18883)
jinja : fix object item order (and properly implement dictsort) (#18904)
tests : add test-jinja -py option for cross-checking (#18906)
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
ci : run test-jinja -py on high perf [no ci] (#18916)
jinja : fix undefined keys and attributes and int/float as bool (#18924)
jinja: support none|string (#18995)
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
jinja : implement mixed type object keys (#18955)
---------
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
jinja : undefined should be treated as sequence/iterable (return string/array) by filters/tests (#19147)
`tojson` is not a supported `undefined` filter
keep it DRY and fix some types
jinja : do not pass empty tools and add some none filters (#19176)
jinja : add unordered_map include to value.h [no ci] (#19205)
jinja : add missing 'in' test to template engine (#19004) (#19239)
The jinja template parser was missing the 'in' test from
global_builtins(), causing templates using reject("in", ...),
select("in", ...), or 'x is in(y)' to fail with
"selectattr: unknown test 'in'".
This broke tool-calling for Qwen3-Coder and any other model
whose chat template uses the 'in' test.
Added test_is_in supporting array, string, and object containment
checks, mirroring the existing 'in' operator logic in runtime.cpp.
Includes test cases for all three containment types plus
reject/select filter usage.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
---------
Co-authored-by: Sid Mohan <sidmohan0@users.noreply.github.com>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
Add Jinja support for "indent" string filter (#19529)
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
add vendor
refactor chat
server : support preserving reasoning_content in assistant message (#18994)
chat : fix translategemma crash on common_chat_format_example (#19019)
chat: fix language input for translategemma (#19052)
Co-authored-by: Aldehir Rojas <hello@alde.dev>
---------
Co-authored-by: Aldehir Rojas <hello@alde.dev>
chat: fix case where template accepts type content only (#19419)
mtmd : chat : Fix extra \n between text and media marker (#19595)
Thanks to @tugot17 for detecting and reporting the issue.
For vision models (e.g. LFM2.5-VL-1.6B and Qwen/Qwen3-VL-4B-Instruct) `llama-mtmd-cli` produces identical output to HF implementation.
However `llama-server` doesn't. I traced it down to extra newline
inserted after `<__media__>`.
This happens in `to_json_oaicompat`, that treats media markers as text
and joins all parts with `\n` separator.
PR introduces new type `media_marker` and uses it for media markers.
Extra logic is added to prevent insertion of newlines before and after
media markers.
With this change number of input tokens is identical to HF
implementation and as a result the output is also identical.
I explored other ways to address the issue
* remove completely `\n` between text parts in `to_json_oaicompat`
* merge text messages in server-common.cpp before sending them to `to_json_oaicompat`
Please propose alternative ways of fixing this issue.
Co-authored-by: Piotr Wilkin (ilintar) <piotr.wilkin@syndatis.com>
---------
Co-authored-by: Piotr Wilkin (ilintar) <piotr.wilkin@syndatis.com>
common : merge qwen3-coder and nemotron nano 3 parsers (#19765)
common : fix improper trimming in XML parser on complete message (#19805)
Co-authored-by: Jules LEIDELINGER <11395311+julio75012@users.noreply.github.com>
jinja: correct stats for tojson and string filters (#19785)
jinja : correct default size for string slices (#19913)
common : handle unicode during partial json parsing (#16526)
common : fix json schema with '\' in literals (#17307)
add back qwen_coder_xml and mirothinker
Co-authored-by: Aldehir Rojas <hello@alde.dev>
* Merging mainline - WIP
* Merging mainline - WIP
AVX2 and CUDA appear to work.
CUDA performance seems slightly (~1-2%) lower as it is so often
the case with llama.cpp/ggml after some "improvements" have been made.
* Merging mainline - fix Metal
* Remove check
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* ci : start using Pythia models over OpenLlama
ggml-ci
* ci : disable q2_k ppl tests
* ci : use convert-hf-to-gguf.py
* ci : update gg_get_model
* ci : fix convert outfile name
ggml-ci
* llama : gptneox arch use F32 attn prec
ggml-ci
* ggml : add ggml_flash_attn_ext API
* ggml : fix GQA support in ggml_flash_attn_ext
* ggml : online attention (CPU)
* metal : initial implementation
* metal : f16 precision
* metal : reduce branches
* metal : specialize for head size
* wip : 8 rows per simd group
* wip : 4 rows per simd group
* wip : template for rows per warp
* metal : parallelize across KV size
* metal : parallel reduce across heads
* metal : efficient flash_attn_f16 implementation
* metal : avoid redundant loads of the attention
* metal : scale and mask in matrix form
* metal : fix comment
* llama : avoid ggml_cast, use F32 query
* metal : add parallel reduce version (disabled)
* metal : move output into local memory + optimize
- the result from each simdgroup now stays in the registers
- significantly reduced SRAM usage
- more efficient skipping of -INF blocks
- avoid simdgroup barrier in hot loop
- add comments
* metal : add tests, fix scaling, support C > 32
* metal : improve precision
* ggml : fix f16 mad
* metal : minor
* metal : support Q > 8
* tests : add ATTN tests
* metal : disable buffer allocation logs
* tests : more
* metal : faster inner loop for C == 32
* metal : fix array initialization
* tests : ifdef
* ggml : switch to padded F16 mask for ggml_soft_max, ggml_flash_attn_ext
* ggml : fix ggml_soft_max mask requirement
* cuda : fix soft_max to use correct mask size
* cuda : add flash_attn kernel (wip)
* metal : optimize softmax for C > 32
* metal : optimize softmax
* tests : minor fix
* cuda : avoid zeroing fragments
* tests : update dims
* cuda : fix __hisinf() result check
* cuda : avoid warp_reduce for smax
* cuda : use int instead of int64_t
Noticeably improves performance (thanks to Johannes)
* cuda : make loops use the same loop values
Thanks Johannes again for the tip
* cuda : unroll some of the loops
* cuda : avoid __hisinf branches
* cuda : use half2 in softmax
* cuda : switch to 1 warp for bs > 16
* cuda : speed-up reduce part of the kernel
* cuda : unroll Q*K^T loop
* cuda : fix -INF block check
* cuda : simplify softmax
* cuda : fix matrix names
* cuda : minor
* llama : adapt to F16 KQ_pos
* llama : adapt new models to F16 KQ_mask
* ggml : fix F16 store (ARM NEON)
* llama : fix type of KQ_mask and KQ_pos
* ggml : fix CPU soft_max
* tests : add hs=256
* cuda : fix build
* metal : improve perf via smaller int registers
* cuda : adapt soft_max to F16 mask and pos
* CUDA: faster FlashAttention, kernel for bs == 1
* 16 cols for Phi-2
* no vec for hs, no hs==256 ncols==32 for Volta
* adjust kernel selection logic
* 4 warps, 256 stride for all D
* no ncols == 64
* Multiple parallel blocks for batch size 1
* fix compile warnings
* fix excessive KQ_b loads
* fix cmake build
* fix KV cache padding, NaN from INFINITY (#6438)
* llama : flash_attn cparam + fix defrag
* server: support flash_attn param
* server: bench: enable flash_attn param
* CUDA: refactor host code, dyn. par. blocks
* fix flash_attn_vec_f16 race condition
* flush softmax exp below threshold to 0
* store temp KQ in registers
* Calculate KQ as FP32 if KQV has GGML_PREC_F32
* Add __hgt2_mask implementation for CUDA 11
* fix KQ FP32 precision fpr parallel_blocks > 1
* llama-bench : add -fa,--flash-attn arg
* metal : add BS=1 kernel for flash attention (#6508)
* metal : add BS=1 kernel for flash attention (wip)
* metal : support more than 1 warps
* metal : opts
* metal : opt
* metal : switch to parallel reduce
* metal : reduce registers
* metal : simplify
* metal : initial FA vec kernel
* metal : use F32 attention accumulators
* batched-bench : add fattn arg
* llama : simplify llama_build_kv_store
ggml-ci
* llama : adapt build_olmo to changes
* ggml : fix arm fp16 store on windows
* metal : clean-up
* metal : clean-up kernel code
* metal : minor
* tests : remove benchmarks
ggml-ci
* ggml : fix avx512 const correctness
ggml-ci
* ggml : fix soft_max with bias on CPU
ggml-ci
* common : print --flash-attn in help
* ggml : fix num dimensions in ggml_flash_attn_ext
* llama : force disable flash attention for incompatible models
* ggml : ggml_soft_max support F16/F32 mask/pos
ggml-ci
* cuda : uint -> uint32_t
* cuda : "constexpr dim3" -> "const dim3"
ggml-ci
* cuda : try to fix __hgt2_mask
ggml-ci
* ggml : add TODO's for F16/F32 mask/pos support in other backends
* llama : replace bool need_kq_pos with use_alibi
* llama : prep ALiBi support for BERT models
ggml-ci
* llama : fix n_batch requirements
ggml-ci
* cont
* server : add help for --flash-attn arg
* llama : disable FA for AMD
* tests : remove TMP_ATTN_BENCH
ggml-ci
* llama : support save/load state with FA enabled
ggml-ci
* ci : add CUDA save-load-state tests
ggml-ci
* llama : llama_kv_cache_clear zeroes data + fix save-load seq
ggml-ci
* llama : fix copy-paste errors, add TODO
* llama : disallow incompatible states
* llama : update llama_state_get_size after v_trans field
* metal : remove tmp log
* llama : add static reminder for llama_state_get_size
* metal : fix max nsg
ggml-ci
* ci : fix arg order
ggml-ci
---------
Co-authored-by: Johannes Gäßler <johannesg@5d6.de>
Co-authored-by: Pierrick HYMBERT <pierrick.hymbert@gmail.com>
* Fix --split-max-size
Byte size calculation was done on int and overflowed.
* add tests.sh
* add examples test scripts to ci run
Will autodiscover examples/*/tests.sh scripts and run them.
* move WORK_PATH to a subdirectory
* clean up before and after test
* explicitly define which scripts to run
* add --split-max-size to readme
* first update for migration
* update init_cublas
* add debug functio, commit all help code
* step 1
* step 2
* step3 add fp16, slower 31->28
* add GGML_LIST_DEVICE function
* step 5 format device and print
* step6, enhance error check, remove CUDA macro, enhance device id to fix none-zero id issue
* support main device is non-zero
* step7 add debug for code path, rm log
* step 8, rename all macro & func from cuda by sycl
* fix error of select non-zero device, format device list
* ren ggml-sycl.hpp -> ggml-sycl.h
* clear CMAKE to rm unused lib and options
* correct queue: rm dtct:get_queue
* add print tensor function to debug
* fix error: wrong result in 658746bb26702e50f2c59c0e4ada8e9da6010481
* summary dpct definition in one header file to replace folder:dpct
* refactor device log
* mv dpct definition from folder dpct to ggml-sycl.h
* update readme, refactor build script
* fix build with sycl
* set nthread=1 when sycl, increase performance
* add run script, comment debug code
* add ls-sycl-device tool
* add ls-sycl-device, rm unused files
* rm rear space
* dos2unix
* Update README_sycl.md
* fix return type
* remove sycl version from include path
* restore rm code to fix hang issue
* add syc and link for sycl readme
* rm original sycl code before refactor
* fix code err
* add know issue for pvc hang issue
* enable SYCL_F16 support
* align pr4766
* check for sycl blas, better performance
* cleanup 1
* remove extra endif
* add build&run script, clean CMakefile, update guide by review comments
* rename macro to intel hardware
* editor config format
* format fixes
* format fixes
* editor format fix
* Remove unused headers
* skip build sycl tool for other code path
* replace tab by space
* fix blas matmul function
* fix mac build
* restore hip dependency
* fix conflict
* ren as review comments
* mv internal function to .cpp file
* export funciton print_sycl_devices(), mv class dpct definition to source file
* update CI/action for sycl code, fix CI error of repeat/dup
* fix action ID format issue
* rm unused strategy
* enable llama_f16 in ci
* fix conflict
* fix build break on MacOS, due to CI of MacOS depend on external ggml, instead of internal ggml
* fix ci cases for unsupported data type
* revert unrelated changed in cuda cmake
remove useless nommq
fix typo of GGML_USE_CLBLAS_SYCL
* revert hip cmake changes
* fix indent
* add prefix in func name
* revert no mmq
* rm cpu blas duplicate
* fix no_new_line
* fix src1->type==F16 bug.
* pass batch offset for F16 src1
* fix batch error
* fix wrong code
* revert sycl checking in test-sampling
* pass void as arguments of ggml_backend_sycl_print_sycl_devices
* remove extra blank line in test-sampling
* revert setting n_threads in sycl
* implement std::isinf for icpx with fast math.
* Update ci/run.sh
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update examples/sycl/run-llama2.sh
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update examples/sycl/run-llama2.sh
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update CMakeLists.txt
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update CMakeLists.txt
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update CMakeLists.txt
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Update CMakeLists.txt
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* add copyright and MIT license declare
* update the cmd example
---------
Co-authored-by: jianyuzh <jianyu.zhang@intel.com>
Co-authored-by: luoyu-intel <yu.luo@intel.com>
Co-authored-by: Meng, Hengyu <hengyu.meng@intel.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* scripts : add lib.sh and lib_test.sh
* scripts : stub out new ci-run.sh script
* scripts : switch to PascalCase for functions
This looks a little odd at first, but I find it very useful as a
convention to know if a command is part of our code vs a builtin.
* scripts : add some fancy conversion from snake_case to PascalCase
* Add venv to ci/run.sh
* Revert scripts work
* scripts : add wrapper script for local use of ci/run.sh
* Simplify .gitignore for tests, clang-tidy fixes
* Label all ctest tests
* ci : ctest uses -L main
* Attempt at writing ctest_with_model
* Update test-model-load-cancel
* ci : add ctest_with_model for debug and release
ggml-ci
* Fix gg_get_model function
ggml-ci
* got stuck on CMake
* Add get_model.cpp to tests/CMakeLists.txt
ggml-ci
* Fix README.md output for ctest_with_model
ggml-ci
* workflows : use `-L main` for all ctest
ggml-ci
* Fixes
* GG_RUN_CTEST_MODELFILE => LLAMACPP_TESTMODELFILE
* Always show warning rather than failing if model file variable is not
set
* scripts : update usage text for ci-run.sh
* ggml : add IQ2 to test-backend-ops + refactoring
ggml-ci
* cuda : update supports_op for IQ2
ggml-ci
* ci : enable LLAMA_CUBLAS=1 for CUDA nodes
ggml-ci
* cuda : fix out-of-bounds-access in `mul_mat_vec_q`
ggml-ci
* tests : avoid creating RNGs for each Q tensor
ggml-ci
* tests : avoid creating RNGs for each tensor
ggml-ci
* backend : add eval callback
ggml-ci
* backend : group nodes in a single compute when user don't need them
* backend : clean-up the implementation
ggml-ci
* simple : do not perform tensor data copy if not needed
* simple : fix
* imatrix : offload to GPU support
* imatrix : fix ggml_mul_mat_id hanlding
ggml-ci
* ci : add imatrix test
ggml-ci
* ci : rearrange output
ggml-ci
* ggml : disable fast-math for Metal (cmake build only)
ggml-ci
* metal : fix Metal API debug warnings
* cmake : add -fno-inline for Metal build (#4545)
* metal : fix API debug warnings
* metal : fix compile warnings
* metal : use uint64_t for strides
* cmake : rename option to LLAMA_METAL_SHADER_DEBUG
* metal : fix mat-vec Q8_0 kernel for BS > 1
* metal : normalize mat-vec kernel signatures
* cmake : respect LLAMA_QKK_64 option
* metal : fix mat-vec Q4_K kernel for QK_K == 64
ggml-ci
* ci : add lora test
ggml-ci
* move lora summary to the top, add lora logs
ggml-ci
* ci : decrease CPU ppl runs to 2 to avoide 20 min timeout
ggml-ci
* add 7b lora test
use 1 thread for CUDA generation tests
ggml-ci
* add test with q8_0 (cpu only)
ggml-ci
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* ci : add 7B CUDA tests
ggml-ci
* ci : add Q2_K to the tests
* ci : bump CUDA ppl chunks
ggml-ci
* ci : increase CUDA TG len + add --ignore-eos
* ci : reduce CUDA ppl cunks down to 4 to save time
* ci : run ctest
ggml-ci
* ci : add open llama 3B-v2 tests
ggml-ci
* ci : disable wget progress output
ggml-ci
* ci : add open llama 3B-v2 tg tests for q4 and q5 quantizations
ggml-ci
* tests : try to fix tail free sampling test
ggml-ci
* ci : add K-quants
ggml-ci
* ci : add short perplexity tests
ggml-ci
* ci : add README.md
* ppl : add --chunks argument to limit max number of chunks
ggml-ci
* ci : update README