ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-05-24 22:59:14 +00:00

Author	SHA1	Message	Date
SamuelOliveirads	d93dfb5e6b	fix: save/restore sampler state during speculative checkpoint When speculative decoding rejects draft tokens and restores the recurrent state checkpoint, the sampler (RNG, grammar, prev tokens) must also be restored to maintain consistency. Without this, the sampler state reflects the rejected draft tokens, leading to potential divergence. Uses common_sampler_clone() to snapshot the sampler before the speculative batch decode, and restores it on rejection.	2026-04-16 22:36:37 -03:00
SamuelOliveirads	d670cf85cd	server: spec checkpoints for recurrent models	2026-04-16 21:53:52 -03:00
dmaivel	4f4bcfbe67	Add --defer-experts flag to defer expert mmap residency on Linux (#1634 ) * Add --defer-experts flag to defer expert mmap residency on Linux * Disable warmup when defer-experts is enabled	2026-04-16 08:54:44 +02:00
Samuel Oliveira Alves	470d3a3b5b	Add support for parallel graphs to GLM MTP (#1637 ) * mtp: fix split graph assert * Add mtp split graph mode * remove unused ffn function for unsupported mtp * revert cuda context syncronization	2026-04-16 08:05:34 +02:00
dungquixote42	869b83bc49	Add Unicode allowlist (#1597 ) * initial commit * cleanup * fix whitelist arg parsing and simplify keyword search state * rename white* to allow* * add vocab_pieces init function, rename update functions, delete accidentally added file * delete temporary bias code * auto-generate fill function with script data inside * deduplicate allowlist unicode rule parsing * minor cleanup * delete unnecessary header * refactor allowlist to support sequential rule sets via keywords * add early exit for zero-rules case * delete accidentally added file	2026-04-10 18:22:57 +02:00
Samuel Oliveira Alves	557b674f63	Add llama_context to MTP (#1601 ) * wip: separate llama_context for MTP with graph reuse * wip: fix KV cache desync with separate MTP context * refactor: remove dead mtp logic code, encapsulate KV mirroring * mtp-context: derive args directly from the main model's context * mtp: fix kv cache positions * clean small comments * minor refactor for context shift	2026-04-09 15:33:56 +02:00
Kawrakow	9b5785ad6b	Gemma4 tokenizer fixes (#1603 )	2026-04-09 15:33:28 +02:00
Samuel Oliveira Alves	3de81530c5	Allow tuning of the best args for speculative decoding. (#1595 ) * wip: build spec tuner for spefic args * wip: test different reward system * spec-tune: fix the reward to find best params given a good TPS * spec-tune: refactor logic for its own file * minor clean for comments and modules	2026-04-08 08:02:42 +02:00
Nexes the Elder	0a6e4335f7	Little maintenance (#1579 ) * Little maintenance * llama-quantize : Add the missing items in the help * Add GGML_MAX_CONTEXTS define in the general cmakelist.txt * Make the KV cache (CPU) based warnings clearer * Correct placement of GGML_MAX_CONTEXTS definition * Revert wrong indents This reverts commit `d0728cbb6c`. * Moving the GGML_MAX_CONTEXTS definition to src/CMakeLists.txt * Update warning message for unsupported KV cache types * forgotten antislash	2026-04-08 07:58:49 +02:00
firecoperana	5e8bb724ce	server: support slot save/restore/erase for mtmd tokens and checkpoints (#1584 ) Co-authored-by: firecoperana <firecoperana>	2026-04-05 08:41:04 +02:00
Kawrakow	73742c5db9	mtmd: be able to use alternative types for the KQ multiplication (#1567 ) mtmd: allow using types other than f32 for KQ Do not cast q if kq_type is quantized * Fix formatting * More formatting	2026-04-02 08:04:05 +02:00
hksdpc255	46f9f0fb31	fix #1524 (#1543 )	2026-03-29 18:50:09 +02:00
Kawrakow	bc2c74c9db	Add --fit to llama-bench (#1542 )	2026-03-29 08:05:07 +02:00
Kawrakow	798af8676a	Correct available split modes in llama-bench (#1539 )	2026-03-28 09:28:01 +01:00
Samuel Oliveira Alves	1f3e832cb3	Improve mtp acceptance rate (#1499 ) * wip: port MTP architecture Ports the Multi-Token Prediction (MTP) architecture to the older `llama.cpp` codebase used by `ikllama`. Changes include: - Updating `llama_batch` to support `mtp_params`. - Modifying `llama_decode_internal` (and `encode`) to handle MTP operations (Warmup, Update, Draft). - Adding public APIs for MTP state management (`llama_set_draft_input_hidden_state`). - Adapting the embedding extraction logic to skip MTP update passes. * Refactors `server_slot` to support generic speculative decoding (MTP or Draft Model). * core: enable hybrid outputs (logits + embeddings) for MTP support * fix(mtp): correct KV-cache slot finding for updates * fix(mtp): persist hidden states to prevent context corruption during drafting * refactor(mtp): clean unused code * fix(mtp): update server to new functions name * fix(mtp): fix graph and save hidden state * mtp: refactor integration, context params and kv cache search * mtp: fix hidden state extraction and speculative acceptance flow * server: fix MTP warmup for long prompts and reset token buffer * llama: refactor MTP operation state to context parameters * server: fix n_past calculation in MTP acceptance * llama: fix mtp enable flags * speculative: refactor MTP to use common_speculative interface * context: remove unused signatures * clip: fix deprecated enum-enum conversion warning * common: fix format string crash in help message * context: fix mtp activation logic * llamat: always use the extracted embedding * llama: get all embeddings to kv cache * llama: revert logit to not run mtp for not supported arch * llama: allocate all the n_outputs for MTP * wip * server-context: get only the last embedding for hidden state * ggml-backend: fix array of bounds in debug build * server-context: run mt kv update to each prompt batch * revert segmentation fault fixes * glm-mtp(feat): optimize graph embedding and recursive drafting	2026-03-25 10:20:22 +01:00
Nexes the Elder	094f76ee86	Cleaner log for adjusted splits (#1494 ) * sweep-bench: add more skipped patterns to --minilog * cleaner log for adjusted splits * Add totalization for adjusted splits * Clean up semicolons * Addition for totalizer ^^ * Change accordingly to review * Forgotten leftover removed * 'total' instead of 'totalized'	2026-03-24 07:49:40 +01:00
firecoperana	cdf9142aa5	fix grammar stack empty error for qwen3.5 (#1490 ) * fix grammar stack empty error for qwen3.5 * Add to --help --------- Co-authored-by: firecoperana <firecoperana>	2026-03-24 07:48:20 +01:00
Kawrakow	4eb08208f2	Fix misleading quantize error message (#1493 )	2026-03-23 13:55:18 +01:00
firecoperana	0c9bc3ed28	server: support --minilog to log request message for completions/response/anthropic and response (#1477 ) Co-authored-by: firecoperana <firecoperana>	2026-03-20 16:13:43 +01:00
firecoperana	10b44eca72	server: sync anthropic api code (#1469 ) * server: sync anthropic api code * fix cc header issue --------- Co-authored-by: firecoperana <firecoperana>	2026-03-20 10:18:45 +01:00
Juk Armstrong	08f81b5afd	Fix batch calculation for image processing (#1475 )	2026-03-20 10:16:05 +01:00
Nexes the Elder	6c665f38fd	sweep-bench: add -minilog argument to reduce verbose logging (#1468 ) Purpose: Add --minilog flag to llama-sweep-bench that filters log output to show only essential GPU/layer distribution information while suppressing verbose model metadata and per-layer device assignment messages. Changes: - Add llama_selective_log_callback with blacklist approach (sweep-bench.cpp) Blacklisted patterns (hidden): - Per-layer device assignments ('Setting default device in layer') - KV metadata dump header and entries - Tensor type counts - Model validation messages - EOG/special token cache info - Metadata printout (llm_load_print_meta, print_info) - Layer sizes table - Tensor loading info (llm_load_tensors) - Separator lines - Most common cases of incomplete/continuation lines are also hidden All other log output is shown, including: - GPU VRAM info - Split/buffer distribution per device - Graph split estimates - Final benchmark table and timings	2026-03-20 09:40:56 +01:00
firecoperana	f9b7fe9749	llama: add --dry-run option (#1462 ) Co-authored-by: firecoperana <firecoperana>	2026-03-18 17:20:17 +01:00
Nexes the Elder	61fad8b094	Print timings in sweep-bench (#1454 )	2026-03-18 06:57:00 +01:00
StrikeOner	a399456c12	fix: propagate CPPHTTPLIB_OPENSSL_SUPPORT to cpp-httplib target when LLAMA_SERVER_SSL=ON (#1451 ) Without this, libcpp-httplib.a is compiled without SSL support, causing an undefined reference to httplib::SSLServer at link time even though the OpenSSL libraries are present on the link line. Fixes #1449 Co-authored-by: kerem seyhan <kerem.seyhan@codecut.de>	2026-03-17 16:39:11 +01:00
hksdpc255	fe92e30d1e	server : preserve anthropic thinking blocks in conversion (#1441 )	2026-03-16 13:59:19 +01:00
hksdpc255	18a9b4c125	fix chat parser not been used in anthropic api (#1437 )	2026-03-16 08:59:01 +01:00
hksdpc255	a655a95378	Prevent adding content that starts with 'x-anthropic-' to system_content. (#1436 )	2026-03-16 08:57:09 +01:00
dungquixote42	be2940f57a	Adaptive P sampler: update review logic, delete old code comments, put prep stage after logit bias (#1386 ) * simpler n_rewind logic, delete old comments * use more consistent names, add updt_w_cur to json schema * align comments * refactor review logic, update struct/variable names * revert cosmetic changes * check enable/disable in llama_prep_adaptive_p_impl() * delete extra whitespaces after statement * show target in debug prints * more concise debug print * delete old comments * update with loop instead of move() * comment out all adaptive p debug prints * more debug prints * move review() variables: common_sampler struct -> common_sampler_review() args * match n_unsent type * fix merge bugs, delete adaptive p references in buffer_and_check_string_ban() * restore accidental erasure * Revert "adaptive p: collect probability before logit bias" This reverts commit `1434878461`.	2026-03-14 12:34:12 +01:00
Kawrakow	633c1baa94	Enable imatrix calculation for models with fused ffn_up/gate_exps tensors (#1418 )	2026-03-13 17:57:38 +01:00
firecoperana	433531ddae	server : support multi-modal context checkpoints and prompt caching (#1398 ) * server : support multi-modal context checkpoints and prompt caching do not create checkpoint right after image processing improve mtmd check for slot ops fix context shift do not abort if template parse failed * change to debug message when detecting ban token --------- Co-authored-by: firecoperana <firecoperana>	2026-03-13 08:07:57 +01:00
SneedwareInc	525d8b8a40	Update server string+regex ban documentation (#1407 ) * Update server string/regex ban documentation * Update README.md * Update README.md	2026-03-13 07:08:38 +01:00
SneedwareInc	4a247593dc	Make string ban more robust and add regex ban (#1243 ) * Test new ctx_sampling->n_rewind system * CRLF quickfix * Adaptive p check * merge banned_n * Fix attempt 1 * Fix attempt 2	2026-03-11 15:30:27 +01:00
firecoperana	ab1d74074b	common : introduce composable PEG parser combinators for chat parsing and new jinja template engine (#1369 ) --------- Co-authored-by: Piotr Wilkin <piotr.wilkin@syndatis.com> common : add nemotron 3 parsing (#18077) common : add parser for ministral/mistral large 3/devstral 2 (#17713) common : default content to an empty string (#18485) chat: make tool description and parameters optional per OpenAI spec (#18478) Per the OpenAI API specification, both 'description' and 'parameters' fields in tool function definitions are optional. Previously, the parser would throw an exception if these fields were missing. Attempts to fix #17667 common : implement new jinja template engine (#18462) --------- Co-authored-by: Alde Rojas <hello@alde.dev> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> jinja: correct member access rule (#18905) jinja : fix lexing of float literals with sign (#18901) jinja : add missing tojson filter for bool (#18900) jinja : attribute support for join, map and sort (#18883) jinja : fix object item order (and properly implement dictsort) (#18904) tests : add test-jinja -py option for cross-checking (#18906) Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> ci : run test-jinja -py on high perf [no ci] (#18916) jinja : fix undefined keys and attributes and int/float as bool (#18924) jinja: support none\|string (#18995) Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> jinja : implement mixed type object keys (#18955) --------- Co-authored-by: Xuan Son Nguyen <son@huggingface.co> jinja : undefined should be treated as sequence/iterable (return string/array) by filters/tests (#19147) `tojson` is not a supported `undefined` filter keep it DRY and fix some types jinja : do not pass empty tools and add some none filters (#19176) jinja : add unordered_map include to value.h [no ci] (#19205) jinja : add missing 'in' test to template engine (#19004) (#19239) The jinja template parser was missing the 'in' test from global_builtins(), causing templates using reject("in", ...), select("in", ...), or 'x is in(y)' to fail with "selectattr: unknown test 'in'". This broke tool-calling for Qwen3-Coder and any other model whose chat template uses the 'in' test. Added test_is_in supporting array, string, and object containment checks, mirroring the existing 'in' operator logic in runtime.cpp. Includes test cases for all three containment types plus reject/select filter usage. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> --------- Co-authored-by: Sid Mohan <sidmohan0@users.noreply.github.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com> Co-authored-by: Xuan Son Nguyen <son@huggingface.co> Add Jinja support for "indent" string filter (#19529) Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> add vendor refactor chat server : support preserving reasoning_content in assistant message (#18994) chat : fix translategemma crash on common_chat_format_example (#19019) chat: fix language input for translategemma (#19052) Co-authored-by: Aldehir Rojas <hello@alde.dev> --------- Co-authored-by: Aldehir Rojas <hello@alde.dev> chat: fix case where template accepts type content only (#19419) mtmd : chat : Fix extra \n between text and media marker (#19595) Thanks to @tugot17 for detecting and reporting the issue. For vision models (e.g. LFM2.5-VL-1.6B and Qwen/Qwen3-VL-4B-Instruct) `llama-mtmd-cli` produces identical output to HF implementation. However `llama-server` doesn't. I traced it down to extra newline inserted after `<__media__>`. This happens in `to_json_oaicompat`, that treats media markers as text and joins all parts with `\n` separator. PR introduces new type `media_marker` and uses it for media markers. Extra logic is added to prevent insertion of newlines before and after media markers. With this change number of input tokens is identical to HF implementation and as a result the output is also identical. I explored other ways to address the issue * remove completely `\n` between text parts in `to_json_oaicompat` * merge text messages in server-common.cpp before sending them to `to_json_oaicompat` Please propose alternative ways of fixing this issue. Co-authored-by: Piotr Wilkin (ilintar) <piotr.wilkin@syndatis.com> --------- Co-authored-by: Piotr Wilkin (ilintar) <piotr.wilkin@syndatis.com> common : merge qwen3-coder and nemotron nano 3 parsers (#19765) common : fix improper trimming in XML parser on complete message (#19805) Co-authored-by: Jules LEIDELINGER <11395311+julio75012@users.noreply.github.com> jinja: correct stats for tojson and string filters (#19785) jinja : correct default size for string slices (#19913) common : handle unicode during partial json parsing (#16526) common : fix json schema with '\' in literals (#17307) add back qwen_coder_xml and mirothinker Co-authored-by: Aldehir Rojas <hello@alde.dev>	2026-03-09 11:03:33 +01:00
dungquixote42	a903409a5e	fix adaptive p sampler rewinding too far back (#1359 ) * fix adaptive p sampler rewinding too far back * update comments * correct default value for total_weight, more comments * new variables/names * update comment for n_rewind * move null pointer check back to common_sampler_review() * refactor weighted_sum and total_weight to vector<pair>, better boundary check in llama_review_adaptive_p_impl()	2026-03-04 13:26:25 +01:00
Kawrakow	fd16a418de	Fix clang warnings on macOS (#1354 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2026-03-03 16:27:16 +01:00
Kawrakow	505e2c57f9	Reduce memory use when processing large images (#1349 )	2026-03-02 17:54:56 +01:00
Nexes the Elder	d4ac5f1566	gguf-split: fix the split output files naming (#1336 ) * Fix gguf-split.cpp splits output naming With this fix, the initial extension of the source .gguf file is not included in the naming of the output file before the numeration of the splits. ex: No more model.gguf-00001-of-00200.gguf Instead, model-00001-of-00200.gguf * increase ggml_max_context to 2048 * Revert GGML_MAX_CONTEXTS to 64	2026-03-02 08:43:47 +01:00
Kawrakow	d239dabcc6	Graph parallel for Qwen-3.5-MoE (#1347 ) * Graph parallel for Qwen3.5-MoE * Add --max-gpu to llama-bench * Fix graph reuse when not all GPUs participate in self-attention	2026-03-02 07:48:43 +01:00
firecoperana	8f9e19d57c	server: add checkpoint tolerance and fix grammar_trigger init (#1346 ) Co-authored-by: firecoperana <firecoperana>	2026-03-02 07:45:32 +01:00
Kawrakow	04c140fe54	Make vision woork with Qwen-3.5 models (#1345 )	2026-03-01 17:44:37 +01:00
Kawrakow	0ff3a43289	Bring back #1333 and #1335 (#1340 ) * Bring back fused delta net 3 * Remove autoregressive and chunking	2026-02-28 14:31:42 +01:00
Kawrakow	1922449b2c	Revert delta net 3 (#1339 ) * Revert "Simplify delta-net (#1335)" This reverts commit `e5fc30244c`. * Revert "Fused delta net 3 (#1333)" This reverts commit `7b68353e09`.	2026-02-28 13:12:08 +01:00
Kawrakow	e5fc30244c	Simplify delta-net (#1335 ) * Simplify delta-net * Minor * Minor	2026-02-28 11:12:19 +01:00
Kawrakow	7b68353e09	Fused delta net 3 (#1333 ) * This is better than chunked * Keep the state in registers * Cleanup * Remove unused stuff * Minor * Make fused delta-net the default * Fix race	2026-02-27 15:02:56 +01:00
firecoperana	3fac78c48b	server: enable checkpoint for recurrent models (#1310 ) * server: enable checkpoint for recurrent models create checkpoint after cancel fix ban string and rm context during rewind add checkpoint interval only save recurrent cache * save checkpoint during pp --------- Co-authored-by: firecoperana <firecoperana>	2026-02-26 06:51:18 +01:00
Kawrakow	c77ec4b8b8	Fused delta-net (#1315 ) * Revive fused delta-net * Add command line argument for fused delta net * Simplify/improve CUDA delta-net * Add -fdn to llama-bench * More CUDA fused delta net optimizations * CPU optimizations * Much faster fused delta-net on the CPU It seems it is faster than the chunked implementation! * Change meaning of fdn from bool flag to threshold value * Use eps = 1e-6 * Give some nodes a name	2026-02-25 14:12:48 +01:00
Nexes the Elder	170467e835	Llama-quantize: Partial requant feature (#1313 ) * Partial Requant feature for llama-quantize - Inspired by the recently portcopied --dry-run feature. - Allows to partially requantize a split quantized .gguf by requantizing only the missing splits in the destination directory. - Works both for GGUF which are split tensors by tensors, or by group of several tensors (though this one is not very much tested beyond 2 tensors by split). - Vibe coded. * Create output directory if it doesn't exist in llama-quantize * Create output directory if it doesn't exist in gguf-split * Add exit when directory fails to be created on Windows * Use std::filesystem * cleanup	2026-02-25 07:25:15 +01:00
Joshua Jolley	68431b049a	server: propagate task index to response objects for batch requests (#1303 ) When multiple prompts are sent in a single /v1/completions request, each response needs to carry the correct index so the client can match results to their corresponding prompts. The index field was not being set on partial responses, final responses, or embedding responses, causing batch results to all report index 0. Set res->index = slot.task->index in send_partial_response, send_final_response, and send_embedding. Generated with [Devin](https://cli.devin.ai/docs) Co-authored-by: Joshua Jolley <jjolley@clearwateranalytics.com> Co-authored-by: Devin <noreply@cognition.ai>	2026-02-24 15:39:38 +01:00
Kawrakow	cfb6747776	llama-quantize: --dry-run option (#1309 )	2026-02-24 15:21:52 +01:00

1 2 3 4 5 ...

1178 Commits