ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-01-26 09:09:50 +00:00

Author	SHA1	Message	Date
Kawrakow	98b30e5e81	Faster adaptive_p sampling (#1165 ) * A hopefully more efficient adaptive_p sampling * Once at it, lets fix the formatting too * More formatting * Hopefully better * This should be better * Correctly accumulate adaptive_p sampling time * AVX2	2026-01-19 16:03:09 +02:00
Kawrakow	fa58c20c42	A hopefully more efficient adaptive_p sampling (#1161 ) * A hopefully more efficient adaptive_p sampling * Once at it, lets fix the formatting too * More formatting * Correctly accumulate sampling time for adaptive_p	2026-01-19 15:01:55 +02:00
dungquixote42	6dfbef27ec	Adaptive p: bugfix + optimization + refactor (#1155 ) * adaptive-p sampler: fix zeroed orig_probs bug and refactor - Fix bug where original probabilities were captured as zero by calculating them from logits in llama_prep_adaptive_p (new). - Replace vector with unordered_map to track candidate probabilities, filtering for relevance via logit delta (16.6f). - Standardize API naming: llama_<action/verb>_<focus/name/topic>_<extra/info> - Update function signatures to follow most other samplers. * resolve merge bug * adaptive-p: revert reordering function definitions	2026-01-18 08:26:06 +02:00
firecoperana	d71a3ec315	Server: refactor and rename functions (#1151 ) * Server: rename functions and refactor code rename functions refactor update slots rename params_base rename timings * change * Revert kv cache name changes * Revert 2 * fix test build error --------- Co-authored-by: firecoperana <firecoperana>	2026-01-18 08:16:57 +02:00
firecoperana	ee463b079e	Webui: add text completions and adaptive_p sampling (#1153 ) * Webui: add text completions and adaptive_p sampling * update description --------- Co-authored-by: firecoperana <firecoperana>	2026-01-17 08:37:07 +02:00
dungquixote42	52ad1c6421	Implement Adaptive-P Sampler (#1100 ) * initial implementation of adaptive-p sampler * explicitly mark candidates unsorted + cleanup qualifiers * cosmetic update * reorg prototypes * lockstep with mainline * add _impl for _init + reorg * add LLAMA_API to prototypes * update sharpness to 10 * lockstep: rng seed * delete llama_sampling member in llama_sampler_adaptive_p * fix LLAMA_API return type * lockstep: rng seed cont * actually correct implementation * lockstep: sorting behavior * const -> constexpr for known constants * add missing space * fix softmax usage in adaptive p sampler * cosmetic changes * implement do-not-sort version of softmax * simpify rng seed, add static to constexpr * refactor: remove iface + use shared rng + use actually original probabilities * adaptive-p: add dedicated rng back in * fix initial max_logit + add float vector to adaptive p sampler context + stochastic sampling * adaptive-p: fuse first softmax with transformation * adaptive-p: implement binary search selection * adaptive-p: update comment	2026-01-10 07:58:53 +02:00
hksdpc255	d7476a1b46	fix grammar for Kimi-K2 (#1103 ) * Update key-value separator and value end format * Sample grammar first if resampling --------- Co-authored-by: firecoperana <firecoperana>	2026-01-05 07:57:25 +02:00
firecoperana	1cad1ec1cc	Update grammar (#1023 ) * grammar : fix JSON Schema for string regex with top-level alt. (#9903) Prior to this commit, using a JSON Schema containing a string with `pattern` regular expression that uses top-level alternation (e.g. `"pattern": "^A\|B\|C\|D$"`) would result in invalid JSON output from the constrained sampling grammar, because it ended up creating a grammar rule like this for the string: ``` thing ::= "\"" "A" \| "B" \| "C" \| "D" "\"" space ``` Note that this rule will only match a starting quote for the "A" case, and will only match an ending quote for the "D" case, so this rule will always produce invalid JSON when used for sampling (that is, the JSON will always be lacking the starting quote, the ending quote, or both). This was fixed in a simple way by adding parentheses to the generated rule (for all string pattern rules, to keep it simple), such that the new generated rule looks like this (correct): ``` thing ::= "\"" ("A" \| "B" \| "C" \| "D") "\"" space ``` * grammars : add English-only grammar (#10612) * grammar : handle maxItems == 0 in JSON schema (#13117) Co-authored-by: Richard Lyons <frob@cloudstaff.com> * grammar-parser : fix possible null-deref (#9004) Fixes: https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=70680 Signed-off-by: David Korczynski <david@adalogics.com> * llama : fix typo in llama-grammar.h [no ci] (#11816) * * server: fix "--grammar-file" parameter (#12285) * common : use std::string_view now that we target c++17 (#14319) * json : support `enum` values within `allOf` (#15830) * grammar : use int64_t to avoid int overflows in int schema to grammar conversion logic (#16626) * grammar : support array references in json schema (#16792) * grammar : support array references in json schema * Update json-schema-to-grammar.cpp Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> * grammar : improve regex when naming ref derived rules * grammar : replace non-conformant definitions array with anyOf test case --------- Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> # Conflicts: # tests/test-json-schema-to-grammar.cpp * merge fix * llama : minor grammar refactor (#10897) * llama: fix error on bad grammar (#12628) * grammar : fix integer overflow (#17381) * Fix DoS / integer overflow * Remove optional, use INT64_MAX instead as placeholder value (it's technically -1, so it fits :) * White space * Actually, since it's unsigned, use UINT64_MAX # Conflicts: # src/llama-grammar.cpp * grammar: fix regression caused by #17381 (#17412) * grammar: fix regression caused by #17381 * more readable # Conflicts: # src/llama-grammar.cpp * Merge Fix * Fix warnings --------- Signed-off-by: David Korczynski <david@adalogics.com> Co-authored-by: Joe Eli McIlvain <joe.eli.mac@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: frob <rick+github@frob.com.au> Co-authored-by: Richard Lyons <frob@cloudstaff.com> Co-authored-by: DavidKorczynski <david@adalogics.com> Co-authored-by: Daniel Bevenius <daniel.bevenius@gmail.com> Co-authored-by: firecoperana <firecoperana> Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com> Co-authored-by: Aldehir Rojas <hello@alde.dev> Co-authored-by: Olivier Chafik <olivier.chafik@gmail.com> Co-authored-by: Piotr Wilkin (ilintar) <piotr.wilkin@syndatis.com> Co-authored-by: Xuan-Son Nguyen <son@huggingface.co> Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-30 18:45:38 +01:00
Kawrakow	e2f21c8dc8	Move minja and nlohmann/json to vendor (#802 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-09-27 09:12:35 +02:00
firecoperana	d7882c3cf8	Tool calls support from mainline (#723 ) * Tool calls support from mainline * update cmake * revert api for /completions * Fix broken thinking process for gpt-oss * add missing args and fix webui bugs * add missing args and fix webui bugs2 * Fix reasoning format error * add usage * change default post_sampling_probs to true * add back generated_text * Remove server endpoints tests * add log * Chat fixes * Remove logs * webui: revert extra handling of thinking process --------- Co-authored-by: firecoperana <firecoperana> Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-09-01 08:38:49 +03:00
g2mt	b6bc5eedad	Port speculative decoding from upstream to llama-server (#645 ) * server : integrate speculative decoding * server: Fix field names * server: fix include, whitespace * fix compile errors in speculative.cpp * add llama_sampling_sample_and_accept_n to sampling * finish porting speculative decoding in server * port functions from common/speculative, common/sampling * remove arg * fix function names * init params_dft to none * correct value for n_ctx * prefix kv cache tensors with model name to avoid conflict * fix call arguments * fix spec decoding args * correct slot.id * use n_max * port the rest of sampling funcs * fix func arguments * slot.id starts at 1? * Revert "prefix kv cache tensors with model name to avoid conflict" This reverts commit `fbd5dfd866`. * disable draft logging * disable logging in speculative.cpp in mainline, these would be LOG_DEBUG, but since ik_llama doesnt support it, logging is disabled entirely * add more draft model parameters * fix * pass flash_attn * add speculative params for parity * set speculative params in launch_slot_with_task instead	2025-08-16 07:26:44 +03:00
Kawrakow	1db6a073cb	Do not crash when there is no DRY sampler (#578 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-07-03 15:26:52 +02:00
firecoperana	d1f92e24d3	add dry sampler (#513 ) * add dry sampler * use vocab instead of model in dry_init function * fix compile error for build test --------- Co-authored-by: firecoperana <firecoperana>	2025-06-19 10:24:53 +03:00
Kawrakow	1d28b2a9a1	Adding top-n-sigma sampler (#489 ) * Adding top-n-sigma sampler * Fix typos in XTC PR * Update README.md for main and server * More README * More README --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-06-03 17:35:09 +03:00
Kawrakow	accf69b126	Adding the XTC sampler (#486 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-06-03 11:32:03 +03:00
Kawrakow	0ceeb11721	Merge mainline llama.cpp (#3 ) * Merging mainline - WIP * Merging mainline - WIP AVX2 and CUDA appear to work. CUDA performance seems slightly (~1-2%) lower as it is so often the case with llama.cpp/ggml after some "improvements" have been made. * Merging mainline - fix Metal * Remove check --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2024-07-27 07:55:01 +02:00
Georgi Gerganov	43b6515153	common : normalize naming style (#7462 ) * common : normalize naming style ggml-ci * common : match declaration / definition order * zig : try to fix build	2024-05-22 20:04:20 +03:00
Olivier Chafik	287fa980b8	`grammars`: fix resampling logic regression (#7424 )	2024-05-21 20:40:00 +01:00
Johannes Gäßler	70a18260b2	server: fix reported top tokens for temperature 0 (#7203 )	2024-05-11 10:11:28 +02:00
Johannes Gäßler	e56a09c3dd	server: fix incorrectly reported token probabilities (#7125 ) * server: normalize token probabilities * fix temperature == 0.0f	2024-05-07 23:07:58 +02:00
David Renshaw	4fcd38d3e2	sampling : use std::random_device{}() for default random seed (#6962 )	2024-04-29 16:35:45 +03:00
Johannes Gäßler	95878a0936	Server: fix seed for multiple slots (#6835 ) * Server: add tests for consistent results * sampling: separate rng per sampling context	2024-04-24 11:08:36 +02:00
Minsoo Cheong	67aff3f53d	sampling : deduplicated code for probability distribution access (#6240 ) * sampling: remove duplicated code for probability distribution access * free original_logits * fix original_logits allocation * fixes based on review @cebtenzzre * change function name to `llama_sampling_prepare`	2024-03-24 10:54:07 +02:00
Clint Herron	4a6d25766b	grammar : handle missing "root" node (#6004 )	2024-03-13 20:10:40 +02:00
Minsoo Cheong	e47512e1b7	speculative : implement stochastic speculative sampling (#5625 ) * (WIP) Implement stochastic speculative decoding * sample from residual distribution on draft accept failure * fix #5657: force greedy sampling with probs when temp is 0 * remove p_accept parameter * fix style * remove unused variables * add srand() in speculative.cpp * replace use of rand() with mt19937 sampling * fixes based on review (@JohannesGaessler) * fix r random generation * randomly select next sequence to verify + fix bug in memory freeing * fix bug in active_seqs sync * fix uniform int distribution initialization * remove warnings from comparison between int and size_t * check grammar in `llama_sample_probability_distribution_impl` * remove malloc code by utilizing vectors * add PR link to README	2024-03-04 20:24:00 +02:00
Pierrick Hymbert	0c39081992	server: tests - slow inference causes timeout on the CI (#5715 ) * server: tests - longer inference timeout for CI	2024-02-25 22:48:33 +01:00
Robey Holderith	fbc3ee16c2	common, server : surface min_keep as its own parameter (#5567 ) * Feature - surface min_keep as its own parameter * Updated README with min_keep param	2024-02-18 21:11:16 +02:00
Georgi Gerganov	1441b588ab	sampling : do not set min_keep to n_probs (#5564 )	2024-02-18 19:38:06 +02:00
Alexey Parfenov	14c96c2c4c	server : add "samplers" param to control the samplers order (#5494 )	2024-02-16 13:33:25 +02:00
Alexey Parfenov	f2f1af418b	common : use enums for sampler types (#5418 ) * common: use enums for sampler types * Apply suggestions from code review Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * minor : spaces --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-02-11 15:43:31 +02:00
Georgi Gerganov	d6f4a7c4bc	common : fix compile warning	2024-02-11 15:33:43 +02:00
Johannes Gäßler	5a19feb61f	sampling: fix top_k <= 0 (#5388 ) * sampling: fix top_k <= 0 * Update llama.cpp Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2024-02-08 09:46:30 +01:00
Michael Klimenko	fc949e58f3	Remove unused data and add fixes (#5154 ) * Remove unused data and add fixes * Add missing file * Address review comments * Replace the scope of vq allocation	2024-01-27 15:25:55 +01:00
l3utterfly	c6e551b2a3	llama : dynamic temperature sampling (#4972 ) * implemented dynamic temperature sampling from koboldcpp * removed trailing whitespace * removed unused temp parameter in llama_sample_entropy * exposed exponent_val in dynamic temp sampler * added debug check for printf statements * use nullptr in llama_sample_softmax call during llama_sample_entropy this avoids counting the time taken stats twice Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * return earlier if there is only 1 candiate (i.e. max_entropy == 0) * reformat 't' case in llama_sample_queue Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com> * check for one or zero candidates case in llama_sample_entropy --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>	2024-01-25 22:06:22 +02:00
David Friehs	af3d54c2c0	llama : apply classifier-free guidance to logits directly (#4951 )	2024-01-15 15:06:52 +02:00
Alexey Parfenov	593a2e1be5	server : allow to specify custom prompt for penalty calculation (#3727 )	2023-12-23 11:31:49 +02:00
kalomaze	57c6c251f8	grammar : check the full vocab only if necessary (opt) (#4306 ) * Check the full vocab for grammar only if necessary * Fix missing logit restoration step (?) Does this matter, actually? * Fix whitespace / formatting * Adjust comment * Didn't mean to push test gbnf * Split sampling into the helper function (?) And also revert the changes made to the header * common : fix final newline --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-12-23 11:27:07 +02:00
Georgi Gerganov	1ff71c6ab8	common : fix compile warning	2023-12-06 10:41:03 +02:00
MaggotHATE	6a3e340d1e	sampling : custom samplers order (#4285 ) * Samplers sequence order w parameter * Cleaned commented code * Fixed formatting * Rewrote with unordered_map * Revert and rewrite, too many problems and safeguards would be needed * Fixed code style * Code style fixes according to review * More readable samplers input string, fixed help * Style fix in sampler_queue * Formatting fixes * Fixing whitespaces	2023-12-05 12:05:51 +02:00
l3utterfly	5dee23d4fe	sampling : null grammar field after reset (#3885 )	2023-11-01 15:40:43 +02:00
kalomaze	1603f191ad	samplers : Min-P sampler implementation [alternative to Top P/Top K] (#3841 ) * Introduce the new Min-P sampler by @kalomaze The Min-P sampling method was designed as an alternative to Top-P, and aims to ensure a balance of quality and variety. The parameter p represents the minimum probability for a token to be considered, relative to the probability of the most likely token. * Min-P enabled and set to 0.05 default --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: cebtenzzre <cebtenzzre@gmail.com>	2023-10-31 20:44:49 +01:00
Georgi Gerganov	dd8b7789a0	llama : add option for greedy sampling with probs (#3813 ) * llama : add option for greedy sampling with probs * llama : add comment about llama_sample_token_greedy() missing probs * sampling : temp == 0.0 -> no probs, temp < 0.0 -> probs	2023-10-28 14:23:11 +03:00
Marcus Dunn	a04926db16	llama : remove token functions with `context` args in favor of `model` (#3720 ) * added `llama_model_token_` variants to all the `llama_token_` functions. * added `LLAMA_API` * formatting Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> * removed old `llama_token` functions * changed 3 more functions to take in model - `llama_token_get_text` - `llama_token_get_score` - `llama_token_get_type` * added back docs * fixed main.cpp * changed token functions to use new model variants * changed token functions to use new model variants --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-10-23 22:40:03 +03:00
Georgi Gerganov	ede7949722	sampling : refactor init to use llama_sampling_params (#3696 ) * sampling : refactor init to use llama_sampling_params * llama : combine repetition, frequency and presence penalties in 1 call * examples : remove embd-input and gptneox-wip * sampling : rename penalty params + reduce size of "prev" vector * sampling : add llama_sampling_print helper * sampling : hide prev behind API and apply #3661 ggml-ci	2023-10-20 21:07:23 +03:00
Georgi Gerganov	57dbdbdc54	speculative : add tree-based sampling example (#3624 ) * sampling : one sequence per sampling context ggml-ci * speculative : add tree-based sampling support ggml-ci * speculative : reuse the n_parallel CLI param * speculative : refactor sampling * examples : fix build after sampling refactoring ggml-ci * batched : fix n_seq_id * sampling : fix malloc ggml-ci * swift : fix build ggml-ci * swift : try to fix build ggml-ci * prompts : add assistant.txt * common : add llama_batch_add() and llama_batch_clear() helpers * speculative : minor refactor ggml-ci * minor : comments + rename ggml-ci * speculative : fix off-by-one for n_drafted * speculative : fix the n_drafted fix + p constants	2023-10-18 16:21:57 +03:00
Kerfuffle	ecd831a6b8	common : fix mirostat state when using multiple sequences (#3543 ) * Fix mirostat state when using multiple sequences * Fix mirostat by completely refactoring sampling! * Try to fix zig build. * Export function to fetch/create default sampler states Code formatting cleanups and add some comments Silence a warning about id not being used when logging is disabled * Apply some renaming suggestions. Fix comments that were out of sync with the pull. * Use more consistant naming convention for sampling contexts	2023-10-11 22:35:46 +03:00

46 Commits