* adaptive-p sampler: fix zeroed orig_probs bug and refactor
- Fix bug where original probabilities were captured as zero by calculating
them from logits in llama_prep_adaptive_p (new).
- Replace vector with unordered_map to track candidate probabilities,
filtering for relevance via logit delta (16.6f).
- Standardize API naming: llama_<action/verb>_<focus/name/topic>_<extra/info>
- Update function signatures to follow most other samplers.
* resolve merge bug
* adaptive-p: revert reordering function definitions
* grammar : fix JSON Schema for string regex with top-level alt. (#9903)
Prior to this commit, using a JSON Schema containing a string
with `pattern` regular expression that uses top-level alternation
(e.g. `"pattern": "^A|B|C|D$"`) would result in invalid JSON
output from the constrained sampling grammar, because it
ended up creating a grammar rule like this for the string:
```
thing ::= "\"" "A" | "B" | "C" | "D" "\"" space
```
Note that this rule will only match a starting quote for the "A" case,
and will only match an ending quote for the "D" case,
so this rule will always produce invalid JSON when used for sampling
(that is, the JSON will always be lacking the starting quote,
the ending quote, or both).
This was fixed in a simple way by adding parentheses to the
generated rule (for all string pattern rules, to keep it simple),
such that the new generated rule looks like this (correct):
```
thing ::= "\"" ("A" | "B" | "C" | "D") "\"" space
```
* grammars : add English-only grammar (#10612)
* grammar : handle maxItems == 0 in JSON schema (#13117)
Co-authored-by: Richard Lyons <frob@cloudstaff.com>
* grammar-parser : fix possible null-deref (#9004)
Fixes: https://bugs.chromium.org/p/oss-fuzz/issues/detail?id=70680
Signed-off-by: David Korczynski <david@adalogics.com>
* llama : fix typo in llama-grammar.h [no ci] (#11816)
* * server: fix "--grammar-file" parameter (#12285)
* common : use std::string_view now that we target c++17 (#14319)
* json : support `enum` values within `allOf` (#15830)
* grammar : use int64_t to avoid int overflows in int schema to grammar conversion logic (#16626)
* grammar : support array references in json schema (#16792)
* grammar : support array references in json schema
* Update json-schema-to-grammar.cpp
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
* grammar : improve regex when naming ref derived rules
* grammar : replace non-conformant definitions array with anyOf test case
---------
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
# Conflicts:
# tests/test-json-schema-to-grammar.cpp
* merge fix
* llama : minor grammar refactor (#10897)
* llama: fix error on bad grammar (#12628)
* grammar : fix integer overflow (#17381)
* Fix DoS / integer overflow
* Remove optional, use INT64_MAX instead as placeholder value (it's technically -1, so it fits :)
* White space
* Actually, since it's unsigned, use UINT64_MAX
# Conflicts:
# src/llama-grammar.cpp
* grammar: fix regression caused by #17381 (#17412)
* grammar: fix regression caused by #17381
* more readable
# Conflicts:
# src/llama-grammar.cpp
* Merge Fix
* Fix warnings
---------
Signed-off-by: David Korczynski <david@adalogics.com>
Co-authored-by: Joe Eli McIlvain <joe.eli.mac@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: frob <rick+github@frob.com.au>
Co-authored-by: Richard Lyons <frob@cloudstaff.com>
Co-authored-by: DavidKorczynski <david@adalogics.com>
Co-authored-by: Daniel Bevenius <daniel.bevenius@gmail.com>
Co-authored-by: firecoperana <firecoperana>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Aldehir Rojas <hello@alde.dev>
Co-authored-by: Olivier Chafik <olivier.chafik@gmail.com>
Co-authored-by: Piotr Wilkin (ilintar) <piotr.wilkin@syndatis.com>
Co-authored-by: Xuan-Son Nguyen <son@huggingface.co>
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* gmp-oss: common
* gpt-oss: attnetion sinks, swiglu_oai
* gpt-oss: WIP llama
Model loads and runs (CPU only), but PPL is much to high
(~1500 for 1st batch vs ~200 in mainline).
Is it because of SWA, because of vocab, or did I introduce a bug somewhere?
* gpt-oss: CPU seems to be working
It was the SWA thta was missing in the previous commit.
There are issues with EOG tokens, so this still needs to be added.
* CUDA: ADD_ID
Just a copy from mainline
* gpt-oss: Seems to be working on CUDA
* gpt-oss: add sinks to the attn-vec kernels
* CUDA: add head size of 64 to new mma
Haven't turned it on yet, but observe slightly better PP and slightly
worse TG performance with that.
* gpt-oss: add ability to use -fmoe (only CUDA for now)
* Move row sums to the write place
* Add sinks to iqk flash attention
* gpt_oss: Implement -fmoe on the CPU
* Simdify swiglu_oai
Turning it off for now as performance becomes more variable,
so perhaps I'm running into thermal trottling imore often
because of making the CPU work too hard.
* llama: factor out model loader
* Builds successfully
* It runs, but mmap does not work
* Fix llama_mmap so mmap works
* Minor
* Fix CUDA after latest changes
* Attempt to use CUDA graphs with MoE models - not working
* CUDA graphs WIP - still not working
* CUDA graphs - seems to be working
Likely not all MLA variants are working.
I no longer remember why I added the q8_0 cpy that
transposes the tensor, but if really needed, this is now
missing. Also missing is q6_0.
* Make q8_0 cache work for DeepSeek models with CUDA graphs
* cuda: cpy for q6_0
* Fix llama_mmap on non-Linux platforms
* Adding forgotten file
* Iterating on Windows build failures
* cuda: re-add q8_0 -> q8_0 transpose
so mla = 2 can be used with CUDA graphs and q8_0 cache.
* Disable graphs without -fmoe
* Minor
* Turn graphs on by default
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* add dry sampler
* use vocab instead of model in dry_init function
* fix compile error for build test
---------
Co-authored-by: firecoperana <firecoperana>
* Adding top-n-sigma sampler
* Fix typos in XTC PR
* Update README.md for main and server
* More README
* More README
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Merging mainline - WIP
* Merging mainline - WIP
AVX2 and CUDA appear to work.
CUDA performance seems slightly (~1-2%) lower as it is so often
the case with llama.cpp/ggml after some "improvements" have been made.
* Merging mainline - fix Metal
* Remove check
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>