Files
ik_llama.cpp/common/peg-parser.cpp
firecoperana e0596bf614 Autoparser - complete refactoring of parser architecture (#1376)
* Autoparser - complete refactoring of parser architecture

Autoparser: add optional argument reshuffle capability

Autoparser: True streaming (#20177)

* Relax atomicity constraint for nicer, more pleasent, True Streaming parsing

* Whitespace

* Remove redundant atomics

Revert to OAI-compatible args (#20213)

* Revert to OAI-compatible args

* Apply workaround::func_args_not_string

Fix structured outputs (#20223)

* Fix structured outputs

* Update common/chat-auto-parser-generator.cpp

Co-authored-by: Aldehir Rojas <hello@alde.dev>

---------

Co-authored-by: Aldehir Rojas <hello@alde.dev>

Fix compile bug (#20203)

* Fix compile bug

* Update common/chat-auto-parser-helpers.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
# Conflicts:
#	common/chat-auto-parser-helpers.cpp

common : gracefully handle incomplete output (#20191)

* common : handle incomplete UTF-8 at end of input in PEG parser

* cont : if reached end prematurely, emit needs_more_input to propagate partial output

* cont: refactor peg parse context to add lenient flag

* cont : remove partial flag, keep lenient flag

PEG parser for LFM2 (#20251)

* PEG parser for LFM2

* Simplify using python_value()

common: map developer role to system (#20215)

* Map developer role to system
* Simplify

common: consolidate PEG string parsers (#20263)

* common : consolidate PEG string parsers
* cont : fix json_string_content()

examples : fix empty items in json_schema_to_grammar.py [no ci] (#19968)

* Fix logic for retrieving schema items in `json_schema_to_grammar.py`

If `schema['items']` is `{}` and `prefixItems not in schema', as `{}` is Falsy, the original code here will raise an error.

I think if `schema['items']` is `{}`, them items should just be `{}`

* Apply suggestion from @CISC

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Add tests for arrays with empty items

Add two unit tests to `tests/test-json-schema-to-grammar.cpp` that validate handling of arrays when 'items' is an empty schema and when 'prefixItems' is present alongside an empty 'items'. Both tests expect the same generated grammar, ensuring the JSON Schema->grammar conversion treats an empty 'items' schema (and the presence of 'prefixItems') correctly and covering this edge case.

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

Reduce level of content parser warning message to avoid log spam on non-debug verbosity (#20347)

do not return if template parse failed

add arg to enable parallel tool call

common : fix incorrect uses of stoul (#20313)
# Conflicts:
#	common/arg.cpp
#	src/llama-grammar.cpp

examples : fix empty items in json_schema_to_grammar.py [no ci] (#19968)

* Fix logic for retrieving schema items in `json_schema_to_grammar.py`

If `schema['items']` is `{}` and `prefixItems not in schema', as `{}` is Falsy, the original code here will raise an error.

I think if `schema['items']` is `{}`, them items should just be `{}`

* Apply suggestion from @CISC

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Add tests for arrays with empty items

Add two unit tests to `tests/test-json-schema-to-grammar.cpp` that validate handling of arrays when 'items' is an empty schema and when 'prefixItems' is present alongside an empty 'items'. Both tests expect the same generated grammar, ensuring the JSON Schema->grammar conversion treats an empty 'items' schema (and the presence of 'prefixItems') correctly and covering this edge case.

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

Add support for MiroThinker with new jinja template

common/parser: handle reasoning budget (#20297)

* v1

* Finished!

* Handlie cli

* Reasoning sampler

* Apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Less explosive terminology :)

* Add utf-8 case and tests

* common : migrate reasoning budget sampler to common

* cont : clean up

* cont : expose state and allow passing as initial state

* cont : remove unused imports

* cont : update state machine doc string

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Alde Rojas <hello@alde.dev>

common/parser: use nlohmann::ordered_json to preserve parameter order (#20385)

common/parser: add GigaChatV3/3.1 models support (#19931)

Co-authored-by: Mishusha <pmv26021975@gmail.com>

common/parser: gracefully handle undetected tool parser, print error message. (#20286)

fix: prevent nullptr dereference (#20552)

common : fix iterator::end() dereference (#20445)
# Conflicts:
#	common/regex-partial.cpp

jinja : add capability check for object args (#20612)

common/parser: add `--skip-chat-parsing` to force a pure content parser. (#20289)

* Add `--force-pure-content` to force a pure content parser.

* Update common/arg.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

common : rework gpt-oss parser (#20393)

* common : rework gpt-oss parser

* cont : fix gpt-oss tests

* cont : add structured output test

* cont : rename final to final_msg

common : fix gpt-oss content removal (#20745)

common/parser: add proper reasoning tag prefill reading (#20424)

* Implement proper prefill extraction

* Refactor cli parameters, update docs, move reasoning budget sampler part to common/reasoning-budget.cpp

* Update tools/server/server-task.cpp

* refactor: move grammars to variant, remove grammar_external, handle exception internally

* Make code less C++y

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

chat : handle tool calls with no required args in TAG_WITH_TAGGED format (#20764)

* chat : handle tool calls with no required args in TAG_WITH_TAGGED format

* Update tests/test-chat.cpp [no ci]

Co-authored-by: Aldehir Rojas <hello@alde.dev>

---------

Co-authored-by: Piotr Wilkin (ilintar) <piotr.wilkin@syndatis.com>
Co-authored-by: Aldehir Rojas <hello@alde.dev>

common/parser : fix out_of_range crash in throw path (#20424 regression) (#20777)

* chat : fix out_of_range crash in throw path (#20424 regression)

#20424 introduced effective_input = generation_prompt + input, but the
throw path uses input.substr(result.end) where result.end is a position
within effective_input. Every thinking model with a non-empty
generation_prompt crashes with std::out_of_range instead of the intended
error message.

Test crashes on unpatched master, passes with fix:

  cmake -B build -DLLAMA_BUILD_TESTS=ON -DLLAMA_BUILD_TOOLS=OFF
  cmake --build build --target test-chat
  ./build/bin/test-chat

* Update test-chat.cpp

* Update test-chat.cpp

* Update test-chat.cpp

---------

Co-authored-by: Piotr Wilkin (ilintar) <piotr.wilkin@syndatis.com>

jinja : fix heap OOB read in value equality comparison (#20782)

Address GHSA-q9j6-4hhc-rq9p and GHSA-2q4c-9gq5-5vfp.

The three-iterator overload of std::equal in value_array_t::equivalent()
and value_object_t::equivalent() reads past the end of the shorter
container when comparing arrays or objects of different lengths.

Use the four-iterator overload (C++14) which checks both range lengths.

Found-by: Pwno

common : fix typo in debug log ('extracft' -> 'extract') (#20807)

common/parser: fix nasty bug causing subtle corruption of generation prompt (#20825)

jinja : refactor token advancement (#20864)

* refactor token advancement

* exercise sub-expressions

common/autoparser : detect reasoning markers when enable_thinking changes system prompt (#20859)

common : replace wrap_for_generation with a prefix convenience function and fix gpt-oss (#20912)

jinja: fix macro with kwargs (#20960)

* jinja: fix macro with kwargs

* Apply suggestions from code review

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* fix newline problem

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

common : inhibit lazy grammar sampler while reasoning is active (#20970)

* common : inhibit grammar while reasoning budget is active

* cont : update force_pos in accept

* cont : fix tests

* cont : tweak should apply logic

* cont : return early not using grammar sampler

* Add tests

* cont : prevent backend sampling when reasoning budget enabled

* cont : fix typo

---------

Co-authored-by: Piotr Wilkin <piotr.wilkin@syndatis.com>
# Conflicts:
#	common/reasoning-budget.h
#	common/sampling.cpp
#	tools/cli/cli.cpp
#	tools/server/server-common.cpp
#	tools/server/server-task.cpp

common/parser: fix reasoning whitespace bugs + extra parser tests (#21085)

* fix whitespace reasoning issues + add reconstruction tests

* Proper fix

* fix Nemotron autoparser test expectations to include newline in marker

common : add reasoning_format = none support to gpt-oss (#21094)

common/json-schema: fix: handle non-capturing groups (?:...) in JSON schema pattern converter (#21124)

The regex-to-grammar converter in _visit_pattern() crashes with SIGSEGV
when a JSON schema "pattern" field contains a non-capturing group (?:...).

Root cause: when the parser sees '(' followed by '?', it pushes a warning
but does not advance past '?:'. The recursive transform() call then
interprets '?' as a quantifier and calls seq.back() on an empty vector,
causing undefined behavior.

This commonly occurs when serving OpenAI-compatible tool calls from
clients that include complex regex patterns in their JSON schemas (e.g.,
date validation patterns like ^(?:(?:\d\d[2468][048]|...)-02-29|...)$).

The fix:
- Skip '?:' after '(' to treat non-capturing groups as regular groups
- For unsupported syntax (?=, ?!, etc.), skip to matching ')' safely,
  handling escaped characters to avoid miscounting parenthesis depth
- Adjust the ')' unbalanced-parentheses check using direct char
  comparisons instead of substr
- Add test cases for non-capturing groups (C++ only, as the JS/Python
  implementations do not yet support this syntax)

common/parser: fix handling of tool definition with missing properties key (#21128)

jinja : handle empty expressions correctly (#20913)

* Reject empty computed member expressions before returning slices[0] from parse_member_expression_arguments().

* Treat empty computed member expressions with Jinja2 undefined semantics

Treat empty computed member expressions like `a[]` as undefined instead of
raising a parser error, to match Jinja2 behavior.

- return a noop expression for empty computed member arguments
- return undefined when a computed member key evaluates to undefined
- add Jinja tests covering `a[]|default('fallback')` and `a[] is undefined`

* Handle undefined computed member properties

Move undefined-property handling to the common member access path, and add a test covering `a[undefined] is undefined`.

* Use default undefined value in member access

Initialize val and then return it when property is undefined.

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* empty statement parses to blank_expression instead of noop_statement

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

common : gpt-oss handle builtin and unsolicited tool calls (#21213)

fix: tool call parsing for LFM2 and LFM2.5 models (#21242)

* fix: tool call parsing for LFM2 and LFM2.5 models'

* refactor: add test / break out lfm2 and lfm2.5 parsing logic
# Conflicts:
#	common/chat.cpp

Relax prefill parser to allow space. (#21240)

* Relax prefill parser to allow space.

* Move changes from prefix() to parser generation

* Only allow spaces if we're not having a pure content parser next

common : add commentary rules for gpt-oss-20b (#21286)

add reasoning budget

model, mtmd: fix gguf conversion for audio/vision mmproj (#21309)

* fix gguf conversion for audio/vision mmproj

* fix test
# Conflicts:
#	convert_hf_to_gguf.py
#	examples/eval-callback/eval-callback.cpp
#	examples/mtmd/CMakeLists.txt
#	examples/mtmd/clip-impl.h
#	examples/mtmd/mtmd.cpp
#	gguf-py/gguf/constants.py
#	gguf-py/gguf/gguf_writer.py
#	gguf-py/gguf/tensor_mapping.py
#	src/CMakeLists.txt
#	src/llama-arch.cpp
#	src/llama-arch.h
#	src/llama-model.cpp
#	src/llama-model.h
#	src/llama-vocab.cpp
#	src/models/models.h
#	tests/test-llama-archs.cpp
#	tools/mtmd/clip-graph.h
#	tools/mtmd/clip-model.h
#	tools/mtmd/clip.cpp
#	tools/mtmd/models/models.h

fix: gemma 4 template (#21326)

chat : avoid including json in chat.h (#21306)

jinja: coerce input for string-specific filters (#21370)

common : fix tool call type detection for nullable and enum schemas (#21327)

* common : fix tool call type detection for nullable and enum schemas

* common, tests : fix grammar delegation for nullable/enum schemas and add tests

Fix enum type inference to scan all enum values (not just index 0) so
schemas like {"enum": [0, "celsius"]} correctly detect string type.

Fix schema_delegates in peg-parser to handle nullable type arrays
(["string", "null"]) and typeless enum schemas in raw mode, allowing
the tagged parser to use raw text instead of JSON-formatted strings.

Add test cases for Qwen3-Coder (TAG_WITH_TAGGED format):
- nullable string ["string", "null"]
- nullable string with null first ["null", "string"]
- nullable integer ["integer", "null"]
- enum without explicit type key

common/parser: fix call ID detection (Mistral parser mostly) + atomicity for tag-json parsers (#21230)

* Fix call ID detection (Mistral parser mostly) + atomicity for tag-json parsers

* Rename

* Update common/chat-auto-parser-generator.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

common : add gemma 4 specialized parser (#21418)

* common : add gemma4 dedicated parser

* cont : add '<|tool_response>' as eog

* cont : emit JSON from Gemma4 tool call AST

* cont : more fixes

* cont : refactor convert function

* cont : refine rules and mapping

* cont : add more tests

* cont : clean up

* cont : remove autoparser gemma4 implementation

* cont : more cleanup

* cont : rename gemma4.jinja to match the others

* cont : add custom template to support interleaved thinking

* cont : preserve reasoning in model turns

* cont : fix initializer error

* cont : fix unused vars

* cont : fix accidental static

* cont : fix specialized_template signature

* fix extra semicolon

* remove debug line and extra space [no ci]

fix reasoning budget

parser: fix MiniMax handling (#21573)

jinja : support ensure_ascii=true, string repetition and int/float self-filtering (#21623)

* feat: jinja engine improvements for reka-edge

Port three Jinja engine improvements needed for the reka-edge model:
1. Python-style string repetition ("ab" * 3 → "ababab")
2. ensure_ascii=true support for tojson filter (escapes non-ASCII to \uXXXX)
3. int() builtin on value_int_t (identity, needed for Reka Edge template)

* fix: escape invalid utf8 bytes when ensure_ascii=true

The json_ensure_ascii_preserving_format function does not correctly
handle an edge case where if UTF-8 parsing fails, it adds the non-ascii
character back to the output as a raw byte.

This commit fixes that by adding the unicode standard replacement
character \\ufffd to the output instead. This is the standard behavior
for various programming languages like Python, Rust, Go, etc.

* chore: address PR comments

1. Add todo comment for supporting string repetition for array/tuples
2. Add support for float identity operation
3. Move invalid ascii test case to test_fuzzing

* chore: accept suggestion for common/jinja/value.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

common : simplify autoparser tagged parser rules (#21216)

* common : simplify autoparser tagged parser rules

* cont : remove upper limit on optional args

* cont : revert changes to parsing at the end

* cont : undo arbitrary ordering of optional args

* cont : fix uninitialized required parameters

* revert to simplify merge

* re-apply patches

* restore flexible optional arg ordering tests

common : fix ambiguous grammar rule in gemma4 (#21661)

* common : fix ambiguous grammar rule in gemma4

* cont : fix missing comma...

common : enable reasoning budget sampler for gemma4 (#21697)

* fix: enable reasoning budget sampler for gemma4

Add thinking_start_tag and thinking_end_tag to
common_chat_params_init_gemma4(). Without these, the reasoning
budget sampler never activates for gemma4.

Make the newline after "thought" optional in the PEG parser to
handle budget=0 (sampler forces end tag before the newline).

Add test case for empty thinking block.

Fixes #21487

* use p.space() instead of p.optional(p.literal("\n")) in gemma4 thought parser

common : better align to the updated official gemma4 template (#21704)

fix: Fix broken structured output when using $refs in json_schema (#21699)

chat: dedicated DeepSeek v3.2 parser + "official" template (#21785)

Hide render_message_to_json warning

common/gemma4 : handle parsing edge cases (#21760)

common: skip reasoning budget sampler when no budget is requested (#21870)

* common: skip reasoning budget sampler when no budget is requested

After I added thinking_start_tag / thinking_end_tag for gemma4 in #21697, the reasoning budget sampler gets unconditionally created even when no budget is configured (the default -1). The same applies to kimi_k2, lfm2, lfm2_5, and ministral_3 which also set these tags. The budget gets converted to INT_MAX, so the sampler never actually forces any tokens but still runs per-token checks (start tag matching in IDLE state, token-to-piece conversion + UTF-8 checks in COUNTING state).

More importantly, the mere existence of the sampler (non-null rbudget) disables backend sampling. Backend sampling lets the GPU select tokens directly, avoiding a full logits transfer from GPU to CPU every token. This could explain the 30% speed regression reported in #21784 (98 t/s to 70 t/s on Vulkan).

So I added a reasoning_budget_tokens >= 0 check to the sampler creation condition. When the budget is unlimited, the sampler is not created, backend sampling stays enabled, and no per-token overhead is added. When a budget is explicitly set (0, 128, 1024, etc.), the sampler is created and works as before.

* common: preserve rbudget when grammar is lazy

Following up on the review feedback on #21870: keep the reasoning budget sampler when grammar_lazy is true, so the thinking-block grammar suppression from #20970 still works when tools are in use. This way, we only skip the sampler when both no budget is set AND grammar is not lazy.

autoparser: support case of JSON_NATIVE with per-call markers (test case: Reka-Edge) (#21892)

* fix grammar

* fix add sampled token

---------

Co-authored-by: Piotr Wilkin (ilintar) <piotr.wilkin@syndatis.com>
Co-authored-by: firecoperana <firecoperana>
2026-04-22 10:04:13 +02:00

2129 lines
80 KiB
C++

#include "peg-parser.h"
#include "common.h"
#include "json-schema-to-grammar.h"
#include "log.h"
#include "unicode.h"
#include <algorithm>
#include <initializer_list>
#include <map>
#include <memory>
#include <nlohmann/json.hpp>
#include <regex>
#include <stdexcept>
#include <unordered_set>
// Trick to catch missing branches
template <typename T>
inline constexpr bool is_always_false_v = false;
const char * common_peg_parse_result_type_name(common_peg_parse_result_type type) {
switch (type) {
case COMMON_PEG_PARSE_RESULT_FAIL: return "fail";
case COMMON_PEG_PARSE_RESULT_SUCCESS: return "success";
case COMMON_PEG_PARSE_RESULT_NEED_MORE_INPUT: return "need_more_input";
default: return "unknown";
}
}
static bool is_hex_digit(const char c) {
return (c >= '0' && c <= '9') || (c >= 'a' && c <= 'f') || (c >= 'A' && c <= 'F');
}
// Trie for matching multiple literals.
// This is used in common_peg_until_parser and to build a GBNF exclusion grammar
struct trie {
struct node {
std::map<uint32_t, size_t> children; // Use uint32_t to store Unicode codepoints
bool is_word;
};
std::vector<node> nodes;
trie(const std::vector<std::string> & words) {
create_node(); // root node
for (const auto & w : words) {
insert(w);
}
}
enum match_result { NO_MATCH, PARTIAL_MATCH, COMPLETE_MATCH };
// Check if a delimiter starts at the given position
match_result check_at(std::string_view sv, size_t start_pos) const {
size_t current = 0; // Start at root
size_t pos = start_pos;
// LOG_DBG("%s: checking at pos %zu, sv='%s'\n", __func__, start_pos, std::string(sv).c_str());
while (pos < sv.size()) {
auto result = common_parse_utf8_codepoint(sv, pos);
if (result.status != utf8_parse_result::SUCCESS) {
break;
}
auto it = nodes[current].children.find(result.codepoint);
if (it == nodes[current].children.end()) {
// Can't continue matching
return match_result{match_result::NO_MATCH};
}
current = it->second;
pos += result.bytes_consumed;
// Check if we've matched a complete word
if (nodes[current].is_word) {
return match_result{match_result::COMPLETE_MATCH};
}
}
// Reached end of input while still in the trie (not at root)
if (current != 0) {
// We're in the middle of a potential match
return match_result{match_result::PARTIAL_MATCH};
}
// Reached end at root (no match)
return match_result{match_result::NO_MATCH};
}
struct prefix_and_next {
std::vector<uint32_t> prefix;
std::vector<uint32_t> next_chars;
};
std::vector<prefix_and_next> collect_prefix_and_next() {
std::vector<uint32_t> prefix;
std::vector<prefix_and_next> result;
collect_prefix_and_next(0, prefix, result);
return result;
}
private:
void collect_prefix_and_next(size_t index, std::vector<uint32_t> & prefix, std::vector<prefix_and_next> & out) {
if (!nodes[index].is_word) {
if (!nodes[index].children.empty()) {
std::vector<uint32_t> chars;
chars.reserve(nodes[index].children.size());
for (const auto & p : nodes[index].children) {
chars.push_back(p.first);
}
out.emplace_back(prefix_and_next{prefix, chars});
}
}
for (const auto & p : nodes[index].children) {
uint32_t ch = p.first;
auto child = p.second;
prefix.push_back(ch);
collect_prefix_and_next(child, prefix, out);
prefix.pop_back();
}
}
size_t create_node() {
size_t index = nodes.size();
nodes.emplace_back();
return index;
}
void insert(const std::string & word) {
size_t current = 0;
size_t pos = 0;
while (pos < word.length()) {
auto result = common_parse_utf8_codepoint(word, pos);
if (result.status != utf8_parse_result::SUCCESS) {
break;
}
uint32_t ch = result.codepoint;
pos += result.bytes_consumed;
auto it = nodes[current].children.find(ch);
if (it == nodes[current].children.end()) {
size_t child = create_node();
nodes[current].children[ch] = child;
current = child;
} else {
current = it->second;
}
}
nodes[current].is_word = true;
}
};
static std::pair<uint32_t, size_t> parse_hex_escape(const std::string & str, size_t pos, int hex_count) {
if (pos + hex_count > str.length()) {
return {0, 0};
}
uint32_t value = 0;
for (int i = 0; i < hex_count; i++) {
char c = str[pos + i];
if (!is_hex_digit(c)) {
return {0, 0};
}
value <<= 4;
if ('a' <= c && c <= 'f') {
value += c - 'a' + 10;
} else if ('A' <= c && c <= 'F') {
value += c - 'A' + 10;
} else if ('0' <= c && c <= '9') {
value += c - '0';
} else {
break;
}
}
return {value, static_cast<size_t>(hex_count)};
}
static std::pair<uint32_t, size_t> parse_char_class_char(const std::string & content, size_t pos) {
if (content[pos] == '\\' && pos + 1 < content.length()) {
switch (content[pos + 1]) {
case 'x': {
auto result = parse_hex_escape(content, pos + 2, 2);
if (result.second > 0) {
return {result.first, 2 + result.second};
}
// Invalid escape, treat as literal 'x'
return {static_cast<uint32_t>('x'), 2};
}
case 'u': {
auto result = parse_hex_escape(content, pos + 2, 4);
if (result.second > 0) {
return {result.first, 2 + result.second};
}
// Invalid escape, treat as literal 'u'
return {static_cast<uint32_t>('u'), 2};
}
case 'U': {
auto result = parse_hex_escape(content, pos + 2, 8);
if (result.second > 0) {
return {result.first, 2 + result.second};
}
// Invalid escape, treat as literal 'U'
return {static_cast<uint32_t>('U'), 2};
}
case 'n': return {'\n', 2};
case 't': return {'\t', 2};
case 'r': return {'\r', 2};
case '\\': return {'\\', 2};
case ']': return {']', 2};
case '[': return {'[', 2};
default: return {static_cast<uint32_t>(content[pos + 1]), 2};
}
}
// Regular character - return as codepoint
return {static_cast<uint32_t>(static_cast<unsigned char>(content[pos])), 1};
}
static std::pair<std::vector<common_peg_chars_parser::char_range>, bool> parse_char_classes(const std::string & classes) {
std::vector<common_peg_chars_parser::char_range> ranges;
bool negated = false;
std::string content = classes;
if (content.front() == '[') {
content = content.substr(1);
}
if (content.back() == ']') {
content.pop_back();
}
// Check for negation
if (!content.empty() && content.front() == '^') {
negated = true;
content = content.substr(1);
}
size_t i = 0;
while (i < content.length()) {
auto [start, start_len] = parse_char_class_char(content, i);
i += start_len;
if (i + 1 < content.length() && content[i] == '-') {
// Range detected
auto [end, end_len] = parse_char_class_char(content, i + 1);
ranges.push_back(common_peg_chars_parser::char_range{start, end});
i += 1 + end_len;
} else {
ranges.push_back(common_peg_chars_parser::char_range{start, start});
}
}
return {ranges, negated};
}
common_peg_ast_id common_peg_ast_arena::find_by_tag(const common_peg_ast_node & parent, const std::string & tag, int max_depth) const {
for (auto child_id : parent.children) {
const auto & child = get(child_id);
if (child.tag == tag) {
return child_id;
}
if (max_depth > 1) {
auto result = find_by_tag(child, tag, max_depth - 1);
if (result != COMMON_PEG_INVALID_AST_ID) {
return result;
}
}
}
return COMMON_PEG_INVALID_AST_ID;
}
common_peg_ast_id common_peg_ast_arena::find_by_rule(const common_peg_ast_node & parent, const std::string & rule, int max_depth) const {
for (auto child_id : parent.children) {
const auto & child = get(child_id);
if (child.rule == rule) {
return child_id;
}
if (max_depth > 1) {
auto result = find_by_rule(child, rule, max_depth - 1);
if (result != COMMON_PEG_INVALID_AST_ID) {
return result;
}
}
}
return COMMON_PEG_INVALID_AST_ID;
}
void common_peg_ast_arena::visit(common_peg_ast_id id, const common_peg_ast_visitor & visitor) const {
if (id == COMMON_PEG_INVALID_AST_ID) {
return;
}
const auto & node = get(id);
visitor(node);
for (const auto & child : node.children) {
visit(child, visitor);
}
}
void common_peg_ast_arena::visit(const common_peg_parse_result & result, const common_peg_ast_visitor & visitor) const {
for (const auto & node : result.nodes) {
visit(node, visitor);
}
}
struct parser_executor;
common_peg_parser_id common_peg_arena::add_parser(common_peg_parser_variant parser) {
common_peg_parser_id id = parsers_.size();
parsers_.push_back(std::move(parser));
return id;
}
void common_peg_arena::add_rule(const std::string & name, common_peg_parser_id id) {
rules_[name] = id;
}
common_peg_parser_id common_peg_arena::get_rule(const std::string & name) const {
auto it = rules_.find(name);
if (it == rules_.end()) {
throw std::runtime_error("Rule not found: " + name);
}
return it->second;
}
struct parser_executor {
const common_peg_arena & arena;
common_peg_parse_context & ctx;
size_t start_pos;
parser_executor(const common_peg_arena & arena, common_peg_parse_context & ctx, size_t start)
: arena(arena), ctx(ctx), start_pos(start) {}
std::string debug_indent() const { return std::string(ctx.parse_depth * 2, ' '); }
std::string debug_input_snippet(size_t pos, size_t len = 60) const {
if (pos >= ctx.input.size()) {
return "<EOF>";
}
auto snippet = ctx.input.substr(pos, len);
// Escape newlines for display
std::string result;
for (char c : snippet) {
if (c == '\n') {
result += "\\n";
} else if (c == '\r') {
result += "\\r";
} else if (c == '\t') {
result += "\\t";
} else {
result += c;
}
}
if (pos + len < ctx.input.size()) {
result += "...";
}
return result;
}
common_peg_parse_result operator()(const common_peg_epsilon_parser & /* p */) const {
return common_peg_parse_result(COMMON_PEG_PARSE_RESULT_SUCCESS, start_pos);
}
common_peg_parse_result operator()(const common_peg_start_parser & /* p */) const {
return common_peg_parse_result(
start_pos == 0 ? COMMON_PEG_PARSE_RESULT_SUCCESS : COMMON_PEG_PARSE_RESULT_FAIL,
start_pos
);
}
common_peg_parse_result operator()(const common_peg_end_parser & /* p */) const {
return common_peg_parse_result(
start_pos >= ctx.input.size() ? COMMON_PEG_PARSE_RESULT_SUCCESS : COMMON_PEG_PARSE_RESULT_FAIL,
start_pos
);
}
common_peg_parse_result operator()(const common_peg_literal_parser & p) {
auto pos = start_pos;
for (auto i = 0u; i < p.literal.size(); ++i) {
if (pos >= ctx.input.size()) {
if (!ctx.is_lenient()) {
return common_peg_parse_result(COMMON_PEG_PARSE_RESULT_FAIL, start_pos);
}
return common_peg_parse_result(COMMON_PEG_PARSE_RESULT_NEED_MORE_INPUT, start_pos, pos);
}
if (ctx.input[pos] != p.literal[i]) {
return common_peg_parse_result(COMMON_PEG_PARSE_RESULT_FAIL, start_pos);
}
++pos;
}
return common_peg_parse_result(COMMON_PEG_PARSE_RESULT_SUCCESS, start_pos, pos);
}
common_peg_parse_result operator()(const common_peg_sequence_parser & p) {
if (ctx.is_debug()) {
LOG_DBG("%sSEQ start at %zu '%s' (%zu children)\n", debug_indent().c_str(), start_pos,
debug_input_snippet(start_pos).c_str(), p.children.size());
}
ctx.parse_depth++;
auto pos = start_pos;
std::vector<common_peg_ast_id> nodes;
for (size_t i = 0; i < p.children.size(); i++) {
const auto & child_id = p.children[i];
if (ctx.is_debug()) {
fprintf(stderr, "%sSEQ child %zu: %s\n", debug_indent().c_str(), i, arena.dump(child_id).c_str());
}
auto result = arena.parse(child_id, ctx, pos);
if (ctx.is_debug()) {
fprintf(stderr, "%sSEQ child %zu: %s at %zu->%zu\n", debug_indent().c_str(), i,
common_peg_parse_result_type_name(result.type), result.start, result.end);
}
if (result.fail()) {
ctx.parse_depth--;
if (ctx.is_debug()) {
fprintf(stderr, "%sSEQ -> FAIL\n", debug_indent().c_str());
}
return common_peg_parse_result(COMMON_PEG_PARSE_RESULT_FAIL, start_pos, result.end);
}
if (!result.nodes.empty()) {
nodes.insert(nodes.end(), result.nodes.begin(), result.nodes.end());
}
if (result.need_more_input()) {
ctx.parse_depth--;
if (ctx.is_debug()) {
fprintf(stderr, "%sSEQ -> NEED_MORE\n", debug_indent().c_str());
}
return common_peg_parse_result(COMMON_PEG_PARSE_RESULT_NEED_MORE_INPUT, start_pos, result.end, std::move(nodes));
}
pos = result.end;
}
ctx.parse_depth--;
if (ctx.is_debug()) {
fprintf(stderr, "%sSEQ -> SUCCESS at %zu->%zu\n", debug_indent().c_str(), start_pos, pos);
}
return common_peg_parse_result(COMMON_PEG_PARSE_RESULT_SUCCESS, start_pos, pos, std::move(nodes));
}
common_peg_parse_result operator()(const common_peg_choice_parser & p) {
if (ctx.is_debug()) {
fprintf(stderr, "%sCHOICE start at %zu '%s' (%zu options)\n", debug_indent().c_str(), start_pos,
debug_input_snippet(start_pos).c_str(), p.children.size());
}
ctx.parse_depth++;
auto pos = start_pos;
for (size_t i = 0; i < p.children.size(); i++) {
const auto & child_id = p.children[i];
if (ctx.is_debug()) {
fprintf(stderr, "%sCHOICE option %zu: %s\n", debug_indent().c_str(), i, arena.dump(child_id).c_str());
}
auto result = arena.parse(child_id, ctx, pos);
if (ctx.is_debug()) {
fprintf(stderr, "%sCHOICE option %zu: %s\n", debug_indent().c_str(), i,
common_peg_parse_result_type_name(result.type));
}
if (!result.fail()) {
ctx.parse_depth--;
if (ctx.is_debug()) {
fprintf(stderr, "%sCHOICE -> %s (option %zu)\n", debug_indent().c_str(),
common_peg_parse_result_type_name(result.type), i);
}
return result;
}
}
ctx.parse_depth--;
if (ctx.is_debug()) {
fprintf(stderr, "%sCHOICE -> FAIL (no options matched)\n", debug_indent().c_str());
}
return common_peg_parse_result(COMMON_PEG_PARSE_RESULT_FAIL, start_pos);
}
common_peg_parse_result operator()(const common_peg_repetition_parser & p) {
if (ctx.is_debug()) {
fprintf(stderr, "%sREPEAT start at %zu '%s' (min=%d, max=%d)\n", debug_indent().c_str(), start_pos,
debug_input_snippet(start_pos).c_str(), p.min_count, p.max_count);
}
ctx.parse_depth++;
auto pos = start_pos;
int match_count = 0;
std::vector<common_peg_ast_id> nodes;
// Try to match up to max_count times (or unlimited if max_count is -1)
while (p.max_count == -1 || match_count < p.max_count) {
if (pos >= ctx.input.size()) {
if (ctx.is_debug()) {
fprintf(stderr, "%sREPEAT: at end of input, count=%d\n", debug_indent().c_str(), match_count);
}
break;
}
auto result = arena.parse(p.child, ctx, pos);
if (ctx.is_debug()) {
fprintf(stderr, "%sREPEAT iter %d: %s at %zu->%zu, nodes=%zu\n", debug_indent().c_str(), match_count,
common_peg_parse_result_type_name(result.type), result.start, result.end, result.nodes.size());
fprintf(stderr, "%sREPEAT CHILD: %s\n", debug_indent().c_str(), arena.dump(p.child).c_str());
}
if (result.success()) {
// Prevent infinite loop on empty matches
if (result.end == pos) {
if (ctx.is_debug()) {
fprintf(stderr, "%s REPEAT: empty match, stopping\n", debug_indent().c_str());
}
break;
}
if (!result.nodes.empty()) {
nodes.insert(nodes.end(), result.nodes.begin(), result.nodes.end());
}
pos = result.end;
match_count++;
continue;
}
if (result.need_more_input()) {
if (!result.nodes.empty()) {
nodes.insert(nodes.end(), result.nodes.begin(), result.nodes.end());
}
ctx.parse_depth--;
if (ctx.is_debug()) {
fprintf(stderr, "%sREPEAT -> NEED_MORE (count=%d, nodes=%zu)\n", debug_indent().c_str(),
match_count, nodes.size());
}
return common_peg_parse_result(COMMON_PEG_PARSE_RESULT_NEED_MORE_INPUT, start_pos, result.end, std::move(nodes));
}
// Child failed - stop trying
if (ctx.is_debug()) {
fprintf(stderr, "%sREPEAT: child failed, stopping\n", debug_indent().c_str());
}
break;
}
// Check if we got enough matches
if (p.min_count > 0 && match_count < p.min_count) {
ctx.parse_depth--;
if (pos >= ctx.input.size() && ctx.is_lenient()) {
if (ctx.is_debug()) {
fprintf(stderr, "%sREPEAT -> NEED_MORE (not enough matches: %d < %d)\n", debug_indent().c_str(),
match_count, p.min_count);
}
return common_peg_parse_result(COMMON_PEG_PARSE_RESULT_NEED_MORE_INPUT, start_pos, pos, std::move(nodes));
}
if (ctx.is_debug()) {
fprintf(stderr, "%sREPEAT -> FAIL (not enough matches: %d < %d)\n", debug_indent().c_str(), match_count,
p.min_count);
}
return common_peg_parse_result(COMMON_PEG_PARSE_RESULT_FAIL, start_pos, pos);
}
ctx.parse_depth--;
if (ctx.is_debug()) {
fprintf(stderr, "%sREPEAT -> SUCCESS (count=%d, nodes=%zu)\n", debug_indent().c_str(), match_count,
nodes.size());
}
return common_peg_parse_result(COMMON_PEG_PARSE_RESULT_SUCCESS, start_pos, pos, std::move(nodes));
}
common_peg_parse_result operator()(const common_peg_and_parser & p) {
auto result = arena.parse(p.child, ctx, start_pos);
// Pass result but don't consume input
return common_peg_parse_result(result.type, start_pos);
}
common_peg_parse_result operator()(const common_peg_not_parser & p) {
auto result = arena.parse(p.child, ctx, start_pos);
if (result.success()) {
// Fail if the underlying parser matches
return common_peg_parse_result(COMMON_PEG_PARSE_RESULT_FAIL, start_pos);
}
if (result.need_more_input()) {
// Propagate - need to know what child would match before negating
return common_peg_parse_result(COMMON_PEG_PARSE_RESULT_NEED_MORE_INPUT, start_pos);
}
// Child failed, so negation succeeds
return common_peg_parse_result(COMMON_PEG_PARSE_RESULT_SUCCESS, start_pos);
}
common_peg_parse_result operator()(const common_peg_any_parser & /* p */) const {
// Parse a single UTF-8 codepoint (not just a single byte)
auto result = common_parse_utf8_codepoint(ctx.input, start_pos);
if (result.status == utf8_parse_result::INCOMPLETE) {
if (!ctx.is_lenient()) {
return common_peg_parse_result(COMMON_PEG_PARSE_RESULT_FAIL, start_pos);
}
return common_peg_parse_result(COMMON_PEG_PARSE_RESULT_NEED_MORE_INPUT, start_pos);
}
if (result.status == utf8_parse_result::INVALID) {
return common_peg_parse_result(COMMON_PEG_PARSE_RESULT_FAIL, start_pos);
}
return common_peg_parse_result(COMMON_PEG_PARSE_RESULT_SUCCESS, start_pos, start_pos + result.bytes_consumed);
}
common_peg_parse_result operator()(const common_peg_space_parser & /* p */) {
auto pos = start_pos;
while (pos < ctx.input.size()) {
auto c = static_cast<unsigned char>(ctx.input[pos]);
if (std::isspace(c)) {
++pos;
} else {
break;
}
}
return common_peg_parse_result(COMMON_PEG_PARSE_RESULT_SUCCESS, start_pos, pos);
}
common_peg_parse_result operator()(const common_peg_chars_parser & p) const {
auto pos = start_pos;
int match_count = 0;
// Try to match up to max_count times (or unlimited if max_count is -1)
while (p.max_count == -1 || match_count < p.max_count) {
auto result = common_parse_utf8_codepoint(ctx.input, pos);
if (result.status == utf8_parse_result::INCOMPLETE) {
if (match_count >= p.min_count) {
// We have enough matches, succeed with what we have
return common_peg_parse_result(COMMON_PEG_PARSE_RESULT_SUCCESS, start_pos, pos);
}
// Not enough matches yet
if (!ctx.is_lenient()) {
return common_peg_parse_result(COMMON_PEG_PARSE_RESULT_FAIL, start_pos);
}
return common_peg_parse_result(COMMON_PEG_PARSE_RESULT_NEED_MORE_INPUT, start_pos, pos);
}
if (result.status == utf8_parse_result::INVALID) {
// Malformed UTF-8 in input
if (match_count >= p.min_count) {
// We have enough matches, succeed up to here
return common_peg_parse_result(COMMON_PEG_PARSE_RESULT_SUCCESS, start_pos, pos);
}
// Not enough matches, fail
return common_peg_parse_result(COMMON_PEG_PARSE_RESULT_FAIL, start_pos);
}
// Check if this codepoint matches our character class
bool matches = false;
for (const auto & range : p.ranges) {
if (range.contains(result.codepoint)) {
matches = true;
break;
}
}
// If negated, invert the match result
if (p.negated) {
matches = !matches;
}
if (matches) {
pos += result.bytes_consumed;
++match_count;
} else {
// Character doesn't match, stop matching
break;
}
}
// Check if we got enough matches
if (match_count < p.min_count) {
if (pos >= ctx.input.size() && ctx.is_lenient()) {
return common_peg_parse_result(COMMON_PEG_PARSE_RESULT_NEED_MORE_INPUT, start_pos, pos);
}
return common_peg_parse_result(COMMON_PEG_PARSE_RESULT_FAIL, start_pos, pos);
}
return common_peg_parse_result(COMMON_PEG_PARSE_RESULT_SUCCESS, start_pos, pos);
}
static common_peg_parse_result handle_escape_sequence(common_peg_parse_context & ctx, size_t start, size_t & pos, const char delimiter) {
++pos; // consume '\'
if (pos >= ctx.input.size()) {
if (!ctx.is_lenient()) {
return common_peg_parse_result(COMMON_PEG_PARSE_RESULT_FAIL, start);
}
return common_peg_parse_result(COMMON_PEG_PARSE_RESULT_NEED_MORE_INPUT, start, pos);
}
char c = ctx.input[pos];
if (c == delimiter || c == '\\' || c == '/' || c == 'b' || c == 'f' || c == 'n' || c == 'r' || c == 't') {
++pos;
return common_peg_parse_result(COMMON_PEG_PARSE_RESULT_SUCCESS, start, pos);
} else if (c == 'u') {
return handle_unicode_escape(ctx, start, pos);
} else {
return common_peg_parse_result(COMMON_PEG_PARSE_RESULT_FAIL, start);
}
}
static common_peg_parse_result handle_unicode_escape(common_peg_parse_context & ctx, size_t start, size_t & pos) {
++pos; // consume 'u'
for (int i = 0; i < 4; ++i) {
if (pos >= ctx.input.size()) {
if (!ctx.is_lenient()) {
return common_peg_parse_result(COMMON_PEG_PARSE_RESULT_FAIL, start);
}
return common_peg_parse_result(COMMON_PEG_PARSE_RESULT_NEED_MORE_INPUT, start, pos);
}
if (!is_hex_digit(ctx.input[pos])) {
return common_peg_parse_result(COMMON_PEG_PARSE_RESULT_FAIL, start);
}
++pos;
}
return common_peg_parse_result(COMMON_PEG_PARSE_RESULT_SUCCESS, start, pos);
}
common_peg_parse_result operator()(const common_peg_string_parser & p) {
auto pos = start_pos;
// Parse string content (without quotes)
while (pos < ctx.input.size()) {
char c = ctx.input[pos];
if (c == p.delimiter) {
// Found closing delimiter - success (don't consume it)
return common_peg_parse_result(COMMON_PEG_PARSE_RESULT_SUCCESS, start_pos, pos);
}
if (c == '\\') {
auto result = handle_escape_sequence(ctx, start_pos, pos, p.delimiter);
if (!result.success()) {
return result;
}
} else {
auto utf8_result = common_parse_utf8_codepoint(ctx.input, pos);
if (utf8_result.status == utf8_parse_result::INCOMPLETE) {
if (!ctx.is_lenient()) {
return common_peg_parse_result(COMMON_PEG_PARSE_RESULT_FAIL, start_pos);
}
return common_peg_parse_result(COMMON_PEG_PARSE_RESULT_NEED_MORE_INPUT, start_pos, pos);
}
if (utf8_result.status == utf8_parse_result::INVALID) {
return common_peg_parse_result(COMMON_PEG_PARSE_RESULT_FAIL, start_pos);
}
pos += utf8_result.bytes_consumed;
}
}
// Reached end without finding closing quote
if (!ctx.is_lenient()) {
return common_peg_parse_result(COMMON_PEG_PARSE_RESULT_FAIL, start_pos, pos);
}
return common_peg_parse_result(COMMON_PEG_PARSE_RESULT_NEED_MORE_INPUT, start_pos, pos);
}
common_peg_parse_result operator()(const common_peg_until_parser & p) const {
trie matcher(p.delimiters);
// Scan input and check for delimiters
size_t pos = start_pos;
size_t last_valid_pos = start_pos;
while (pos < ctx.input.size()) {
auto utf8_result = common_parse_utf8_codepoint(ctx.input, pos);
if (utf8_result.status == utf8_parse_result::INCOMPLETE) {
// Incomplete UTF-8 sequence
if (!ctx.is_lenient()) {
// Input is complete but UTF-8 is incomplete = malformed
return common_peg_parse_result(COMMON_PEG_PARSE_RESULT_FAIL, start_pos);
}
// Return what we have so far (before incomplete sequence)
return common_peg_parse_result(COMMON_PEG_PARSE_RESULT_NEED_MORE_INPUT, start_pos, last_valid_pos);
}
if (utf8_result.status == utf8_parse_result::INVALID) {
// Malformed UTF-8
return common_peg_parse_result(COMMON_PEG_PARSE_RESULT_FAIL, start_pos);
}
// Check if a delimiter starts at this position
auto match = matcher.check_at(ctx.input, pos);
if (match == trie::COMPLETE_MATCH) {
// Found a complete delimiter, return everything before it
return common_peg_parse_result(COMMON_PEG_PARSE_RESULT_SUCCESS, start_pos, pos);
}
if (match == trie::PARTIAL_MATCH) {
// Found a partial match extending to end of input, return everything before it
return common_peg_parse_result(COMMON_PEG_PARSE_RESULT_SUCCESS, start_pos, pos);
}
pos += utf8_result.bytes_consumed;
last_valid_pos = pos;
}
if (last_valid_pos == ctx.input.size() && ctx.is_lenient()) {
// Reached the end of a partial stream, there might still be more input that we need to consume.
return common_peg_parse_result(COMMON_PEG_PARSE_RESULT_NEED_MORE_INPUT, start_pos, last_valid_pos);
}
return common_peg_parse_result(COMMON_PEG_PARSE_RESULT_SUCCESS, start_pos, last_valid_pos);
}
common_peg_parse_result operator()(const common_peg_schema_parser & p) {
return arena.parse(p.child, ctx, start_pos);
}
common_peg_parse_result operator()(const common_peg_rule_parser & p) {
// Parse the child
auto result = arena.parse(p.child, ctx, start_pos);
if (!result.fail()) {
std::string_view text;
if (result.start < ctx.input.size()) {
text = std::string_view(ctx.input).substr(result.start, result.end - result.start);
}
auto node_id = ctx.ast.add_node(
p.name,
"",
result.start,
result.end,
text,
std::move(result.nodes),
result.need_more_input()
);
return common_peg_parse_result(result.type, result.start, result.end, { node_id });
}
return result;
}
common_peg_parse_result operator()(const common_peg_tag_parser & p) {
// Parse the child
if (ctx.is_debug()) {
fprintf(stderr, "%sTAG: %s\n", debug_indent().c_str(), p.tag.c_str());
}
auto result = arena.parse(p.child, ctx, start_pos);
if (!result.fail()) {
std::string_view text;
if (result.start < ctx.input.size()) {
text = std::string_view(ctx.input).substr(result.start, result.end - result.start);
}
auto node_id = ctx.ast.add_node(
"",
p.tag,
result.start,
result.end,
text,
std::move(result.nodes),
result.need_more_input()
);
return common_peg_parse_result(result.type, result.start, result.end, { node_id });
}
return result;
}
common_peg_parse_result operator()(const common_peg_ref_parser & p) {
auto rule_id = arena.get_rule(p.name);
return arena.parse(rule_id, ctx, start_pos);
}
common_peg_parse_result operator()(const common_peg_atomic_parser & p) {
auto result = arena.parse(p.child, ctx, start_pos);
if (result.need_more_input()) {
// Clear nodes so they don't propagate up.
result.nodes.clear();
}
return result;
}
common_peg_parse_result operator()(const common_peg_gbnf_parser & p) {
return arena.parse(p.child, ctx, start_pos);
}
};
common_peg_parse_result common_peg_arena::parse(common_peg_parse_context & ctx, size_t start) const {
if (root_ == COMMON_PEG_INVALID_PARSER_ID) {
throw std::runtime_error("No root parser set");
}
return parse(root_, ctx, start);
}
common_peg_parse_result common_peg_arena::parse(common_peg_parser_id id, common_peg_parse_context & ctx, size_t start) const {
// Execute parser
const auto & parser = parsers_.at(id);
parser_executor exec(*this, ctx, start);
return std::visit(exec, parser);
}
common_peg_parser_id common_peg_arena::resolve_ref(common_peg_parser_id id) {
const auto & parser = parsers_.at(id);
if (auto ref = std::get_if<common_peg_ref_parser>(&parser)) {
return get_rule(ref->name);
}
return id;
}
static void bfs_node(common_peg_ast_arena &arena, std::ostringstream & oss, const common_peg_ast_node & node, int indent) {
for (int i = 0; i < indent; i++) {
oss << " ";
}
oss << "NODE " << node.id;
if (!node.rule.empty()) {
oss << " (rule " << node.rule << ")";
}
if (!node.tag.empty()) {
oss << " (tag " << node.tag << ")";
}
oss << " ['" << node.text << "']\n";
for (const auto child : node.children) {
bfs_node(arena, oss, arena.get(child), indent + 1);
}
}
std::string common_peg_ast_arena::dump() {
std::ostringstream oss;
for (auto & node : nodes_) {
bfs_node(*this, oss, node, 0);
}
return oss.str();
}
void common_peg_arena::resolve_refs() {
// Walk through all parsers and replace refs with their corresponding rule IDs
for (auto & parser : parsers_) {
std::visit([this](auto & p) {
using T = std::decay_t<decltype(p)>;
if constexpr (std::is_same_v<T, common_peg_sequence_parser>) {
for (auto & child : p.children) {
child = resolve_ref(child);
}
} else if constexpr (std::is_same_v<T, common_peg_choice_parser>) {
for (auto & child : p.children) {
child = resolve_ref(child);
}
} else if constexpr (std::is_same_v<T, common_peg_repetition_parser> ||
std::is_same_v<T, common_peg_and_parser> ||
std::is_same_v<T, common_peg_not_parser> ||
std::is_same_v<T, common_peg_tag_parser> ||
std::is_same_v<T, common_peg_atomic_parser> ||
std::is_same_v<T, common_peg_gbnf_parser>) {
p.child = resolve_ref(p.child);
} else if constexpr (std::is_same_v<T, common_peg_rule_parser>) {
p.child = resolve_ref(p.child);
} else if constexpr (std::is_same_v<T, common_peg_schema_parser>) {
p.child = resolve_ref(p.child);
} else if constexpr (std::is_same_v<T, common_peg_epsilon_parser> ||
std::is_same_v<T, common_peg_start_parser> ||
std::is_same_v<T, common_peg_end_parser> ||
std::is_same_v<T, common_peg_ref_parser> ||
std::is_same_v<T, common_peg_until_parser> ||
std::is_same_v<T, common_peg_literal_parser> ||
std::is_same_v<T, common_peg_string_parser> ||
std::is_same_v<T, common_peg_chars_parser> ||
std::is_same_v<T, common_peg_any_parser> ||
std::is_same_v<T, common_peg_space_parser>) {
// These rules do not have children
} else {
static_assert(is_always_false_v<T>);
}
}, parser);
}
// Also flatten root if it's a ref
if (root_ != COMMON_PEG_INVALID_PARSER_ID) {
root_ = resolve_ref(root_);
}
}
std::string common_peg_arena::dump(common_peg_parser_id id) const {
std::unordered_set<common_peg_parser_id> visited;
return dump_impl(id, visited);
}
std::string common_peg_arena::dump_impl(common_peg_parser_id id,
std::unordered_set<common_peg_parser_id> & visited) const {
// Check for cycles
if (visited.count(id)) {
return "[cycle]";
}
visited.insert(id);
const auto & parser = parsers_.at(id);
return std::visit([this, &visited](const auto & p) -> std::string {
using T = std::decay_t<decltype(p)>;
if constexpr (std::is_same_v<T, common_peg_epsilon_parser>) {
return "Epsilon";
} else if constexpr (std::is_same_v<T, common_peg_start_parser>) {
return "Start";
} else if constexpr (std::is_same_v<T, common_peg_end_parser>) {
return "End";
} else if constexpr (std::is_same_v<T, common_peg_literal_parser>) {
return "Literal(" + p.literal + ")";
} else if constexpr (std::is_same_v<T, common_peg_sequence_parser>) {
std::vector<std::string> parts;
for (const auto & child : p.children) {
parts.push_back(dump_impl(child, visited));
}
return "Sequence(" + string_join(parts, ", ") + ")";
} else if constexpr (std::is_same_v<T, common_peg_choice_parser>) {
std::vector<std::string> parts;
for (const auto & child : p.children) {
parts.push_back(dump_impl(child, visited));
}
return "Choice(" + string_join(parts, ", ") + ")";
} else if constexpr (std::is_same_v<T, common_peg_repetition_parser>) {
if (p.max_count == -1) {
return "Repetition(" + dump_impl(p.child, visited) + ", " + std::to_string(p.min_count) +
", unbounded)";
}
return "Repetition(" + dump_impl(p.child, visited) + ", " + std::to_string(p.min_count) + ", " + std::to_string(p.max_count) + ")";
} else if constexpr (std::is_same_v<T, common_peg_and_parser>) {
return "And(" + dump_impl(p.child, visited) + ")";
} else if constexpr (std::is_same_v<T, common_peg_not_parser>) {
return "Not(" + dump_impl(p.child, visited) + ")";
} else if constexpr (std::is_same_v<T, common_peg_atomic_parser>) {
return "Atomic(" + dump_impl(p.child, visited) + ")";
} else if constexpr (std::is_same_v<T, common_peg_gbnf_parser>) {
return "Gbnf(" + p.grammar + ", " + dump_impl(p.child, visited) + ")";
} else if constexpr (std::is_same_v<T, common_peg_any_parser>) {
return "Any";
} else if constexpr (std::is_same_v<T, common_peg_space_parser>) {
return "Space";
} else if constexpr (std::is_same_v<T, common_peg_chars_parser>) {
if (p.max_count == -1) {
return "CharRepeat(" + p.pattern + ", " + std::to_string(p.min_count) + ", unbounded)";
}
return "CharRepeat(" + p.pattern + ", " + std::to_string(p.min_count) + ", " + std::to_string(p.max_count) + ")";
} else if constexpr (std::is_same_v<T, common_peg_string_parser>) {
return "String(" + std::string(1, p.delimiter) + ")";
} else if constexpr (std::is_same_v<T, common_peg_until_parser>) {
return "Until(" + string_join(p.delimiters, " | ") + ")";
} else if constexpr (std::is_same_v<T, common_peg_schema_parser>) {
return "Schema(" + dump_impl(p.child, visited) + ", " + (p.schema ? p.schema->dump() : "null") + ")";
} else if constexpr (std::is_same_v<T, common_peg_rule_parser>) {
return "Rule(" + p.name + ", " + dump_impl(p.child, visited) + ")";
} else if constexpr (std::is_same_v<T, common_peg_ref_parser>) {
return "Ref(" + p.name + ")";
} else if constexpr (std::is_same_v<T, common_peg_tag_parser>) {
return "Tag(" + p.tag + ", " + dump(p.child) + ")";
} else if constexpr (std::is_same_v<T, common_peg_atomic_parser>) {
return "Atomic(" + dump(p.child) + ")";
} else {
return "Unknown";
}
}, parser);
}
common_peg_parser & common_peg_parser::operator=(const common_peg_parser & other) {
id_ = other.id_;
return *this;
}
common_peg_parser & common_peg_parser::operator+=(const common_peg_parser & other) {
id_ = builder_.sequence({id_, other.id_});
return *this;
}
common_peg_parser & common_peg_parser::operator|=(const common_peg_parser & other) {
id_ = builder_.choice({id_, other.id_});
return *this;
}
common_peg_parser common_peg_parser::operator+(const common_peg_parser & other) const {
return builder_.sequence({id_, other.id_});
}
common_peg_parser common_peg_parser::operator|(const common_peg_parser & other) const {
return builder_.choice({id_, other.id_});
}
common_peg_parser common_peg_parser::operator<<(const common_peg_parser & other) const {
return builder_.sequence({id_, builder_.space(), other.id_});
}
common_peg_parser common_peg_parser::operator+(const char * str) const {
return *this + builder_.literal(str);
}
common_peg_parser common_peg_parser::operator+(const std::string & str) const {
return *this + builder_.literal(str);
}
common_peg_parser common_peg_parser::operator<<(const char * str) const {
return *this << builder_.literal(str);
}
common_peg_parser common_peg_parser::operator<<(const std::string & str) const {
return *this << builder_.literal(str);
}
common_peg_parser common_peg_parser::operator|(const char * str) const {
return *this | builder_.literal(str);
}
common_peg_parser common_peg_parser::operator|(const std::string & str) const {
return *this | builder_.literal(str);
}
common_peg_parser operator+(const char * str, const common_peg_parser & p) {
return p.builder().literal(str) + p;
}
common_peg_parser operator+(const std::string & str, const common_peg_parser & p) {
return operator+(str.c_str(), p);
}
common_peg_parser operator<<(const char * str, const common_peg_parser & p) {
return p.builder().literal(str) << p;
}
common_peg_parser operator<<(const std::string & str, const common_peg_parser & p) {
return operator<<(str.c_str(), p);
}
common_peg_parser operator|(const char * str, const common_peg_parser & p) {
return p.builder().literal(str) | p;
}
common_peg_parser operator|(const std::string & str, const common_peg_parser & p) {
return operator|(str.c_str(), p);
}
static std::string rule_name(const std::string & name) {
static const std::regex invalid_rule_chars_re("[^a-zA-Z0-9-]+");
return std::regex_replace(name, invalid_rule_chars_re, "-");
}
common_peg_parser_builder::common_peg_parser_builder() {}
common_peg_parser common_peg_parser_builder::sequence(const std::vector<common_peg_parser_id> & parsers) {
// Flatten nested sequences
std::vector<common_peg_parser_id> flattened;
for (const auto & p : parsers) {
const auto & parser = arena_.get(p);
if (auto seq = std::get_if<common_peg_sequence_parser>(&parser)) {
flattened.insert(flattened.end(), seq->children.begin(), seq->children.end());
} else {
flattened.push_back(p);
}
}
return wrap(arena_.add_parser(common_peg_sequence_parser{flattened}));
}
common_peg_parser common_peg_parser_builder::sequence(const std::vector<common_peg_parser> & parsers) {
std::vector<common_peg_parser_id> ids;
ids.reserve(parsers.size());
for (const auto & p : parsers) {
ids.push_back(p.id());
}
return sequence(ids);
}
common_peg_parser common_peg_parser_builder::sequence(std::initializer_list<common_peg_parser> parsers) {
std::vector<common_peg_parser_id> ids;
ids.reserve(parsers.size());
for (const auto & p : parsers) {
ids.push_back(p.id());
}
return sequence(ids);
}
common_peg_parser common_peg_parser_builder::choice(const std::vector<common_peg_parser_id> & parsers) {
// Flatten nested choices
std::vector<common_peg_parser_id> flattened;
for (const auto & p : parsers) {
const auto & parser = arena_.get(p);
if (auto choice = std::get_if<common_peg_choice_parser>(&parser)) {
flattened.insert(flattened.end(), choice->children.begin(), choice->children.end());
} else {
flattened.push_back(p);
}
}
return wrap(arena_.add_parser(common_peg_choice_parser{flattened}));
}
common_peg_parser common_peg_parser_builder::choice(const std::vector<common_peg_parser> & parsers) {
std::vector<common_peg_parser_id> ids;
ids.reserve(parsers.size());
for (const auto & p : parsers) {
ids.push_back(p.id());
}
return choice(ids);
}
common_peg_parser common_peg_parser_builder::choice(std::initializer_list<common_peg_parser> parsers) {
std::vector<common_peg_parser_id> ids;
ids.reserve(parsers.size());
for (const auto & p : parsers) {
ids.push_back(p.id());
}
return choice(ids);
}
common_peg_parser common_peg_parser_builder::chars(const std::string & classes, int min, int max) {
auto [ranges, negated] = parse_char_classes(classes);
return wrap(arena_.add_parser(common_peg_chars_parser{classes, ranges, negated, min, max}));
}
common_peg_parser common_peg_parser_builder::schema(const common_peg_parser & p, const std::string & name, const nlohmann::ordered_json & schema, bool raw) {
return wrap(arena_.add_parser(common_peg_schema_parser{p.id(), name, std::make_shared<nlohmann::ordered_json>(schema), raw}));
}
common_peg_parser common_peg_parser_builder::rule(const std::string & name, const common_peg_parser & p, bool trigger) {
auto clean_name = rule_name(name);
auto rule_id = arena_.add_parser(common_peg_rule_parser{clean_name, p.id(), trigger});
arena_.add_rule(clean_name, rule_id);
return ref(clean_name);
}
common_peg_parser common_peg_parser_builder::rule(const std::string & name, const std::function<common_peg_parser()> & builder_fn, bool trigger) {
auto clean_name = rule_name(name);
if (arena_.has_rule(clean_name)) {
return ref(clean_name);
}
// Create placeholder rule to allow recursive references
auto placeholder = any(); // Temporary placeholder
auto placeholder_rule_id = arena_.add_parser(common_peg_rule_parser{clean_name, placeholder.id(), trigger});
arena_.add_rule(clean_name, placeholder_rule_id);
// Build the actual parser
auto parser = builder_fn();
// Replace placeholder with actual rule
auto rule_id = arena_.add_parser(common_peg_rule_parser{clean_name, parser.id(), trigger});
arena_.rules_[clean_name] = rule_id;
return ref(clean_name);
}
void common_peg_parser_builder::set_root(const common_peg_parser & p) {
arena_.set_root(p.id());
}
common_peg_arena common_peg_parser_builder::build() {
arena_.resolve_refs();
return std::move(arena_);
}
// String primitives
common_peg_parser common_peg_parser_builder::string_content(char delimiter) {
return wrap(arena_.add_parser(common_peg_string_parser{delimiter}));
}
common_peg_parser common_peg_parser_builder::double_quoted_string() {
return rule("double-quoted-string", [this]() {
return sequence({literal("\""), string_content('"'), literal("\""), space()});
});
}
common_peg_parser common_peg_parser_builder::single_quoted_string() {
return rule("single-quoted-string", [this]() {
return sequence({literal("'"), string_content('\''), literal("'"), space()});
});
}
common_peg_parser common_peg_parser_builder::quoted_string() {
return rule("quoted-string", [this]() {
return choice({double_quoted_string(), single_quoted_string()});
});
}
// JSON parsers
common_peg_parser common_peg_parser_builder::json_number() {
return rule("json-number", [this]() {
auto digit1_9 = chars("[1-9]", 1, 1);
auto digits = chars("[0-9]");
auto int_part = choice({literal("0"), sequence({digit1_9, chars("[0-9]", 0, -1)})});
auto frac = sequence({literal("."), digits});
auto exp = sequence({choice({literal("e"), literal("E")}), optional(chars("[+-]", 1, 1)), digits});
// Negative lookahead: only commit the number when the next character can't extend it.
// At EOF in partial mode, chars returns NEED_MORE → negate propagates NEED_MORE → number not committed.
// This prevents premature commits of partial numbers (e.g. "3" when "3.14" is incoming).
auto not_number_continuation = negate(chars("[0-9.eE+-]", 1, 1));
return sequence({ optional(literal("-")), int_part, optional(frac), optional(exp), not_number_continuation, space() });
});
}
common_peg_parser common_peg_parser_builder::json_string() {
return rule("json-string", [this]() {
return sequence({literal("\""), string_content('"'), literal("\""), space()});
});
}
common_peg_parser common_peg_parser_builder::json_bool() {
return rule("json-bool", [this]() {
return sequence({choice({literal("true"), literal("false")}), space()});
});
}
common_peg_parser common_peg_parser_builder::json_null() {
return rule("json-null", [this]() {
return sequence({literal("null"), space()});
});
}
common_peg_parser common_peg_parser_builder::json_object() {
return rule("json-object", [this]() {
auto ws = space();
auto member = sequence({json_string(), ws, literal(":"), ws, json()});
auto members = sequence({member, zero_or_more(sequence({ws, literal(","), ws, member}))});
return sequence({
literal("{"),
ws,
choice({
literal("}"),
sequence({members, ws, literal("}")})
}),
ws
});
});
}
common_peg_parser common_peg_parser_builder::json_array() {
return rule("json-array", [this]() {
auto ws = space();
auto elements = sequence({json(), zero_or_more(sequence({literal(","), ws, json()}))});
return sequence({
literal("["),
ws,
choice({
literal("]"),
sequence({elements, ws, literal("]")})
}),
ws
});
});
}
common_peg_parser common_peg_parser_builder::json() {
return rule("json-value", [this]() {
return choice({
json_object(),
json_array(),
json_string(),
json_number(),
json_bool(),
json_null()
});
});
}
common_peg_parser common_peg_parser_builder::python_string() {
return rule("python-string", [this]() {
return choice({double_quoted_string(), single_quoted_string()});
});
}
common_peg_parser common_peg_parser_builder::python_number() {
return json_number();
}
common_peg_parser common_peg_parser_builder::python_bool() {
return rule("python-bool", [this]() {
return sequence({
choice({literal("True"), literal("False")}),
space()
});
});
}
common_peg_parser common_peg_parser_builder::python_null() {
return rule("python-none", [this]() {
return sequence({literal("None"), space()});
});
}
common_peg_parser common_peg_parser_builder::python_dict() {
return rule("python-dict", [this]() {
auto ws = space();
auto member = sequence({python_string(), ws, literal(":"), ws, python_value()});
auto members = sequence({member, zero_or_more(sequence({ws, literal(","), ws, member}))});
return sequence({
literal("{"),
ws,
choice({
literal("}"),
sequence({members, ws, literal("}")})
}),
ws
});
});
}
common_peg_parser common_peg_parser_builder::python_array() {
return rule("python-array", [this]() {
auto ws = space();
auto elements = sequence({python_value(), zero_or_more(sequence({literal(","), ws, python_value()}))});
return sequence({
literal("["),
ws,
choice({
literal("]"),
sequence({elements, ws, literal("]")})
}),
ws
});
});
}
common_peg_parser common_peg_parser_builder::python_value() {
return rule("python-value", [this]() {
return choice({
python_dict(),
python_array(),
python_string(),
python_number(),
python_bool(),
python_null()
});
});
}
common_peg_parser common_peg_parser_builder::marker() {
auto sharp_bracket_parser = literal("<") + until(">") + literal(">");
auto square_bracket_parser = literal("[") + until("]") + literal("]");
return choice({ sharp_bracket_parser, square_bracket_parser });
}
common_peg_parser common_peg_parser_builder::json_member(const std::string & key, const common_peg_parser & p) {
auto ws = space();
return sequence({
literal("\"" + key + "\""),
ws,
literal(":"),
ws,
p,
});
}
static std::string gbnf_escape_char_class(uint32_t c) {
if (c == '-' || c == ']' || c == '[' || c == '\\') {
return "\\" + std::string(1, (char) c);
}
// Escape whitespace control characters
if (c == '\n') {
return "\\n";
}
if (c == '\t') {
return "\\t";
}
if (c == '\r') {
return "\\r";
}
// Printable ASCII
if (c >= 0x20 && c <= 0x7E) {
return std::string(1, (char) c);
}
// Hex escape
char buf[16];
const char * hex = "0123456789ABCDEF";
if (c <= 0xFF) {
buf[0] = '\\';
buf[1] = 'x';
buf[2] = hex[(c >> 4) & 0xF];
buf[3] = hex[c & 0xF];
buf[4] = '\0';
} else if (c <= 0xFFFF) {
buf[0] = '\\';
buf[1] = 'u';
buf[2] = hex[(c >> 12) & 0xF];
buf[3] = hex[(c >> 8) & 0xF];
buf[4] = hex[(c >> 4) & 0xF];
buf[5] = hex[c & 0xF];
buf[6] = '\0';
} else {
buf[0] = '\\';
buf[1] = 'U';
for (int i = 0; i < 8; i++) {
buf[2 + i] = hex[(c >> ((7 - i) * 4)) & 0xF];
}
buf[10] = '\0';
}
return std::string(buf);
}
static std::string gbnf_excluding_pattern(const std::vector<std::string> & strings) {
trie matcher(strings);
auto pieces = matcher.collect_prefix_and_next();
std::string pattern;
for (size_t i = 0; i < pieces.size(); ++i) {
if (i > 0) {
pattern += " | ";
}
const auto & pre = pieces[i].prefix;
const auto & chars = pieces[i].next_chars;
std::string cls;
cls.reserve(chars.size());
for (uint32_t ch : chars) {
cls += gbnf_escape_char_class(ch);
}
if (!pre.empty()) {
pattern += gbnf_format_literal(common_unicode_cpts_to_utf8(pre)) + " [^" + cls + "]";
} else {
pattern += "[^" + cls + "]";
}
}
return "(" + pattern + ")*";
}
static std::unordered_set<std::string> collect_reachable_rules(
const common_peg_arena & arena,
const common_peg_parser_id & rule
) {
std::unordered_set<std::string> reachable;
std::unordered_set<std::string> visited;
std::function<void(common_peg_parser_id)> visit = [&](common_peg_parser_id id) {
const auto & parser = arena.get(id);
std::visit([&](const auto & p) {
using T = std::decay_t<decltype(p)>;
if constexpr (std::is_same_v<T, common_peg_epsilon_parser> ||
std::is_same_v<T, common_peg_start_parser> ||
std::is_same_v<T, common_peg_end_parser> ||
std::is_same_v<T, common_peg_until_parser> ||
std::is_same_v<T, common_peg_literal_parser> ||
std::is_same_v<T, common_peg_chars_parser> ||
std::is_same_v<T, common_peg_space_parser> ||
std::is_same_v<T, common_peg_any_parser> ||
std::is_same_v<T, common_peg_string_parser>) {
// These parsers do not have any children
} else if constexpr (std::is_same_v<T, common_peg_sequence_parser>) {
for (auto child : p.children) {
visit(child);
}
} else if constexpr (std::is_same_v<T, common_peg_choice_parser>) {
for (auto child : p.children) {
visit(child);
}
} else if constexpr (std::is_same_v<T, common_peg_repetition_parser> ||
std::is_same_v<T, common_peg_and_parser> ||
std::is_same_v<T, common_peg_not_parser> ||
std::is_same_v<T, common_peg_tag_parser> ||
std::is_same_v<T, common_peg_atomic_parser> ||
std::is_same_v<T, common_peg_gbnf_parser> ||
std::is_same_v<T, common_peg_schema_parser>) {
visit(p.child);
} else if constexpr (std::is_same_v<T, common_peg_rule_parser>) {
if (visited.find(p.name) == visited.end()) {
visited.insert(p.name);
reachable.insert(p.name);
visit(p.child);
}
} else if constexpr (std::is_same_v<T, common_peg_ref_parser>) {
// Traverse rules so we pick up everything
auto referenced_rule = arena.get_rule(p.name);
visit(referenced_rule);
} else {
static_assert(is_always_false_v<T>);
}
}, parser);
};
visit(rule);
return reachable;
}
// GBNF generation implementation
void common_peg_arena::build_grammar(const common_grammar_builder & builder, bool lazy) const {
auto schema_delegates = [](const common_peg_schema_parser & s) -> bool {
if (!s.schema) {
return true;
}
if (s.raw && s.schema->contains("type")) {
const auto & type_val = s.schema->at("type");
if (type_val.is_string() && type_val == "string") {
return true;
}
// Handle nullable types like ["string", "null"] - delegate when the
// non-null type is string, since the tagged format uses raw text
if (type_val.is_array()) {
for (const auto & t : type_val) {
if (t.is_string() && t.get<std::string>() != "null") {
return t.get<std::string>() == "string";
}
}
}
}
// Delegate for enum schemas in raw mode - enum values are literal strings
if (s.raw && !s.schema->contains("type") && s.schema->contains("enum")) {
return true;
}
return false;
};
// Unwrap the parser so we can properly check if it's a sequence or choice
auto effective_parser = [&](common_peg_parser_id id) -> const common_peg_parser_variant & {
while (true) {
const auto & p = parsers_.at(id);
if (const auto * tag = std::get_if<common_peg_tag_parser>(&p)) {
id = tag->child;
} else if (const auto * atomic = std::get_if<common_peg_atomic_parser>(&p)) {
id = atomic->child;
} else if (const auto * schema = std::get_if<common_peg_schema_parser>(&p)) {
if (schema_delegates(*schema)) {
id = schema->child;
} else {
return p;
}
} else {
return p;
}
}
};
// Generate GBNF for a parser
std::function<std::string(common_peg_parser_id)> to_gbnf = [&](common_peg_parser_id id) -> std::string {
const auto & parser = parsers_.at(id);
return std::visit([&](const auto & p) -> std::string {
using T = std::decay_t<decltype(p)>;
if constexpr (std::is_same_v<T, common_peg_epsilon_parser> ||
std::is_same_v<T, common_peg_start_parser> ||
std::is_same_v<T, common_peg_end_parser>) {
return "";
} else if constexpr (std::is_same_v<T, common_peg_literal_parser>) {
return gbnf_format_literal(p.literal);
} else if constexpr (std::is_same_v<T, common_peg_sequence_parser>) {
std::string s;
for (const auto & child : p.children) {
auto child_gbnf = to_gbnf(child);
if (child_gbnf.empty()) {
continue;
}
if (!s.empty()) {
s += " ";
}
const auto & child_parser = effective_parser(child);
if (std::holds_alternative<common_peg_choice_parser>(child_parser) ||
std::holds_alternative<common_peg_sequence_parser>(child_parser)) {
s += "(" + child_gbnf + ")";
} else {
s += child_gbnf;
}
}
return s;
} else if constexpr (std::is_same_v<T, common_peg_choice_parser>) {
std::string s;
for (const auto & child : p.children) {
if (!s.empty()) {
s += " | ";
}
auto child_gbnf = to_gbnf(child);
const auto & child_parser = effective_parser(child);
if (std::holds_alternative<common_peg_choice_parser>(child_parser)) {
s += "(" + child_gbnf + ")";
} else {
s += child_gbnf;
}
}
return s;
} else if constexpr (std::is_same_v<T, common_peg_repetition_parser>) {
auto child_gbnf = to_gbnf(p.child);
const auto & child_parser = effective_parser(p.child);
if (std::holds_alternative<common_peg_choice_parser>(child_parser) ||
std::holds_alternative<common_peg_sequence_parser>(child_parser)) {
child_gbnf = "(" + child_gbnf + ")";
}
if (p.min_count == 0 && p.max_count == 1) {
return child_gbnf + "?";
}
if (p.min_count == 0 && p.max_count == -1) {
return child_gbnf + "*";
}
if (p.min_count == 1 && p.max_count == -1) {
return child_gbnf + "+";
}
if (p.max_count == -1) {
return child_gbnf + "{" + std::to_string(p.min_count) + ",}";
}
if (p.min_count == p.max_count) {
if (p.min_count == 1) {
return child_gbnf;
}
return child_gbnf + "{" + std::to_string(p.min_count) + "}";
}
return child_gbnf + "{" + std::to_string(p.min_count) + "," + std::to_string(p.max_count) + "}";
} else if constexpr (std::is_same_v<T, common_peg_and_parser> || std::is_same_v<T, common_peg_not_parser>) {
return ""; // Lookahead not supported in GBNF
} else if constexpr (std::is_same_v<T, common_peg_any_parser>) {
return ".";
} else if constexpr (std::is_same_v<T, common_peg_space_parser>) {
return "space";
} else if constexpr (std::is_same_v<T, common_peg_chars_parser>) {
std::string result = p.pattern;
if (p.min_count == 0 && p.max_count == 1) {
return result + "?";
}
if (p.min_count == 0 && p.max_count == -1) {
return result + "*";
}
if (p.min_count == 1 && p.max_count == -1) {
return result + "+";
}
if (p.max_count == -1) {
return result + "{" + std::to_string(p.min_count) + ",}";
}
if (p.min_count == p.max_count) {
if (p.min_count == 1) {
return result;
}
return result + "{" + std::to_string(p.min_count) + "}";
}
return result + "{" + std::to_string(p.min_count) + "," + std::to_string(p.max_count) + "}";
} else if constexpr (std::is_same_v<T, common_peg_string_parser>) {
const std::string delim(1, p.delimiter);
return R"(( [^)" + delim + R"(\\] | "\\" ( [)" + delim + R"(\\/ bfnrt] | "u" [0-9a-fA-F]{4} ) )*)";
} else if constexpr (std::is_same_v<T, common_peg_until_parser>) {
if (p.delimiters.empty()) {
return ".*";
}
return gbnf_excluding_pattern(p.delimiters);
} else if constexpr (std::is_same_v<T, common_peg_schema_parser>) {
if (schema_delegates(p)) {
return to_gbnf(p.child);
}
return builder.add_schema(p.name, *p.schema);
} else if constexpr (std::is_same_v<T, common_peg_rule_parser>) {
return p.name;
} else if constexpr (std::is_same_v<T, common_peg_ref_parser>) {
// Refs should not exist after flattening, but kept just in case
return p.name;
} else if constexpr (std::is_same_v<T, common_peg_tag_parser>) {
return to_gbnf(p.child);
} else if constexpr (std::is_same_v<T, common_peg_atomic_parser>) {
return to_gbnf(p.child);
} else if constexpr (std::is_same_v<T, common_peg_gbnf_parser>) {
return p.grammar;
} else {
static_assert(is_always_false_v<T>);
}
}, parser);
};
// Collect reachable rules
std::unordered_set<std::string> reachable_rules;
if (lazy) {
// Collect rules reachable from trigger rules
for (const auto & [name, id] : rules_) {
const auto & parser = parsers_.at(id);
if (auto rule = std::get_if<common_peg_rule_parser>(&parser)) {
if (rule->trigger) {
// Mark trigger as reachable and visit it
reachable_rules.insert(name);
auto add_rules = collect_reachable_rules(*this, id);
reachable_rules.insert(add_rules.begin(), add_rules.end());
}
}
}
} else {
// Collect rules reachable from root
reachable_rules = collect_reachable_rules(*this, root_);
}
// Create GBNF rules for all reachable rules
for (const auto & [name, rule_id] : rules_) {
if (reachable_rules.find(name) == reachable_rules.end()) {
continue;
}
const auto & parser = parsers_.at(rule_id);
if (auto rule = std::get_if<common_peg_rule_parser>(&parser)) {
builder.add_rule(rule->name, to_gbnf(rule->child));
}
}
if (lazy) {
// Generate root rule from trigger rules only
std::vector<std::string> trigger_names;
for (const auto & [name, rule_id] : rules_) {
const auto & parser = parsers_.at(rule_id);
if (auto rule = std::get_if<common_peg_rule_parser>(&parser)) {
if (rule->trigger) {
trigger_names.push_back(rule->name);
}
}
}
// Sort for predictable order
std::sort(trigger_names.begin(), trigger_names.end());
builder.add_rule("root", string_join(trigger_names, " | "));
} else if (root_ != COMMON_PEG_INVALID_PARSER_ID) {
builder.add_rule("root", to_gbnf(root_));
}
}
static nlohmann::json serialize_parser_variant(const common_peg_parser_variant & variant) {
using json = nlohmann::json;
return std::visit([](const auto & p) -> json {
using T = std::decay_t<decltype(p)>;
if constexpr (std::is_same_v<T, common_peg_epsilon_parser>) {
return json{{"type", "epsilon"}};
} else if constexpr (std::is_same_v<T, common_peg_start_parser>) {
return json{{"type", "start"}};
} else if constexpr (std::is_same_v<T, common_peg_end_parser>) {
return json{{"type", "end"}};
} else if constexpr (std::is_same_v<T, common_peg_literal_parser>) {
return json{{"type", "literal"}, {"literal", p.literal}};
} else if constexpr (std::is_same_v<T, common_peg_sequence_parser>) {
return json{{"type", "sequence"}, {"children", p.children}};
} else if constexpr (std::is_same_v<T, common_peg_choice_parser>) {
return json{{"type", "choice"}, {"children", p.children}};
} else if constexpr (std::is_same_v<T, common_peg_repetition_parser>) {
return json{
{"type", "repetition"},
{"child", p.child},
{"min_count", p.min_count},
{"max_count", p.max_count}
};
} else if constexpr (std::is_same_v<T, common_peg_and_parser>) {
return json{{"type", "and"}, {"child", p.child}};
} else if constexpr (std::is_same_v<T, common_peg_not_parser>) {
return json{{"type", "not"}, {"child", p.child}};
} else if constexpr (std::is_same_v<T, common_peg_any_parser>) {
return json{{"type", "any"}};
} else if constexpr (std::is_same_v<T, common_peg_space_parser>) {
return json{{"type", "space"}};
} else if constexpr (std::is_same_v<T, common_peg_chars_parser>) {
json ranges = json::array();
for (const auto & range : p.ranges) {
ranges.push_back({{"start", range.start}, {"end", range.end}});
}
return json{
{"type", "chars"},
{"pattern", p.pattern},
{"ranges", ranges},
{"negated", p.negated},
{"min_count", p.min_count},
{"max_count", p.max_count}
};
} else if constexpr (std::is_same_v<T, common_peg_string_parser>) {
return json{{"type", "string"}, {"delimiter", std::string(1, p.delimiter)}};
} else if constexpr (std::is_same_v<T, common_peg_until_parser>) {
return json{{"type", "until"}, {"delimiters", p.delimiters}};
} else if constexpr (std::is_same_v<T, common_peg_schema_parser>) {
return json{
{"type", "schema"},
{"child", p.child},
{"name", p.name},
{"schema", p.schema ? *p.schema : nullptr},
{"raw", p.raw}
};
} else if constexpr (std::is_same_v<T, common_peg_rule_parser>) {
return json{
{"type", "rule"},
{"name", p.name},
{"child", p.child},
{"trigger", p.trigger}
};
} else if constexpr (std::is_same_v<T, common_peg_ref_parser>) {
return json{{"type", "ref"}, {"name", p.name}};
} else if constexpr (std::is_same_v<T, common_peg_atomic_parser>) {
return json{{"type", "atomic"}, {"child", p.child}};
} else if constexpr (std::is_same_v<T, common_peg_tag_parser>) {
return json{
{"type", "tag"},
{"child", p.child},
{"tag", p.tag}
};
} else if constexpr (std::is_same_v<T, common_peg_gbnf_parser>) {
return json{{"type", "gbnf"}, {"child", p.child}, {"grammar", p.grammar}};
}
}, variant);
}
nlohmann::json common_peg_arena::to_json() const {
auto parsers = nlohmann::json::array();
for (const auto & parser : parsers_) {
parsers.push_back(serialize_parser_variant(parser));
}
return nlohmann::json{
{"parsers", parsers},
{"rules", rules_},
{"root", root_}
};
}
static common_peg_parser_variant deserialize_parser_variant(const nlohmann::json & j) {
if (!j.contains("type") || !j["type"].is_string()) {
throw std::runtime_error("Parser variant JSON missing or invalid 'type' field");
}
std::string type = j["type"];
if (type == "epsilon") {
return common_peg_epsilon_parser{};
}
if (type == "start") {
return common_peg_start_parser{};
}
if (type == "end") {
return common_peg_end_parser{};
}
if (type == "literal") {
if (!j.contains("literal") || !j["literal"].is_string()) {
throw std::runtime_error("literal parser missing or invalid 'literal' field");
}
return common_peg_literal_parser{j["literal"]};
}
if (type == "sequence") {
if (!j.contains("children") || !j["children"].is_array()) {
throw std::runtime_error("sequence parser missing or invalid 'children' field");
}
return common_peg_sequence_parser{j["children"].get<std::vector<common_peg_parser_id>>()};
}
if (type == "choice") {
if (!j.contains("children") || !j["children"].is_array()) {
throw std::runtime_error("choice parser missing or invalid 'children' field");
}
return common_peg_choice_parser{j["children"].get<std::vector<common_peg_parser_id>>()};
}
if (type == "repetition") {
if (!j.contains("child") || !j.contains("min_count") || !j.contains("max_count")) {
throw std::runtime_error("repetition parser missing required fields");
}
return common_peg_repetition_parser{
j["child"].get<common_peg_parser_id>(),
j["min_count"].get<int>(),
j["max_count"].get<int>()
};
}
if (type == "and") {
if (!j.contains("child")) {
throw std::runtime_error("and parser missing 'child' field");
}
return common_peg_and_parser{j["child"].get<common_peg_parser_id>()};
}
if (type == "not") {
if (!j.contains("child")) {
throw std::runtime_error("not parser missing 'child' field");
}
return common_peg_not_parser{j["child"].get<common_peg_parser_id>()};
}
if (type == "any") {
return common_peg_any_parser{};
}
if (type == "space") {
return common_peg_space_parser{};
}
if (type == "chars") {
if (!j.contains("pattern") || !j.contains("ranges") || !j.contains("negated") ||
!j.contains("min_count") || !j.contains("max_count")) {
throw std::runtime_error("chars parser missing required fields");
}
common_peg_chars_parser parser;
parser.pattern = j["pattern"];
parser.negated = j["negated"];
parser.min_count = j["min_count"];
parser.max_count = j["max_count"];
for (const auto & range_json : j["ranges"]) {
if (!range_json.contains("start") || !range_json.contains("end")) {
throw std::runtime_error("char_range missing 'start' or 'end' field");
}
parser.ranges.push_back({
range_json["start"].get<uint32_t>(),
range_json["end"].get<uint32_t>()
});
}
return parser;
}
if (type == "string") {
if (!j.contains("delimiter")) {
throw std::runtime_error("string parser missing delimiter field.");
}
std::string delimiter = j["delimiter"];
if (delimiter.empty()) {
throw std::runtime_error("string parser delimiter is empty.");
}
return common_peg_string_parser{delimiter[0]};
}
if (type == "until") {
if (!j.contains("delimiters") || !j["delimiters"].is_array()) {
throw std::runtime_error("until parser missing or invalid 'delimiters' field");
}
return common_peg_until_parser{j["delimiters"].get<std::vector<std::string>>()};
}
if (type == "schema") {
if (!j.contains("child") || !j.contains("name") || !j.contains("schema") || !j.contains("raw")) {
throw std::runtime_error("schema parser missing required fields");
}
common_peg_schema_parser parser;
parser.child = j["child"].get<common_peg_parser_id>();
parser.name = j["name"];
if (!j["schema"].is_null()) {
parser.schema = std::make_shared<nlohmann::ordered_json>(j["schema"]);
}
parser.raw = j["raw"].get<bool>();
return parser;
}
if (type == "rule") {
if (!j.contains("name") || !j.contains("child") || !j.contains("trigger")) {
throw std::runtime_error("rule parser missing required fields");
}
return common_peg_rule_parser{
j["name"].get<std::string>(),
j["child"].get<common_peg_parser_id>(),
j["trigger"].get<bool>()
};
}
if (type == "ref") {
if (!j.contains("name") || !j["name"].is_string()) {
throw std::runtime_error("ref parser missing or invalid 'name' field");
}
return common_peg_ref_parser{j["name"]};
}
if (type == "atomic") {
if (!j.contains("child")) {
throw std::runtime_error("tag parser missing required fields");
}
return common_peg_atomic_parser{
j["child"].get<common_peg_parser_id>(),
};
}
if (type == "tag") {
if (!j.contains("child") || !j.contains("tag")) {
throw std::runtime_error("tag parser missing required fields");
}
return common_peg_tag_parser{
j["child"].get<common_peg_parser_id>(),
j["tag"].get<std::string>(),
};
}
if (type == "gbnf") {
if (!j.contains("child") || !j.contains("grammar")) {
throw std::runtime_error("gbnf parser missing required fields");
}
return common_peg_gbnf_parser{
j["child"].get<common_peg_parser_id>(),
j["grammar"].get<std::string>(),
};
}
throw std::runtime_error("Unknown parser type: " + type);
}
common_peg_arena common_peg_arena::from_json(const nlohmann::json & j) {
if (!j.contains("parsers") || !j["parsers"].is_array()) {
throw std::runtime_error("JSON missing or invalid 'parsers' array");
}
if (!j.contains("rules") || !j["rules"].is_object()) {
throw std::runtime_error("JSON missing or invalid 'rules' object");
}
if (!j.contains("root")) {
throw std::runtime_error("JSON missing 'root' field");
}
common_peg_arena arena;
const auto & parsers_json = j["parsers"];
arena.parsers_.reserve(parsers_json.size());
for (const auto & parser_json : parsers_json) {
arena.parsers_.push_back(deserialize_parser_variant(parser_json));
}
arena.rules_ = j["rules"].get<std::unordered_map<std::string, common_peg_parser_id>>();
for (const auto & [name, id] : arena.rules_) {
if (id >= arena.parsers_.size()) {
throw std::runtime_error("Rule '" + name + "' references invalid parser ID: " + std::to_string(id));
}
}
arena.root_ = j["root"].get<common_peg_parser_id>();
if (arena.root_ != COMMON_PEG_INVALID_PARSER_ID && arena.root_ >= arena.parsers_.size()) {
throw std::runtime_error("Root references invalid parser ID: " + std::to_string(arena.root_));
}
return arena;
}
std::string common_peg_arena::save() const {
return to_json().dump();
}
void common_peg_arena::load(const std::string & data) {
*this = from_json(nlohmann::json::parse(data));
}
common_peg_arena build_peg_parser(const std::function<common_peg_parser(common_peg_parser_builder & builder)> & fn) {
common_peg_parser_builder builder;
builder.set_root(fn(builder));
return builder.build();
}