mirror of
https://github.com/ikawrakow/ik_llama.cpp.git
synced 2026-02-05 05:50:12 +00:00
* Add new webui from llama.cpp * Add new webui * feat: Improve mobile UI for Settings Dialog (#16084) * feat: Improve mobile UI for Settings Dialog * chore: update webui build output * fix: Linting errors * chore: update webui build output # Conflicts: # examples/server/webui_llamacpp/src/lib/components/app/chat/ChatSettings/ChatSettingsFields.svelte # examples/server/webui_llamacpp/src/lib/components/app/chat/ChatSettings/ChatSettingsSection.svelte # tools/server/public/index.html.gz * webui : fix handling incomplete chunks (#16107) * Always show message actions for mobile UI + improvements for user message sizing (#16076) # Conflicts: # .gitignore # examples/server/webui_llamacpp/package.json # examples/server/webui_llamacpp/scripts/dev.sh # tools/server/webui/scripts/post-build.sh * webui: switch to hash-based routing (alternative of #16079) (#16157) * Switched web UI to hash-based routing * Added hash to missed goto function call * Removed outdated SPA handling code * Fixed broken sidebar home link # Conflicts: # examples/server/webui_llamacpp/src/routes/+layout.ts # tools/server/server.cpp * Allow viewing conversations even when llama server is down (#16255) * webui: allow viewing conversations and sending messages even if llama-server is down - Cached llama.cpp server properties in browser localStorage on startup, persisting successful fetches and reloading them when refresh attempts fail so the chat UI continues to render while the backend is unavailable. - Cleared the stored server properties when resetting the store to prevent stale capability data after cache-backed operation. - Kept the original error-splash behavior when no cached props exist so fresh installs still surface a clear failure state instead of rendering stale data. * feat: Add UI for `props` endpoint unavailable + cleanup logic * webui: extend cached props fallback to offline errors Treat connection failures (refused, DNS, timeout, fetch) the same way as server 5xx so the warning banner shows up when cache is available, instead of falling back to a full error screen. * webui: Left the chat form enabled when a server warning is present so operators can keep sending messages e.g., to restart the backend over llama-swap, even while cached /props data is in use * chore: update webui build output --------- Co-authored-by: Pascal <admin@serveurperso.com> # Conflicts: # examples/server/webui_llamacpp/src/lib/components/app/chat/ChatScreen/ChatScreenWarning.svelte # examples/server/webui_llamacpp/src/lib/constants/localstorage-keys.ts * Enhance text file detection logic for file attachments (#16199) * feat: Enhances text file detection logic * chore: Build static `webui` output * chore: update webui build output # Conflicts: # examples/server/webui_llamacpp/src/lib/constants/binary-detection.ts * Show message actions by default (#16289) * fix: preserved zero values in chat settings inputs and textareas by switching to nullish coalescing for field values and default placeholders (#16312) * Improve Mobile UI for dialogs and action dropdowns (#16222) * fix: Always show conversation item actions * feat: Improve Alert Dialog and Dialog mobile UI * feat: Add settings reset to default confirmation * fix: Close Edit dialog on save * chore: update webui build output * webui: implement proper z-index system and scroll management - Add CSS variable for centralized z-index control - Fix dropdown positioning with Settings dialog conflicts - Prevent external scroll interference with proper event handling - Clean up hardcoded z-index values for maintainable architecture * webui: ensured the settings dialog enforces dynamic viewport height on mobile while retaining existing desktop sizing overrides * feat: Use `dvh` instead of computed px height for dialogs max height on mobile * chore: update webui build output * feat: Improve Settings fields UI * chore: update webui build output * chore: update webui build output --------- Co-authored-by: Pascal <admin@serveurperso.com> * Fix thinking blocks with quotes + add handling `[THINK]...[/THINK]` blocks (#16326) * fix: prevent reasoning blocks with quotes from being truncated * chore: update webui build output * feat: Improve thinking content parsing * test: Adds ChatMessage component stories for different thinking blocks * chore: update webui build output * fix: ChatMessage story fix --------- Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * Chatapi ignore empty sampling (#16330) * fix: skip empty sampling fields instead of coercing to 0 in chat API options * chore: update webui build output * webui: Remove running `llama-server` within WebUI `dev.sh` script (#16363) * Add optional setting for showing "Model used:" information (#16337) * feat: Add a setting to include model name used to generate the message * feat: UI improvements * feat: Save model info along with the database message entry creation * chore: Build webui static output * Improve code block color theming (#16325) * feat: Improve code block theming * chore: update webui build output * chore: Update webui static build * Conversation action dialogs as singletons from Chat Sidebar + apply conditional rendering for Actions Dropdown for Chat Conversation Items (#16369) * fix: Render Conversation action dialogs as singletons from Chat Sidebar level * chore: update webui build output * fix: Render Actions Dropdown conditionally only when user hovers conversation item + remove unused markup * chore: Update webui static build * fix: Always truncate conversation names * chore: Update webui static build * fix: track viewportHeight via window.innerHeight to avoid unwanted scrolling (#16356) Use <svelte:window bind:innerHeight> instead of manual resize listener Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * webui : Fix messages payload sent to chat completions (#16402) * fix: Include just the currently active message branches instead of all in chat completions request * chore: Build webui static output * chore: Formatting * chore: update webui build output * Capture model name only after first token (streaming) or completed request (#16405) * feat: Capture model name only after first token (streaming) or completed request (non-streaming) * chore: update webui build output * chore: update webui build output * Fix missing messages on sibling navigation (#16408) * fix: resolve message disappearing issue when navigating between regenerated siblings by using current leaf nodes instead of cached sibling IDs * chore: update webui build output * chore: update webui build output * webui : added download action (#13552) (#16282) * webui : added download action (#13552) * webui : import and export (for all conversations) * webui : fixed download-format, import of one conversation * webui : add ExportedConversations type for chat import/export * feat: Update naming & order * chore: Linting * webui : Updated static build output --------- Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * refactor: centralize CoT parsing in backend for streaming mode (#16394) * refactor: unify reasoning handling via backend reasoning_content, drop frontend tag parsing - Updated the chat message component to surface backend-supplied reasoning via message.thinking while showing the raw assistant content without inline tag scrubbing - Simplified chat streaming to append content chunks directly, stream reasoning into the message model, and persist any partial reasoning when generation stops - Refactored the chat service SSE handler to rely on server-provided reasoning_content, removing legacy <think> parsing logic - Refreshed Storybook data and streaming flows to populate the thinking field explicitly for static and streaming assistant messages * refactor: implement streaming-aware universal reasoning parser Remove the streaming mode limitation from --reasoning-format by refactoring try_parse_reasoning() to handle incremental parsing of <think> tags across all formats. - Rework try_parse_reasoning() to track whitespace, partial tags, and multiple reasoning segments, allowing proper separation of reasoning_content and content in streaming mode - Parse reasoning tags before tool call handling in content-only and Llama 3.x formats to ensure inline <think> blocks are captured correctly - Change default reasoning_format from 'auto' to 'deepseek' for consistent behavior - Add 'deepseek-legacy' option to preserve old inline behavior when needed - Update CLI help and documentation to reflect streaming support - Add parser tests for inline <think>...</think> segments The parser now continues processing content after </think> closes instead of stopping, enabling proper message.reasoning_content and message.content separation in both streaming and non-streaming modes. Fixes the issue where streaming responses would dump everything (including post-thinking content) into reasoning_content while leaving content empty. * refactor: address review feedback from allozaur - Passed the assistant message content directly to ChatMessageAssistant to drop the redundant derived state in the chat message component - Simplified chat streaming updates by removing unused partial-thinking handling and persisting partial responses straight from currentResponse - Refreshed the ChatMessage stories to cover standard and reasoning scenarios without the old THINK-tag parsing examples Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * refactor: restore forced reasoning prefix to pass test-chat ([chat] All tests passed) - store the exact sequence seen on input when 'thinking_forced_open' enforces a reasoning block - inject this prefix before the first accumulated segment in 'reasoning_content', then clear it to avoid duplication - repeat the capture on every new 'start_think' detection to properly handle partial/streaming flows * refactor: address review feedback from ngxson * debug: say goodbye to curl -N, hello one-click raw stream - adds a new checkbox in the WebUI to display raw LLM output without backend parsing or frontend Markdown rendering * Update tools/server/webui/src/lib/components/app/chat/ChatMessages/ChatMessage.svelte Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * webui: add Storybook example for raw LLM output and scope reasoning format toggle per story - Added a Storybook example that showcases the chat message component in raw LLM output mode with the provided trace sample - Updated every ChatMessage story to toggle the disableReasoningFormat setting so the raw-output rendering remains scoped to its own example * npm run format * chat-parser: address review feedback from ngxson Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com> --------- Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com> # Conflicts: # common/arg.cpp # examples/server/webui_llamacpp/src/lib/utils/thinking.ts # tools/server/README.md * No markdown in cot (#16483) * fix: let the model think in plaintext * chore: npm run format + npm run build * webui: updated the chat service to only include max_tokens in the req… (#16489) * webui: updated the chat service to only include max_tokens in the request payload when the setting is explicitly provided, while still mapping explicit zero or null values to the infinite-token sentinel * chore: update webui build output * feat: render user content as markdown option (#16358) * feat: render user content as markdown option - Add a persisted 'renderUserContentAsMarkdown' preference to the settings defaults and info metadata so the choice survives reloads like other options - Surface the new 'Render user content as Markdown' checkbox in the General section of the chat settings dialog, beneath the PDF toggle - Render user chat messages with 'MarkdownContent' when the new setting is enabled, matching assistant formatting while preserving the existing card styling otherwise - chore: update webui build output * chore: update webui build output * webui: remove client-side context pre-check and rely on backend for limits (#16506) * fix: make SSE client robust to premature [DONE] in agentic proxy chains * webui: remove client-side context pre-check and rely on backend for limits Removed the client-side context window pre-check and now simply sends messages while keeping the dialog imports limited to core components, eliminating the maximum context alert path Simplified streaming and non-streaming chat error handling to surface a generic 'No response received from server' error whenever the backend returns no content Removed the obsolete maxContextError plumbing from the chat store so state management now focuses on the core message flow without special context-limit cases * webui: cosmetic rename of error messages * Update tools/server/webui/src/lib/stores/chat.svelte.ts Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * Update tools/server/webui/src/lib/stores/chat.svelte.ts Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * Update tools/server/webui/src/lib/components/app/chat/ChatScreen/ChatScreen.svelte Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * Update tools/server/webui/src/lib/components/app/chat/ChatScreen/ChatScreen.svelte Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * chore: update webui build output --------- Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> # Conflicts: # examples/server/webui_llamacpp/src/lib/components/app/dialogs/ChatErrorDialog.svelte # examples/server/webui_llamacpp/src/lib/components/app/dialogs/MaximumContextAlertDialog.svelte # examples/server/webui_llamacpp/src/lib/services/context.ts * fix: add remark plugin to render raw HTML as literal text (#16505) * fix: add remark plugin to render raw HTML as literal text Implemented a missing MDAST stage to neutralize raw HTML like major LLM WebUIs do ensuring consistent and safe Markdown rendering Introduced 'remarkLiteralHtml', a plugin that converts raw HTML nodes in the Markdown AST into plain-text equivalents while preserving indentation and line breaks. This ensures consistent rendering and prevents unintended HTML execution, without altering valid Markdown structure Kept 'remarkRehype' in the pipeline since it performs the required conversion from MDAST to HAST for KaTeX, syntax highlighting, and HTML serialization Refined the link-enhancement logic to skip unnecessary DOM rewrites, fixing a subtle bug where extra paragraphs were injected after the first line due to full innerHTML reconstruction, and ensuring links open in new tabs only when required Final pipeline: remarkGfm -> remarkMath -> remarkBreaks -> remarkLiteralHtml -> remarkRehype -> rehypeKatex -> rehypeHighlight -> rehypeStringify * fix: address review feedback from allozaur * chore: update webui build output # Conflicts: # examples/server/webui_llamacpp/src/lib/constants/literal-html.ts * Add server-driven parameter defaults and syncing (#16515) # Conflicts: # examples/server/webui_llamacpp/src/lib/components/app/chat/ChatSettings/ParameterSourceIndicator.svelte # examples/server/webui_llamacpp/src/lib/constants/precision.ts # examples/server/webui_llamacpp/src/lib/services/parameter-sync.spec.ts # examples/server/webui_llamacpp/src/lib/services/parameter-sync.ts # examples/server/webui_llamacpp/src/lib/utils/config-helpers.ts # examples/server/webui_llamacpp/src/lib/utils/precision.ts * fix: added a normalization step for MathJax-style \[\] and \(\) delimiters (#16599) * fix: added a normalization step for MathJax-style \[\] and \(\) delimiters So inline and block equations are converted before KaTeX rendering, enabling proper display of model-generated LaTeX in the WebUI * chore: update webui build output * webui: reorganize settings layout (#16607) * webui: reorganize settings layout * chore: update webui build output * fix: remove unused variable * chore: update webui build output * Enable per-conversation loading states to allow having parallel conversations (#16327) * feat: Per-conversation loading states and tracking streaming stats * chore: update webui build output * refactor: Chat state management Consolidates loading state management by using a global `isLoading` store synchronized with individual conversation states. This change ensures proper reactivity and avoids potential race conditions when updating the UI based on the loading status of different conversations. It also improves the accuracy of statistics displayed. Additionally, slots service methods are updated to use conversation IDs for per-conversation state management, avoiding global state pollution. * feat: Adds loading indicator to conversation items * chore: update webui build output * fix: Fix aborting chat streaming Improves the chat stream abortion process by ensuring that partial responses are saved before the abort signal is sent. This avoids a race condition where the onError callback could clear the streaming state before the partial response is saved. Additionally, the stream reading loop and callbacks are now checked for abort signals to prevent further processing after abortion. * refactor: Remove redundant comments * chore: build webui static output * refactor: Cleanup * chore: update webui build output * chore: update webui build output * fix: Conversation loading indicator for regenerating messages * chore: update webui static build * feat: Improve configuration * feat: Install `http-server` as dev dependency to not need to rely on `npx` in CI * Import/Export UX improvements (#16619) * webui : added download action (#13552) * webui : import and export (for all conversations) * webui : fixed download-format, import of one conversation * webui : add ExportedConversations type for chat import/export * feat: Update naming & order * chore: Linting * feat: Import/Export UX improvements * chore: update webui build output * feat: Update UI placement of Import/Export tab in Chat Settings Dialog * refactor: Cleanup chore: update webui build output * feat: Enable shift-click multiple conversation items selection * chore: update webui static build * chore: update webui static build --------- Co-authored-by: Sascha Rogmann <github@rogmann.org> # Conflicts: # examples/server/webui_llamacpp/src/lib/components/app/chat/ChatSettings/ConversationSelectionDialog.svelte # examples/server/webui_llamacpp/src/lib/components/app/chat/ChatSettings/ImportExportTab.svelte # examples/server/webui_llamacpp/src/lib/utils/conversation-utils.ts * Prevent premature submission on IME input (#16673) * fix: Prevent premature submission on IME input * chore: update webui static build * refactor: Put IME completion checker in a helper function and add checking for `KeyboardEvent.eventKey === 229` * chore: update webui static build * chore: update webui static build * chore: update webui static build # Conflicts: # examples/server/webui_llamacpp/src/lib/utils/is-ime-composing.ts * Handle legacy 'context' attachments (#16687) * webui: introduce OpenAI-compatible model selector in JSON payload (#16562) * webui: introduce OpenAI-compatible model selector in JSON payload * webui: restore OpenAI-Compatible model source of truth and unify metadata capture This change re-establishes a single, reliable source of truth for the active model: fully aligned with the OpenAI-Compat API behavior It introduces a unified metadata flow that captures the model field from both streaming and non-streaming responses, wiring a new onModel callback through ChatService The model name is now resolved directly from the API payload rather than relying on server /props or UI assumptions ChatStore records and persists the resolved model for each assistant message during streaming, ensuring consistency across the UI and database Type definitions for API and settings were also extended to include model metadata and the onModel callback, completing the alignment with OpenAI-Compat semantics * webui: address review feedback from allozaur * webui: move model selector into ChatForm (idea by @allozaur) * webui: make model selector more subtle and integrated into ChatForm * webui: replaced the Flowbite selector with a native Svelte dropdown * webui: add developer setting to toggle the chat model selector * webui: address review feedback from allozaur Normalized streamed model names during chat updates by trimming input and removing directory components before saving or persisting them, so the conversation UI shows only the filename Forced model names within the chat form selector dropdown to render as a single-line, truncated entry with a tooltip revealing the full name * webui: toggle displayed model source for legacy vs OpenAI-Compat modes When the selector is disabled, it falls back to the active server model name from /props When the model selector is enabled, the displayed model comes from the message metadata (the one explicitly selected and sent in the request) * Update tools/server/webui/src/lib/components/app/chat/ChatForm/ChatFormActions.svelte Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * Update tools/server/webui/src/lib/constants/localstorage-keys.ts Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * Update tools/server/webui/src/lib/components/app/chat/ChatForm/ChatFormModelSelector.svelte Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * Update tools/server/webui/src/lib/components/app/chat/ChatMessages/ChatMessageAssistant.svelte Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * Update tools/server/webui/src/lib/services/chat.ts Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * Update tools/server/webui/src/lib/services/chat.ts Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * webui: refactor model selector and persistence helpers - Replace inline portal and event listeners with proper Svelte bindings - Introduce 'persisted' store helper for localStorage sync without runes - Extract 'normalizeModelName' utils + Vitest coverage - Simplify ChatFormModelSelector structure and cleanup logic Replaced the persisted store helper's use of '$state/$effect' runes with a plain TS implementation to prevent orphaned effect runtime errors outside component context Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * webui: document normalizeModelName usage with inline examples * Update tools/server/webui/src/lib/components/app/chat/ChatForm/ChatFormModelSelector.svelte Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * Update tools/server/webui/src/lib/stores/models.svelte.ts Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * Update tools/server/webui/src/lib/stores/models.svelte.ts Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * webui: extract ModelOption type into dedicated models.d.ts Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * webui: refine ChatMessageAssistant displayedModel source logic * webui: stabilize dropdown, simplify model extraction, and init assistant model field * chore: update webui static build * Update tools/server/webui/src/lib/components/app/chat/ChatMessages/ChatMessageAssistant.svelte Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * chore: npm format, update webui static build * webui: align sidebar trigger position, remove z-index glitch * chore: update webui build output --------- Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> # Conflicts: # examples/server/webui_llamacpp/src/lib/components/app/chat/ChatForm/ChatFormModelSelector.svelte # examples/server/webui_llamacpp/src/lib/services/models.ts # examples/server/webui_llamacpp/src/lib/stores/models.svelte.ts # examples/server/webui_llamacpp/src/lib/stores/persisted.svelte.ts # examples/server/webui_llamacpp/src/lib/types/models.d.ts # examples/server/webui_llamacpp/src/lib/utils/model-names.test.ts # examples/server/webui_llamacpp/src/lib/utils/model-names.ts # examples/server/webui_llamacpp/src/lib/utils/portal-to-body.ts * webui: support q URL parameter (#16728) * webui: support q URL parameter Fixes #16722 I’ve checked that it works with Firefox’s AI tools * webui: apply suggestions from code review Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * chore: update webui static build --------- Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * build fix --------- Co-authored-by: firecoperana <firecoperana> Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> Co-authored-by: Quentin Bramas <quentin.bramas@gmail.com> Co-authored-by: Isaac McFadyen <isaac@imcf.me> Co-authored-by: Pascal <admin@serveurperso.com> Co-authored-by: Sascha Rogmann <59577610+srogmann@users.noreply.github.com> Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com> Co-authored-by: Sascha Rogmann <github@rogmann.org> Co-authored-by: Florian Badie <florianbadie@odrling.xyz>
635 lines
27 KiB
C++
635 lines
27 KiB
C++
//
|
|
// Copyright (C) 2023-2025 The llama.cpp authors
|
|
// Copyright (C) 2024-2025 Iwan Kawrakow
|
|
// MIT license
|
|
// SPDX-License-Identifier: MIT
|
|
//
|
|
|
|
// Various helper functions and utilities
|
|
|
|
#pragma once
|
|
|
|
#include "llama.h"
|
|
|
|
#include "sampling.h"
|
|
|
|
#define LOG_NO_FILE_LINE_FUNCTION
|
|
#include "log.h"
|
|
#include <set>
|
|
#include <cmath>
|
|
#include <string>
|
|
#include <sstream>
|
|
#include <string_view>
|
|
#include <vector>
|
|
#include <random>
|
|
#include <thread>
|
|
#include <unordered_map>
|
|
#include <tuple>
|
|
#include <map>
|
|
#include <sstream>
|
|
|
|
#ifdef _WIN32
|
|
#define DIRECTORY_SEPARATOR '\\'
|
|
#else
|
|
#define DIRECTORY_SEPARATOR '/'
|
|
#endif // _WIN32
|
|
|
|
#define die(msg) do { fputs("error: " msg "\n", stderr); exit(1); } while (0)
|
|
#define die_fmt(fmt, ...) do { fprintf(stderr, "error: " fmt "\n", __VA_ARGS__); exit(1); } while (0)
|
|
|
|
#define print_build_info() do { \
|
|
fprintf(stderr, "%s: build = %d (%s)\n", __func__, LLAMA_BUILD_NUMBER, LLAMA_COMMIT); \
|
|
fprintf(stderr, "%s: built with %s for %s\n", __func__, LLAMA_COMPILER, LLAMA_BUILD_TARGET); \
|
|
} while(0)
|
|
|
|
#define DEFAULT_MODEL_PATH "models/7B/ggml-model-f16.gguf"
|
|
|
|
struct llama_lora_adapter_info {
|
|
std::string path;
|
|
float scale;
|
|
};
|
|
|
|
struct llama_lora_adapter_container : llama_lora_adapter_info {
|
|
struct llama_lora_adapter * adapter;
|
|
};
|
|
|
|
// build info
|
|
extern int LLAMA_BUILD_NUMBER;
|
|
extern char const * LLAMA_COMMIT;
|
|
extern char const * LLAMA_COMPILER;
|
|
extern char const * LLAMA_BUILD_TARGET;
|
|
|
|
struct llama_control_vector_load_info;
|
|
|
|
//
|
|
// CPU utils
|
|
//
|
|
|
|
int32_t cpu_get_num_physical_cores();
|
|
int32_t cpu_get_num_math();
|
|
|
|
enum llama_example {
|
|
LLAMA_EXAMPLE_COMMON,
|
|
LLAMA_EXAMPLE_SPECULATIVE,
|
|
LLAMA_EXAMPLE_MAIN,
|
|
LLAMA_EXAMPLE_EMBEDDING,
|
|
LLAMA_EXAMPLE_PERPLEXITY,
|
|
LLAMA_EXAMPLE_RETRIEVAL,
|
|
LLAMA_EXAMPLE_PASSKEY,
|
|
LLAMA_EXAMPLE_IMATRIX,
|
|
LLAMA_EXAMPLE_BENCH,
|
|
LLAMA_EXAMPLE_SERVER,
|
|
LLAMA_EXAMPLE_CVECTOR_GENERATOR,
|
|
LLAMA_EXAMPLE_EXPORT_LORA,
|
|
LLAMA_EXAMPLE_MTMD,
|
|
LLAMA_EXAMPLE_LOOKUP,
|
|
LLAMA_EXAMPLE_PARALLEL,
|
|
LLAMA_EXAMPLE_TTS,
|
|
LLAMA_EXAMPLE_DIFFUSION,
|
|
LLAMA_EXAMPLE_FINETUNE,
|
|
|
|
LLAMA_EXAMPLE_COUNT,
|
|
};
|
|
|
|
//
|
|
// CLI argument parsing
|
|
//
|
|
|
|
// dimensionality reduction methods, used by cvector-generator
|
|
enum dimre_method {
|
|
DIMRE_METHOD_PCA,
|
|
DIMRE_METHOD_MEAN,
|
|
};
|
|
|
|
// reasoning API response format (not to be confused as chat template's reasoning format)
|
|
enum common_reasoning_format {
|
|
COMMON_REASONING_FORMAT_NONE,
|
|
COMMON_REASONING_FORMAT_AUTO,
|
|
COMMON_REASONING_FORMAT_DEEPSEEK_LEGACY, // Extract thinking tag contents and return as `message.reasoning_content`, or leave inline in <think> tags in stream mode
|
|
COMMON_REASONING_FORMAT_DEEPSEEK, // Extract thinking tag contents and return as `message.reasoning_content`, including in streaming deltas.
|
|
};
|
|
|
|
enum common_webui {
|
|
COMMON_WEBUI_NONE,
|
|
COMMON_WEBUI_AUTO,
|
|
COMMON_WEBUI_LLAMACPP,
|
|
};
|
|
|
|
common_webui common_webui_from_name(const std::string& format);
|
|
|
|
struct model_paths {
|
|
std::string path = ""; // model local path // NOLINT
|
|
std::string url = ""; // model url to download // NOLINT
|
|
std::string hf_repo = ""; // HF repo // NOLINT
|
|
std::string hf_file = ""; // HF file // NOLINT
|
|
std::string docker_repo = ""; // Docker repo // NOLINT
|
|
};
|
|
|
|
struct gpt_params {
|
|
uint32_t seed = LLAMA_DEFAULT_SEED; // RNG seed
|
|
|
|
int32_t n_threads = cpu_get_num_math();
|
|
int32_t n_threads_draft = -1;
|
|
int32_t n_threads_batch = -1; // number of threads to use for batch processing (-1 = use n_threads)
|
|
int32_t n_threads_batch_draft = -1;
|
|
int32_t n_predict = -1; // new tokens to predict
|
|
int32_t n_ctx = 0; // context size
|
|
int32_t n_ctx_draft = 0; // context size for draft model
|
|
int32_t n_batch = 2048; // logical batch size for prompt processing (must be >=32 to use BLAS)
|
|
int32_t n_ubatch = 512; // physical batch size for prompt processing (must be >=32 to use BLAS)
|
|
int32_t n_keep = 0; // number of tokens to keep from initial prompt
|
|
int32_t n_draft = 16; // number of tokens to draft during speculative decoding
|
|
int32_t n_draft_min = 1; // minimum number of tokens to draft during speculative decoding
|
|
float p_draft_min = 0.8f; // minimum speculative decoding probability (greedy)
|
|
int32_t n_chunks = -1; // max number of chunks to process (-1 = unlimited)
|
|
int32_t n_parallel = 1; // number of parallel sequences to decode
|
|
int32_t n_sequences = 1; // number of sequences to decode
|
|
float p_split = 0.1f; // speculative decoding split probability
|
|
int32_t n_gpu_layers = -1; // number of layers to store in VRAM (-1 - use default)
|
|
int32_t n_gpu_layers_draft = -1; // number of layers to store in VRAM for the draft model (-1 - use default)
|
|
int32_t main_gpu = 0; // the GPU that is used for scratch and small tensors
|
|
float tensor_split[128] = {0}; // how split tensors should be distributed across GPUs
|
|
int32_t grp_attn_n = 1; // group-attention factor
|
|
int32_t grp_attn_w = 512; // group-attention width
|
|
int32_t n_print = -1; // print token count every n tokens (-1 = disabled)
|
|
float rope_freq_base = 0.0f; // RoPE base frequency
|
|
float rope_freq_scale = 0.0f; // RoPE frequency scaling factor
|
|
float yarn_ext_factor = -1.0f; // YaRN extrapolation mix factor
|
|
float yarn_attn_factor = -1.0f; // YaRN magnitude scaling factor
|
|
float yarn_beta_fast = -1.0f; // YaRN low correction dim
|
|
float yarn_beta_slow = -1.0f; // YaRN high correction dim
|
|
int32_t yarn_orig_ctx = 0; // YaRN original context length
|
|
float defrag_thold = -1.0f; // KV cache defragmentation threshold
|
|
|
|
ggml_backend_sched_eval_callback cb_eval = nullptr;
|
|
void * cb_eval_user_data = nullptr;
|
|
|
|
ggml_numa_strategy numa = GGML_NUMA_STRATEGY_DISABLED;
|
|
|
|
enum llama_split_mode split_mode = LLAMA_SPLIT_MODE_LAYER; // how to split the model across GPUs
|
|
enum llama_rope_scaling_type rope_scaling_type = LLAMA_ROPE_SCALING_TYPE_UNSPECIFIED;
|
|
enum llama_pooling_type pooling_type = LLAMA_POOLING_TYPE_UNSPECIFIED; // pooling type for embeddings
|
|
enum llama_attention_type attention_type = LLAMA_ATTENTION_TYPE_UNSPECIFIED; // attention type for embeddings
|
|
|
|
// // sampling parameters
|
|
struct llama_sampling_params sparams;
|
|
|
|
std::string model = ""; // model path
|
|
std::string model_draft = ""; // draft model for speculative decoding
|
|
std::string model_alias = "unknown"; // model alias
|
|
std::string model_url = ""; // model url to download
|
|
std::string hf_token = ""; // HF token
|
|
std::string hf_repo = ""; // HF repo
|
|
std::string hf_file = ""; // HF file
|
|
std::string prompt = "";
|
|
std::string prompt_file = ""; // store the external prompt file name
|
|
bool prompt_is_binary = false; // don't fool around when the prompt contains binary data (as it is for multiple choice)
|
|
std::string path_prompt_cache = ""; // path to file for saving/loading prompt eval state
|
|
std::string input_prefix = ""; // string to prefix user inputs with
|
|
std::string input_suffix = ""; // string to suffix user inputs with
|
|
std::string logdir = ""; // directory in which to save YAML log files
|
|
std::string lookup_cache_static = ""; // path of static ngram cache file for lookup decoding
|
|
std::string lookup_cache_dynamic = ""; // path of dynamic ngram cache file for lookup decoding
|
|
std::string logits_file = ""; // file for saving *all* logits
|
|
std::string rpc_servers = ""; // comma separated list of RPC servers
|
|
|
|
std::vector<std::string> in_files; // all input files
|
|
std::vector<std::string> antiprompt; // strings upon which more user input is prompted (a.k.a. reverse prompts)
|
|
std::vector<llama_model_kv_override> kv_overrides;
|
|
std::vector<llama_model_tensor_buft_override> tensor_buft_overrides;
|
|
std::vector<std::pair<int,int>> offload_policy;
|
|
|
|
std::vector<std::pair<std::string, std::string>> replacements_draft; // main to speculative model replacements
|
|
|
|
bool lora_init_without_apply = false; // only load lora to memory, but do not apply it to ctx (user can manually apply lora later using llama_lora_adapter_apply)
|
|
std::vector<llama_lora_adapter_info> lora_adapters; // lora adapter path with user defined scale
|
|
|
|
std::vector<llama_control_vector_load_info> control_vectors; // control vector with user defined scale
|
|
|
|
int32_t verbosity = 0;
|
|
int32_t control_vector_layer_start = -1; // layer range for control vector
|
|
int32_t control_vector_layer_end = -1; // layer range for control vector
|
|
|
|
int32_t ppl_stride = 0; // stride for perplexity calculations. If left at 0, the pre-existing approach will be used.
|
|
int32_t ppl_output_type = 0; // = 0 -> ppl output is as usual, = 1 -> ppl output is num_tokens, ppl, one per line
|
|
// (which is more convenient to use for plotting)
|
|
//
|
|
bool hellaswag = false; // compute HellaSwag score over random tasks from datafile supplied in prompt
|
|
size_t hellaswag_tasks = 400; // number of tasks to use when computing the HellaSwag score
|
|
|
|
bool winogrande = false; // compute Winogrande score over random tasks from datafile supplied in prompt
|
|
size_t winogrande_tasks = 0; // number of tasks to use when computing the Winogrande score. If 0, all tasks will be computed
|
|
|
|
bool multiple_choice = false; // compute TruthfulQA score over random tasks from datafile supplied in prompt
|
|
size_t multiple_choice_tasks = 0; // number of tasks to use when computing the TruthfulQA score. If 0, all tasks will be computed
|
|
|
|
bool kl_divergence = false; // compute KL divergence
|
|
|
|
bool usage = false; // print usage
|
|
bool use_color = false; // use color to distinguish generations and inputs
|
|
bool special = false; // enable special token output
|
|
bool interactive = false; // interactive mode
|
|
bool interactive_first = false; // wait for user input immediately
|
|
bool conversation = false; // conversation mode (does not print special tokens and suffix/prefix)
|
|
bool prompt_cache_all = false; // save user input and generations to prompt cache
|
|
bool prompt_cache_ro = false; // open the prompt cache read-only and do not update it
|
|
|
|
bool escape = true; // escape "\n", "\r", "\t", "\'", "\"", and "\\"
|
|
bool multiline_input = false; // reverse the usage of `\`
|
|
bool simple_io = false; // improves compatibility with subprocesses and limited consoles
|
|
bool cont_batching = true; // insert new sequences for decoding on-the-fly
|
|
bool flash_attn = true; // flash attention
|
|
int mla_attn = 0; // MLA 0: standard attention, 1: MLA with K and transposed V cache, 2: MLA with just K cache
|
|
int attn_max_batch = 0; // Max batch size to use when computing attention (only applicable if flash_attn = false)
|
|
bool fused_moe_up_gate = true; // fused up*unary(gate) op for MoE models
|
|
bool fused_up_gate = true; // fused up*unary(gate) op
|
|
bool fused_mmad = true; // fused mul+multi_add op
|
|
bool grouped_expert_routing = false; // if to use grouped expert routing (BailingMoeV2 arch)
|
|
int min_experts = -1;
|
|
float thresh_experts = 0;
|
|
|
|
bool input_prefix_bos = false; // prefix BOS to user inputs, preceding input_prefix
|
|
bool ignore_eos = false; // ignore generated EOS tokens
|
|
bool logits_all = false; // return logits for all tokens in the batch
|
|
bool use_mmap = true; // use mmap for faster loads
|
|
bool use_mlock = false; // use mlock to keep model in memory
|
|
bool verbose_prompt = false; // print prompt tokens before generation
|
|
bool display_prompt = true; // print prompt before generation
|
|
bool infill = false; // use infill mode
|
|
bool dump_kv_cache = false; // dump the KV cache contents for debugging purposes
|
|
bool no_kv_offload = false; // disable KV offloading
|
|
bool warmup = true; // warmup run
|
|
bool batch_warmup = false; // batch warmup run
|
|
bool check_tensors = false; // validate tensor data
|
|
bool repack_tensors = false; // repack tensors if interleaved variant is available
|
|
bool use_thp = false; // use transparent huge pages (linux only)
|
|
bool validate_quants = false; // if true, check for NaNs while loading the model
|
|
bool only_active_exps = true; // if true, offload only active experts (relevant only for hybrid CPU/GPU)
|
|
|
|
std::string cache_type_k = "f16"; // KV cache data type for the K
|
|
std::string cache_type_v = "f16"; // KV cache data type for the V
|
|
std::string cache_type_k_draft = ""; // KV cache data type for K for the draft model
|
|
std::string cache_type_v_draft = ""; // KV cache data type for V for the draft model
|
|
|
|
// multimodal models (see examples/mtmd)
|
|
model_paths mmproj;
|
|
bool mmproj_use_gpu = true; // use GPU for multimodal model
|
|
bool no_mmproj = false; // explicitly disable multimodal model
|
|
std::vector<std::string> image; // path to image file(s)
|
|
|
|
// embedding
|
|
bool embedding = false; // get only sentence embedding
|
|
int32_t embd_normalize = 2; // normalisation for embendings (-1=none, 0=max absolute int16, 1=taxicab, 2=euclidean, >2=p-norm)
|
|
std::string embd_out = ""; // empty = default, "array" = [[],[]...], "json" = openai style, "json+" = same "json" + cosine similarity matrix
|
|
std::string embd_sep = "\n"; // separator of embendings
|
|
|
|
// server params
|
|
int32_t port = 8080; // server listens on this network port
|
|
int32_t timeout_read = 600; // http read timeout in seconds
|
|
int32_t timeout_write = timeout_read; // http write timeout in seconds
|
|
int32_t n_threads_http = -1; // number of threads to process HTTP requests
|
|
bool send_done = false; // send done message as required for OAI compatibility
|
|
|
|
std::string hostname = "127.0.0.1";
|
|
std::string public_path = "";
|
|
std::string chat_template = "";
|
|
bool use_jinja = false; // NOLINT
|
|
std::string system_prompt = "";
|
|
bool enable_chat_template = true;
|
|
common_reasoning_format reasoning_format = COMMON_REASONING_FORMAT_DEEPSEEK;
|
|
int reasoning_budget = -1;
|
|
bool prefill_assistant = true;
|
|
|
|
std::vector<std::string> api_keys;
|
|
|
|
std::string ssl_file_key = "";
|
|
std::string ssl_file_cert = "";
|
|
|
|
std::map<std::string, std::string> default_template_kwargs;
|
|
|
|
// "advanced" endpoints are disabled by default for better security
|
|
common_webui webui = COMMON_WEBUI_AUTO;
|
|
bool endpoint_slots = true;
|
|
bool endpoint_props = false; // only control POST requests, not GET
|
|
bool endpoint_metrics = false;
|
|
|
|
bool log_json = false;
|
|
|
|
std::string slot_save_path;
|
|
std::string sql_save_file;
|
|
std::string sqlite_zstd_ext_file;
|
|
|
|
float slot_prompt_similarity = 0.5f;
|
|
|
|
// batched-bench params
|
|
bool is_pp_shared = false;
|
|
|
|
std::vector<int32_t> n_pp;
|
|
std::vector<int32_t> n_tg;
|
|
std::vector<int32_t> n_pl;
|
|
|
|
// retrieval params
|
|
std::vector<std::string> context_files; // context files to embed
|
|
|
|
int32_t chunk_size = 64; // chunk size for context embedding
|
|
|
|
std::string chunk_separator = "\n"; // chunk separator for context embedding
|
|
|
|
// passkey params
|
|
int32_t n_junk = 250; // number of times to repeat the junk text
|
|
int32_t i_pos = -1; // position of the passkey in the junk text
|
|
|
|
// imatrix params
|
|
std::string out_file = "imatrix.dat"; // save the resulting imatrix to this file
|
|
std::string output_tensor_name = "output.weight"; // name of the output tensor
|
|
|
|
int32_t n_out_freq = 10; // output the imatrix every n_out_freq iterations
|
|
int32_t n_save_freq = 0; // save the imatrix every n_save_freq iterations
|
|
int32_t i_chunk = 0; // start processing from this chunk
|
|
|
|
bool process_output = false; // collect data for the output tensor
|
|
bool compute_ppl = true; // whether to compute perplexity
|
|
|
|
// cvector-generator params
|
|
int n_pca_batch = 100;
|
|
int n_pca_iterations = 1000;
|
|
dimre_method cvector_dimre_method = DIMRE_METHOD_PCA;
|
|
std::string cvector_outfile = "control_vector.gguf";
|
|
std::string cvector_positive_file = "examples/cvector-generator/positive.txt";
|
|
std::string cvector_negative_file = "examples/cvector-generator/negative.txt";
|
|
|
|
bool spm_infill = false; // suffix/prefix/middle pattern for infill
|
|
|
|
std::string lora_outfile = "ggml-lora-merged-f16.gguf";
|
|
|
|
bool sweep_bench_output_jsonl = false;
|
|
};
|
|
|
|
void gpt_params_handle_hf_token(gpt_params & params);
|
|
void gpt_params_handle_model_default(gpt_params & params);
|
|
|
|
bool gpt_params_parse_ex (int argc, char ** argv, gpt_params & params);
|
|
bool gpt_params_parse (int argc, char ** argv, gpt_params & params);
|
|
bool gpt_params_find_arg (int argc, char ** argv, const std::string & arg, gpt_params & params, int & i, bool & invalid_param);
|
|
void gpt_params_print_usage(int argc, char ** argv, const gpt_params & params);
|
|
|
|
std::string gpt_params_get_system_info(const gpt_params & params);
|
|
|
|
//
|
|
// String utils
|
|
//
|
|
std::string string_join(const std::vector<std::string>& values, const std::string& separator);
|
|
std::string string_strip(const std::string & str);
|
|
std::string string_get_sortable_timestamp();
|
|
|
|
static bool string_starts_with(const std::string& str,
|
|
const std::string& prefix) { // While we wait for C++20's std::string::starts_with...
|
|
return str.rfind(prefix, 0) == 0;
|
|
}
|
|
|
|
std::vector<std::string> string_split(const std::string& str, const std::string& delimiter);
|
|
std::vector<std::string> string_split(const std::string& str, char delim);
|
|
|
|
void string_replace_all(std::string & s, const std::string & search, const std::string & replace);
|
|
// While we wait for C++20's std::string::ends_with...
|
|
bool string_ends_with(const std::string_view& str, const std::string_view& suffix);
|
|
size_t string_find_partial_stop(const std::string_view& str, const std::string_view& stop);
|
|
|
|
std::string regex_escape(const std::string& s);
|
|
|
|
template<class T>
|
|
static std::vector<T> string_split(const std::string & str, char delim) {
|
|
std::vector<T> values;
|
|
std::istringstream str_stream(str);
|
|
std::string token;
|
|
while (std::getline(str_stream, token, delim)) {
|
|
T value;
|
|
std::istringstream token_stream(token);
|
|
token_stream >> value;
|
|
values.push_back(value);
|
|
}
|
|
return values;
|
|
}
|
|
|
|
template<>
|
|
std::vector<std::string> string_split<std::string>(const std::string& input, char separator)
|
|
{
|
|
std::vector<std::string> parts;
|
|
size_t begin_pos = 0;
|
|
size_t separator_pos = input.find(separator);
|
|
while (separator_pos != std::string::npos) {
|
|
std::string part = input.substr(begin_pos, separator_pos - begin_pos);
|
|
parts.emplace_back(part);
|
|
begin_pos = separator_pos + 1;
|
|
separator_pos = input.find(separator, begin_pos);
|
|
}
|
|
parts.emplace_back(input.substr(begin_pos, separator_pos - begin_pos));
|
|
return parts;
|
|
}
|
|
|
|
bool string_parse_kv_override(const char * data, std::vector<llama_model_kv_override> & overrides);
|
|
void string_process_escapes(std::string & input);
|
|
|
|
//
|
|
// Filesystem utils
|
|
//
|
|
|
|
bool fs_validate_filename(const std::string & filename);
|
|
bool fs_create_directory_with_parents(const std::string & path);
|
|
|
|
std::string fs_get_cache_directory();
|
|
std::string fs_get_cache_file(const std::string & filename);
|
|
|
|
//
|
|
// Model utils
|
|
//
|
|
|
|
struct llama_init_result {
|
|
struct llama_model * model = nullptr;
|
|
struct llama_context * context = nullptr;
|
|
std::vector<llama_lora_adapter_container> lora_adapters;
|
|
};
|
|
|
|
struct llama_init_result llama_init_from_gpt_params(gpt_params & params);
|
|
|
|
struct llama_model_params llama_model_params_from_gpt_params (const gpt_params & params);
|
|
struct llama_context_params llama_context_params_from_gpt_params(const gpt_params & params);
|
|
|
|
struct llama_model * llama_load_model_from_url(const char * model_url, const char * path_model, const char * hf_token, const struct llama_model_params & params);
|
|
struct llama_model * llama_load_model_from_hf(const char * repo, const char * file, const char * path_model, const char * hf_token, const struct llama_model_params & params);
|
|
|
|
// clear LoRA adapters from context, then apply new list of adapters
|
|
void llama_lora_adapters_apply(struct llama_context * ctx, std::vector<llama_lora_adapter_container> & lora_adapters);
|
|
|
|
// Batch utils
|
|
|
|
void llama_batch_clear(struct llama_batch & batch);
|
|
|
|
void llama_batch_add(
|
|
struct llama_batch & batch,
|
|
llama_token id,
|
|
llama_pos pos,
|
|
const std::vector<llama_seq_id> & seq_ids,
|
|
bool logits);
|
|
|
|
//
|
|
// Vocab utils
|
|
//
|
|
|
|
// tokenizes a string into a vector of tokens
|
|
// should work similar to Python's `tokenizer.encode`
|
|
std::vector<llama_token> llama_tokenize(
|
|
const struct llama_context * ctx,
|
|
const std::string & text,
|
|
bool add_special,
|
|
bool parse_special = false);
|
|
|
|
std::vector<llama_token> llama_tokenize(
|
|
const struct llama_model * model,
|
|
const std::string & text,
|
|
bool add_special,
|
|
bool parse_special = false);
|
|
|
|
// tokenizes a token into a piece, optionally renders special/control tokens
|
|
// should work similar to Python's `tokenizer.id_to_piece`
|
|
std::string llama_token_to_piece(
|
|
const struct llama_context * ctx,
|
|
llama_token token,
|
|
bool special = true);
|
|
|
|
std::string llama_token_to_piece(
|
|
const struct llama_model* model,
|
|
llama_token token,
|
|
bool special = true);
|
|
|
|
// detokenizes a vector of tokens into a string
|
|
// should work similar to Python's `tokenizer.decode`
|
|
// optionally renders special/control tokens
|
|
std::string llama_detokenize(
|
|
llama_context * ctx,
|
|
const std::vector<llama_token> & tokens,
|
|
bool special = true);
|
|
|
|
// Uses the value from the model metadata if possible, otherwise
|
|
// defaults to true when model type is SPM, otherwise false.
|
|
bool llama_should_add_bos_token(const llama_model * model);
|
|
|
|
//
|
|
// Chat template utils
|
|
//
|
|
//struct common_tool_call {
|
|
// std::string name;
|
|
// std::string arguments;
|
|
// std::string id;
|
|
//};
|
|
//
|
|
//// same with llama_chat_message, but uses std::string
|
|
//struct common_chat_msg {
|
|
// std::string role;
|
|
// std::string content;
|
|
// std::vector<common_tool_call> tool_calls;
|
|
// std::string reasoning_content = "";
|
|
//};
|
|
|
|
//// Check if the template supplied via "--chat-template" is supported or not. Returns true if it's valid
|
|
//bool llama_chat_verify_template(const struct llama_model* , const std::string& tmpl, bool use_jinja);
|
|
//
|
|
//namespace minja {
|
|
// class chat_template;
|
|
//}
|
|
//
|
|
//typedef minja::chat_template common_chat_template;
|
|
//
|
|
//struct common_chat_templates {
|
|
// bool has_explicit_template; // Model had builtin template or template overridde was specified.
|
|
// std::unique_ptr<common_chat_template> template_default; // always set (defaults to chatml)
|
|
// std::unique_ptr<common_chat_template> template_tool_use;
|
|
//};
|
|
//
|
|
//
|
|
//// CPP wrapper for llama_chat_apply_template
|
|
//// If the built-in template is not supported, we default to chatml
|
|
//// If the custom "tmpl" is not supported, we throw an error
|
|
//std::string llama_chat_apply_template(
|
|
// const struct llama_model* model,
|
|
// const common_chat_template& tmpl,
|
|
// const std::vector< common_chat_msg>& chat,
|
|
// bool add_ass,
|
|
// bool use_jinja);
|
|
//
|
|
//// Format single message, while taking into account the position of that message in chat history
|
|
//std::string llama_chat_format_single(const struct llama_model* model,
|
|
// const common_chat_template& tmpl,
|
|
// const std::vector< common_chat_msg>& past_msg,
|
|
// const common_chat_msg& new_msg,
|
|
// bool add_ass,
|
|
// bool use_jinja);
|
|
//
|
|
//// Returns an example of formatted chat
|
|
//std::string llama_chat_format_example(const struct llama_model* model,
|
|
// const common_chat_template& tmpl, bool use_jinja);
|
|
//
|
|
//common_chat_templates llama_chat_templates_from_model(const struct llama_model* model, const std::string& chat_template_override);
|
|
|
|
|
|
//
|
|
// KV cache utils
|
|
//
|
|
|
|
// Dump the KV cache view with the number of sequences per cell.
|
|
void llama_kv_cache_dump_view(const llama_kv_cache_view & view, int row_size = 80);
|
|
|
|
// Dump the KV cache view showing individual sequences in each cell (long output).
|
|
void llama_kv_cache_dump_view_seqs(const llama_kv_cache_view & view, int row_size = 40);
|
|
|
|
//
|
|
// Embedding utils
|
|
//
|
|
|
|
void llama_embd_normalize(const float * inp, float * out, int n, int embd_norm = 2);
|
|
|
|
float llama_embd_similarity_cos(const float * embd1, const float * embd2, int n);
|
|
|
|
//
|
|
// Control vector utils
|
|
//
|
|
|
|
struct llama_control_vector_data {
|
|
int n_embd;
|
|
|
|
// stores data for layers [1, n_layer] where n_layer = data.size() / n_embd
|
|
std::vector<float> data;
|
|
};
|
|
|
|
struct llama_control_vector_load_info {
|
|
float strength;
|
|
|
|
std::string fname;
|
|
};
|
|
|
|
// Load control vectors, scale each by strength, and add them together.
|
|
// On error, returns {-1, empty}
|
|
llama_control_vector_data llama_control_vector_load(const std::vector<llama_control_vector_load_info> & load_infos);
|
|
|
|
//
|
|
// Split utils
|
|
//
|
|
|
|
static const char * const LLM_KV_SPLIT_NO = "split.no";
|
|
static const char * const LLM_KV_SPLIT_COUNT = "split.count";
|
|
static const char * const LLM_KV_SPLIT_TENSORS_COUNT = "split.tensors.count";
|
|
|
|
//
|
|
// YAML utils
|
|
//
|
|
|
|
void yaml_dump_vector_float (FILE * stream, const char * prop_name, const std::vector<float> & data);
|
|
void yaml_dump_vector_int (FILE * stream, const char * prop_name, const std::vector<int> & data);
|
|
void yaml_dump_string_multiline(FILE * stream, const char * prop_name, const char * data);
|
|
|
|
void yaml_dump_non_result_info(
|
|
FILE * stream, const gpt_params & params, const llama_context * lctx,
|
|
const std::string & timestamp, const std::vector<int> & prompt_tokens, const char * model_desc);
|
|
|
|
std::string string_format(const char* fmt, ...);
|