Commit Graph

1092 Commits

Author SHA1 Message Date
hksdpc255
d24ea9e48e server : add Anthropic Messages API support (#1012) 2025-11-29 07:24:59 +01:00
Kawrakow
45cd1a70f5 Fix llama-bench mla parameter (#1016)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-27 09:33:30 +01:00
firecoperana
0b126b2ca6 Fix prompt tokenization issue during prompt processing (#1008)
* Find common tokens between prompt and cache
Fix wrong context size usage for mtmd
Use start position of common part
server: handle context shift

* Add size check for inexact match

* Change

---------

Co-authored-by: firecoperana <firecoperana>
2025-11-26 10:34:26 +01:00
firecoperana
07d08e15ad webui update (#1003)
webui: add system message in export conversation, support upload conversation with system message
Webui: show upload only when in new conversation
Webui: Add model name
webui: increase height of chat message window when clicking editing
Webui: autoclose settings dialog dropdown and maximze screen width when zoom in
webui: fix date issues and add more dates
webui: change error to toast.error.
server: add n_past and slot_id in props_simple
webui: add cache tokens, context and prompt speed in chat
webui: modernize ui
webui: change welcome message
webui: change speed display
webui: change run python icon
webui: add config to use server defaults for sampler
webui: put speed on left and context on right

webui: recognize AsciiDoc files as valid text files (#16850)

* webui: recognize AsciiDoc files as valid text files

* webui: add an updated static webui build

* webui: add the updated dependency list

* webui: re-add an updated static webui build

Add a setting to display message generation statistics (#16901)

* feat: Add setting to display message generation statistics

* chore: build static webui output

webui: add HTML/JS preview support to MarkdownContent with sandboxed iframe (#16757)

* webui: add HTML/JS preview support to MarkdownContent with sandboxed iframe dialog

Extended MarkdownContent to flag previewable code languages,
add a preview button alongside copy controls, manage preview
dialog state, and share styling for the new button group

Introduced CodePreviewDialog.svelte, a sandboxed iframe modal
for rendering HTML/JS previews with consistent dialog controls

* webui: fullscreen HTML preview dialog using bits-ui

* Update tools/server/webui/src/lib/components/app/misc/CodePreviewDialog.svelte

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>

* Update tools/server/webui/src/lib/components/app/misc/MarkdownContent.svelte

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>

* webui: pedantic style tweak for CodePreviewDialog close button

* webui: remove overengineered preview language logic

* chore: update webui static build

---------

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>

webui: auto-refresh /props on inference start to resync model metadata (#16784)

* webui: auto-refresh /props on inference start to resync model metadata

- Add no-cache headers to /props and /slots
- Throttle slot checks to 30s
- Prevent concurrent fetches with promise guard
- Trigger refresh from chat streaming for legacy and ModelSelector
- Show dynamic serverWarning when using cached data

* fix: restore proper legacy behavior in webui by using unified /props refresh

Updated assistant message bubbles to show each message's stored model when available,
falling back to the current server model only when the per-message value is missing

When the model selector is disabled, now fetches /props and prioritizes that model name
over chunk metadata, then persists it with the streamed message so legacy mode properly
reflects the backend configuration

* fix: detect first valid SSE chunk and refresh server props once

* fix: removed the slots availability throttle constant and state

* webui: purge ai-generated cruft

* chore: update webui static build

feat(webui): improve LaTeX rendering with currency detection (#16508)

* webui : Revised LaTeX formula recognition

* webui : Further examples containg amounts

* webui : vitest for maskInlineLaTeX

* webui: Moved preprocessLaTeX to lib/utils

* webui: LaTeX in table-cells

* chore: update webui build output (use theirs)

* webui: backslash in LaTeX-preprocessing

* chore: update webui build output

* webui: look-behind backslash-check

* chore: update webui build output

* Apply suggestions from code review

Code maintenance (variable names, code formatting, string handling)

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>

* webui: Moved constants to lib/constants.

* webui: package woff2 inside base64 data

* webui: LaTeX-line-break in display formula

* chore: update webui build output

* webui: Bugfix (font embedding)

* webui: Bugfix (font embedding)

* webui: vite embeds assets

* webui: don't suppress 404 (fonts)

* refactor: KaTeX integration with SCSS

Moves KaTeX styling to SCSS for better customization and font embedding.

This change includes:
- Adding `sass` as a dev dependency.
- Introducing a custom SCSS file to override KaTeX variables and disable TTF/WOFF fonts, relying solely on WOFF2 for embedding.
- Adjusting the Vite configuration to resolve `katex-fonts` alias and inject SCSS variables.

* fix: LaTeX processing within blockquotes

* webui: update webui build output

---------

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>

server : add props.model_alias (#16943)

* server : add props.model_alias

webui: fix keyboard shortcuts for new chat & edit chat title (#17007)

Better UX for handling multiple attachments in WebUI (#17246)

webui: add OAI-Compat Harmony tool-call streaming visualization and persistence in chat UI (#16618)

* webui: add OAI-Compat Harmony tool-call live streaming visualization and persistence in chat UI

- Purely visual and diagnostic change, no effect on model context, prompt
  construction, or inference behavior

- Captured assistant tool call payloads during streaming and non-streaming
  completions, and persisted them in chat state and storage for downstream use

- Exposed parsed tool call labels beneath the assistant's model info line
  with graceful fallback when parsing fails

- Added tool call badges beneath assistant responses that expose JSON tooltips
  and copy their payloads when clicked, matching the existing model badge styling

- Added a user-facing setting to toggle tool call visibility to the Developer
  settings section directly under the model selector option

* webui: remove scroll listener causing unnecessary layout updates (model selector)

* Update tools/server/webui/src/lib/components/app/chat/ChatMessages/ChatMessageAssistant.svelte

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>

* Update tools/server/webui/src/lib/components/app/chat/ChatMessages/ChatMessageAssistant.svelte

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>

* chore: npm run format & update webui build output

* chore: update webui build output

---------

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>

webui: Fix clickability around chat processing statistics UI (#17278)

* fix: Better pointer events handling in chat processing info elements

* chore: update webui build output

Fix merge error

webui: Add a "Continue" Action for Assistant Message (#16971)

* feat: Add "Continue" action for assistant messages

* feat: Continuation logic & prompt improvements

* chore: update webui build output

* feat: Improve logic for continuing the assistant message

* chore: update webui build output

* chore: Linting

* chore: update webui build output

* fix: Remove synthetic prompt logic, use the prefill feature by sending the conversation payload ending with assistant message

* chore: update webui build output

* feat: Enable "Continue" button based on config & non-reasoning model type

* chore: update webui build output

* chore: Update packages with `npm audit fix`

* fix: Remove redundant error

* chore: update webui build output

* chore: Update `.gitignore`

* fix: Add missing change

* feat: Add auto-resizing for Edit Assistant/User Message textareas

* chore: update webui build output

Improved file naming & structure for UI components (#17405)

* refactor: Component iles naming & structure

* chore: update webui build output

* refactor: Dialog titles + components namig

* chore: update webui build output

* refactor: Imports

* chore: update webui build output

webui: hide border of button

webui: update

webui: update

webui: update

add vision

webui: minor settings reorganization and add disable autoscroll option (#17452)

* webui: added a dedicated 'Display' settings section that groups visualization options

* webui: added a Display setting to toggle automatic chat scrolling

* chore: update webui build output

Co-authored-by: firecoperana <firecoperana>
2025-11-24 07:03:45 +01:00
gapeleon
f6163dd58f Fix: Register missing /apply-template endpoint (#999)
The handle_apply_template handler was defined but never registered as
  an HTTP endpoint, causing 404 errors when calling POST /apply-template.

  This commit adds the missing endpoint registration to match the
  functionality available in llama.cpp mainline.

Co-authored-by: Gapeleon <gapeleon@users.noreply.github.com>
2025-11-24 06:53:15 +01:00
Yap Sok Ann
7505165dee Fix truncated logprobs when streaming is off (#998)
The logic to skip the logprobs of the stop token was originally from
ggml-org/llama.cpp#2849, and was later modified as part of
ggml-org/llama.cpp#10643 to be applied only to STOP_TYPE_WORD.

The latter change wasn't included in #723. Then, after #958 got merged,
the logic got inadvertently applied to GLM-4.5/4.6 and Kimi K2,
resulting in truncated logprobs when streaming is off.

This commit reverts the logic from ggml-org/llama.cpp#2849, such that
the logprobs of the stop token will always be included in the response,
when logprobs is enabled. From testing, this matches with the behavior
of Fireworks inference server, for both chat completions and text
completions endpoints.

Also fix logprobs param handling for the text completion endpoint.
2025-11-24 06:52:15 +01:00
Kawrakow
0f6986a33c Disable split mode "row" (#987)
* Disable split mode "row"

* Also llama-bench

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-19 16:15:50 +01:00
firecoperana
bacb8fb79f Server: Handle context shift better to reduce prompt processing time (#973)
* Handle context shift better to reduce pp

Add context-shift args

Add back ga_n in context shift

* optimize discard function and bring back n_keep = -1

---------

Co-authored-by: firecoperana <firecoperana>
2025-11-19 16:04:48 +01:00
firecoperana
b40d11b22d Fix kv cache save and load for GLM model (#965)
Co-authored-by: firecoperana <firecoperana>
2025-11-15 17:04:16 +02:00
firecoperana
5ec0def0ef Fix compiler warnings (#963)
* Fix changes meaning warnings

* A couple of more warnings and formatting

---------

Co-authored-by: firecoperana <firecoperana>
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-15 07:07:15 +02:00
firecoperana
bb358223cd server: cache prompt to host memory (#954)
* server : host-memory prompt caching

change similarity calculation and prompt save conditions

Remove unneeded token limit

rename variable

Separate prompt save and load logic

change default values

change log

remove truncate prompt logic

* add description

* bug fixes

* remove token limit in init

---------

Co-authored-by: firecoperana <firecoperana>
2025-11-14 18:40:13 +02:00
firecoperana
177b5d2a47 Fix cuda init error in rpc (#957)
Co-authored-by: firecoperana <firecoperana>
2025-11-14 06:59:54 +02:00
Kawrakow
6b9d1bf4b4 Graph reuse (#947)
* Add mainline compatible FA command line option

* Graph reuse: add command line argument to turn it on

* WIP

* This seems to work

* This is perhaps cleaner

* Change the command line option to -gr

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-14 06:58:19 +02:00
Kawrakow
88c02fa108 Set default MLA to 3 also in llama-bench (#949)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-13 09:52:06 +02:00
Kawrakow
25cd985c9b Add --n-cpu-moe to llama_bench (#937)
* Add --n-cpu-moe to llama_banch

* Add usage

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-11 08:44:59 +02:00
Kawrakow
121ed91165 Add rcache to llama-bench (#936)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-11 08:06:18 +02:00
firecoperana
eea6cc4433 Server: Add --draft-params to set draft model parameter via command line args (#932)
* Add command line argument for draft model

* Remove second context of draft model

* Format print

* print usage if parsing -draft fails

---------

Co-authored-by: firecoperana <firecoperana>
2025-11-10 09:51:07 +02:00
firecoperana
73c28dbef4 server: bug fix for preserved_tokens not preserved in process_token (#926)
Co-authored-by: firecoperana <firecoperana>
2025-11-09 14:16:29 +02:00
firecoperana
b63309a918 Fix embedding missing, CORS and crash using verbose in server (#924)
* server: fix crash when prompt has image and is too long

* server: fix CORS

* server: fix empty result for embedding

* change error message to truncate prompt

* server: fix slot id for save and load state

* bug fix

* server: update slot similarity to handle mtmd

* server: quick hack to calculate number of token processed with image

* server: fix out of range error when detokenizing prompt under verbose

* Add back Access-Control-Allow-Origin

* Server: Add prompt tokens in embedding results

---------

Co-authored-by: firecoperana <firecoperana>
2025-11-09 14:16:03 +02:00
Nexes the Elder
f9a411e5db More informative PPL readout line (#914)
* More informative PPL readout line

* trailing whitespace..
2025-11-07 16:41:24 +02:00
Kawrakow
532a05e466 CUDA: set compute parameters via command line arguments (#910)
* cuda: set compute parameters via command line arguments

* Also llama-bench

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-07 07:11:23 +02:00
Kawrakow
66ef68bc14 Fix compiler warning 2025-11-06 07:12:07 +02:00
firecoperana
18f5a6caef Bug fixes for completions and prompt caching in server (#906)
* Bug fixes for completions and prompt caching in server

* Fix compiler warning about redefinition

---------

Co-authored-by: firecoperana <firecoperana>
2025-11-06 07:10:51 +02:00
Kawrakow
e68f50be9a Allow quantization of ffn_gate_inp (#896)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-05 10:44:32 +02:00
firecoperana
7978f04996 Add vision support in llama-server (#901)
* server: add support for vision model
webui: add support for vision model

* server : remove hack for extra parallel slot#10187

* llama : fix KV shift for qwen2vl #13870

* add no-context-shift parameter

---------

Co-authored-by: firecoperana <firecoperana>
2025-11-05 10:43:46 +02:00
Thireus ☠
86597623a5 Port of Qwen3-VL support from mainline (#883)
* Port of Qwen3-VL for latest ik_llama.cpp

- convert_hf_to_gguf.py - Not touched, use llama.cpp to convert model instead
- sysl and metal support for imrope not added
- Vulkan support for imrope not tested
- Code not tested

* Bugfix n_embd was declared multiple times

https://github.com/ikawrakow/ik_llama.cpp/pull/883#issuecomment-3471179655

* Fix n_embd issue with qwen3vl

* model.output tensor not required

https://github.com/ikawrakow/ik_llama.cpp/pull/883#discussion_r2480388389

* Improved logic for qkv combined tensors

59ceaf8fcb (r2480395800)
59ceaf8fcb (r2480398187)

* Fix n_embd for merge_qkv() + cleaner code

https://github.com/ikawrakow/ik_llama.cpp/pull/883#discussion_r2481227395

* Revert TENSOR_NOT_REQUIRED
2025-11-04 19:20:54 +02:00
Kawrakow
efcb5f9d9e sweep-bench: be able to set TG tokens via -n (#897)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-04 14:39:30 +02:00
Kawrakow
cfb840379f Biased mmvq: minor optimization (#880)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-10-31 14:21:18 +02:00
jarrodfeaks
92517e74ad fix v1/chat/completions assistant prefill (#874) 2025-10-29 17:21:05 +02:00
Nexes the Elder
e68dabc242 A few server commits from mainline. (#872)
server : handle models with missing EOS token (#8997)

server : fix segfault on long system prompt (#8987)
* server : fix segfault on long system prompt
* server : fix parallel generation with very small batch sizes
* server : fix typo in comment

server : init stop and error fields of the result struct (#9026)

server : fix duplicated n_predict key in the generation_settings (#8994)

server : support reading arguments from environment variables (#9105)
* server : support reading arguments from environment variables
* add -fa and -dt
* readme : specify non-arg env var

server : add some missing env variables (#9116)
* server : add some missing env variables
* add LLAMA_ARG_HOST to server dockerfile
* also add LLAMA_ARG_CONT_BATCHING

Credits are to the respective authors.
Not a single merge conflict occurred.
Compiled, then tested without bug.
2025-10-28 09:58:31 +02:00
firecoperana
904e994bfb Support --device and --device-draft parameter (#866)
* add --device and --device-draft parameter

* don't print debug message in release mode

* fix

* bug fix to throw exception when no device specified

* add const

---------

Co-authored-by: firecoperana <firecoperana>
2025-10-27 18:13:28 +02:00
firecoperana
bf991ba60a Add --webui arg to launch llama.cpp new webui (#786)
* Add new webui from llama.cpp

* Add new webui

* feat: Improve mobile UI for Settings Dialog (#16084)

* feat: Improve mobile UI for Settings Dialog

* chore: update webui build output

* fix: Linting errors

* chore: update webui build output
# Conflicts:
#	examples/server/webui_llamacpp/src/lib/components/app/chat/ChatSettings/ChatSettingsFields.svelte
#	examples/server/webui_llamacpp/src/lib/components/app/chat/ChatSettings/ChatSettingsSection.svelte
#	tools/server/public/index.html.gz

* webui : fix handling incomplete chunks (#16107)

* Always show message actions for mobile UI + improvements for user message sizing (#16076)

# Conflicts:
#	.gitignore
#	examples/server/webui_llamacpp/package.json
#	examples/server/webui_llamacpp/scripts/dev.sh
#	tools/server/webui/scripts/post-build.sh

* webui: switch to hash-based routing (alternative of #16079) (#16157)

* Switched web UI to hash-based routing

* Added hash to missed goto function call

* Removed outdated SPA handling code

* Fixed broken sidebar home link
# Conflicts:
#	examples/server/webui_llamacpp/src/routes/+layout.ts
#	tools/server/server.cpp

* Allow viewing conversations even when llama server is down (#16255)

* webui: allow viewing conversations and sending messages even if llama-server is down

- Cached llama.cpp server properties in browser localStorage on startup, persisting successful fetches and reloading them when refresh attempts fail so the chat UI continues to render while the backend is unavailable.
- Cleared the stored server properties when resetting the store to prevent stale capability data after cache-backed operation.
- Kept the original error-splash behavior when no cached props exist so fresh installs still surface a clear failure state instead of rendering stale data.

* feat: Add UI for `props` endpoint unavailable + cleanup logic

* webui: extend cached props fallback to offline errors

Treat connection failures (refused, DNS, timeout, fetch) the same way as
server 5xx so the warning banner shows up when cache is available, instead
of falling back to a full error screen.

* webui: Left the chat form enabled when a server warning is present so operators can keep sending messages

e.g., to restart the backend over llama-swap, even while cached /props data is in use

* chore: update webui build output

---------

Co-authored-by: Pascal <admin@serveurperso.com>
# Conflicts:
#	examples/server/webui_llamacpp/src/lib/components/app/chat/ChatScreen/ChatScreenWarning.svelte
#	examples/server/webui_llamacpp/src/lib/constants/localstorage-keys.ts

* Enhance text file detection logic for file attachments (#16199)

* feat: Enhances text file detection logic

* chore: Build static `webui` output

* chore: update webui build output
# Conflicts:
#	examples/server/webui_llamacpp/src/lib/constants/binary-detection.ts

* Show message actions by default (#16289)

* fix: preserved zero values in chat settings inputs and textareas by switching to nullish coalescing for field values and default placeholders (#16312)

* Improve Mobile UI for dialogs and action dropdowns (#16222)

* fix: Always show conversation item actions

* feat: Improve Alert Dialog and Dialog mobile UI

* feat: Add settings reset to default confirmation

* fix: Close Edit dialog on save

* chore: update webui build output

* webui: implement proper z-index system and scroll management

- Add CSS variable for centralized z-index control
- Fix dropdown positioning with Settings dialog conflicts
- Prevent external scroll interference with proper event handling
- Clean up hardcoded z-index values for maintainable architecture

* webui: ensured the settings dialog enforces dynamic viewport height on mobile while retaining existing desktop sizing overrides

* feat: Use `dvh` instead of computed px height for dialogs max height on mobile

* chore: update webui build output

* feat: Improve Settings fields UI

* chore: update webui build output

* chore: update webui build output

---------

Co-authored-by: Pascal <admin@serveurperso.com>

* Fix thinking blocks with quotes + add handling `[THINK]...[/THINK]` blocks (#16326)

* fix: prevent reasoning blocks with quotes from being truncated

* chore: update webui build output

* feat: Improve thinking content parsing

* test: Adds ChatMessage component stories for different thinking blocks

* chore: update webui build output

* fix: ChatMessage story fix

---------

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>

* Chatapi ignore empty sampling (#16330)

* fix: skip empty sampling fields instead of coercing to 0 in chat API options

* chore: update webui build output

* webui: Remove running `llama-server` within WebUI `dev.sh` script (#16363)

* Add optional setting for showing "Model used:" information (#16337)

* feat: Add a setting to include model name used to generate the message

* feat: UI improvements

* feat: Save model info along with the database message entry creation

* chore: Build webui static output

* Improve code block color theming (#16325)

* feat: Improve code block theming

* chore: update webui build output

* chore: Update webui static build

* Conversation action dialogs as singletons from Chat Sidebar + apply conditional rendering for Actions Dropdown for Chat Conversation Items (#16369)

* fix: Render Conversation action dialogs as singletons from Chat Sidebar level

* chore: update webui build output

* fix: Render Actions Dropdown conditionally only when user hovers conversation item + remove unused markup

* chore: Update webui static build

* fix: Always truncate conversation names

* chore: Update webui static build

* fix: track viewportHeight via window.innerHeight to avoid unwanted scrolling (#16356)

Use <svelte:window bind:innerHeight> instead of manual resize listener

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>

* webui : Fix messages payload sent to chat completions (#16402)

* fix: Include just the currently active message branches instead of all in chat completions request

* chore: Build webui static output

* chore: Formatting

* chore: update webui build output

* Capture model name only after first token (streaming) or completed request (#16405)

* feat: Capture model name only after first token (streaming) or completed request (non-streaming)

* chore: update webui build output

* chore: update webui build output

* Fix missing messages on sibling navigation (#16408)

* fix: resolve message disappearing issue when navigating between regenerated siblings by using current leaf nodes instead of cached sibling IDs

* chore: update webui build output

* chore: update webui build output

* webui : added download action (#13552) (#16282)

* webui : added download action (#13552)

* webui : import and export (for all conversations)

* webui : fixed download-format, import of one conversation

* webui : add ExportedConversations type for chat import/export

* feat: Update naming & order

* chore: Linting

* webui : Updated static build output

---------

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>

* refactor: centralize CoT parsing in backend for streaming mode (#16394)

* refactor: unify reasoning handling via backend reasoning_content, drop frontend tag parsing

- Updated the chat message component to surface backend-supplied reasoning via message.thinking while showing the raw assistant content without inline tag scrubbing
- Simplified chat streaming to append content chunks directly, stream reasoning into the message model, and persist any partial reasoning when generation stops
- Refactored the chat service SSE handler to rely on server-provided reasoning_content, removing legacy <think> parsing logic
- Refreshed Storybook data and streaming flows to populate the thinking field explicitly for static and streaming assistant messages

* refactor: implement streaming-aware universal reasoning parser

Remove the streaming mode limitation from --reasoning-format by refactoring
try_parse_reasoning() to handle incremental parsing of <think> tags across
all formats.

- Rework try_parse_reasoning() to track whitespace, partial tags, and
  multiple reasoning segments, allowing proper separation of reasoning_content
  and content in streaming mode
- Parse reasoning tags before tool call handling in content-only and Llama 3.x
  formats to ensure inline <think> blocks are captured correctly
- Change default reasoning_format from 'auto' to 'deepseek' for consistent
  behavior
- Add 'deepseek-legacy' option to preserve old inline behavior when needed
- Update CLI help and documentation to reflect streaming support
- Add parser tests for inline <think>...</think> segments

The parser now continues processing content after </think> closes instead of
stopping, enabling proper message.reasoning_content and message.content
separation in both streaming and non-streaming modes.

Fixes the issue where streaming responses would dump everything (including
post-thinking content) into reasoning_content while leaving content empty.

* refactor: address review feedback from allozaur

- Passed the assistant message content directly to ChatMessageAssistant to drop the redundant derived state in the chat message component
- Simplified chat streaming updates by removing unused partial-thinking handling and persisting partial responses straight from currentResponse
- Refreshed the ChatMessage stories to cover standard and reasoning scenarios without the old THINK-tag parsing examples

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>

* refactor: restore forced reasoning prefix to pass test-chat ([chat] All tests passed)

- store the exact sequence seen on input when 'thinking_forced_open' enforces a reasoning block
- inject this prefix before the first accumulated segment in 'reasoning_content', then clear it to avoid duplication
- repeat the capture on every new 'start_think' detection to properly handle partial/streaming flows

* refactor: address review feedback from ngxson

* debug: say goodbye to curl -N, hello one-click raw stream

- adds a new checkbox in the WebUI to display raw LLM output without backend parsing or frontend Markdown rendering

* Update tools/server/webui/src/lib/components/app/chat/ChatMessages/ChatMessage.svelte

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>

* webui: add Storybook example for raw LLM output and scope reasoning format toggle per story

- Added a Storybook example that showcases the chat message component in raw LLM output mode with the provided trace sample
- Updated every ChatMessage story to toggle the disableReasoningFormat setting so the raw-output rendering remains scoped to its own example

* npm run format

* chat-parser: address review feedback from ngxson

Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>

---------

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
# Conflicts:
#	common/arg.cpp
#	examples/server/webui_llamacpp/src/lib/utils/thinking.ts
#	tools/server/README.md

* No markdown in cot (#16483)

* fix: let the model think in plaintext

* chore: npm run format + npm run build

* webui: updated the chat service to only include max_tokens in the req… (#16489)

* webui: updated the chat service to only include max_tokens in the request payload when the setting is explicitly provided, while still mapping explicit zero or null values to the infinite-token sentinel

* chore: update webui build output

* feat: render user content as markdown option (#16358)

* feat: render user content as markdown option
- Add a persisted 'renderUserContentAsMarkdown' preference to the settings defaults and info metadata so the choice survives reloads like other options
- Surface the new 'Render user content as Markdown' checkbox in the General section of the chat settings dialog, beneath the PDF toggle
- Render user chat messages with 'MarkdownContent' when the new setting is enabled, matching assistant formatting while preserving the existing card styling otherwise
- chore: update webui build output

* chore: update webui build output

* webui: remove client-side context pre-check and rely on backend for limits (#16506)

* fix: make SSE client robust to premature [DONE] in agentic proxy chains

* webui: remove client-side context pre-check and rely on backend for limits

Removed the client-side context window pre-check and now simply sends messages
while keeping the dialog imports limited to core components, eliminating the
maximum context alert path

Simplified streaming and non-streaming chat error handling to surface a generic
'No response received from server' error whenever the backend returns no content

Removed the obsolete maxContextError plumbing from the chat store so state
management now focuses on the core message flow without special context-limit cases

* webui: cosmetic rename of error messages

* Update tools/server/webui/src/lib/stores/chat.svelte.ts

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>

* Update tools/server/webui/src/lib/stores/chat.svelte.ts

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>

* Update tools/server/webui/src/lib/components/app/chat/ChatScreen/ChatScreen.svelte

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>

* Update tools/server/webui/src/lib/components/app/chat/ChatScreen/ChatScreen.svelte

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>

* chore: update webui build output

---------

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
# Conflicts:
#	examples/server/webui_llamacpp/src/lib/components/app/dialogs/ChatErrorDialog.svelte
#	examples/server/webui_llamacpp/src/lib/components/app/dialogs/MaximumContextAlertDialog.svelte
#	examples/server/webui_llamacpp/src/lib/services/context.ts

* fix: add remark plugin to render raw HTML as literal text (#16505)

* fix: add remark plugin to render raw HTML as literal text

Implemented a missing MDAST stage to neutralize raw HTML like major LLM WebUIs
do ensuring consistent and safe Markdown rendering

Introduced 'remarkLiteralHtml', a plugin that converts raw HTML nodes in the
Markdown AST into plain-text equivalents while preserving indentation and
line breaks. This ensures consistent rendering and prevents unintended HTML
execution, without altering valid Markdown structure

Kept 'remarkRehype' in the pipeline since it performs the required conversion
from MDAST to HAST for KaTeX, syntax highlighting, and HTML serialization

Refined the link-enhancement logic to skip unnecessary DOM rewrites,
fixing a subtle bug where extra paragraphs were injected after the first
line due to full innerHTML reconstruction, and ensuring links open in new
tabs only when required

Final pipeline: remarkGfm -> remarkMath -> remarkBreaks -> remarkLiteralHtml
-> remarkRehype -> rehypeKatex -> rehypeHighlight -> rehypeStringify

* fix: address review feedback from allozaur

* chore: update webui build output
# Conflicts:
#	examples/server/webui_llamacpp/src/lib/constants/literal-html.ts

* Add server-driven parameter defaults and syncing (#16515)

# Conflicts:
#	examples/server/webui_llamacpp/src/lib/components/app/chat/ChatSettings/ParameterSourceIndicator.svelte
#	examples/server/webui_llamacpp/src/lib/constants/precision.ts
#	examples/server/webui_llamacpp/src/lib/services/parameter-sync.spec.ts
#	examples/server/webui_llamacpp/src/lib/services/parameter-sync.ts
#	examples/server/webui_llamacpp/src/lib/utils/config-helpers.ts
#	examples/server/webui_llamacpp/src/lib/utils/precision.ts

* fix: added a normalization step for MathJax-style \[\] and \(\) delimiters (#16599)

* fix: added a normalization step for MathJax-style \[\] and \(\) delimiters

So inline and block equations are converted before KaTeX rendering,
enabling proper display of model-generated LaTeX in the WebUI

* chore: update webui build output

* webui: reorganize settings layout (#16607)

* webui: reorganize settings layout

* chore: update webui build output

* fix: remove unused variable

* chore: update webui build output

* Enable per-conversation loading states to allow having parallel conversations (#16327)

* feat: Per-conversation loading states and tracking streaming stats

* chore: update webui build output

* refactor: Chat state management

Consolidates loading state management by using a global `isLoading` store synchronized with individual conversation states.

This change ensures proper reactivity and avoids potential race conditions when updating the UI based on the loading status of different conversations. It also improves the accuracy of statistics displayed.

Additionally, slots service methods are updated to use conversation IDs for per-conversation state management, avoiding global state pollution.

* feat: Adds loading indicator to conversation items

* chore: update webui build output

* fix: Fix aborting chat streaming

Improves the chat stream abortion process by ensuring that partial responses are saved before the abort signal is sent.

This avoids a race condition where the onError callback could clear the streaming state before the partial response is saved. Additionally, the stream reading loop and callbacks are now checked for abort signals to prevent further processing after abortion.

* refactor: Remove redundant comments

* chore: build webui static output

* refactor: Cleanup

* chore: update webui build output

* chore: update webui build output

* fix: Conversation loading indicator for regenerating messages

* chore: update webui static build

* feat: Improve configuration

* feat: Install `http-server` as dev dependency to not need to rely on `npx` in CI

* Import/Export UX improvements (#16619)

* webui : added download action (#13552)

* webui : import and export (for all conversations)

* webui : fixed download-format, import of one conversation

* webui : add ExportedConversations type for chat import/export

* feat: Update naming & order

* chore: Linting

* feat: Import/Export UX improvements

* chore: update webui build output

* feat: Update UI placement of Import/Export tab in Chat Settings Dialog

* refactor: Cleanup

chore: update webui build output

* feat: Enable shift-click multiple conversation items selection

* chore: update webui static build

* chore: update webui static build

---------

Co-authored-by: Sascha Rogmann <github@rogmann.org>
# Conflicts:
#	examples/server/webui_llamacpp/src/lib/components/app/chat/ChatSettings/ConversationSelectionDialog.svelte
#	examples/server/webui_llamacpp/src/lib/components/app/chat/ChatSettings/ImportExportTab.svelte
#	examples/server/webui_llamacpp/src/lib/utils/conversation-utils.ts

* Prevent premature submission on IME input (#16673)

* fix: Prevent premature submission on IME input

* chore: update webui static build

* refactor: Put IME completion checker in a helper function and add checking for `KeyboardEvent.eventKey === 229`

* chore: update webui static build

* chore: update webui static build

* chore: update webui static build
# Conflicts:
#	examples/server/webui_llamacpp/src/lib/utils/is-ime-composing.ts

* Handle legacy 'context' attachments (#16687)

* webui: introduce OpenAI-compatible model selector in JSON payload (#16562)

* webui: introduce OpenAI-compatible model selector in JSON payload

* webui: restore OpenAI-Compatible model source of truth and unify metadata capture

This change re-establishes a single, reliable source of truth for the active model:
fully aligned with the OpenAI-Compat API behavior

It introduces a unified metadata flow that captures the model field from both
streaming and non-streaming responses, wiring a new onModel callback through ChatService
The model name is now resolved directly from the API payload rather than relying on
server /props or UI assumptions

ChatStore records and persists the resolved model for each assistant message during
streaming, ensuring consistency across the UI and database
Type definitions for API and settings were also extended to include model metadata
and the onModel callback, completing the alignment with OpenAI-Compat semantics

* webui: address review feedback from allozaur

* webui: move model selector into ChatForm (idea by @allozaur)

* webui: make model selector more subtle and integrated into ChatForm

* webui: replaced the Flowbite selector with a native Svelte dropdown

* webui: add developer setting to toggle the chat model selector

* webui: address review feedback from allozaur

Normalized streamed model names during chat updates
by trimming input and removing directory components before saving
or persisting them, so the conversation UI shows only the filename

Forced model names within the chat form selector dropdown to render as
a single-line, truncated entry with a tooltip revealing the full name

* webui: toggle displayed model source for legacy vs OpenAI-Compat modes

When the selector is disabled, it falls back to the active server model name from /props

When the model selector is enabled, the displayed model comes from the message metadata
(the one explicitly selected and sent in the request)

* Update tools/server/webui/src/lib/components/app/chat/ChatForm/ChatFormActions.svelte

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>

* Update tools/server/webui/src/lib/constants/localstorage-keys.ts

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>

* Update tools/server/webui/src/lib/components/app/chat/ChatForm/ChatFormModelSelector.svelte

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>

* Update tools/server/webui/src/lib/components/app/chat/ChatMessages/ChatMessageAssistant.svelte

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>

* Update tools/server/webui/src/lib/services/chat.ts

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>

* Update tools/server/webui/src/lib/services/chat.ts

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>

* webui: refactor model selector and persistence helpers

- Replace inline portal and event listeners with proper Svelte bindings
- Introduce 'persisted' store helper for localStorage sync without runes
- Extract 'normalizeModelName' utils + Vitest coverage
- Simplify ChatFormModelSelector structure and cleanup logic

Replaced the persisted store helper's use of '$state/$effect' runes with
a plain TS implementation to prevent orphaned effect runtime errors
outside component context

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>

* webui: document normalizeModelName usage with inline examples

* Update tools/server/webui/src/lib/components/app/chat/ChatForm/ChatFormModelSelector.svelte

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>

* Update tools/server/webui/src/lib/stores/models.svelte.ts

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>

* Update tools/server/webui/src/lib/stores/models.svelte.ts

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>

* webui: extract ModelOption type into dedicated models.d.ts

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>

* webui: refine ChatMessageAssistant displayedModel source logic

* webui: stabilize dropdown, simplify model extraction, and init assistant model field

* chore: update webui static build

* Update tools/server/webui/src/lib/components/app/chat/ChatMessages/ChatMessageAssistant.svelte

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>

* chore: npm format, update webui static build

* webui: align sidebar trigger position, remove z-index glitch

* chore: update webui build output

---------

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
# Conflicts:
#	examples/server/webui_llamacpp/src/lib/components/app/chat/ChatForm/ChatFormModelSelector.svelte
#	examples/server/webui_llamacpp/src/lib/services/models.ts
#	examples/server/webui_llamacpp/src/lib/stores/models.svelte.ts
#	examples/server/webui_llamacpp/src/lib/stores/persisted.svelte.ts
#	examples/server/webui_llamacpp/src/lib/types/models.d.ts
#	examples/server/webui_llamacpp/src/lib/utils/model-names.test.ts
#	examples/server/webui_llamacpp/src/lib/utils/model-names.ts
#	examples/server/webui_llamacpp/src/lib/utils/portal-to-body.ts

* webui: support q URL parameter (#16728)

* webui: support q URL parameter

Fixes #16722
I’ve checked that it works with Firefox’s AI tools

* webui: apply suggestions from code review

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>

* chore: update webui static build

---------

Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>

* build fix

---------

Co-authored-by: firecoperana <firecoperana>
Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
Co-authored-by: Quentin Bramas <quentin.bramas@gmail.com>
Co-authored-by: Isaac McFadyen <isaac@imcf.me>
Co-authored-by: Pascal <admin@serveurperso.com>
Co-authored-by: Sascha Rogmann <59577610+srogmann@users.noreply.github.com>
Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
Co-authored-by: Sascha Rogmann <github@rogmann.org>
Co-authored-by: Florian Badie <florianbadie@odrling.xyz>
2025-10-27 14:22:02 +02:00
firecoperana
40eabe5dd8 fix bugs (#870)
Co-authored-by: firecoperana <firecoperana>
2025-10-27 14:17:48 +02:00
Kawrakow
41d6c42b96 Change flash attention and fmoe to be on by default (#863)
* Change fmoe to be on by default

* Change default fmoe also in llama-bench

* Change flash attention to be on by default

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-10-25 09:37:28 +03:00
Kawrakow
36f9601e8d Make ooae on by default and add to llama-bench (#842)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-10-20 08:32:41 +03:00
Kawrakow
cde642e591 Grouped expert routing (CPU only) (#836)
* Better argsort (CPU)

* Attemt at grouped topk

* This seems to do the trick for grouped experts routing

* Cleanup

* Trying to merge, something is not right

* Working merged grouped top_k (CPU)

* Add command line option to enable grouped expert routing

* Add grouped expert routing option to llama-bench

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-10-16 14:57:02 +03:00
Viktor Ivakin
2b71974af9 Fix incomplete utf-8 characters in streaming text completions (#810) 2025-10-13 16:25:29 +03:00
AesSedai
23275ac066 Remove duplicate 99% KLD output, add additional percentiles to match mainline (#817) 2025-10-05 07:13:32 +02:00
Kawrakow
e2f21c8dc8 Move minja and nlohmann/json to vendor (#802)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-09-27 09:12:35 +02:00
Kawrakow
c1a0e15377 Port mdmd from mainline + Qwen2/2.5-VL support (#798)
* Add mtmd: the beginning

* Add mtmd: mtmd.cpp compiles

* Add mtmd: clip initialization compiles

* Add mtmd: clip.cpp compiles

* Add mtmd: builds successfully

* Add CPU implementation for GGML_OP_GLU

* Add CUDA implementation for GGML_OP_GLU

* Add CPU implementation for GGML_OP_CONV_2D and GGML_OP_CONV_2D_DW

* Add CUDA implementation for GGML_OP_CONV_2D and GGML_OP_CONV_2D_DW

* Add mtmd: refresh CPU rope

* Add mtmd: refresh CUDA rope

* Add mtmd: add Qwen2-VL

* Add mtmd: Qwen2.5-VL text seems to work with this change

* Add mtmd: fix swiglu

* Add mtmd: use LOG_TEE so generated tokens show up in terminal

* Add mtmd: do not attempt to load a GPU backend if none are available

* GLU, not GPU

* Fix typo

* Fix new/free mismatch

* LOG stuff

* Add mtmd: this fixes gibberish on second image

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-09-27 08:45:29 +02:00
Yap Sok Ann
4f9b0ec4f0 Fix logprobs (#787)
This commit is mostly a cherry-pick of ggml-org/llama.cpp#10783, plus
optimization to do partial sort when sorting the logits.

That mainline PR and friends were partially cherry-picked by #723, but
wasn't really in a working state yet.

A couple of additional changes:
* Include timing information in response, which was (unintentionally?)
  done in mainline since ggml-org/llama.cpp#10643.
* Also return the actual logprobs for accepted draft tokens. This is
  still a TODO in mainline [1].

Note that there is a TG performance penalty to return the logprobs, as
we need to sort the logits. By doing partial sort, the penalty is quite
small. Here are some numbers I got using the same prompt:

This PR with partial sort:
* no draft, no logprobs: 12.87 tok/s
* no draft, with logprobs: 12.61 tok/s (2.0% drop)
* with draft, no logprobs: 36.74 tok/s
* with draft, with logprobs: 36.12 tok/s (1.7% drop)

If cherry-pick the full sort from mainline PR:
* no draft, no logprobs: 12.81 tok/s
* no draft, with logprobs: 12.02 tok/s (6.2% drop)
* with draft, no logprobs: 36.59 tok/s
* with draft, with logprobs: 29.08 tok/s (20.5% drop)

[1] https://github.com/ggml-org/llama.cpp/blob/b6548/tools/server/server.cpp#L4019

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-09-25 15:43:30 +02:00
firecoperana
17f7f1ed18 Update webui to handle reasoning content and include usage stats in server only when requested (#791)
* handle reasoning content in webui
server : include usage statistics only when user request them (#16052)
server : only attempt to enable thinking if using jinja (#15967)

* config reasoning_content in webui and change default to auto

---------

Co-authored-by: firecoperana <firecoperana>
2025-09-24 07:45:09 +02:00
firecoperana
a6da22beb2 Deepseek V3.1 native tool calling support (OpenAI Style) (#771) 2025-09-13 07:51:40 +02:00
firecoperana
8403308d8e fix v1 completions streaming mode (#768) 2025-09-09 15:38:12 +02:00
firecoperana
d7882c3cf8 Tool calls support from mainline (#723)
* Tool calls support from mainline

* update cmake

* revert api for /completions

* Fix broken thinking process for gpt-oss

* add missing args and fix webui bugs

* add missing args and fix webui bugs2

* Fix reasoning format error

* add usage

* change default post_sampling_probs to true

* add back generated_text

* Remove server endpoints tests

* add log

* Chat fixes

* Remove logs

* webui: revert extra handling of thinking process

---------

Co-authored-by: firecoperana <firecoperana>
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-09-01 08:38:49 +03:00
Kawrakow
8de297b795 Fused FFN_UP+FFN_GATE op (#741)
* Fused up+gate+unary for regular (not MoE) FFN - CPU

* WIP CUDA

* Seems to be working on CUDA

For a dense model we get 2-3% speedup for PP and ~0.6% for TG.

* Add command line option

This time the option is ON by default, and one needs to turn it
off via -no-fug or --no-fused-up-gate

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-08-31 18:16:36 +03:00
saood06
7a68553487 Add mikupad to ik_llama as an alternative WebUI (#558)
* mikupad.html in ik_llama.cpp (functional but WIP)

* Remove hardcoded extension and add error handling to extension loading

* Update version number and add features array to version

* Make version endpoint always accessible

* Fix case with empty sql

* Add useful error message when launched without sql file

* Add sigma sampler

* Update sigma step and max based on docs

* Remove selectedSessionId and handle it with URL fragment

* Export All (code only, no UI)

* Add compression to server.cpp

* Major UI work (and also add update backend endpoints to accomadate)

* Finalize UI

* Fix visual bug

* fix merge conflict issue

* Pull in full sqlite_modern_cpp repo for the license as it is not attached to source files

* Make compression not show in sidebar if extension is not loaded

* Finalize build, Put support behing LLAMA_SERVER_SQLITE3: command not found build option, and update error message to include the build option is not passed situation

* Fix compile without flag on systems without it installed
2025-08-24 08:27:29 -05:00
Kawrakow
0b448997ec Does this fix #690? (#711)
* Does this fix #690?

* Another attempt

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-08-21 19:17:33 +03:00
Kawrakow
bc2cce5ce1 Disable experimental code that causes issues with MSVC (#707)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-08-19 18:09:49 +03:00
g2mt
06bed7e01b Port universal assisted decoding to llama-server (#699)
* port universal assisted decoding to server

* fix calls

* fix LOG_INFO

* fix llama_detokenize call

* use emplace_back
2025-08-18 09:22:23 +03:00