server : handle models with missing EOS token (#8997)
server : fix segfault on long system prompt (#8987)
* server : fix segfault on long system prompt
* server : fix parallel generation with very small batch sizes
* server : fix typo in comment
server : init stop and error fields of the result struct (#9026)
server : fix duplicated n_predict key in the generation_settings (#8994)
server : support reading arguments from environment variables (#9105)
* server : support reading arguments from environment variables
* add -fa and -dt
* readme : specify non-arg env var
server : add some missing env variables (#9116)
* server : add some missing env variables
* add LLAMA_ARG_HOST to server dockerfile
* also add LLAMA_ARG_CONT_BATCHING
Credits are to the respective authors.
Not a single merge conflict occurred.
Compiled, then tested without bug.
* Fuse Q, K, V gemv+add
* More gemv+add fusing
* Faster copy when tensors are contiguous
Relevant for storing data into the KV cache. I see ~1% speedup
for fast models (Ling-mini-2.0, gpt-oss-20b, etc.)
* Cleanup
* Make sure the bias really is 1 row to use fusion
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Add new webui from llama.cpp
* Add new webui
* feat: Improve mobile UI for Settings Dialog (#16084)
* feat: Improve mobile UI for Settings Dialog
* chore: update webui build output
* fix: Linting errors
* chore: update webui build output
# Conflicts:
# examples/server/webui_llamacpp/src/lib/components/app/chat/ChatSettings/ChatSettingsFields.svelte
# examples/server/webui_llamacpp/src/lib/components/app/chat/ChatSettings/ChatSettingsSection.svelte
# tools/server/public/index.html.gz
* webui : fix handling incomplete chunks (#16107)
* Always show message actions for mobile UI + improvements for user message sizing (#16076)
# Conflicts:
# .gitignore
# examples/server/webui_llamacpp/package.json
# examples/server/webui_llamacpp/scripts/dev.sh
# tools/server/webui/scripts/post-build.sh
* webui: switch to hash-based routing (alternative of #16079) (#16157)
* Switched web UI to hash-based routing
* Added hash to missed goto function call
* Removed outdated SPA handling code
* Fixed broken sidebar home link
# Conflicts:
# examples/server/webui_llamacpp/src/routes/+layout.ts
# tools/server/server.cpp
* Allow viewing conversations even when llama server is down (#16255)
* webui: allow viewing conversations and sending messages even if llama-server is down
- Cached llama.cpp server properties in browser localStorage on startup, persisting successful fetches and reloading them when refresh attempts fail so the chat UI continues to render while the backend is unavailable.
- Cleared the stored server properties when resetting the store to prevent stale capability data after cache-backed operation.
- Kept the original error-splash behavior when no cached props exist so fresh installs still surface a clear failure state instead of rendering stale data.
* feat: Add UI for `props` endpoint unavailable + cleanup logic
* webui: extend cached props fallback to offline errors
Treat connection failures (refused, DNS, timeout, fetch) the same way as
server 5xx so the warning banner shows up when cache is available, instead
of falling back to a full error screen.
* webui: Left the chat form enabled when a server warning is present so operators can keep sending messages
e.g., to restart the backend over llama-swap, even while cached /props data is in use
* chore: update webui build output
---------
Co-authored-by: Pascal <admin@serveurperso.com>
# Conflicts:
# examples/server/webui_llamacpp/src/lib/components/app/chat/ChatScreen/ChatScreenWarning.svelte
# examples/server/webui_llamacpp/src/lib/constants/localstorage-keys.ts
* Enhance text file detection logic for file attachments (#16199)
* feat: Enhances text file detection logic
* chore: Build static `webui` output
* chore: update webui build output
# Conflicts:
# examples/server/webui_llamacpp/src/lib/constants/binary-detection.ts
* Show message actions by default (#16289)
* fix: preserved zero values in chat settings inputs and textareas by switching to nullish coalescing for field values and default placeholders (#16312)
* Improve Mobile UI for dialogs and action dropdowns (#16222)
* fix: Always show conversation item actions
* feat: Improve Alert Dialog and Dialog mobile UI
* feat: Add settings reset to default confirmation
* fix: Close Edit dialog on save
* chore: update webui build output
* webui: implement proper z-index system and scroll management
- Add CSS variable for centralized z-index control
- Fix dropdown positioning with Settings dialog conflicts
- Prevent external scroll interference with proper event handling
- Clean up hardcoded z-index values for maintainable architecture
* webui: ensured the settings dialog enforces dynamic viewport height on mobile while retaining existing desktop sizing overrides
* feat: Use `dvh` instead of computed px height for dialogs max height on mobile
* chore: update webui build output
* feat: Improve Settings fields UI
* chore: update webui build output
* chore: update webui build output
---------
Co-authored-by: Pascal <admin@serveurperso.com>
* Fix thinking blocks with quotes + add handling `[THINK]...[/THINK]` blocks (#16326)
* fix: prevent reasoning blocks with quotes from being truncated
* chore: update webui build output
* feat: Improve thinking content parsing
* test: Adds ChatMessage component stories for different thinking blocks
* chore: update webui build output
* fix: ChatMessage story fix
---------
Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
* Chatapi ignore empty sampling (#16330)
* fix: skip empty sampling fields instead of coercing to 0 in chat API options
* chore: update webui build output
* webui: Remove running `llama-server` within WebUI `dev.sh` script (#16363)
* Add optional setting for showing "Model used:" information (#16337)
* feat: Add a setting to include model name used to generate the message
* feat: UI improvements
* feat: Save model info along with the database message entry creation
* chore: Build webui static output
* Improve code block color theming (#16325)
* feat: Improve code block theming
* chore: update webui build output
* chore: Update webui static build
* Conversation action dialogs as singletons from Chat Sidebar + apply conditional rendering for Actions Dropdown for Chat Conversation Items (#16369)
* fix: Render Conversation action dialogs as singletons from Chat Sidebar level
* chore: update webui build output
* fix: Render Actions Dropdown conditionally only when user hovers conversation item + remove unused markup
* chore: Update webui static build
* fix: Always truncate conversation names
* chore: Update webui static build
* fix: track viewportHeight via window.innerHeight to avoid unwanted scrolling (#16356)
Use <svelte:window bind:innerHeight> instead of manual resize listener
Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
* webui : Fix messages payload sent to chat completions (#16402)
* fix: Include just the currently active message branches instead of all in chat completions request
* chore: Build webui static output
* chore: Formatting
* chore: update webui build output
* Capture model name only after first token (streaming) or completed request (#16405)
* feat: Capture model name only after first token (streaming) or completed request (non-streaming)
* chore: update webui build output
* chore: update webui build output
* Fix missing messages on sibling navigation (#16408)
* fix: resolve message disappearing issue when navigating between regenerated siblings by using current leaf nodes instead of cached sibling IDs
* chore: update webui build output
* chore: update webui build output
* webui : added download action (#13552) (#16282)
* webui : added download action (#13552)
* webui : import and export (for all conversations)
* webui : fixed download-format, import of one conversation
* webui : add ExportedConversations type for chat import/export
* feat: Update naming & order
* chore: Linting
* webui : Updated static build output
---------
Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
* refactor: centralize CoT parsing in backend for streaming mode (#16394)
* refactor: unify reasoning handling via backend reasoning_content, drop frontend tag parsing
- Updated the chat message component to surface backend-supplied reasoning via message.thinking while showing the raw assistant content without inline tag scrubbing
- Simplified chat streaming to append content chunks directly, stream reasoning into the message model, and persist any partial reasoning when generation stops
- Refactored the chat service SSE handler to rely on server-provided reasoning_content, removing legacy <think> parsing logic
- Refreshed Storybook data and streaming flows to populate the thinking field explicitly for static and streaming assistant messages
* refactor: implement streaming-aware universal reasoning parser
Remove the streaming mode limitation from --reasoning-format by refactoring
try_parse_reasoning() to handle incremental parsing of <think> tags across
all formats.
- Rework try_parse_reasoning() to track whitespace, partial tags, and
multiple reasoning segments, allowing proper separation of reasoning_content
and content in streaming mode
- Parse reasoning tags before tool call handling in content-only and Llama 3.x
formats to ensure inline <think> blocks are captured correctly
- Change default reasoning_format from 'auto' to 'deepseek' for consistent
behavior
- Add 'deepseek-legacy' option to preserve old inline behavior when needed
- Update CLI help and documentation to reflect streaming support
- Add parser tests for inline <think>...</think> segments
The parser now continues processing content after </think> closes instead of
stopping, enabling proper message.reasoning_content and message.content
separation in both streaming and non-streaming modes.
Fixes the issue where streaming responses would dump everything (including
post-thinking content) into reasoning_content while leaving content empty.
* refactor: address review feedback from allozaur
- Passed the assistant message content directly to ChatMessageAssistant to drop the redundant derived state in the chat message component
- Simplified chat streaming updates by removing unused partial-thinking handling and persisting partial responses straight from currentResponse
- Refreshed the ChatMessage stories to cover standard and reasoning scenarios without the old THINK-tag parsing examples
Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
* refactor: restore forced reasoning prefix to pass test-chat ([chat] All tests passed)
- store the exact sequence seen on input when 'thinking_forced_open' enforces a reasoning block
- inject this prefix before the first accumulated segment in 'reasoning_content', then clear it to avoid duplication
- repeat the capture on every new 'start_think' detection to properly handle partial/streaming flows
* refactor: address review feedback from ngxson
* debug: say goodbye to curl -N, hello one-click raw stream
- adds a new checkbox in the WebUI to display raw LLM output without backend parsing or frontend Markdown rendering
* Update tools/server/webui/src/lib/components/app/chat/ChatMessages/ChatMessage.svelte
Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
* webui: add Storybook example for raw LLM output and scope reasoning format toggle per story
- Added a Storybook example that showcases the chat message component in raw LLM output mode with the provided trace sample
- Updated every ChatMessage story to toggle the disableReasoningFormat setting so the raw-output rendering remains scoped to its own example
* npm run format
* chat-parser: address review feedback from ngxson
Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
---------
Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
# Conflicts:
# common/arg.cpp
# examples/server/webui_llamacpp/src/lib/utils/thinking.ts
# tools/server/README.md
* No markdown in cot (#16483)
* fix: let the model think in plaintext
* chore: npm run format + npm run build
* webui: updated the chat service to only include max_tokens in the req… (#16489)
* webui: updated the chat service to only include max_tokens in the request payload when the setting is explicitly provided, while still mapping explicit zero or null values to the infinite-token sentinel
* chore: update webui build output
* feat: render user content as markdown option (#16358)
* feat: render user content as markdown option
- Add a persisted 'renderUserContentAsMarkdown' preference to the settings defaults and info metadata so the choice survives reloads like other options
- Surface the new 'Render user content as Markdown' checkbox in the General section of the chat settings dialog, beneath the PDF toggle
- Render user chat messages with 'MarkdownContent' when the new setting is enabled, matching assistant formatting while preserving the existing card styling otherwise
- chore: update webui build output
* chore: update webui build output
* webui: remove client-side context pre-check and rely on backend for limits (#16506)
* fix: make SSE client robust to premature [DONE] in agentic proxy chains
* webui: remove client-side context pre-check and rely on backend for limits
Removed the client-side context window pre-check and now simply sends messages
while keeping the dialog imports limited to core components, eliminating the
maximum context alert path
Simplified streaming and non-streaming chat error handling to surface a generic
'No response received from server' error whenever the backend returns no content
Removed the obsolete maxContextError plumbing from the chat store so state
management now focuses on the core message flow without special context-limit cases
* webui: cosmetic rename of error messages
* Update tools/server/webui/src/lib/stores/chat.svelte.ts
Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
* Update tools/server/webui/src/lib/stores/chat.svelte.ts
Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
* Update tools/server/webui/src/lib/components/app/chat/ChatScreen/ChatScreen.svelte
Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
* Update tools/server/webui/src/lib/components/app/chat/ChatScreen/ChatScreen.svelte
Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
* chore: update webui build output
---------
Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
# Conflicts:
# examples/server/webui_llamacpp/src/lib/components/app/dialogs/ChatErrorDialog.svelte
# examples/server/webui_llamacpp/src/lib/components/app/dialogs/MaximumContextAlertDialog.svelte
# examples/server/webui_llamacpp/src/lib/services/context.ts
* fix: add remark plugin to render raw HTML as literal text (#16505)
* fix: add remark plugin to render raw HTML as literal text
Implemented a missing MDAST stage to neutralize raw HTML like major LLM WebUIs
do ensuring consistent and safe Markdown rendering
Introduced 'remarkLiteralHtml', a plugin that converts raw HTML nodes in the
Markdown AST into plain-text equivalents while preserving indentation and
line breaks. This ensures consistent rendering and prevents unintended HTML
execution, without altering valid Markdown structure
Kept 'remarkRehype' in the pipeline since it performs the required conversion
from MDAST to HAST for KaTeX, syntax highlighting, and HTML serialization
Refined the link-enhancement logic to skip unnecessary DOM rewrites,
fixing a subtle bug where extra paragraphs were injected after the first
line due to full innerHTML reconstruction, and ensuring links open in new
tabs only when required
Final pipeline: remarkGfm -> remarkMath -> remarkBreaks -> remarkLiteralHtml
-> remarkRehype -> rehypeKatex -> rehypeHighlight -> rehypeStringify
* fix: address review feedback from allozaur
* chore: update webui build output
# Conflicts:
# examples/server/webui_llamacpp/src/lib/constants/literal-html.ts
* Add server-driven parameter defaults and syncing (#16515)
# Conflicts:
# examples/server/webui_llamacpp/src/lib/components/app/chat/ChatSettings/ParameterSourceIndicator.svelte
# examples/server/webui_llamacpp/src/lib/constants/precision.ts
# examples/server/webui_llamacpp/src/lib/services/parameter-sync.spec.ts
# examples/server/webui_llamacpp/src/lib/services/parameter-sync.ts
# examples/server/webui_llamacpp/src/lib/utils/config-helpers.ts
# examples/server/webui_llamacpp/src/lib/utils/precision.ts
* fix: added a normalization step for MathJax-style \[\] and \(\) delimiters (#16599)
* fix: added a normalization step for MathJax-style \[\] and \(\) delimiters
So inline and block equations are converted before KaTeX rendering,
enabling proper display of model-generated LaTeX in the WebUI
* chore: update webui build output
* webui: reorganize settings layout (#16607)
* webui: reorganize settings layout
* chore: update webui build output
* fix: remove unused variable
* chore: update webui build output
* Enable per-conversation loading states to allow having parallel conversations (#16327)
* feat: Per-conversation loading states and tracking streaming stats
* chore: update webui build output
* refactor: Chat state management
Consolidates loading state management by using a global `isLoading` store synchronized with individual conversation states.
This change ensures proper reactivity and avoids potential race conditions when updating the UI based on the loading status of different conversations. It also improves the accuracy of statistics displayed.
Additionally, slots service methods are updated to use conversation IDs for per-conversation state management, avoiding global state pollution.
* feat: Adds loading indicator to conversation items
* chore: update webui build output
* fix: Fix aborting chat streaming
Improves the chat stream abortion process by ensuring that partial responses are saved before the abort signal is sent.
This avoids a race condition where the onError callback could clear the streaming state before the partial response is saved. Additionally, the stream reading loop and callbacks are now checked for abort signals to prevent further processing after abortion.
* refactor: Remove redundant comments
* chore: build webui static output
* refactor: Cleanup
* chore: update webui build output
* chore: update webui build output
* fix: Conversation loading indicator for regenerating messages
* chore: update webui static build
* feat: Improve configuration
* feat: Install `http-server` as dev dependency to not need to rely on `npx` in CI
* Import/Export UX improvements (#16619)
* webui : added download action (#13552)
* webui : import and export (for all conversations)
* webui : fixed download-format, import of one conversation
* webui : add ExportedConversations type for chat import/export
* feat: Update naming & order
* chore: Linting
* feat: Import/Export UX improvements
* chore: update webui build output
* feat: Update UI placement of Import/Export tab in Chat Settings Dialog
* refactor: Cleanup
chore: update webui build output
* feat: Enable shift-click multiple conversation items selection
* chore: update webui static build
* chore: update webui static build
---------
Co-authored-by: Sascha Rogmann <github@rogmann.org>
# Conflicts:
# examples/server/webui_llamacpp/src/lib/components/app/chat/ChatSettings/ConversationSelectionDialog.svelte
# examples/server/webui_llamacpp/src/lib/components/app/chat/ChatSettings/ImportExportTab.svelte
# examples/server/webui_llamacpp/src/lib/utils/conversation-utils.ts
* Prevent premature submission on IME input (#16673)
* fix: Prevent premature submission on IME input
* chore: update webui static build
* refactor: Put IME completion checker in a helper function and add checking for `KeyboardEvent.eventKey === 229`
* chore: update webui static build
* chore: update webui static build
* chore: update webui static build
# Conflicts:
# examples/server/webui_llamacpp/src/lib/utils/is-ime-composing.ts
* Handle legacy 'context' attachments (#16687)
* webui: introduce OpenAI-compatible model selector in JSON payload (#16562)
* webui: introduce OpenAI-compatible model selector in JSON payload
* webui: restore OpenAI-Compatible model source of truth and unify metadata capture
This change re-establishes a single, reliable source of truth for the active model:
fully aligned with the OpenAI-Compat API behavior
It introduces a unified metadata flow that captures the model field from both
streaming and non-streaming responses, wiring a new onModel callback through ChatService
The model name is now resolved directly from the API payload rather than relying on
server /props or UI assumptions
ChatStore records and persists the resolved model for each assistant message during
streaming, ensuring consistency across the UI and database
Type definitions for API and settings were also extended to include model metadata
and the onModel callback, completing the alignment with OpenAI-Compat semantics
* webui: address review feedback from allozaur
* webui: move model selector into ChatForm (idea by @allozaur)
* webui: make model selector more subtle and integrated into ChatForm
* webui: replaced the Flowbite selector with a native Svelte dropdown
* webui: add developer setting to toggle the chat model selector
* webui: address review feedback from allozaur
Normalized streamed model names during chat updates
by trimming input and removing directory components before saving
or persisting them, so the conversation UI shows only the filename
Forced model names within the chat form selector dropdown to render as
a single-line, truncated entry with a tooltip revealing the full name
* webui: toggle displayed model source for legacy vs OpenAI-Compat modes
When the selector is disabled, it falls back to the active server model name from /props
When the model selector is enabled, the displayed model comes from the message metadata
(the one explicitly selected and sent in the request)
* Update tools/server/webui/src/lib/components/app/chat/ChatForm/ChatFormActions.svelte
Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
* Update tools/server/webui/src/lib/constants/localstorage-keys.ts
Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
* Update tools/server/webui/src/lib/components/app/chat/ChatForm/ChatFormModelSelector.svelte
Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
* Update tools/server/webui/src/lib/components/app/chat/ChatMessages/ChatMessageAssistant.svelte
Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
* Update tools/server/webui/src/lib/services/chat.ts
Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
* Update tools/server/webui/src/lib/services/chat.ts
Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
* webui: refactor model selector and persistence helpers
- Replace inline portal and event listeners with proper Svelte bindings
- Introduce 'persisted' store helper for localStorage sync without runes
- Extract 'normalizeModelName' utils + Vitest coverage
- Simplify ChatFormModelSelector structure and cleanup logic
Replaced the persisted store helper's use of '$state/$effect' runes with
a plain TS implementation to prevent orphaned effect runtime errors
outside component context
Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
* webui: document normalizeModelName usage with inline examples
* Update tools/server/webui/src/lib/components/app/chat/ChatForm/ChatFormModelSelector.svelte
Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
* Update tools/server/webui/src/lib/stores/models.svelte.ts
Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
* Update tools/server/webui/src/lib/stores/models.svelte.ts
Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
* webui: extract ModelOption type into dedicated models.d.ts
Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
* webui: refine ChatMessageAssistant displayedModel source logic
* webui: stabilize dropdown, simplify model extraction, and init assistant model field
* chore: update webui static build
* Update tools/server/webui/src/lib/components/app/chat/ChatMessages/ChatMessageAssistant.svelte
Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
* chore: npm format, update webui static build
* webui: align sidebar trigger position, remove z-index glitch
* chore: update webui build output
---------
Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
# Conflicts:
# examples/server/webui_llamacpp/src/lib/components/app/chat/ChatForm/ChatFormModelSelector.svelte
# examples/server/webui_llamacpp/src/lib/services/models.ts
# examples/server/webui_llamacpp/src/lib/stores/models.svelte.ts
# examples/server/webui_llamacpp/src/lib/stores/persisted.svelte.ts
# examples/server/webui_llamacpp/src/lib/types/models.d.ts
# examples/server/webui_llamacpp/src/lib/utils/model-names.test.ts
# examples/server/webui_llamacpp/src/lib/utils/model-names.ts
# examples/server/webui_llamacpp/src/lib/utils/portal-to-body.ts
* webui: support q URL parameter (#16728)
* webui: support q URL parameter
Fixes#16722
I’ve checked that it works with Firefox’s AI tools
* webui: apply suggestions from code review
Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
* chore: update webui static build
---------
Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
* build fix
---------
Co-authored-by: firecoperana <firecoperana>
Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com>
Co-authored-by: Quentin Bramas <quentin.bramas@gmail.com>
Co-authored-by: Isaac McFadyen <isaac@imcf.me>
Co-authored-by: Pascal <admin@serveurperso.com>
Co-authored-by: Sascha Rogmann <59577610+srogmann@users.noreply.github.com>
Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
Co-authored-by: Sascha Rogmann <github@rogmann.org>
Co-authored-by: Florian Badie <florianbadie@odrling.xyz>
* Args for MMVQ functions
* WIP
* Fused ffn_up*unary_op(ffn_gate) for MMVQ (no bias)
We see nearly 2% TG speedup for Ling-mini-2.0 and
about 1% for DeepSeek-Lite.
* Fused ffn_up*unary_op(ffn_gate) for MMVQ (with bias)
* Fusing also for iqk/trellis/repacked quants
* Fusing mmvq also in non-MoE up+gate
* Fuse mul_mat_id and add_id into a single kernel for mmvq
* Also iqk quants
* Split mmvq.cu and iqk_mmvq.cu into separate template instances
* Put iqk mmvq implementations into template instances
* Somehow I forgot to change the ggml_type in the legacy template calls
* Add disagnostics
* Disable assert
* Fix TG fused up*nary(gate) when down cannot be fused
The wrong memory buffer got used in that case
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Change fmoe to be on by default
* Change default fmoe also in llama-bench
* Change flash attention to be on by default
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Adding fused mul+multi_add + CPU implementation
* fused mul+multi_add: command line argument to disable it
* Faster tensor name formatting
We gain ~1% for Ling-mini-2.0 when running on CUDA.
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Adding fused mul+multi_add + CPU implementation
* fused mul+multi_add: CUDA
* fused mul+multi_add: command line argument to disable it
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Fuse add+add+fused_rms
* Try this
* Macro to easily enable/disable fusion
* Various:
* Check that all tensors involved are on the same device before applying fusion
* Fuse sigmoid+scale+sum_rows+div
* Fix the fused bailingmoe2 experts selection
The issue there was that the bias was not per row, but per
expert group, so only the first n_per_group biases were used
for al experts.
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Combine all calls to llm_build_norm to a single line
so more easily check what kind of arguments are being passed
by simply using grep.
* Combine add + fused_rms_norm
For many models this happens at each layer: the result of the
layer is added to the ayer input, which then becomes the input
to the next layer, which then is typically normalized via
fused_rms_norm.
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Do not allocate KV cache for unused layers
* Do not apply experts weight scale if it is 1
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Fuse sigmoid+add+grouped_topk+get_rows (CPU)
* Fix CPU + CUDA
but CUDA is somehow not 100% correct as I get a slightly different
PPL (lower!)
* Minor
* Fuse sigmoid+add+topk+get_rows (CUDA)
* Fuse sigmoid+add+topk+get_rows (CPU)
* Fuse topk+view+get_rows+reshape+softmax (CPU)
* Fuse topk+view+get_rows+reshape+softmax (CUDA)
* cpu: turn off the openai topk fusing for now
Something is not right and I don't see the bug.
On the CPU one doesn't gain much if anything, so not a big loss.
* Also fuse sum_rows and div
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Conditionally write moe_shared_expert_intermediate_size
Ling-1T config.json does *not* have `moe_shared_expert_intermediate_size`.
Ling-flash-2.0a *does* have it.
This small patch just makes the gguf_writer conditionally detect as
needed.
* Fix Ling-1T missing moe_shared_expert_intermediate_size
Thanks CISC for the proper patch to include the needed values!
* Better argsort (CPU)
* Attemt at grouped topk
* This seems to do the trick for grouped experts routing
* Cleanup
* Trying to merge, something is not right
* Working merged grouped top_k (CPU)
* Add command line option to enable grouped expert routing
* Add grouped expert routing option to llama-bench
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Parallelize mask
We see non-negligible PP gains for long contexts.
More importantly, the strange drop in performance
observed for GPT-OSS for context >= 32k tokens is gone.
* Whith FA on, create mask as f16 directly
* WIP
* Reduce KQ mask padding to 16
Why was it 64 in the first place?
I don't observe any issues, while TG performance
for long contexts improves by 2-4%.
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* llama_model and llama_hparams
* llama_build_context
Surprisingly small reduction in llama.cpp compile time given
the reduction in LOCs (22k -> 14k)
* LLM_TN
llama.cpp compilation: 50 s -> 33 s
* llama_quantize
* arch names
* All graph building is now in llm-build-context.cpp
* hparams loading
llama.cpp is now just 9300 LOC, but still takes 32 seconds to compile.
* We are now at 6 seconds to build the src folder
* load -> create
We are not actually loading the tensors, but just creating them.
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Add mtmd: the beginning
* Add mtmd: mtmd.cpp compiles
* Add mtmd: clip initialization compiles
* Add mtmd: clip.cpp compiles
* Add mtmd: builds successfully
* Add CPU implementation for GGML_OP_GLU
* Add CUDA implementation for GGML_OP_GLU
* Add CPU implementation for GGML_OP_CONV_2D and GGML_OP_CONV_2D_DW
* Add CUDA implementation for GGML_OP_CONV_2D and GGML_OP_CONV_2D_DW
* Add mtmd: refresh CPU rope
* Add mtmd: refresh CUDA rope
* Add mtmd: add Qwen2-VL
* Add mtmd: Qwen2.5-VL text seems to work with this change
* Add mtmd: fix swiglu
* Add mtmd: use LOG_TEE so generated tokens show up in terminal
* Add mtmd: do not attempt to load a GPU backend if none are available
* GLU, not GPU
* Fix typo
* Fix new/free mismatch
* LOG stuff
* Add mtmd: this fixes gibberish on second image
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Avoid computing FA chunks where the mask is -infinity
* Avoid computing FA chunks where the mask is -infinity also for f16/bf16
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
This commit is mostly a cherry-pick of ggml-org/llama.cpp#10783, plus
optimization to do partial sort when sorting the logits.
That mainline PR and friends were partially cherry-picked by #723, but
wasn't really in a working state yet.
A couple of additional changes:
* Include timing information in response, which was (unintentionally?)
done in mainline since ggml-org/llama.cpp#10643.
* Also return the actual logprobs for accepted draft tokens. This is
still a TODO in mainline [1].
Note that there is a TG performance penalty to return the logprobs, as
we need to sort the logits. By doing partial sort, the penalty is quite
small. Here are some numbers I got using the same prompt:
This PR with partial sort:
* no draft, no logprobs: 12.87 tok/s
* no draft, with logprobs: 12.61 tok/s (2.0% drop)
* with draft, no logprobs: 36.74 tok/s
* with draft, with logprobs: 36.12 tok/s (1.7% drop)
If cherry-pick the full sort from mainline PR:
* no draft, no logprobs: 12.81 tok/s
* no draft, with logprobs: 12.02 tok/s (6.2% drop)
* with draft, no logprobs: 36.59 tok/s
* with draft, with logprobs: 29.08 tok/s (20.5% drop)
[1] https://github.com/ggml-org/llama.cpp/blob/b6548/tools/server/server.cpp#L4019
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Quick attempt to fuse the Q, K, V GEMMs
Doesn't do much on the CPU
* Doesn't do much on the GPU either
* Use llm_build_mul_mat_qkv
* This is not needed
* Revert timing on committed by mistake
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* handle reasoning content in webui
server : include usage statistics only when user request them (#16052)
server : only attempt to enable thinking if using jinja (#15967)
* config reasoning_content in webui and change default to auto
---------
Co-authored-by: firecoperana <firecoperana>