Commit Graph

367 Commits

Author SHA1 Message Date
firecoperana
8403308d8e fix v1 completions streaming mode (#768) 2025-09-09 15:38:12 +02:00
firecoperana
d7882c3cf8 Tool calls support from mainline (#723)
* Tool calls support from mainline

* update cmake

* revert api for /completions

* Fix broken thinking process for gpt-oss

* add missing args and fix webui bugs

* add missing args and fix webui bugs2

* Fix reasoning format error

* add usage

* change default post_sampling_probs to true

* add back generated_text

* Remove server endpoints tests

* add log

* Chat fixes

* Remove logs

* webui: revert extra handling of thinking process

---------

Co-authored-by: firecoperana <firecoperana>
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-09-01 08:38:49 +03:00
saood06
7a68553487 Add mikupad to ik_llama as an alternative WebUI (#558)
* mikupad.html in ik_llama.cpp (functional but WIP)

* Remove hardcoded extension and add error handling to extension loading

* Update version number and add features array to version

* Make version endpoint always accessible

* Fix case with empty sql

* Add useful error message when launched without sql file

* Add sigma sampler

* Update sigma step and max based on docs

* Remove selectedSessionId and handle it with URL fragment

* Export All (code only, no UI)

* Add compression to server.cpp

* Major UI work (and also add update backend endpoints to accomadate)

* Finalize UI

* Fix visual bug

* fix merge conflict issue

* Pull in full sqlite_modern_cpp repo for the license as it is not attached to source files

* Make compression not show in sidebar if extension is not loaded

* Finalize build, Put support behing LLAMA_SERVER_SQLITE3: command not found build option, and update error message to include the build option is not passed situation

* Fix compile without flag on systems without it installed
2025-08-24 08:27:29 -05:00
Kawrakow
0b448997ec Does this fix #690? (#711)
* Does this fix #690?

* Another attempt

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-08-21 19:17:33 +03:00
g2mt
06bed7e01b Port universal assisted decoding to llama-server (#699)
* port universal assisted decoding to server

* fix calls

* fix LOG_INFO

* fix llama_detokenize call

* use emplace_back
2025-08-18 09:22:23 +03:00
Pavel Dudkouski
259cbf0bde Added "usage" to server response (#695)
- Fixed "timings" calculation
- Always send "timings" in last message
2025-08-17 02:25:14 +00:00
g2mt
b6bc5eedad Port speculative decoding from upstream to llama-server (#645)
* server : integrate speculative decoding

* server: Fix field names

* server: fix include, whitespace

* fix compile errors in speculative.cpp

* add llama_sampling_sample_and_accept_n to sampling

* finish porting speculative decoding in server

* port functions from common/speculative, common/sampling

* remove arg

* fix function names

* init params_dft to none

* correct value for n_ctx

* prefix kv cache tensors with model name to avoid conflict

* fix call arguments

* fix spec decoding args

* correct slot.id

* use n_max

* port the rest of sampling funcs

* fix func arguments

* slot.id starts at 1?

* Revert "prefix kv cache tensors with model name to avoid conflict"

This reverts commit fbd5dfd866.

* disable draft logging

* disable logging in speculative.cpp

in mainline, these would be LOG_DEBUG, but since ik_llama doesnt support
it, logging is disabled entirely

* add more draft model parameters

* fix

* pass flash_attn

* add speculative params for parity

* set speculative params in launch_slot_with_task instead
2025-08-16 07:26:44 +03:00
firecoperana
21ced1e3c1 Fix completions endpoint (#684)
Co-authored-by: firecoperana <firecoperana>
2025-08-11 09:43:20 +03:00
firecoperana
ff024df079 add jinja template support (#677)
Co-authored-by: firecoperana <firecoperana>
2025-08-09 12:50:30 +00:00
Anton Sokolchenko
dc1746338c Fix for Deepseek r1 parsing (#676)
* Implement function calling / tools for ik_llama.cpp for Kimi K2

* Implement basic tool choice

* Backport llama.cpp tool calls support

* Enhance function calls with improved chat parser and string utilities

- Add new chat.h/chat.cpp and chat-parser.h/chat-parser.cpp for better chat handling
- Improve function calls parsing with fallback to llama.cpp builder pattern
- Add string utility functions (starts_with, ends_with, find_partial_stop)
- Update README with function calls testing instructions
- Enhance Kimi K2 parser and function calls documentation
- Add comprehensive test suite for function calls
- Update CMakeLists.txt and Makefile for new components

* Enhance function calling with unified streaming and parser improvements

- Fix streaming content cleanup to prevent function syntax in output
- Unify content extraction patterns with llama.cpp approach
- Improve Kimi K2 parser robustness and partial content handling
- Add comprehensive test coverage for function call scenarios
- Optimize chat message parsing and diff computation

* Replace hardcoded values in kimi_k2_parser.hpp with named constants

- Add compile-time constants for all token format markers
- Add compile-time constants for XML format markers
- Add compile-time constants for simple format patterns
- Replace all hardcoded string literals with named constants
- Use compile-time length calculation to avoid manual counting
- Improve maintainability and reduce magic numbers throughout parser

* Fix duplicate common_chat_parse definition

- Remove duplicate implementation from chat-parser.cpp
- Keep single implementation in chat.cpp following llama.cpp patterns
- Resolves linker error: multiple definition of common_chat_parse

* Fix JSON assertion failure in function call parsing

- Add proper validation that 'function' field is an object before accessing nested keys
- Handle missing 'arguments' field gracefully with default "{}"
- Prevents crash when parsing malformed tool call JSON structures

* Add comprehensive Qwen3 XML tool calling support with unit tests

- Implement Qwen3 XML parser with <tool_call>{"name": "func", "arguments": {...}}</tool_call> format
- Add model detection and routing for Qwen3 vs Kimi-K2 formats
- Create 8 comprehensive unit tests covering parsing, streaming, error handling
- Fix token format cleaning bug in kimi_k2_parser.hpp processing order
- Remove progressive parsing code and related utilities
- Add tool injection support for Qwen3 format in server utils

* Add DeepSeek R1 function calling support with comprehensive unit tests

- Implement complete DeepSeek R1 tool call parsing in common_chat_parser.cpp
- Add DeepSeek R1 model detection and tool injection in deepseek_r1_tools.hpp
- Update function_calls.hpp with DeepSeek R1 integration and content extraction
- Update documentation to reflect support for Kimi-K2, Qwen3, and DeepSeek R1 models
- Add comprehensive unit tests for DeepSeek R1 reasoning, tool calls, and integration
- Port exact implementation patterns from original llama.cpp for compatibility

Key features:
- Native DeepSeek R1 format: <|tool▁calls▁begin|>function<|tool▁sep|>name```json{}```<|tool▁call▁end|><|tool▁calls▁end|>
- Reasoning content extraction from <think>...</think> tags
- Multiple tool calls support with separate call blocks
- Model detection for deepseek-r1, deepseek_r1 naming patterns
- Integration with incremental parsing and streaming support

* Add partial parsing support for JSON and regex

- json-partial.h/cpp: JSON partial parsing functionality
- regex-partial.h/cpp: Regex partial parsing functionality

* Add format_chat integration tests for Qwen3 tool injection

- Add test_qwen3_format_chat_integration() to validate tool injection pipeline
- Test tool injection conditions and system message enhancement
- Verify JSON formatting and anti-preamble instructions
- Add comprehensive test documentation

Tests confirm tool injection works correctly - conversational preamble
issue is not in ik_llama.cpp but likely in UI configuration.

* Fix Qwen3 tool call parsing - pass model name to parser

Server was not passing model name to parse_chat_message_incremental(),
causing Qwen3 to fall back to Kimi-K2 parser and return tool calls
as content instead of proper tool_calls array.

* Fix non-streaming path to use model-specific parsing

Non-streaming responses were hardcoded to use Kimi-K2 format,
causing Qwen3 XML tool calls to be returned as content instead
of proper tool_calls array. Now uses same model detection as
streaming path for consistency.

* Update Qwen3 function call handling in server and tests

- Enhanced server function call detection and response formatting
- Improved test coverage for Qwen3 tool call scenarios
- Refined XML parsing for better tool execution support

* Add DeepSeek-R1 function call parsing support

Implements comprehensive parsing for all 4 DeepSeek-R1 function call formats:
- Format 1: Standard function call syntax (already supported)
- Format 2: Alternative function call patterns (already supported)
- Format 3: Tools array format - function\n```json\n{"tools": [...]}
- Format 4: XML wrapped format - <tool_call>function</think>Name\n```json\n{...}```</tool_call>

Key changes:
- Added parse_deepseek_r1_tools_array() following original parse_prefixed_json_tool_call_array pattern
- Added parse_deepseek_r1_xml_wrapped() following Hermes-2-Pro XML wrapper patterns
- Integrated both parsers into exception handling chain for robust fallback
- Added comprehensive TDD test coverage for all formats
- Anonymized all confidential information while preserving functionality

Resolves tool_calls_count=0 issue where DeepSeek-R1 models generated valid tool calls
but server failed to parse them correctly.

* Update function_calls.md documentation for DeepSeek-R1 Format 4

- Added Format 4 (XML wrapped) documentation with examples
- Updated implementation notes with correct parser order (3→4→1→2)
- Marked all DeepSeek-R1 formats as working (July 2025 update)
- Updated test status for Format 3 and 4 as passing
- Added parse_deepseek_r1_xml_wrapped() function reference
- Corrected implementation file line numbers

* Fix merge conflict in test-function-calls.cpp

- Removed incomplete merge conflict marker from line 3027
- Ensured all tests compile and pass successfully
- All DeepSeek-R1 formats (1-4) working correctly
- All streaming and content cleaning tests passing

* Fix DeepSeek R1 parsing issue with responses wrapped in think tags

Restore missing consume_rest() call from working PR #648 implementation.
When responses don't contain tool calls, remaining content after reasoning
parsing must be preserved as displayable content.

Fixes issue where entire responses wrapped in <think> tags resulted in
empty content output.

* Implement proper reasoning handling following original llama.cpp patterns

- Add missing reasoning_format and reasoning_in_content fields to common_chat_syntax
- Update try_parse_reasoning to match original llama.cpp logic exactly
- Add TDD test case with reasoning_in_content=true for DeepSeek R1
- Following TDD: test should now pass with proper syntax configuration

Based on original llama.cpp implementation patterns.

* TDD SUCCESS: Fix DeepSeek R1 thinking tag termination issue

 Test passes with reasoning_in_content=true configuration
- Content properly preserved: '<think>content</think>' displays fully
- Reasoning field empty as expected
- Following TDD: test-first approach validates the fix

Next: Update server to automatically apply this configuration.

* Complete server integration fix for DeepSeek R1 thinking tag termination

- Server now automatically sets reasoning_in_content=true for DeepSeek R1 models
- Fixes issue where responses wrapped in <think> tags appear empty to users

* Add TDD test case for DeepSeek R1 thinking tag termination issue

- Test reproduces the exact failure scenario reported by user
- Validates that reasoning_in_content=true fixes the issue
- Demonstrates empty content problem and working solution

* Add remaining TDD test changes for DeepSeek R1 thinking tag fix

* Add debug output after upstream merge

* Remove temporary benchmark and debug files

- Remove tests/benchmark-progressive-parsing.cpp (development tool, not part of core functionality)
- Remove tests/reproduce_bug.sh (debugging script, not needed for PR)
2025-08-08 13:56:44 +03:00
Anton Sokolchenko
05a61510b9 Fix Qwen3 content extraction breaking code formatting (#661)
Problem:
- qwen3::extract_content_during_parsing() used aggressive regex to collapse multiple newlines
- This broke proper code formatting (e.g., PEP 8's 2 empty lines between functions)
- Affected non-tool-call streaming output where formatting is critical

Solution:
- Replace aggressive std::regex_replace(R"(\n\s*\n)", "\n") with gentle string_strip()
- Follow original llama.cpp patterns: only trim leading/trailing whitespace
- Preserve internal formatting including multiple newlines
- Add proper include for common.h to access string_strip function

Changes:
- examples/server/parsers/qwen3_parser.hpp: Replace whitespace cleanup with string_strip()
- tests/test-function-calls.cpp: Add test_qwen3_whitespace_preservation() to prevent regression

Testing:
-  PEP 8 compliance: 2 empty lines between functions preserved
-  Tool call parsing: All Qwen3 tests continue to pass
-  No regressions: Existing functionality maintained
-  Follows original llama.cpp whitespace handling patterns
2025-08-07 08:22:01 +03:00
Anton Sokolchenko
f4051d9c3e Deepseek R1 function calls (more formats) (#652)
* Implement function calling / tools for ik_llama.cpp for Kimi K2

* Implement basic tool choice

* Backport llama.cpp tool calls support

* Enhance function calls with improved chat parser and string utilities

- Add new chat.h/chat.cpp and chat-parser.h/chat-parser.cpp for better chat handling
- Improve function calls parsing with fallback to llama.cpp builder pattern
- Add string utility functions (starts_with, ends_with, find_partial_stop)
- Update README with function calls testing instructions
- Enhance Kimi K2 parser and function calls documentation
- Add comprehensive test suite for function calls
- Update CMakeLists.txt and Makefile for new components

* Enhance function calling with unified streaming and parser improvements

- Fix streaming content cleanup to prevent function syntax in output
- Unify content extraction patterns with llama.cpp approach
- Improve Kimi K2 parser robustness and partial content handling
- Add comprehensive test coverage for function call scenarios
- Optimize chat message parsing and diff computation

* Replace hardcoded values in kimi_k2_parser.hpp with named constants

- Add compile-time constants for all token format markers
- Add compile-time constants for XML format markers
- Add compile-time constants for simple format patterns
- Replace all hardcoded string literals with named constants
- Use compile-time length calculation to avoid manual counting
- Improve maintainability and reduce magic numbers throughout parser

* Fix duplicate common_chat_parse definition

- Remove duplicate implementation from chat-parser.cpp
- Keep single implementation in chat.cpp following llama.cpp patterns
- Resolves linker error: multiple definition of common_chat_parse

* Fix JSON assertion failure in function call parsing

- Add proper validation that 'function' field is an object before accessing nested keys
- Handle missing 'arguments' field gracefully with default "{}"
- Prevents crash when parsing malformed tool call JSON structures

* Add comprehensive Qwen3 XML tool calling support with unit tests

- Implement Qwen3 XML parser with <tool_call>{"name": "func", "arguments": {...}}</tool_call> format
- Add model detection and routing for Qwen3 vs Kimi-K2 formats
- Create 8 comprehensive unit tests covering parsing, streaming, error handling
- Fix token format cleaning bug in kimi_k2_parser.hpp processing order
- Remove progressive parsing code and related utilities
- Add tool injection support for Qwen3 format in server utils

* Add DeepSeek R1 function calling support with comprehensive unit tests

- Implement complete DeepSeek R1 tool call parsing in common_chat_parser.cpp
- Add DeepSeek R1 model detection and tool injection in deepseek_r1_tools.hpp
- Update function_calls.hpp with DeepSeek R1 integration and content extraction
- Update documentation to reflect support for Kimi-K2, Qwen3, and DeepSeek R1 models
- Add comprehensive unit tests for DeepSeek R1 reasoning, tool calls, and integration
- Port exact implementation patterns from original llama.cpp for compatibility

Key features:
- Native DeepSeek R1 format: <|tool▁calls▁begin|>function<|tool▁sep|>name```json{}```<|tool▁call▁end|><|tool▁calls▁end|>
- Reasoning content extraction from <think>...</think> tags
- Multiple tool calls support with separate call blocks
- Model detection for deepseek-r1, deepseek_r1 naming patterns
- Integration with incremental parsing and streaming support

* Add partial parsing support for JSON and regex

- json-partial.h/cpp: JSON partial parsing functionality
- regex-partial.h/cpp: Regex partial parsing functionality

* Add format_chat integration tests for Qwen3 tool injection

- Add test_qwen3_format_chat_integration() to validate tool injection pipeline
- Test tool injection conditions and system message enhancement
- Verify JSON formatting and anti-preamble instructions
- Add comprehensive test documentation

Tests confirm tool injection works correctly - conversational preamble
issue is not in ik_llama.cpp but likely in UI configuration.

* Fix Qwen3 tool call parsing - pass model name to parser

Server was not passing model name to parse_chat_message_incremental(),
causing Qwen3 to fall back to Kimi-K2 parser and return tool calls
as content instead of proper tool_calls array.

* Fix non-streaming path to use model-specific parsing

Non-streaming responses were hardcoded to use Kimi-K2 format,
causing Qwen3 XML tool calls to be returned as content instead
of proper tool_calls array. Now uses same model detection as
streaming path for consistency.

* Update Qwen3 function call handling in server and tests

- Enhanced server function call detection and response formatting
- Improved test coverage for Qwen3 tool call scenarios
- Refined XML parsing for better tool execution support

* Add DeepSeek-R1 function call parsing support

Implements comprehensive parsing for all 4 DeepSeek-R1 function call formats:
- Format 1: Standard function call syntax (already supported)
- Format 2: Alternative function call patterns (already supported)
- Format 3: Tools array format - function\n```json\n{"tools": [...]}
- Format 4: XML wrapped format - <tool_call>function</think>Name\n```json\n{...}```</tool_call>

Key changes:
- Added parse_deepseek_r1_tools_array() following original parse_prefixed_json_tool_call_array pattern
- Added parse_deepseek_r1_xml_wrapped() following Hermes-2-Pro XML wrapper patterns
- Integrated both parsers into exception handling chain for robust fallback
- Added comprehensive TDD test coverage for all formats
- Anonymized all confidential information while preserving functionality

Resolves tool_calls_count=0 issue where DeepSeek-R1 models generated valid tool calls
but server failed to parse them correctly.

* Update function_calls.md documentation for DeepSeek-R1 Format 4

- Added Format 4 (XML wrapped) documentation with examples
- Updated implementation notes with correct parser order (3→4→1→2)
- Marked all DeepSeek-R1 formats as working (July 2025 update)
- Updated test status for Format 3 and 4 as passing
- Added parse_deepseek_r1_xml_wrapped() function reference
- Corrected implementation file line numbers

* Fix merge conflict in test-function-calls.cpp

- Removed incomplete merge conflict marker from line 3027
- Ensured all tests compile and pass successfully
- All DeepSeek-R1 formats (1-4) working correctly
- All streaming and content cleaning tests passing
2025-08-07 08:15:57 +03:00
firecoperana
ddceb0a55d Merge pull request #648 from ikawrakow/fcp/missing_token_ps
Fix missing token per second for webui after function call update
2025-07-26 21:13:52 -05:00
Anton Sokolchenko
33daaf7310 Fix text generation endpoint (#654) 2025-07-26 19:36:48 -05:00
firecoperana
f443040d49 webui: move preset settings to top
webui:bug fix
2025-07-25 18:03:01 -05:00
firecoperana
981259fb8b bug fix no timings after tool update 2025-07-25 17:52:43 -05:00
Anton Sokolchenko
cfc8f5a61b Enable LLM function calls (#643) 2025-07-24 20:24:12 +02:00
Anton Sokolchenko
9ee72225dc Function calling support for Kimi-K2 (#628)
* Implement function calling / tools for ik_llama.cpp for Kimi K2

* Implement basic tool choice

* Backport llama.cpp tool calls support

* Enhance function calls with improved chat parser and string utilities

- Add new chat.h/chat.cpp and chat-parser.h/chat-parser.cpp for better chat handling
- Improve function calls parsing with fallback to llama.cpp builder pattern
- Add string utility functions (starts_with, ends_with, find_partial_stop)
- Update README with function calls testing instructions
- Enhance Kimi K2 parser and function calls documentation
- Add comprehensive test suite for function calls
- Update CMakeLists.txt and Makefile for new components

* Enhance function calling with unified streaming and parser improvements

- Fix streaming content cleanup to prevent function syntax in output
- Unify content extraction patterns with llama.cpp approach
- Improve Kimi K2 parser robustness and partial content handling
- Add comprehensive test coverage for function call scenarios
- Optimize chat message parsing and diff computation

* Replace hardcoded values in kimi_k2_parser.hpp with named constants

- Add compile-time constants for all token format markers
- Add compile-time constants for XML format markers
- Add compile-time constants for simple format patterns
- Replace all hardcoded string literals with named constants
- Use compile-time length calculation to avoid manual counting
- Improve maintainability and reduce magic numbers throughout parser

* Fix duplicate common_chat_parse definition

- Remove duplicate implementation from chat-parser.cpp
- Keep single implementation in chat.cpp following llama.cpp patterns
- Resolves linker error: multiple definition of common_chat_parse

* Fix JSON assertion failure in function call parsing

- Add proper validation that 'function' field is an object before accessing nested keys
- Handle missing 'arguments' field gracefully with default "{}"
- Prevents crash when parsing malformed tool call JSON structures

* Add comprehensive Qwen3 XML tool calling support with unit tests

- Implement Qwen3 XML parser with <tool_call>{"name": "func", "arguments": {...}}</tool_call> format
- Add model detection and routing for Qwen3 vs Kimi-K2 formats
- Create 8 comprehensive unit tests covering parsing, streaming, error handling
- Fix token format cleaning bug in kimi_k2_parser.hpp processing order
- Remove progressive parsing code and related utilities
- Add tool injection support for Qwen3 format in server utils

* Add DeepSeek R1 function calling support with comprehensive unit tests

- Implement complete DeepSeek R1 tool call parsing in common_chat_parser.cpp
- Add DeepSeek R1 model detection and tool injection in deepseek_r1_tools.hpp
- Update function_calls.hpp with DeepSeek R1 integration and content extraction
- Update documentation to reflect support for Kimi-K2, Qwen3, and DeepSeek R1 models
- Add comprehensive unit tests for DeepSeek R1 reasoning, tool calls, and integration
- Port exact implementation patterns from original llama.cpp for compatibility

Key features:
- Native DeepSeek R1 format: <|tool▁calls▁begin|>function<|tool▁sep|>name```json{}```<|tool▁call▁end|><|tool▁calls▁end|>
- Reasoning content extraction from <think>...</think> tags
- Multiple tool calls support with separate call blocks
- Model detection for deepseek-r1, deepseek_r1 naming patterns
- Integration with incremental parsing and streaming support

* Add partial parsing support for JSON and regex

- json-partial.h/cpp: JSON partial parsing functionality
- regex-partial.h/cpp: Regex partial parsing functionality

* Add format_chat integration tests for Qwen3 tool injection

- Add test_qwen3_format_chat_integration() to validate tool injection pipeline
- Test tool injection conditions and system message enhancement
- Verify JSON formatting and anti-preamble instructions
- Add comprehensive test documentation

Tests confirm tool injection works correctly - conversational preamble
issue is not in ik_llama.cpp but likely in UI configuration.

* Fix Qwen3 tool call parsing - pass model name to parser

Server was not passing model name to parse_chat_message_incremental(),
causing Qwen3 to fall back to Kimi-K2 parser and return tool calls
as content instead of proper tool_calls array.

* Fix non-streaming path to use model-specific parsing

Non-streaming responses were hardcoded to use Kimi-K2 format,
causing Qwen3 XML tool calls to be returned as content instead
of proper tool_calls array. Now uses same model detection as
streaming path for consistency.
2025-07-23 18:11:42 +02:00
firecoperana
18eeb48941 Webui: New Features for Conversations, Settings, and Chat Messages (#618)
* Webui: add Rename/Upload conversation in header and sidebar

webui: don't change modified date when renaming conversation

* webui: add a preset feature to the settings #14649

* webui: Add editing assistant messages #13522

Webui: keep the following message while editing assistance response.

webui: change icon to edit message

* webui: DB import and export #14347

* webui: Wrap long numbers instead of infinite horizontal scroll (#14062)
fix sidebar being covered by main content #14082

---------

Co-authored-by: firecoperana <firecoperana>
2025-07-20 12:33:55 +02:00
firecoperana
d1f92e24d3 add dry sampler (#513)
* add dry sampler

* use vocab instead of model in dry_init function

* fix compile error for build test

---------

Co-authored-by: firecoperana <firecoperana>
2025-06-19 10:24:53 +03:00
Kawrakow
b8142a583d Send [DONE] for OAI compatibility (#470)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-06-17 10:32:53 +03:00
firecoperana
07777dde1f Add top n sigma sampler and other webui fix (#512)
Co-authored-by: firecoperana <firecoperana>
2025-06-12 08:19:26 +03:00
saood06
fdc60d0aae Docs update (#509)
* use npm as deps manager and vite as bundler

* update XTC docs

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
2025-06-09 05:32:03 -05:00
firecoperana
d1ae1504c6 Fix non rpc build error (#506)
* Add RPC backend in device list to override tensors.

* rpc : prevent crashes on invalid input (#9040)

Add more checks which prevent RPC server from crashing if invalid input
is received from client
# Conflicts:
#	ggml/src/ggml-rpc.cpp

* rpc : print error message when failed to connect endpoint (#9042)

* Fix RPC error

* Add vulkan, sycl to rpc backend

* add thread in rpc cpu backend

* add cache folder and other improvement in rpc

* add header file

* support for models with non-512 aligned tensors

* rpc : do not wait for response when sending RPC_CMD_SET_TENSOR (#12943)

RPC_CMD_SET_TENSOR always returns an empty response and we send this 4
times per token. We can improve TG speed if we don't wait for this empty
response.

The performance impact of this change depends on the network latency.
# Conflicts:
#	ggml/src/ggml-rpc.cpp

* fix(rpc): Improve input validation and error handling (#13069)

* fix(rpc): Improve input validation and error handling

The `rpc-server` was vulnerable to Denial of Service attacks via
several RPC commands (`SET_TENSOR`, `GRAPH_COMPUTE`, etc.). Malformed
messages could trigger failed assertions (e.g., invalid `ggml_type`)
or out-of-bounds reads/writes leading to `GGML_ABORT` calls,
crashing the server process.

This PR introduces robust input validation and replaces `abort()`
calls with graceful error handling:

- **Type Validation:** `deserialize_tensor` now checks if the
  `tensor->type` is within the valid `GGML_TYPE_COUNT` range
  *before* calling `ggml_new_tensor_4d`. Returns `nullptr` on
  invalid type.
- **Bounds Checks:** Replaced `GGML_ABORT` in `set_tensor`,
  `set_tensor_hash`, and `get_tensor` handlers with error
  logging and returning `false` when data/offset parameters
  are out of buffer bounds.
- **Size Checks:** Added safe arithmetic checks (for overflow) in
  `graph_compute` when calculating required message sizes based
  on client-provided `n_nodes` and `n_tensors`. Returns early
  if the reported sizes conflict with the actual message size or
  would lead to overflow.
- **Error Propagation:**
    - `create_node` now checks for `nullptr` return values from
      `deserialize_tensor` and its recursive calls, propagating
      `nullptr` upwards on failure. Uses `find` instead of `at`
      for safer map access.
    - `copy_tensor` now checks for `nullptr` from `deserialize_tensor`
      and sets the response status to failure if deserialization
      or bounds checks fail.
    - `graph_compute` now checks for `nullptr` return from
      `create_node` and returns failure status correctly. The final
      return value now reflects the actual computation status.

These changes improve the RPC server's resilience
against malformed client requests, preventing crashes and ensuring
errors are handled more gracefully.

Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>

* refactor(rpc): address pr comments

removed comments and unnecessary returns

Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>

* refactor(rpc): ambiguous nullptr from create_node

rpc_server::create_node could previously return nullptr if the input ID
was 0 (valid) or if an internal error (deserialization, recursion
failure) occurred (invalid). This ambiguity made error handling
difficult for the caller (`graph_compute`).

This commit clarifies the meaning of nullptr:
- `graph_compute` now checks if the input 'id' was non-zero when
  `create_node` returns nullptr, correctly identifying failures
  versus intentional null links.
- `create_node` avoids recursive calls for zero IDs and propagates
  nullptr unambiguously on failure during recursion.

Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>

* refactor(rpc): initial zero check in create_node

The caller (`graph_compute`) already checks `id != 0` when handling
a `nullptr` return from `create_node`, correctly distinguishing
intentional null links from actual errors. This makes the initial
`if (id == 0)` check redundant.

Also removes the log message when a tensor ID is not found in the
provided map which was added in this branch.

Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>

* fix(rpc): Handle get_alloc_size failure in server

Check the return value of `server.get_alloc_size` in the RPC server
loop. If the call fails, return early to close the connection.

Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>

* refactor(rpc): input size validation in graph_compute

Removes detailed, step-by-step size calculations and overflow
checks in favor of simpler direct comparisons, assuming 64-bit
overflow is unlikely.

Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>

* refactor(rpc): remove extra status code setting

Removes the explicit setting of `response.result = GGML_STATUS_FAILED`
when `create_node` returns `nullptr` within `graph_compute`.
Primary signal is the `false` return value in case of failure.

Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>

* refactor(rpc): remove redundant check for tensor->type

Breaks CI on ubuntu-cpu-make. Tensor type is uint32_t, thus
the check is not needed.

Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>

---------

Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>
# Conflicts:
#	ggml/src/ggml-rpc.cpp

* rpc : fix cache directory initialization (#13188)

Signed-off-by: xiaofei <hbuxiaofei@gmail.com>
# Conflicts:
#	examples/rpc/rpc-server.cpp

* rpc : avoid uninitialized memory in serialize_tensor (#13210)

Zero out the name and padding buffers.

* fix merge error

* Add hello command in RPC

* bug fix

* add rpc header

* fix bug for missing rpc names

* add tpc no delay for rpc

* add back webui

* fix rpc function not found error

---------

Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>
Signed-off-by: xiaofei <hbuxiaofei@gmail.com>
Co-authored-by: firecoperana <firecoperana>
Co-authored-by: Radoslav Gerganov <rgerganov@gmail.com>
Co-authored-by: matt23456 <matt23456>
Co-authored-by: Ville Vesilehto <ville@vesilehto.fi>
Co-authored-by: xiaofei <hbuxiaofei@gmail.com>
Co-authored-by: Justin Santa Barbara <justinsb@google.com>
2025-06-08 17:27:00 +03:00
Kawrakow
7fc39ae7e8 Revert "Rpc improvement (#480)"
This reverts commit 8a5f8573ae.
2025-06-08 14:49:50 +03:00
firecoperana
ed9e8ecc9b Rpc improvement (#480)
* Add RPC backend in device list to override tensors.

* rpc : prevent crashes on invalid input (#9040)

Add more checks which prevent RPC server from crashing if invalid input
is received from client
# Conflicts:
#	ggml/src/ggml-rpc.cpp

* rpc : print error message when failed to connect endpoint (#9042)

* Fix RPC error

* Add vulkan, sycl to rpc backend

* add thread in rpc cpu backend

* add cache folder and other improvement in rpc

* add header file

* support for models with non-512 aligned tensors

* rpc : do not wait for response when sending RPC_CMD_SET_TENSOR (#12943)

RPC_CMD_SET_TENSOR always returns an empty response and we send this 4
times per token. We can improve TG speed if we don't wait for this empty
response.

The performance impact of this change depends on the network latency.
# Conflicts:
#	ggml/src/ggml-rpc.cpp

* fix(rpc): Improve input validation and error handling (#13069)

* fix(rpc): Improve input validation and error handling

The `rpc-server` was vulnerable to Denial of Service attacks via
several RPC commands (`SET_TENSOR`, `GRAPH_COMPUTE`, etc.). Malformed
messages could trigger failed assertions (e.g., invalid `ggml_type`)
or out-of-bounds reads/writes leading to `GGML_ABORT` calls,
crashing the server process.

This PR introduces robust input validation and replaces `abort()`
calls with graceful error handling:

- **Type Validation:** `deserialize_tensor` now checks if the
  `tensor->type` is within the valid `GGML_TYPE_COUNT` range
  *before* calling `ggml_new_tensor_4d`. Returns `nullptr` on
  invalid type.
- **Bounds Checks:** Replaced `GGML_ABORT` in `set_tensor`,
  `set_tensor_hash`, and `get_tensor` handlers with error
  logging and returning `false` when data/offset parameters
  are out of buffer bounds.
- **Size Checks:** Added safe arithmetic checks (for overflow) in
  `graph_compute` when calculating required message sizes based
  on client-provided `n_nodes` and `n_tensors`. Returns early
  if the reported sizes conflict with the actual message size or
  would lead to overflow.
- **Error Propagation:**
    - `create_node` now checks for `nullptr` return values from
      `deserialize_tensor` and its recursive calls, propagating
      `nullptr` upwards on failure. Uses `find` instead of `at`
      for safer map access.
    - `copy_tensor` now checks for `nullptr` from `deserialize_tensor`
      and sets the response status to failure if deserialization
      or bounds checks fail.
    - `graph_compute` now checks for `nullptr` return from
      `create_node` and returns failure status correctly. The final
      return value now reflects the actual computation status.

These changes improve the RPC server's resilience
against malformed client requests, preventing crashes and ensuring
errors are handled more gracefully.

Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>

* refactor(rpc): address pr comments

removed comments and unnecessary returns

Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>

* refactor(rpc): ambiguous nullptr from create_node

rpc_server::create_node could previously return nullptr if the input ID
was 0 (valid) or if an internal error (deserialization, recursion
failure) occurred (invalid). This ambiguity made error handling
difficult for the caller (`graph_compute`).

This commit clarifies the meaning of nullptr:
- `graph_compute` now checks if the input 'id' was non-zero when
  `create_node` returns nullptr, correctly identifying failures
  versus intentional null links.
- `create_node` avoids recursive calls for zero IDs and propagates
  nullptr unambiguously on failure during recursion.

Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>

* refactor(rpc): initial zero check in create_node

The caller (`graph_compute`) already checks `id != 0` when handling
a `nullptr` return from `create_node`, correctly distinguishing
intentional null links from actual errors. This makes the initial
`if (id == 0)` check redundant.

Also removes the log message when a tensor ID is not found in the
provided map which was added in this branch.

Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>

* fix(rpc): Handle get_alloc_size failure in server

Check the return value of `server.get_alloc_size` in the RPC server
loop. If the call fails, return early to close the connection.

Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>

* refactor(rpc): input size validation in graph_compute

Removes detailed, step-by-step size calculations and overflow
checks in favor of simpler direct comparisons, assuming 64-bit
overflow is unlikely.

Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>

* refactor(rpc): remove extra status code setting

Removes the explicit setting of `response.result = GGML_STATUS_FAILED`
when `create_node` returns `nullptr` within `graph_compute`.
Primary signal is the `false` return value in case of failure.

Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>

* refactor(rpc): remove redundant check for tensor->type

Breaks CI on ubuntu-cpu-make. Tensor type is uint32_t, thus
the check is not needed.

Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>

---------

Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>
# Conflicts:
#	ggml/src/ggml-rpc.cpp

* rpc : fix cache directory initialization (#13188)

Signed-off-by: xiaofei <hbuxiaofei@gmail.com>
# Conflicts:
#	examples/rpc/rpc-server.cpp

* rpc : avoid uninitialized memory in serialize_tensor (#13210)

Zero out the name and padding buffers.

* fix merge error

* Add hello command in RPC

* bug fix

* add rpc header

* fix bug for missing rpc names

* add tpc no delay for rpc

* add back webui

---------

Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>
Signed-off-by: xiaofei <hbuxiaofei@gmail.com>
Co-authored-by: firecoperana <firecoperana>
Co-authored-by: Radoslav Gerganov <rgerganov@gmail.com>
Co-authored-by: matt23456 <matt23456>
Co-authored-by: Ville Vesilehto <ville@vesilehto.fi>
Co-authored-by: xiaofei <hbuxiaofei@gmail.com>
Co-authored-by: Justin Santa Barbara <justinsb@google.com>
2025-06-08 14:43:21 +03:00
firecoperana
6cbd73c60e Webui improvement (#481)
* update webui

* add token/s in webui

* add webui files

* fix webui first message disappear in some browser

* add missing html files

---------

Co-authored-by: firecoperana <firecoperana>
2025-06-08 14:38:47 +03:00
saood06
933b061441 Add an endpoint that lists all the saved prompt caches to server (#502) 2025-06-07 00:22:56 -05:00
Kawrakow
1d28b2a9a1 Adding top-n-sigma sampler (#489)
* Adding top-n-sigma sampler

* Fix typos in XTC PR

* Update README.md for main and server

* More README

* More README

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-06-03 17:35:09 +03:00
saood06
ccc00a4a56 set cache_prompt default to true (#465) 2025-05-28 08:18:25 +03:00
Kawrakow
1a4cfbcc53 Merge mainline - Aug 12 2024 (#17)
* Merge mainline

* Fix after merge

* Remove CI check

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-08-12 15:14:32 +02:00
Kawrakow
0ceeb11721 Merge mainline llama.cpp (#3)
* Merging mainline - WIP

* Merging mainline - WIP

AVX2 and CUDA appear to work.
CUDA performance seems slightly (~1-2%) lower as it is so often
the case with llama.cpp/ggml after some "improvements" have been made.

* Merging mainline - fix Metal

* Remove check

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-07-27 07:55:01 +02:00
sasha0552
c7d9dd7634 server : fix smart slot selection (#8020) 2024-06-20 09:57:10 +10:00
Sigbjørn Skjæret
083d5edc87 Only use FIM middle token if it exists (#7648)
* Only use FIM middle if it exists

* Only use FIM middle if it exists
2024-06-18 22:19:45 +10:00
Olivier Chafik
b267b997c5 build: rename main → llama-cli, server → llama-server, llava-cli → llama-llava-cli, etc... (#7809)
* `main`/`server`: rename to `llama` / `llama-server` for consistency w/ homebrew

* server: update refs -> llama-server

gitignore llama-server

* server: simplify nix package

* main: update refs -> llama

fix examples/main ref

* main/server: fix targets

* update more names

* Update build.yml

* rm accidentally checked in bins

* update straggling refs

* Update .gitignore

* Update server-llm.sh

* main: target name -> llama-cli

* Prefix all example bins w/ llama-

* fix main refs

* rename {main->llama}-cmake-pkg binary

* prefix more cmake targets w/ llama-

* add/fix gbnf-validator subfolder to cmake

* sort cmake example subdirs

* rm bin files

* fix llama-lookup-* Makefile rules

* gitignore /llama-*

* rename Dockerfiles

* rename llama|main -> llama-cli; consistent RPM bin prefixes

* fix some missing -cli suffixes

* rename dockerfile w/ llama-cli

* rename(make): llama-baby-llama

* update dockerfile refs

* more llama-cli(.exe)

* fix test-eval-callback

* rename: llama-cli-cmake-pkg(.exe)

* address gbnf-validator unused fread warning (switched to C++ / ifstream)

* add two missing llama- prefixes

* Updating docs for eval-callback binary to use new `llama-` prefix.

* Updating a few lingering doc references for rename of main to llama-cli

* Updating `run-with-preset.py` to use new binary names.
Updating docs around `perplexity` binary rename.

* Updating documentation references for lookup-merge and export-lora

* Updating two small `main` references missed earlier in the finetune docs.

* Update apps.nix

* update grammar/README.md w/ new llama-* names

* update llama-rpc-server bin name + doc

* Revert "update llama-rpc-server bin name + doc"

This reverts commit e474ef1df481fd8936cd7d098e3065d7de378930.

* add hot topic notice to README.md

* Update README.md

* Update README.md

* rename gguf-split & quantize bins refs in **/tests.sh

---------

Co-authored-by: HanClinto <hanclinto@gmail.com>
2024-06-13 00:41:52 +01:00
Georgi Gerganov
f196f2a2c9 server : restore numeric prompts (#7883) 2024-06-12 14:42:29 +03:00
Olivier Chafik
52819e6643 json: refine constraint for whitespace to avoid runaways yet allow pretty print (#7866) 2024-06-11 02:22:57 +01:00
Olivier Chafik
1de5991f7c json: document schema conversion in GBNF readme, align manual grammar examples & converters (#7841)
* json: fix char pattern in grammar converters

* json: prevent number precision & whitespace runaways in example grammars

* json: add doc to grammar readme
2024-06-11 01:00:30 +01:00
Georgi Gerganov
968cfb9d8d server : improve "prompt" handling (#7847) 2024-06-10 14:59:55 +03:00
mgroeber9110
36f9df2257 server: do not remove whitespace at the start of a completion chunk (#7830) 2024-06-09 20:50:35 +10:00
sasha0552
66217bbac6 server : smart slot selection using Longest Common Prefix (#7728)
* server : Smart selection of available slot using Longest Common Substring

* add usage

* remove trailing whitespaces

* Use Longest Common Prefix (LCP) instead of LCS

* Rename argument
2024-06-08 10:50:31 +03:00
Johannes Gäßler
e8ac0b3518 server: update cache_prompt documentation [no ci] (#7745) 2024-06-07 11:15:49 +02:00
woodx
f91a2cbb62 server : do not get prompt in infill mode (#7286)
* avoid to get prompt in infill mode and embedding mode

* remove embedding mode

* refactor format

---------

Co-authored-by: wudexiang <wudexiang@bytedance.com>
2024-06-07 10:09:45 +03:00
Georgi Gerganov
c2a2806fac imatrix : migrate to gpt_params (#7771)
* imatrix : migrate to gpt_params

ggml-ci

* imatrix : add --save-frequency cli arg

* common : fix --no-ppl
2024-06-06 16:30:58 +03:00
Olivier Chafik
bb0026f4f1 grammars: x{min,max} repetition operator (#6640)
* grammars: x{min,max} repetition operator + tweak +/*/? to avoid duplication of original over alternates

* grammars: handle `x{n}` and fix `x{n,n}`

* grammars: document new repetition operators

* grammars: uniform use of int for min & max

* grammars: refactor parser test

* grammar: parsing tests w/ natural pretty print of updated expectations

* grammars: much prettier print of expectations (+ TEST_GRAMMAR_PARSER_PRINT_ALL=1 to force all)

* grammars: improve test pretty print again

* grammars: pretty print rules and chars

* grammars: fix copy rule skipping

* grammars: disallow `a{,}` (not allowed in regexps)

* Update common/grammar-parser.cpp

Co-authored-by: Clint Herron <hanclinto@gmail.com>

* grammars: fix copy rule skipping (again) & display of expectations

* grammars: more test cases

* grammars: update reps parsing to bring ? / * / + closer to before

* json: use new GBNF repetitions{m,n} syntax

* grammars: update performance gotchas w/ repetition advice

* Update examples/json_schema_to_grammar.py

Co-authored-by: Clint Herron <hanclinto@gmail.com>

* Update examples/server/public/json-schema-to-grammar.mjs

Co-authored-by: Clint Herron <hanclinto@gmail.com>

* grammars: comment on rule repetitions

* grammars: ensure unambiguous number alternatives

* grammar: nit typo switched error msgs

* grammar: nit numbering in comment

* json: update numeric rule to be unambiguous

* Apply suggestions from code review

Co-authored-by: Clint Herron <hanclinto@gmail.com>

* Update examples/server/public/json-schema-to-grammar.mjs

Co-authored-by: Clint Herron <hanclinto@gmail.com>

* json: fix integral-part

* grammar: add repetition tests

---------

Co-authored-by: Clint Herron <hanclinto@gmail.com>
2024-06-06 10:07:06 +01:00
Georgi Gerganov
8822dcce8d common : refactor cli arg parsing (#7675)
* common : gpt_params_parse do not print usage

* common : rework usage print (wip)

* common : valign

* common : rework print_usage

* infill : remove cfg support

* common : reorder args

* server : deduplicate parameters

ggml-ci

* common : add missing header

ggml-ci

* common : remote --random-prompt usages

ggml-ci

* examples : migrate to gpt_params

ggml-ci

* batched-bench : migrate to gpt_params

* retrieval : migrate to gpt_params

* common : change defaults for escape and n_ctx

* common : remove chatml and instruct params

ggml-ci

* common : passkey use gpt_params
2024-06-04 21:23:39 +03:00
Yazan Agha-Schrader
e606e940c3 server : new UI (#7633)
* ic

* migrate my eary work

* add the belonging stuff: css,favicon etc

* de prompts

* chore: Update HTML meta tags in index.html file

* add api-key css classes

* some necessary fixes

* Add API key CSS classes and update styling in style.css

* clean the code

* move API to the top, rearrange param sliders. update css

* add tooltips to the parameters with comprehensible explanations

* fix FloatField and BoolField tooltips

* fix grammar field width

* use template literales for promptFormats.js

* update const ModelGenerationInfo

* remove ms per token, since not relevant for most webui users and use cases

* add phi-3 prompt template

* add phi3 to dropdown

* add css class

* update forgotten css theme

* add user message suffix

* fix chatml & add llama3 format

* fix llama3 prompt template

* more prompt format fixes

* add more comon stop tokens

* add missing char

* do not separate with new line or comma

* move prompt style

* add hacky llama2 prompt solution, reduce redundancy in promptFormats.js

* fix toggle state localstorage

* add cmd-r prompt et reduce redundancy

* set default prompt to empty

* move files, clean code

* fix css path

* add a button to the new ui

* move new ui to "/public" due to otherwise problematic CORS behaviour

* include new ui in cpp

* fix wrong link to old ui

* renaming to ensure consistency

* fix typos "prompt-format" -> "prompt-formats"

* use correct indent

* add new ui files to makefile

* fix typo
2024-06-01 22:31:48 +03:00
HanishKVC
cfd5c8cc34 SimpleChat: Simple histogram/repeatMatching driven garbageTrimming, Settings UI, Streaming mode, OpenAi Compat (Model, Authorization Bearer), Save/Restore session, Auto Settings UI (#7548)
* SimpleChat:DU:BringIn local helper js modules using importmap

Use it to bring in a simple trim garbage at end logic, which is
used to trim received response.

Also given that importmap assumes esm / standard js modules, so
also global variables arent implicitly available outside the
modules. So add it has a member of document for now

* SimpleChat:DU: Add trim garbage at end in loop helper

* SimpleChat:DU:TrimGarbage if unable try skip char and retry

* SimpleChat:DU: Try trim using histogram based info

TODO: May have to add max number of uniq chars in histogram at
end of learning phase.

* SimpleChat:DU: Switch trim garbage hist based to maxUniq simple

Instead of blindly building histogram for specified substring
length, and then checking if any new char within specified min
garbage length limit, NOW exit learn state when specified maxUniq
chars are found. Inturn there should be no new chars with in
the specified min garbage length required limit.

TODO: Need to track char classes like alphabets, numerals and
special/other chars.

* SimpleChat:DU: Bring in maxType to the mix along with maxUniq

Allow for more uniq chars, but then ensure that a given type of
char ie numerals or alphabets or other types dont cross the
specified maxType limit. This allows intermixed text garbage
to be identified and trimmed.

* SimpleChat:DU: Cleanup debug log messages

* SimpleChat:UI: Move html ui base helpers into its own module

* SimpleChat:DU:Avoid setting frequence/Presence penalty

Some models like llama3 found to try to be over intelligent by
repeating garbage still, but by tweaking the garbage a bit so that
it is not exactly same. So avoid setting these penalties and let
the model's default behaviour work out, as is.

Also the simple minded histogram based garbage trimming from end,
works to an extent, when the garbage is more predictable and
repeatative.

* SimpleChat:UI: Add and use a para-create-append helper

Also update the config params dump to indicate that now one needs
to use document to get hold of gMe global object, this is bcas of
moving to module type js.

Also add ui.mjs to importmap

* SimpleChat:UI: Helper to create bool button and use it wrt settings

* SimpleChat:UI: Add Select helper and use it wrt ChatHistoryInCtxt

* SimpleChat:UI:Select: dict-name-value, value wrt default, change

Take a dict/object of name-value pairs instead of just names.
Inturn specify the actual value wrt default, rather than the
string representing that value.

Trap the needed change event rather than click wrt select.

* SimpleChat:UI: Add Div wrapped label+element helpers

Move settings related elements to use the new div wrapped ones.

* SimpleChat:UI:Add settings button and bring in settings ui

* SimpleChat:UI:Settings make boolean button text show meaning

* SimpleChat: Update a bit wrt readme and notes in du

* SimpleChat: GarbageTrim enable/disable, show trimmed part ifany

* SimpleChat: highlight trim, garbage trimming bitmore aggressive

Make it easy for end user to identified the trimmed text.

Make garbage trimming logic, consider a longer repeat garbage
substring.

* SimpleChat: Cleanup a bit wrt Api end point related flow

Consolidate many of the Api end point related basic meta data into
ApiEP class.

Remove the hardcoded ApiEP/Mode settings from html+js, instead use
the generic select helper logic, inturn in the settings block.

Move helper to generate the appropriate request json string based
on ApiEP into SimpleChat class itself.

* SimpleChat:Move extracting assistant response to SimpleChat class

so also the trimming of garbage.

* SimpleChat:DU: Bring in both trim garbage logics to try trim

* SimpleChat: Cleanup readme a bit, add one more chathistory length

* SimpleChat:Stream:Initial handshake skeleton

Parse the got stream responses and try extract the data from it.

It allows for a part read to get a single data line or multiple
data line. Inturn extract the json body and inturn the delta
content/message in it.

* SimpleChat: Move handling oneshot mode server response

Move handling of the oneshot mode server response into SimpleChat.

Also add plumbing for moving multipart server response into same.

* SimpleChat: Move multi part server response handling in

* SimpleChat: Add MultiPart Response handling, common trimming

Add logic to call into multipart/stream server response handling.

Move trimming of garbage at the end into the common handle_response
helper.

Add new global flag to control between oneshot and multipart/stream
mode of fetching response. Allow same to be controlled by user.

If in multipart/stream mode, send the stream flag to the server.

* SimpleChat: show streamed generative text as it becomes available

Now that the extracting of streamed generated text is implemented,
add logic to show the same on the screen.

* SimpleChat:DU: Add NewLines helper class

To work with an array of new lines. Allow adding, appending,
shifting, ...

* SimpleChat:DU: Make NewLines shift more robust and flexible

* SimpleChat:HandleResponseMultiPart using NewLines helper

Make handle_response_multipart logic better and cleaner. Now it
allows for working with the situation, where the delta data line
got from server in stream mode, could be split up when recving,
but still the logic will handle it appropriately.

ALERT: Rather except (for now) for last data line wrt a request's
response.

* SimpleChat: Disable console debug by default by making it dummy

Parallely save a reference to the original func.

* SimpleChat:MultiPart/Stream flow cleanup

Dont try utf8-decode and newlines-add_append if no data to work on.

If there is no more data to get (ie done is set), then let NewLines
instance return line without newline at end, So that we dont miss
out on any last-data-line without newline kind of scenario.

Pass stream flag wrt utf-8 decode, so that if any multi-byte char
is only partly present in the passed buffer, it can be accounted
for along with subsequent buffer. At sametime, bcas of utf-8's
characteristics there shouldnt be any unaccounted bytes at end,
for valid block of utf8 data split across chunks, so not bothering
calling with stream set to false at end. LATER: Look at TextDecoder's
implementation, for any over intelligence, it may be doing..
If needed, one can use done flag to account wrt both cases.

* SimpleChat: Move baseUrl to Me and inturn gMe

This should allow easy updating of the base url at runtime by the
end user.

* SimpleChat:UI: Add input element helper

* SimpleChat: Add support for changing the base url

This ensures that if the user is running the server with a
different port or wants to try connect to server on a different
machine, then this can be used.

* SimpleChat: Move request headers into Me and gMe

Inturn allow Authorization to be sent, if not empty.

* SimpleChat: Rather need to use append to insert headers

* SimpleChat: Allow Authorization header to be set by end user

* SimpleChat:UI+: Return div and element wrt creatediv helpers

use it to set placeholder wrt Authorization header.

Also fix copy-paste oversight.

* SimpleChat: readme wrt authorization, maybe minimal openai testing

* SimpleChat: model request field for openai/equivalent compat

May help testing with openai/equivalent web services, if they
require this field.

* SimpleChat: readme stream-utf-8 trim-english deps, exception2error

* Readme: Add a entry for simplechat in the http server section

* SimpleChat:WIP:Collate internally, Stream mode Trap exceptions

This can help ensure that data fetched till that point, can be
made use of, rather than losing it.

On some platforms, the time taken wrt generating a long response,
may lead to the network connection being broken when it enters
some user-no-interaction related power saving mode.

* SimpleChat:theResp-origMsg: Undo a prev change to fix non trim

When the response handling was moved into SimpleChat, I had changed
a flow bit unnecessarily and carelessly, which resulted in the non
trim flow, missing out on retaining the ai assistant response.

This has been fixed now.

* SimpleChat: Save message internally in handle_response itself

This ensures that throwing the caught exception again for higher
up logic, doesnt lose the response collated till that time.

Go through theResp.assistant in catch block, just to keep simple
consistency wrt backtracing just in case.

Update the readme file.

* SimpleChat:Cleanup: Add spacing wrt shown req-options

* SimpleChat:UI: CreateDiv Divs map to GridX2 class

This allows the settings ui to be cleaner structured.

* SimpleChat: Show Non SettingsUI config field by default

* SimpleChat: Allow for multiline system prompt

Convert SystemPrompt into a textarea with 2 rows. Reduce
user-input-textarea to 2 rows from 3, so that overall
vertical space usage remains same.

Shorten usage messages a bit, cleanup to sync with settings ui.

* SimpleChat: Add basic skeleton for saving and loading chat

Inturn when ever a chat message (system/user/model) is added,
the chat will be saved into browser's localStorage.

* SimpleChat:ODS: Add a prefix to chatid wrt ondiskstorage key

* SimpleChat:ODS:WIP:TMP: Add UI to load previously saved chat

This is a temporary flow

* SimpleChat:ODS:Move restore/load saved chat btn setup to Me

This also allows being able to set the common system prompt
ui element to loaded chat's system prompt.

* SimpleChat:Readme updated wrt save and restore chat session info

* SimpleChat:Show chat session restore button, only if saved session

* SimpleChat: AutoCreate ChatRequestOptions settings to an extent

* SimpleChat: Update main README wrt usage with server
2024-06-02 02:20:18 +10:00
Georgi Gerganov
d91eb8d805 server : update js (#7670) 2024-05-31 22:23:04 +03:00
mgroeber9110
b156a7902e server: do not remove whitespace at the start of a completion chunk (#7524) 2024-05-28 14:55:51 +10:00