firecoperana
7470d8bf50
Merge branch 'main' into fcp/string_ban
2026-02-04 21:56:08 -06:00
gapeleon
17d101863d
server: add dynamic control vector management endpoints ( #1223 )
...
This implements the ability to load, unload, and scale control vectors
(representation engineering) mid-inference, following the existing
task-queue pattern used by LoRA adapters.
New Endpoints:
- GET /control-vectors
- POST /control-vectors/load
- POST /control-vectors/unload
- POST /control-vectors/apply (handles scaling)
Technical Notes:
- Centralizes vector aggregation logic to share implementation between
load, unload, and apply tasks.
- Vectors are applied globally to the model context.
- Enforces dimension validation on load to safely reject incompatible
vectors.
Co-authored-by: Gapeleon <gapeleon@users.noreply.github.com >
2026-02-04 16:07:18 +02:00
firecoperana
4f3f1be6bd
init n_buffer
2026-02-02 16:45:54 -06:00
firecoperana
3b38df11e8
server: add string ban
2026-02-02 15:56:42 -06:00
firecoperana
d71a3ec315
Server: refactor and rename functions ( #1151 )
...
* Server: rename functions and refactor code
rename functions
refactor update slots
rename params_base
rename timings
* change
* Revert kv cache name changes
* Revert 2
* fix test build error
---------
Co-authored-by: firecoperana <firecoperana>
2026-01-18 08:16:57 +02:00
hksdpc255
e1c4c4a495
Fix Anthropic Messages API ( #1136 )
...
* server: stop processing the prompt when client disconnects
implement generator-based API for task results
Update httplib.h to 0.27.0
Fix embedding error
Stop prompt processing when disconnected
* Port upstream https://github.com/ggml-org/llama.cpp/pull/18551
* add back anthropic
* Fix merge issue caused by github webui
---------
Co-authored-by: firecoperana <firecoperana>
2026-01-13 08:37:29 +02:00
firecoperana
1a461525d5
server: stop processing the prompt when client disconnects ( #1134 )
...
implement generator-based API for task results
Update httplib.h to 0.27.0
Fix embedding error
Stop prompt processing when disconnected
Co-authored-by: firecoperana <firecoperana>
2026-01-13 07:56:59 +02:00
firecoperana
c03ee1a4d2
server: improve speed of speculative decoding ( #1119 )
...
* server: improve speed of speculative decoding
change logs
rpc: add recompute
spec dec fix
* Fix n_batch_size not set to context size for draft model
---------
Co-authored-by: firecoperana <firecoperana>
2026-01-10 08:01:22 +02:00
firecoperana
2a633c4357
server: exclude thinking tokens when finding the slot ( #1079 )
...
refactor find slot
enable by default
Fix load prompt
rename variables
Co-authored-by: firecoperana <firecoperana>
2025-12-22 09:46:45 +01:00
firecoperana
0e91b89cd3
Refactor chat and server file ( #1062 )
...
* Add alternative log functions
* chat: fix int overflow, prevent size calculation in float/double (#17357 )
* chat: fix int overflow, prevent size calculation in float/double
* Update common/chat.cpp
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
* common : move all common_chat_parse_* to chat-parser.cpp. (#17481 )
# Conflicts:
# common/chat.cpp
* server: split server.cpp code into server/common/task/queue/context
* Fix compiler warning
* Clean up code
* common: use native MultiByteToWideChar
* move server prompt to server task
* Clean code
* delete utils.hpp
---------
Co-authored-by: firecoperana <firecoperana>
Co-authored-by: Xuan-Son Nguyen <son@huggingface.co >
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com >
Co-authored-by: DAN™ <dranger003@gmail.com >
2025-12-15 08:27:20 +01:00