Commit Graph

33 Commits

Author SHA1 Message Date
firecoperana
d1ae1504c6 Fix non rpc build error (#506)
* Add RPC backend in device list to override tensors.

* rpc : prevent crashes on invalid input (#9040)

Add more checks which prevent RPC server from crashing if invalid input
is received from client
# Conflicts:
#	ggml/src/ggml-rpc.cpp

* rpc : print error message when failed to connect endpoint (#9042)

* Fix RPC error

* Add vulkan, sycl to rpc backend

* add thread in rpc cpu backend

* add cache folder and other improvement in rpc

* add header file

* support for models with non-512 aligned tensors

* rpc : do not wait for response when sending RPC_CMD_SET_TENSOR (#12943)

RPC_CMD_SET_TENSOR always returns an empty response and we send this 4
times per token. We can improve TG speed if we don't wait for this empty
response.

The performance impact of this change depends on the network latency.
# Conflicts:
#	ggml/src/ggml-rpc.cpp

* fix(rpc): Improve input validation and error handling (#13069)

* fix(rpc): Improve input validation and error handling

The `rpc-server` was vulnerable to Denial of Service attacks via
several RPC commands (`SET_TENSOR`, `GRAPH_COMPUTE`, etc.). Malformed
messages could trigger failed assertions (e.g., invalid `ggml_type`)
or out-of-bounds reads/writes leading to `GGML_ABORT` calls,
crashing the server process.

This PR introduces robust input validation and replaces `abort()`
calls with graceful error handling:

- **Type Validation:** `deserialize_tensor` now checks if the
  `tensor->type` is within the valid `GGML_TYPE_COUNT` range
  *before* calling `ggml_new_tensor_4d`. Returns `nullptr` on
  invalid type.
- **Bounds Checks:** Replaced `GGML_ABORT` in `set_tensor`,
  `set_tensor_hash`, and `get_tensor` handlers with error
  logging and returning `false` when data/offset parameters
  are out of buffer bounds.
- **Size Checks:** Added safe arithmetic checks (for overflow) in
  `graph_compute` when calculating required message sizes based
  on client-provided `n_nodes` and `n_tensors`. Returns early
  if the reported sizes conflict with the actual message size or
  would lead to overflow.
- **Error Propagation:**
    - `create_node` now checks for `nullptr` return values from
      `deserialize_tensor` and its recursive calls, propagating
      `nullptr` upwards on failure. Uses `find` instead of `at`
      for safer map access.
    - `copy_tensor` now checks for `nullptr` from `deserialize_tensor`
      and sets the response status to failure if deserialization
      or bounds checks fail.
    - `graph_compute` now checks for `nullptr` return from
      `create_node` and returns failure status correctly. The final
      return value now reflects the actual computation status.

These changes improve the RPC server's resilience
against malformed client requests, preventing crashes and ensuring
errors are handled more gracefully.

Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>

* refactor(rpc): address pr comments

removed comments and unnecessary returns

Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>

* refactor(rpc): ambiguous nullptr from create_node

rpc_server::create_node could previously return nullptr if the input ID
was 0 (valid) or if an internal error (deserialization, recursion
failure) occurred (invalid). This ambiguity made error handling
difficult for the caller (`graph_compute`).

This commit clarifies the meaning of nullptr:
- `graph_compute` now checks if the input 'id' was non-zero when
  `create_node` returns nullptr, correctly identifying failures
  versus intentional null links.
- `create_node` avoids recursive calls for zero IDs and propagates
  nullptr unambiguously on failure during recursion.

Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>

* refactor(rpc): initial zero check in create_node

The caller (`graph_compute`) already checks `id != 0` when handling
a `nullptr` return from `create_node`, correctly distinguishing
intentional null links from actual errors. This makes the initial
`if (id == 0)` check redundant.

Also removes the log message when a tensor ID is not found in the
provided map which was added in this branch.

Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>

* fix(rpc): Handle get_alloc_size failure in server

Check the return value of `server.get_alloc_size` in the RPC server
loop. If the call fails, return early to close the connection.

Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>

* refactor(rpc): input size validation in graph_compute

Removes detailed, step-by-step size calculations and overflow
checks in favor of simpler direct comparisons, assuming 64-bit
overflow is unlikely.

Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>

* refactor(rpc): remove extra status code setting

Removes the explicit setting of `response.result = GGML_STATUS_FAILED`
when `create_node` returns `nullptr` within `graph_compute`.
Primary signal is the `false` return value in case of failure.

Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>

* refactor(rpc): remove redundant check for tensor->type

Breaks CI on ubuntu-cpu-make. Tensor type is uint32_t, thus
the check is not needed.

Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>

---------

Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>
# Conflicts:
#	ggml/src/ggml-rpc.cpp

* rpc : fix cache directory initialization (#13188)

Signed-off-by: xiaofei <hbuxiaofei@gmail.com>
# Conflicts:
#	examples/rpc/rpc-server.cpp

* rpc : avoid uninitialized memory in serialize_tensor (#13210)

Zero out the name and padding buffers.

* fix merge error

* Add hello command in RPC

* bug fix

* add rpc header

* fix bug for missing rpc names

* add tpc no delay for rpc

* add back webui

* fix rpc function not found error

---------

Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>
Signed-off-by: xiaofei <hbuxiaofei@gmail.com>
Co-authored-by: firecoperana <firecoperana>
Co-authored-by: Radoslav Gerganov <rgerganov@gmail.com>
Co-authored-by: matt23456 <matt23456>
Co-authored-by: Ville Vesilehto <ville@vesilehto.fi>
Co-authored-by: xiaofei <hbuxiaofei@gmail.com>
Co-authored-by: Justin Santa Barbara <justinsb@google.com>
2025-06-08 17:27:00 +03:00
Kawrakow
7fc39ae7e8 Revert "Rpc improvement (#480)"
This reverts commit 8a5f8573ae.
2025-06-08 14:49:50 +03:00
firecoperana
ed9e8ecc9b Rpc improvement (#480)
* Add RPC backend in device list to override tensors.

* rpc : prevent crashes on invalid input (#9040)

Add more checks which prevent RPC server from crashing if invalid input
is received from client
# Conflicts:
#	ggml/src/ggml-rpc.cpp

* rpc : print error message when failed to connect endpoint (#9042)

* Fix RPC error

* Add vulkan, sycl to rpc backend

* add thread in rpc cpu backend

* add cache folder and other improvement in rpc

* add header file

* support for models with non-512 aligned tensors

* rpc : do not wait for response when sending RPC_CMD_SET_TENSOR (#12943)

RPC_CMD_SET_TENSOR always returns an empty response and we send this 4
times per token. We can improve TG speed if we don't wait for this empty
response.

The performance impact of this change depends on the network latency.
# Conflicts:
#	ggml/src/ggml-rpc.cpp

* fix(rpc): Improve input validation and error handling (#13069)

* fix(rpc): Improve input validation and error handling

The `rpc-server` was vulnerable to Denial of Service attacks via
several RPC commands (`SET_TENSOR`, `GRAPH_COMPUTE`, etc.). Malformed
messages could trigger failed assertions (e.g., invalid `ggml_type`)
or out-of-bounds reads/writes leading to `GGML_ABORT` calls,
crashing the server process.

This PR introduces robust input validation and replaces `abort()`
calls with graceful error handling:

- **Type Validation:** `deserialize_tensor` now checks if the
  `tensor->type` is within the valid `GGML_TYPE_COUNT` range
  *before* calling `ggml_new_tensor_4d`. Returns `nullptr` on
  invalid type.
- **Bounds Checks:** Replaced `GGML_ABORT` in `set_tensor`,
  `set_tensor_hash`, and `get_tensor` handlers with error
  logging and returning `false` when data/offset parameters
  are out of buffer bounds.
- **Size Checks:** Added safe arithmetic checks (for overflow) in
  `graph_compute` when calculating required message sizes based
  on client-provided `n_nodes` and `n_tensors`. Returns early
  if the reported sizes conflict with the actual message size or
  would lead to overflow.
- **Error Propagation:**
    - `create_node` now checks for `nullptr` return values from
      `deserialize_tensor` and its recursive calls, propagating
      `nullptr` upwards on failure. Uses `find` instead of `at`
      for safer map access.
    - `copy_tensor` now checks for `nullptr` from `deserialize_tensor`
      and sets the response status to failure if deserialization
      or bounds checks fail.
    - `graph_compute` now checks for `nullptr` return from
      `create_node` and returns failure status correctly. The final
      return value now reflects the actual computation status.

These changes improve the RPC server's resilience
against malformed client requests, preventing crashes and ensuring
errors are handled more gracefully.

Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>

* refactor(rpc): address pr comments

removed comments and unnecessary returns

Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>

* refactor(rpc): ambiguous nullptr from create_node

rpc_server::create_node could previously return nullptr if the input ID
was 0 (valid) or if an internal error (deserialization, recursion
failure) occurred (invalid). This ambiguity made error handling
difficult for the caller (`graph_compute`).

This commit clarifies the meaning of nullptr:
- `graph_compute` now checks if the input 'id' was non-zero when
  `create_node` returns nullptr, correctly identifying failures
  versus intentional null links.
- `create_node` avoids recursive calls for zero IDs and propagates
  nullptr unambiguously on failure during recursion.

Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>

* refactor(rpc): initial zero check in create_node

The caller (`graph_compute`) already checks `id != 0` when handling
a `nullptr` return from `create_node`, correctly distinguishing
intentional null links from actual errors. This makes the initial
`if (id == 0)` check redundant.

Also removes the log message when a tensor ID is not found in the
provided map which was added in this branch.

Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>

* fix(rpc): Handle get_alloc_size failure in server

Check the return value of `server.get_alloc_size` in the RPC server
loop. If the call fails, return early to close the connection.

Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>

* refactor(rpc): input size validation in graph_compute

Removes detailed, step-by-step size calculations and overflow
checks in favor of simpler direct comparisons, assuming 64-bit
overflow is unlikely.

Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>

* refactor(rpc): remove extra status code setting

Removes the explicit setting of `response.result = GGML_STATUS_FAILED`
when `create_node` returns `nullptr` within `graph_compute`.
Primary signal is the `false` return value in case of failure.

Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>

* refactor(rpc): remove redundant check for tensor->type

Breaks CI on ubuntu-cpu-make. Tensor type is uint32_t, thus
the check is not needed.

Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>

---------

Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>
# Conflicts:
#	ggml/src/ggml-rpc.cpp

* rpc : fix cache directory initialization (#13188)

Signed-off-by: xiaofei <hbuxiaofei@gmail.com>
# Conflicts:
#	examples/rpc/rpc-server.cpp

* rpc : avoid uninitialized memory in serialize_tensor (#13210)

Zero out the name and padding buffers.

* fix merge error

* Add hello command in RPC

* bug fix

* add rpc header

* fix bug for missing rpc names

* add tpc no delay for rpc

* add back webui

---------

Signed-off-by: Ville Vesilehto <ville@vesilehto.fi>
Signed-off-by: xiaofei <hbuxiaofei@gmail.com>
Co-authored-by: firecoperana <firecoperana>
Co-authored-by: Radoslav Gerganov <rgerganov@gmail.com>
Co-authored-by: matt23456 <matt23456>
Co-authored-by: Ville Vesilehto <ville@vesilehto.fi>
Co-authored-by: xiaofei <hbuxiaofei@gmail.com>
Co-authored-by: Justin Santa Barbara <justinsb@google.com>
2025-06-08 14:43:21 +03:00
firecoperana
6cbd73c60e Webui improvement (#481)
* update webui

* add token/s in webui

* add webui files

* fix webui first message disappear in some browser

* add missing html files

---------

Co-authored-by: firecoperana <firecoperana>
2025-06-08 14:38:47 +03:00
Kawrakow
1a4cfbcc53 Merge mainline - Aug 12 2024 (#17)
* Merge mainline

* Fix after merge

* Remove CI check

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-08-12 15:14:32 +02:00
Kawrakow
0ceeb11721 Merge mainline llama.cpp (#3)
* Merging mainline - WIP

* Merging mainline - WIP

AVX2 and CUDA appear to work.
CUDA performance seems slightly (~1-2%) lower as it is so often
the case with llama.cpp/ggml after some "improvements" have been made.

* Merging mainline - fix Metal

* Remove check

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-07-27 07:55:01 +02:00
sasha0552
66217bbac6 server : smart slot selection using Longest Common Prefix (#7728)
* server : Smart selection of available slot using Longest Common Substring

* add usage

* remove trailing whitespaces

* Use Longest Common Prefix (LCP) instead of LCS

* Rename argument
2024-06-08 10:50:31 +03:00
Georgi Gerganov
8822dcce8d common : refactor cli arg parsing (#7675)
* common : gpt_params_parse do not print usage

* common : rework usage print (wip)

* common : valign

* common : rework print_usage

* infill : remove cfg support

* common : reorder args

* server : deduplicate parameters

ggml-ci

* common : add missing header

ggml-ci

* common : remote --random-prompt usages

ggml-ci

* examples : migrate to gpt_params

ggml-ci

* batched-bench : migrate to gpt_params

* retrieval : migrate to gpt_params

* common : change defaults for escape and n_ctx

* common : remove chatml and instruct params

ggml-ci

* common : passkey use gpt_params
2024-06-04 21:23:39 +03:00
Benjamin Findley
9b9b4eb946 change default temperature of OAI compat API from 0 to 1 (#7226)
* change default temperature of OAI compat API from 0 to 1

* make tests explicitly send temperature to OAI API
2024-05-13 12:40:08 +10:00
Johannes Gäßler
4e708b6d16 JSON: [key] -> .at(key), assert() -> GGML_ASSERT (#7143) 2024-05-08 21:53:08 +02:00
Xuan Son Nguyen
777e128db5 clean up json_value & server_log (#7142) 2024-05-08 13:24:14 +02:00
Pedro Cuenca
a512902ca8 llama : support Llama 3 HF conversion (#6745)
* Support Llama 3 conversion

The tokenizer is BPE.

* style

* Accept suggestion

Co-authored-by: Sourab Mangrulkar <13534540+pacman100@users.noreply.github.com>

* llama : add llama_token_is_eog()

ggml-ci

* llama : auto-detect more EOT tokens when missing in KV data

* convert : replacing EOS token is a hack

* llama : fix codegemma EOT token + add TODOs

* llama : fix model type string for 8B model

---------

Co-authored-by: Sourab Mangrulkar <13534540+pacman100@users.noreply.github.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-04-21 14:50:41 +03:00
Pierrick Hymbert
73e01781d4 ci: bench: support sse and fix prompt processing time / server: add tokens usage in stream OAI response (#6495)
* ci: bench: support sse and fix prompt processing time
server: add tokens usage in stream mode

* ci: bench: README.md EOL

* ci: bench: remove total pp and tg as it is not accurate

* ci: bench: fix case when there is no token generated

* ci: bench: change to the 95 percentile for pp and tg as it is closer to what the server exports in metrics

* ci: bench: fix finish reason rate
2024-04-06 05:40:47 +02:00
JH23X
0f4e3af782 server : handle exception on wrong type in request (#6452)
Co-authored-by: Jonas Holzner <jonas.holzner.external@hensoldt.net>
2024-04-03 21:09:52 +03:00
Xuan Son Nguyen
242d1c1e90 Server: clean up OAI params parsing function (#6284)
* server: clean up oai parsing function

* fix response_format

* fix empty response_format

* minor fixes

* add TODO for logprobs

* update docs
2024-03-25 09:42:17 +01:00
Pierrick Hymbert
6331d10b51 server: flush stdout after logging in both text and json layout (#6253) 2024-03-23 13:18:45 +01:00
Olivier Chafik
f41d10fa62 json-schema-to-grammar : fix order of props + non-str const/enum (#6232)
* json: ordered json in server/schema converter to respect orig order

* json: ws nits

* json: support non-string const / enums
2024-03-22 15:07:44 +02:00
Olivier Chafik
c5b204162d json-schema-to-grammar improvements (+ added to server) (#5978)
* json: fix arrays (disallow `[,1]`)

* json: support tuple types (`[number, string]`)

* json: support additionalProperties (`{[k: string]: [string,number][]}`)

* json: support required / optional properties

* json: add support for pattern

* json: resolve $ref (and support https schema urls)

* json: fix $ref resolution

* join: support union types (mostly for nullable types I think)

* json: support allOf + nested anyOf

* json: support any (`{}` or `{type: object}`)

* json: fix merge

* json: temp fix for escapes

* json: spaces in output and unrestricted output spaces

* json: add typings

* json:fix typo

* Create ts-type-to-grammar.sh

* json: fix _format_literal (json.dumps already escapes quotes)

* json: merge lit sequences and handle negatives

{"type": "string", "pattern": "^({\"question\": \"[^\"]+\", \"response\": \"[^\"]+\"}\\n)+$"}

* json: handle pattern repetitions

* Update json-schema-to-grammar.mjs

* Create regex-to-grammar.py

* json: extract repeated regexp patterns to subrule

* Update json-schema-to-grammar.py

* Update json-schema-to-grammar.py

* Update json-schema-to-grammar.py

* json: handle schema from pydantic Optional fields

* Update json-schema-to-grammar.py

* Update json-schema-to-grammar.py

* Update ts-type-to-grammar.sh

* Update ts-type-to-grammar.sh

* json: simplify nullable fields handling

* json: accept duplicate identical rules

* json: revert space to 1 at most

* json: reuse regexp pattern subrules

* json: handle uuid string format

* json: fix literal escapes

* json: add --allow-fetch

* json: simplify range escapes

* json: support negative ranges in patterns

* Delete commit.txt

* json: custom regex parser, adds dot support & JS-portable

* json: rm trailing spaces

* Update json-schema-to-grammar.mjs

* json: updated server & chat `( cd examples/server && ./deps.sh )`

* json: port fixes from mjs to python

* Update ts-type-to-grammar.sh

* json: support prefixItems alongside array items

* json: add date format + fix uuid

* json: add date, time, date-time formats

* json: preserve order of props from TS defs

* json: port schema converter to C++, wire in ./server

* json: nits

* Update json-schema-to-grammar.cpp

* Update json-schema-to-grammar.cpp

* Update json-schema-to-grammar.cpp

* json: fix mjs implementation + align outputs

* Update json-schema-to-grammar.mjs.hpp

* json: test C++, JS & Python versions

* json: nits + regen deps

* json: cleanup test

* json: revert from c++17 to 11

* json: nit fixes

* json: dirty include for test

* json: fix zig build

* json: pass static command to std::system in tests (fixed temp files)

* json: fix top-level $refs

* json: don't use c++20 designated initializers

* nit

* json: basic support for reserved names `{number:{number:{root:number}}}`

* Revamp test cmake to allow args (WORKING_DIRECTORY needed for JSON test)

* json: re-ran server deps.sh

* json: simplify test

* json: support mix of additional props & required/optional

* json: add tests for some expected failures

* json: fix type=const in c++, add failure expectations for non-str const&enum

* json: test (& simplify output of) empty schema

* json: check parsing in test + fix value & string refs

* json: add server tests for OAI JSON response_format

* json: test/fix top-level anyOf

* json: improve grammar parsing failures

* json: test/fix additional props corner cases

* json: fix string patterns (was missing quotes)

* json: ws nit

* json: fix json handling in server when there's no response_format

* json: catch schema conversion errors in server

* json: don't complain about unknown format type in server if unset

* json: cleaner build of test

* json: create examples/json-schema-pydantic-example.py

* json: fix date pattern

* json: move json.hpp & json-schema-to-grammar.{cpp,h} to common

* json: indent 4 spaces

* json: fix naming of top-level c++ function (+ drop unused one)

* json: avoid using namespace std

* json: fix zig build

* Update server.feature

* json: iostream -> fprintf

* json: space before & refs for consistency

* json: nits
2024-03-21 11:50:43 +00:00
Karthick
1d1248fb8c Server: Handle n_keep parameter in the request (#6174) 2024-03-20 12:02:34 +01:00
Xuan Son Nguyen
2534910086 Server: Use multi-task for embeddings endpoint (#6001)
* use multitask for embd endpoint

* specify types

* remove redundant {"n_predict", 0}
2024-03-13 11:39:11 +01:00
Xuan Son Nguyen
2375a2e6d9 Server: format error to json (#5961)
* server: format error to json

* server: do not crash on grammar error

* fix api key test case

* revert limit max n_predict

* small fix

* correct coding style

* update completion.js

* launch_slot_with_task

* update docs

* update_slots

* update webui

* update readme
2024-03-11 10:56:41 +01:00
Minsoo Cheong
1feee1846b server : maintain chat completion id for streaming responses (#5988)
* server: maintain chat completion id for streaming responses

* Update examples/server/utils.hpp

* Update examples/server/utils.hpp

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-03-11 10:09:32 +02:00
Georgi Gerganov
8259cb3547 server : refactor (#5882)
* server : refactoring (wip)

* server : remove llava/clip objects from build

* server : fix empty prompt handling + all slots idle logic

* server : normalize id vars

* server : code style

* server : simplify model chat template validation

* server : code style

* server : minor

* llama : llama_chat_apply_template support null buf

* server : do not process embedding requests when disabled

* server : reorganize structs and enums + naming fixes

* server : merge oai.hpp in utils.hpp

* server : refactor system prompt update at start

* server : disable cached prompts with self-extend

* server : do not process more than n_batch tokens per iter

* server: tests: embeddings use a real embeddings model (#5908)

* server, tests : bump batch to fit 1 embedding prompt

* server: tests: embeddings fix build type Debug is randomly failing (#5911)

* server: tests: embeddings, use different KV Cache size

* server: tests: embeddings, fixed prompt do not exceed n_batch, increase embedding timeout, reduce number of concurrent embeddings

* server: tests: embeddings, no need to wait for server idle as it can timout

* server: refactor: clean up http code (#5912)

* server : avoid n_available var

ggml-ci

* server: refactor: better http codes

* server : simplify json parsing + add comment about t_last

* server : rename server structs

* server : allow to override FQDN in tests

ggml-ci

* server : add comments

---------

Co-authored-by: Pierrick Hymbert <pierrick.hymbert@gmail.com>
2024-03-07 11:41:53 +02:00
Pierrick Hymbert
17a641c829 server: tests: passkey challenge / self-extend with context shift demo (#5832)
* server: tests: add models endpoint scenario

* server: /v1/models add some metadata

* server: tests: add debug field in context before scenario

* server: tests: download model from HF, add batch size

* server: tests: add passkey test

* server: tests: add group attention params

* server: do not truncate prompt tokens if self-extend through group attention is enabled

* server: logs: do not truncate log values

* server: tests - passkey - first good working value of nga

* server: tests: fix server timeout

* server: tests: fix passkey, add doc, fix regex content matching, fix timeout

* server: tests: fix regex content matching

* server: tests: schedule slow tests on master

* server: metrics: fix when no prompt processed

* server: tests: self-extend add llama-2-7B and Mixtral-8x7B-v0.1

* server: tests: increase timeout for completion

* server: tests: keep only the PHI-2 test

* server: tests: passkey add a negative test
2024-03-02 22:00:14 +01:00
Xuan Son Nguyen
f304cb4978 Server: normalize naming (#5779)
* server: normalize naming

* fix spacing
2024-02-29 21:42:11 +01:00
Pierrick Hymbert
6c8868e093 server: logs - unified format and --log-format option (#5700)
* server: logs - always use JSON logger, add add thread_id in message, log task_id and slot_id

* server : skip GH copilot requests from logging

* server : change message format of server_log()

* server : no need to repeat log in comment

* server : log style consistency

* server : fix compile warning

* server : fix tests regex patterns on M2 Ultra

* server: logs: PR feedback on log level

* server: logs: allow to choose log format in json or plain text

* server: tests: output server logs in text

* server: logs switch init logs to server logs macro

* server: logs ensure value json value does not raised error

* server: logs reduce level VERBOSE to VERB to max 4 chars

* server: logs lower case as other log messages

* server: logs avoid static in general

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* server: logs PR feedback: change text log format to: LEVEL [function_name] message | additional=data

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-02-25 13:50:32 +01:00
Pierrick Hymbert
7f6af033d4 server: concurrency fix + monitoring - add /metrics prometheus compatible endpoint (#5708)
* server: monitoring - add /metrics prometheus compatible endpoint

* server: concurrency issue, when 2 task are waiting for results, only one call thread is notified

* server: metrics - move to a dedicated struct
2024-02-25 13:49:43 +01:00
Pierrick Hymbert
bceef5565a server: health: fix race condition on slots data using tasks queue (#5634)
* server: health: fix race condition on slots data using tasks queue

* server: health:
    * include_slots only if slots_endpoint
    * fix compile warning task.target_id not initialized.
2024-02-21 15:47:48 +01:00
Xuan Son Nguyen
8410e862e7 Server: use llama_chat_apply_template (#5593)
* server: use llama_chat_apply_template

* server: remove trailing space

* server: fix format_chat

* server: fix help message

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>

* server: fix formatted_chat

---------

Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
2024-02-20 15:58:27 +01:00
Daniel Hiltgen
42c5518ad5 server : graceful server shutdown (#5244)
This updates the server queue to support graceful shutdown of the server on signals.
2024-02-18 18:23:16 +02:00
Xuan Son Nguyen
71993ba165 server : add llama2 chat template (#5425)
* server: add mistral chat template

* server: fix typo

* server: rename template mistral to llama2

* server: format_llama2: remove BOS

* server: validate "--chat-template" argument

* server: clean up using_chatml variable

Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>

---------

Co-authored-by: Jared Van Bortel <cebtenzzre@gmail.com>
2024-02-11 12:16:22 +02:00
Georgi Gerganov
32e84d74f1 sync : ggml 2024-01-27 17:00:24 +02:00
Xuan Son Nguyen
88bd8be65e server : refactored the task processing logic (#5065)
* server: add llama_server_queue struct

* server: add llama_server_response_event

* server: add comments

* server: move all mutexes away from server.cpp

* server: correct multitask response

* server: only add back deferred tasks when one slot is available

* server: fix a race condition cause by "request_completion"
2024-01-26 14:42:20 +02:00