Uvicorn can log in both the request disconnect handler and the
CancelledError. However, these sometimes don't work and both
need to be checked. But, don't log twice if one works.
Signed-off-by: kingbri <bdashore3@proton.me>
Log all the parts of a request if the config flag is set. The logged
fields are all server side anyways, so nothing is being exposed to
clients.
Signed-off-by: kingbri <bdashore3@proton.me>
This reverts commit 21516bd7b5.
This skips EOS and implementing it the proper way seems more
costly than necessary.
Signed-off-by: kingbri <bdashore3@proton.me>
This helps make the generation loop more efficient by skipping past
chunks that aren't providing any tokens anyways. The offset isn't
affected.
Signed-off-by: kingbri <bdashore3@proton.me>
Middleware runs on both the request and response. Therefore, streaming
responses had increased latency when processing tasks and sending
data to the client which resulted in erratic streaming behavior.
Use a depends to add request IDs since it only executes when the
request is run rather than expecting the response to be sent as well.
For the future, it would be best to think about limiting the time
between each tick of chunk data to be safe.
Signed-off-by: kingbri <bdashore3@proton.me>
Identify which request is being processed to help users disambiguate
which logs correspond to which request.
Signed-off-by: kingbri <bdashore3@proton.me>
Raise a 422 exception for the disconnect. This prevents pydantic
errors when returning a "response" which doesn't contain anything
in this case.
Signed-off-by: kingbri <bdashore3@proton.me>
If a job causes the generator to error, tabby stops working until
a relaunch. It's better to try establishing a system of redundancy
and remake the generator in the event that it fails.
May replace this with an exit signal for a fatal error instead, but
not sure.
Signed-off-by: kingbri <bdashore3@proton.me>
It's possible that tracebacks can give too much info about a system
when sent over the API. Gate this under a flag to send them only
when debugging since this feature is still useful.
Signed-off-by: kingbri <bdashore3@proton.me>
Place the logic into their proper utility functions and cleanup
the code with formatting.
Also, OAI's docs specify that a [DONE] return is needed when everything
is finished.
Signed-off-by: kingbri <bdashore3@proton.me>
API keys are not allowed to view all the admin's models, templates,
draft models, loras, etc. Basically anything that can be viewed
on the filesystem outside of anything that's currently loaded is
not allowed to be returned unless an admin key is present.
This change helps preserve user privacy while not erroring out on
list endpoints that the OAI spec requires.
Signed-off-by: kingbri <bdashore3@proton.me>
Pass a request and internally unwrap the headers. In addition, allow
X-admin-key to get checked in an API key request.
Signed-off-by: kingbri <bdashore3@proton.me>
Move OpenAPI export as an env var within the main function. This
allows for easy export by running main.
In addition, an env variable provides global and explicit state to
disable conditional wheel imports (ex. Exl2 and torch) which caused
errors at first.
Signed-off-by: kingbri <bdashore3@proton.me>
If flash attention is already turned off by exllamaV2 itself, don't
try creating a paged generator. Also condense all the redundant
logic into one if statement.
Also check arch_compat_overrides to see if flash attention should
be disabled for a model arch (ex. Gemma 2)
Signed-off-by: kingbri <bdashore3@proton.me>
Previously, the parameters under the "model" block in config.yml only
handled the loading of a model on startup. This meant that any subsequent
API request required each parameter to be filled out or use a sane default
(usually defaults to the model's config.json).
However, there are cases where admins may want an argument from the
config to apply if the parameter isn't provided in the request body.
To help alleviate this, add a mechanism that works like sampler overrides
where users can specify a flag that acts as a fallback.
Therefore, this change both preserves the source of truth of what
parameters the admin is loading and adds some convenience for users
that want customizable defaults for their requests.
This behavior may change in the future, but I think it solves the
issue for now.
Signed-off-by: kingbri <bdashore3@proton.me>