Use the standard "dict.get("key") or default" to handle fetching values
from kwargs and get a fallback value without possible errors.
Signed-off-by: kingbri <bdashore3@proton.me>
Low_mem doesn't work in exl2 and it was an experimental option to
begin with. Keep the loading code commented out in case it gets fixed
in the future.
A better alternative is to use 8bit cache which works and helps save
VRAM.
Signed-off-by: kingbri <bdashore3@proton.me>
* enables automatic calculation of NTK-aware alpha scaling for models if the rope_alpha arg is not passed in the config, using the same formula used for draft models
sse_starlette kept firing a ping response if it was taking too long
to set an event. Rather than using a hacky workaround, switch to
FastAPI's inbuilt streaming response and construct SSE requests with
a utility function.
This helps the API become more robust and removes an extra requirement.
Signed-off-by: kingbri <bdashore3@proton.me>
This reverts commit cad144126f.
Change this parameter back to repetition_decay. This is different than
rep_pen_slope used in other backends such as kobold and NAI.
Still keep the fallback condition though.
Signed-off-by: kingbri <bdashore3@proton.me>
Unlike other backends, tabby attempts to generate even if the context
is greater than the max sequence length via truncation of the given
context.
Rather than artifically erroring out, give a warning that outputted
console metrics are going to be incorrect and to make sure that
context <= max_seq_len.
Signed-off-by: kingbri <bdashore3@proton.me>
Documented in previous commits. Also make sure that for version checking,
check the value of kwargs instead of if the key is present since requests
pass default values.
Signed-off-by: kingbri <bdashore3@proton.me>
Model: Add extra information to print and fix the divide by zero error.
Auth: Fix validation of API and admin keys to look for the entire key.
References #7 and #6
Signed-off-by: kingbri <bdashore3@proton.me>
Speculative decoding makes use of draft models that ingest the prompt
before forwarding it to the main model.
Add options in the config to support this. API options will occur
in a different commit.
Signed-off-by: kingbri <bdashore3@proton.me>
Add the EOS token into stop strings after checking kwargs. If
ban_eos_token is on, don't add the EOS token in for extra measure.
Signed-off-by: kingbri <bdashore3@proton.me>
Models can be loaded and unloaded via the API. Also add authentication
to use the API and for administrator tasks.
Both types of authorization use different keys.
Also fix the unload function to properly free all used vram.
Signed-off-by: kingbri <bdashore3@proton.me>
The models endpoint fetches all the models that OAI has to offer.
However, since this is an OAI clone, just list the models inside
the user's configured model directory instead.
Signed-off-by: kingbri <bdashore3@proton.me>
Add support for /v1/completions with the option to use streaming
if needed. Also rewrite API endpoints to use async when possible
since that improves request performance.
Model container parameter names also needed rewrites as well and
set fallback cases to their disabled values.
Signed-off-by: kingbri <bdashore3@proton.me>
YAML is a more flexible format when it comes to configuration. Commandline
arguments are difficult to remember and configure especially for
an API with complicated commandline names. Rather than using half-baked
textfiles, implement a proper config solution.
Also add a progress bar when loading models in the commandline.
Signed-off-by: kingbri <bdashore3@proton.me>
Use command-line arguments to load an initial model if necessary.
API routes are broken, but we should be using the container from
now on as a primary interface with the exllama2 library.
Also these args should be turned into a YAML configuration file in
the future.
Signed-off-by: kingbri <bdashore3@proton.me>