Apply the default overrides after inline config has been merged.
Do not require an inline config to apply use_as_default and other
overrides.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
The previous changes broke existing configs and max_seq_len was
force-overriden to 4096. This helps single-user setups since they
do not really benefit from the split cache_size max_seq_len mechanism
(except if batching).
cache_size is still the prime mover in exl3 due to its paging mechanism.
Ideally, for multi-user setups, cache_size should take as much VRAM
as possible and max_seq_len should be limited.
Breakdown:
cache_size and max_seq_len specified -> values
only cache_size/max_seq_len specified -> max_seq_len = cache_size and vice versa
neither specified -> cache_size = 4096, max_seq_len = min(max_position_embeddings, cache_size)
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
- Cache size is now given only by the cache_size config option. Default is 4096 (user should always override to max out VRAM)
- max_seq_len, if not overridden in the config, will default to the model's config.json
- max_seq_len is reduced to be no larger than the cache
use_as_default was not being properly applied into model overrides.
For compartmentalization's sake, apply all overrides in a single function
to avoid clutter.
In addition, fix where the traditional /v1/model/load endpoint checks
for draft options. These can be applied via an inline config, so let
any failures fallthrough.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
Anything below the first level of kwargs was not being merged properly.
A more bulletproof solution would be to refactor the loading code
to separate draft and normal model parameters.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
A common problem in TabbyAPI is that users who want to get up and
running with a model always had issues with max_seq_len causing OOMs.
This is because model devs set max context values in the millions which
requires a lot of VRAM.
To idiot-proof first time setup, make the fallback default 4096 so
users can run their models. If a user still wants to use the model's
max_seq_len, set it to -1.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
The HFModel class serves to coalesce all config files that contain
random keys which are required for model usage.
Adding this base class allows us to expand as HuggingFace randomly
changes their JSON schemas over time, reducing the brunt that backend
devs need to feel when their next model isn't supported.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
Seemed out of place in the common load function. In addition, rename
the transformers utils signature which actually takes a directory
instead of a file.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
Jobs should be started and immediately cleaned up when calling the
generation stream. Expose a stream_generate function and append
this to the base class since it's more idiomatic than generate_gen.
The exl2 container's generate_gen function is now internal.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
One goal is to try migrating away from kwargs and use the ModelLoadRequest
instead. However, Pydantic doesn't support async validators making
parsing of the inline config impossible due to its use of aiofiles.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
The admin takes priority over the regular user. Therefore, if a model
is loading, ignore all incoming generation requests
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
The model_type internal reference was changed to an enum for
a more extendable loading process. Return the current model type
when loading a new model.
Signed-off-by: kingbri <bdashore3@proton.me>
Adds the ability to load vision parts of text + image models. Requires
an explicit flag in config because there isn't a way to automatically
determine whether the vision tower should be used.
Signed-off-by: kingbri <bdashore3@proton.me>
ExllamaV2 should check for solely exllamav2, otherwise errors don't
make sense. Migrate the combined "exl2" computed property to "inference"
since those are the required dependencies for minimal inference.
Signed-off-by: kingbri <bdashore3@proton.me>
There are two ways to load a model:
1. Via the load endpoint
2. Inline with a completion
The defaults were not applying on the inline load, so rewrite to fix
that. However, while doing this, set up a defaults dictionary rather
than comparing it at runtime and remove the pydantic default lambda
on all the model load fields.
This makes the code cleaner and establishes a clear config tree for
loading models.
Signed-off-by: kingbri <bdashore3@proton.me>
Using aiofiles, there's no longer a possiblity of blocking file operations
that can hang up the event loop. In addition, partially migrate
classes to use asynchronous init instead of the normal python magic method.
The only exception is config, since that's handled in the synchonous
init before the event loop starts.
Signed-off-by: kingbri <bdashore3@proton.me>
The config categories can have defined separation, but preserve
the dynamic nature of adding new config options by making all the
internal class vars as dictionaries.
This was necessary since storing global callbacks stored a state
of the previous global_config var that wasn't populated.
Signed-off-by: kingbri <bdashore3@proton.me>
Using "auto" for rope alpha removes ambiguity on how to explicitly
enable automatic rope calculation. The same behavior of None -> auto
calculate still exists, but can be overwritten if a model's tabby_config.yml
includes `rope_alpha`.
Signed-off-by: kingbri <bdashore3@proton.me>
It's best to pass them down the config stack.
API/User config.yml -> model config.yml -> model config.json -> fallback.
Doing this allows for seamless flow and yielding control to each
member in the stack.
Signed-off-by: kingbri <bdashore3@proton.me>
Storing a pathlib type makes it easier to manipulate the model
directory path in the long run without constantly fetching it
from the config.
Signed-off-by: kingbri <bdashore3@proton.me>
Previously, when a SIGINT was emitted and a model load is running,
the API didn't shut down until the load finished due to waitng for
the lock. However, when shutting down, the lock doesn't matter since
the process is being killed anyway.
Signed-off-by: kingbri <bdashore3@proton.me>
Embedding models are managed on a separate backend, but are run
in parallel with the model itself. Therefore, manage this in a separate
container with separate routes.
Signed-off-by: kingbri <bdashore3@proton.me>
Use Infinity as a separate backend and handle the model within the
common module. This separates out the embeddings model from the endpoint
which allows for model loading/unloading in core.
Signed-off-by: kingbri <bdashore3@proton.me>
Move OpenAPI export as an env var within the main function. This
allows for easy export by running main.
In addition, an env variable provides global and explicit state to
disable conditional wheel imports (ex. Exl2 and torch) which caused
errors at first.
Signed-off-by: kingbri <bdashore3@proton.me>
Previously, the parameters under the "model" block in config.yml only
handled the loading of a model on startup. This meant that any subsequent
API request required each parameter to be filled out or use a sane default
(usually defaults to the model's config.json).
However, there are cases where admins may want an argument from the
config to apply if the parameter isn't provided in the request body.
To help alleviate this, add a mechanism that works like sampler overrides
where users can specify a flag that acts as a fallback.
Therefore, this change both preserves the source of truth of what
parameters the admin is loading and adds some convenience for users
that want customizable defaults for their requests.
This behavior may change in the future, but I think it solves the
issue for now.
Signed-off-by: kingbri <bdashore3@proton.me>
Add a sequential lock and wait until jobs are completed before executing
any loading requests that directly alter the model. However, we also
need to block any new requests that come in until the load is finished,
so add a condition that triggers once the lock is free.
Signed-off-by: kingbri <bdashore3@proton.me>
If the model didn't load properly, the container still exists until
unload is called. However, the name check still registered as true.
Signed-off-by: kingbri <bdashore3@proton.me>
Run these iterators on the background thread. On startup, the API
spawns a background thread as needed to run sync code on without blocking
the event loop.
Use asyncio's run_thread function since it allows for errors to be
propegated.
Signed-off-by: kingbri <bdashore3@proton.me>