Apply the default overrides after inline config has been merged.
Do not require an inline config to apply use_as_default and other
overrides.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
The default is the minimum between max_position_embeddings and cache_size.
On AMD and older than Ampere NVIDIA GPUs, cache_size is ignored due
to not being supported by batching on exl2.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
Also remove the intermediate base_seq_len and target_seq_len variables
to make code clearer.
If paged mode is off, max_seq_len becomes the prime mover since batching
is unavailable.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
Since cache_size is a more important parameter now for multi-user
setups, mark it as such by placing it below max_seq_len.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
The previous changes broke existing configs and max_seq_len was
force-overriden to 4096. This helps single-user setups since they
do not really benefit from the split cache_size max_seq_len mechanism
(except if batching).
cache_size is still the prime mover in exl3 due to its paging mechanism.
Ideally, for multi-user setups, cache_size should take as much VRAM
as possible and max_seq_len should be limited.
Breakdown:
cache_size and max_seq_len specified -> values
only cache_size/max_seq_len specified -> max_seq_len = cache_size and vice versa
neither specified -> cache_size = 4096, max_seq_len = min(max_position_embeddings, cache_size)
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
- Cache size is now given only by the cache_size config option. Default is 4096 (user should always override to max out VRAM)
- max_seq_len, if not overridden in the config, will default to the model's config.json
- max_seq_len is reduced to be no larger than the cache
In Windows, checking for a command yields a FileNotFound error if
the utility isn't found. This led to complicated logic which can
be solved by using which instead.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
UV is now supported as first-party in tabbyAPI's start script, so
add a dedicated section to it and recommend over miniconda.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
Uv is the definitive package installation tool for Python, so add
support to check for it via the start script.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
This allows for users to use nccl or native depending on the GPU setup.
NCCL is only available with Linux built wheels.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
Works on cuda 12.4 and up. If CUDA doesn't exist, then don't enable
the backend. This is an env var that needs to be set, so it's not really
possible to set it via config.yml.
This used to be experimental, but it's probably fine to keep it enabled
since it only provides a benefit.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
Adding these to each generation chunk helps remove redundancy and
unecessary request ID operations.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>