Also remove the intermediate base_seq_len and target_seq_len variables
to make code clearer.
If paged mode is off, max_seq_len becomes the prime mover since batching
is unavailable.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
The previous changes broke existing configs and max_seq_len was
force-overriden to 4096. This helps single-user setups since they
do not really benefit from the split cache_size max_seq_len mechanism
(except if batching).
cache_size is still the prime mover in exl3 due to its paging mechanism.
Ideally, for multi-user setups, cache_size should take as much VRAM
as possible and max_seq_len should be limited.
Breakdown:
cache_size and max_seq_len specified -> values
only cache_size/max_seq_len specified -> max_seq_len = cache_size and vice versa
neither specified -> cache_size = 4096, max_seq_len = min(max_position_embeddings, cache_size)
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
- Cache size is now given only by the cache_size config option. Default is 4096 (user should always override to max out VRAM)
- max_seq_len, if not overridden in the config, will default to the model's config.json
- max_seq_len is reduced to be no larger than the cache
This allows for users to use nccl or native depending on the GPU setup.
NCCL is only available with Linux built wheels.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
Adding these to each generation chunk helps remove redundancy and
unecessary request ID operations.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
It's useful for the client to know what the T/s and total time for
generation are per-request.
Works with both completions and chat completions.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
A common problem in TabbyAPI is that users who want to get up and
running with a model always had issues with max_seq_len causing OOMs.
This is because model devs set max context values in the millions which
requires a lot of VRAM.
To idiot-proof first time setup, make the fallback default 4096 so
users can run their models. If a user still wants to use the model's
max_seq_len, set it to -1.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
Some packages such as ExllamaV2 and V3 require specific versions for
the latest features. Rather than creating repetitive functions, create
an agnostic function to check the installed package and then report
to the user to upgrade.
This is also sent to requests for loading and unloading, so keep the
error short.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
The HFModel class serves to coalesce all config files that contain
random keys which are required for model usage.
Adding this base class allows us to expand as HuggingFace randomly
changes their JSON schemas over time, reducing the brunt that backend
devs need to feel when their next model isn't supported.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
This parameter is way too confusing and does not make sense in
the modern LLM space.
Change approved by all maintainers.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>