mirror of
https://github.com/theroyallab/tabbyAPI.git
synced 2026-03-15 00:07:28 +00:00
ExllamaV3: Handle max_seq_len defined and cache_size undefined case
The previous changes broke existing configs and max_seq_len was force-overriden to 4096. This helps single-user setups since they do not really benefit from the split cache_size max_seq_len mechanism (except if batching). cache_size is still the prime mover in exl3 due to its paging mechanism. Ideally, for multi-user setups, cache_size should take as much VRAM as possible and max_seq_len should be limited. Breakdown: cache_size and max_seq_len specified -> values only cache_size/max_seq_len specified -> max_seq_len = cache_size and vice versa neither specified -> cache_size = 4096, max_seq_len = min(max_position_embeddings, cache_size) Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
This commit is contained in:
@@ -157,7 +157,7 @@ async def load_model_gen(model_path: pathlib.Path, **kwargs):
|
||||
|
||||
# Override the max sequence length based on user
|
||||
max_seq_len = kwargs.get("max_seq_len")
|
||||
if max_seq_len == -1 or max_seq_len is None:
|
||||
if max_seq_len == -1:
|
||||
kwargs["max_seq_len"] = hf_model.hf_config.max_position_embeddings
|
||||
|
||||
# Create a new container and check if the right dependencies are installed
|
||||
|
||||
Reference in New Issue
Block a user