ExllamaV3: Handle max_seq_len defined and cache_size undefined case

The previous changes broke existing configs and max_seq_len was
force-overriden to 4096. This helps single-user setups since they
do not really benefit from the split cache_size max_seq_len mechanism
(except if batching).

cache_size is still the prime mover in exl3 due to its paging mechanism.
Ideally, for multi-user setups, cache_size should take as much VRAM
as possible and max_seq_len should be limited.

Breakdown:
cache_size and max_seq_len specified -> values
only cache_size/max_seq_len specified -> max_seq_len = cache_size and vice versa
neither specified -> cache_size = 4096, max_seq_len = min(max_position_embeddings, cache_size)

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
This commit is contained in:
kingbri
2025-10-14 21:48:36 -04:00
parent 04ca346732
commit 62e9fa217a
2 changed files with 24 additions and 10 deletions

View File

@@ -157,7 +157,7 @@ async def load_model_gen(model_path: pathlib.Path, **kwargs):
# Override the max sequence length based on user
max_seq_len = kwargs.get("max_seq_len")
if max_seq_len == -1 or max_seq_len is None:
if max_seq_len == -1:
kwargs["max_seq_len"] = hf_model.hf_config.max_position_embeddings
# Create a new container and check if the right dependencies are installed