ExllamaV3: Handle max_seq_len defined and cache_size undefined case

The previous changes broke existing configs and max_seq_len was force-overriden to 4096. This helps single-user setups since they do not really benefit from the split cache_size max_seq_len mechanism (except if batching). cache_size is still the prime mover in exl3 due to its paging mechanism. Ideally, for multi-user setups, cache_size should take as much VRAM as possible and max_seq_len should be limited. Breakdown: cache_size and max_seq_len specified -> values only cache_size/max_seq_len specified -> max_seq_len = cache_size and vice versa neither specified -> cache_size = 4096, max_seq_len = min(max_position_embeddings, cache_size) Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2026-03-15 00:07:28 +00:00 · 2025-10-14 21:48:36 -04:00
parent 04ca346732
commit 62e9fa217a
2 changed files with 24 additions and 10 deletions
--- a/common/model.py
+++ b/common/model.py
@@ -157,7 +157,7 @@ async def load_model_gen(model_path: pathlib.Path, **kwargs):

    # Override the max sequence length based on user
    max_seq_len = kwargs.get("max_seq_len")
-    if max_seq_len == -1 or max_seq_len is None:
+    if max_seq_len == -1:
        kwargs["max_seq_len"] = hf_model.hf_config.max_position_embeddings

    # Create a new container and check if the right dependencies are installed