Model: Adjust max output len

Max output len should be hardcoded to 16 since it's the amount of tokens to predict per forward pass. 16 is a good value for both normal inference and speculative decoding which also helps save vram compared to 2048 which was the previous default. Signed-off-by: kingbri <bdashore3@proton.me>
2026-04-28 02:01:24 +00:00 · 2024-03-18 22:10:00 -04:00
parent 2704ff8344
commit 4f75fb5588
1 changed files with 4 additions and 0 deletions
--- a/backends/exllamav2/model.py
+++ b/backends/exllamav2/model.py
@@ -143,6 +143,10 @@ class ExllamaV2Container:
        # Make the max seq len 4096 before preparing the config
        # This is a better default than 2038
        self.config.max_seq_len = 4096
        # Hardcode max output length to 16
        self.config.max_output_len = 16
        self.config.prepare()
        # Then override the base_seq_len if present