Model: Adjust max output len

Max output len should be hardcoded to 16 since it's the amount of tokens to predict per forward pass. 16 is a good value for both normal inference and speculative decoding which also helps save vram compared to 2048 which was the previous default. Signed-off-by: kingbri <bdashore3@proton.me>
2026-05-11 16:30:16 +00:00 · 2024-03-18 22:10:00 -04:00
parent 2704ff8344
commit 4f75fb5588
1 changed files with 4 additions and 0 deletions
--- a/backends/exllamav2/model.py
+++ b/backends/exllamav2/model.py
@@ -143,6 +143,10 @@ class ExllamaV2Container:
        # Make the max seq len 4096 before preparing the config
        # This is a better default than 2038
        self.config.max_seq_len = 4096
+
+        # Hardcode max output length to 16
+        self.config.max_output_len = 16
+
        self.config.prepare()

        # Then override the base_seq_len if present