Model: Adjust max output len

Max output len should be hardcoded to 16 since it's the amount of
tokens to predict per forward pass. 16 is a good value for both
normal inference and speculative decoding which also helps save
vram compared to 2048 which was the previous default.

Signed-off-by: kingbri <bdashore3@proton.me>
This commit is contained in:
kingbri
2024-03-18 22:10:00 -04:00
parent 2704ff8344
commit 4f75fb5588

View File

@@ -143,6 +143,10 @@ class ExllamaV2Container:
# Make the max seq len 4096 before preparing the config
# This is a better default than 2038
self.config.max_seq_len = 4096
# Hardcode max output length to 16
self.config.max_output_len = 16
self.config.prepare()
# Then override the base_seq_len if present