Model: Fix chunk size handling

Wrong class attribute name used for max_attention_size and fixes declaration of the draft model's chunk_size. Also expose the parameter to the end user in both config and model load. Signed-off-by: kingbri <bdashore3@proton.me>
2026-03-14 15:57:27 +00:00 · 2024-04-07 18:10:50 -04:00
parent 30c4554572
commit d759a15559
3 changed files with 14 additions and 7 deletions
--- a/config_sample.yml
+++ b/config_sample.yml
@@ -107,6 +107,10 @@ model:
  # Possible values FP16, FP8, Q4. (default: FP16)
  #cache_mode: FP16

+  # Chunk size for prompt ingestion. A lower value reduces VRAM usage at the cost of ingestion speed (default: 2048)
+  # NOTE: Effects vary depending on the model. An ideal value is between 512 and 4096
+  #chunk_size: 2048
+
  # Set the prompt template for this model. If empty, attempts to look for the model's chat template. (default: None)
  # If a model contains multiple templates in its tokenizer_config.json, set prompt_template to the name
  # of the template you want to use.