Model: Fix chunk size handling

Wrong class attribute name used for max_attention_size and fixes
declaration of the draft model's chunk_size.

Also expose the parameter to the end user in both config and model
load.

Signed-off-by: kingbri <bdashore3@proton.me>
This commit is contained in:
kingbri
2024-04-07 18:10:50 -04:00
parent 30c4554572
commit d759a15559
3 changed files with 14 additions and 7 deletions

View File

@@ -107,6 +107,10 @@ model:
# Possible values FP16, FP8, Q4. (default: FP16)
#cache_mode: FP16
# Chunk size for prompt ingestion. A lower value reduces VRAM usage at the cost of ingestion speed (default: 2048)
# NOTE: Effects vary depending on the model. An ideal value is between 512 and 4096
#chunk_size: 2048
# Set the prompt template for this model. If empty, attempts to look for the model's chat template. (default: None)
# If a model contains multiple templates in its tokenizer_config.json, set prompt_template to the name
# of the template you want to use.