Files
tabbyAPI/config_sample.yml
kingbri 27ebec3b35 Model: Add speculative decoding support via config
Speculative decoding makes use of draft models that ingest the prompt
before forwarding it to the main model.

Add options in the config to support this. API options will occur
in a different commit.

Signed-off-by: kingbri <bdashore3@proton.me>
2023-11-18 01:42:20 -05:00

62 lines
2.2 KiB
YAML

# Unless specified in the comments, DO NOT put these options in quotes!
# You can use https://www.yamllint.com/ if you want to check your YAML formatting.
# Options for networking
network:
# The IP to host on (default: 127.0.0.1).
# Use 0.0.0.0 to expose on all network adapters
host: 127.0.0.1
# The port to host on (default: 5000)
port: 5000
# Options for model overrides and loading
model:
# Overrides the directory to look for models (default: models)
# Windows users, DO NOT put this path in quotes! This directory will be invalid otherwise.
# model_dir: your model directory path
# An initial model to load. Make sure the model is located in the model directory!
# A model can be loaded later via the API.
# model_name: A model name
# Set the following to enable speculative decoding
# draft_model_dir: your model directory path to use as draft model (path is independent from model_dir)
# draft_rope_alpha: 1.0 (default: the draft model's alpha value is calculated automatically to scale to the size of the full model.)
# The below parameters apply only if model_name is set
# Maximum model context length (default: 4096)
max_seq_len: 4096
# Automatically allocate resources to GPUs (default: True)
gpu_split_auto: True
# An integer array of GBs of vram to split between GPUs (default: [])
# gpu_split: [20.6, 24]
# Rope scaling parameters (default: 1.0)
rope_scale: 1.0
rope_alpha: 1.0
# Disable Flash-attention 2. Set to True for GPUs lower than Nvidia's 3000 series. (default: False)
no_flash_attention: False
# Enable low vram optimizations in exllamav2 (default: False)
low_mem: False
# Enable 8 bit cache mode for VRAM savings (slight performance hit). Possible values FP16, FP8. (default: FP16)
# cache_mode: FP16
# Options for draft models (speculative decoding). This will use more VRAM!
# draft:
# Overrides the directory to look for draft (default: models)
# draft_model_dir: Your draft model directory path
# An initial draft model to load. Make sure this model is located in the model directory!
# A draft model can be loaded later via the API.
# draft_model_name: A model name
# Rope parameters for draft models (default: 1.0)
# draft_rope_alpha: 1.0