Docs: Update

Update getting started and server options

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
This commit is contained in:
kingbri
2025-03-12 00:40:15 -04:00
parent 4196bb6bc8
commit de77955428
2 changed files with 97 additions and 72 deletions

View File

@@ -6,7 +6,7 @@ To get started, make sure you have the following installed on your system:
- Do NOT install python from the Microsoft store! This will cause issues with pip.
- Alternatively, you can use miniconda or uv if it's present on your system.
> [!NOTE]
> Prefer a video guide? Watch the step-by-step tutorial on [YouTube](https://www.youtube.com/watch?v=03jYz0ijbUU)
@@ -40,16 +40,16 @@ To get started, make sure you have the following installed on your system:
### For Advanced Users
5. Follow steps 1-2 in the [For Beginners](#for-beginners) section
6. Create a python environment through venv:
1. Follow steps 1-2 in the [For Beginners](#for-beginners) section
2. Create a python environment through venv:
1. `python -m venv venv`
2. Activate the venv
1. On Windows: `.\venv\Scripts\activate`
2. On Linux: `source venv/bin/activate`
7. Install the pyproject features based on your system:
3. Install the pyproject features based on your system:
1. Cuda 12.x: `pip install -U .[cu121]`
2. ROCm 5.6: `pip install -U .[amd]`
8. Start the API by either
4. Start the API by either
1. Run `start.bat/sh`. The script will check if you're in a conda environment and skip venv checks.
2. Run `python main.py` to start the API. This won't automatically upgrade your dependencies.
@@ -78,19 +78,19 @@ You can also access the configuration parameters under [2. Configuration](https:
## Where next?
9. Take a look at the [usage docs](https://github.com/theroyallab/tabbyAPI/wiki/03.-Usage)
10. Get started with [community projects](https://github.com/theroyallab/tabbyAPI/wiki/09.-Community-Projects): Find loaders, UIs, and more created by the wider AI community. Any OAI compatible client is also supported.
1. Take a look at the [usage docs](https://github.com/theroyallab/tabbyAPI/wiki/03.-Usage)
2. Get started with [community projects](https://github.com/theroyallab/tabbyAPI/wiki/09.-Community-Projects): Find loaders, UIs, and more created by the wider AI community. Any OAI compatible client is also supported.
## Updating
There are a couple ways to update TabbyAPI:
11. **Update scripts** - Inside the update_scripts folder, you can run the following scripts:
1. **Update scripts** - Inside the update_scripts folder, you can run the following scripts:
1. `update_deps`: Updates dependencies to their latest versions.
2. `update_deps_and_pull`: Updates dependencies and pulls the latest commit of the Github repository.
These scripts exit after running their respective tasks. To start TabbyAPI, run `start.bat` or `start.sh`.
12. **Manual** - Install the pyproject features and update dependencies depending on your GPU:
2. **Manual** - Install the pyproject features and update dependencies depending on your GPU:
1. `pip install -U .[cu121]` = CUDA 12.x
2. `pip install -U .[amd]` = ROCm 6.0
@@ -113,11 +113,11 @@ NOTE:
Here are ways to install exllamav2:
13. From a [wheel/release](https://github.com/turboderp/exllamav2#method-2-install-from-release-with-prebuilt-extension) (Recommended)
1. From a [wheel/release](https://github.com/turboderp/exllamav2#method-2-install-from-release-with-prebuilt-extension) (Recommended)
1. Find the version that corresponds with your cuda and python version. For example, a wheel with `cu121` and `cp311` corresponds to CUDA 12.1 and python 3.11
14. From [pip](https://github.com/turboderp/exllamav2#method-3-install-from-pypi): `pip install exllamav2`
2. From [pip](https://github.com/turboderp/exllamav2#method-3-install-from-pypi): `pip install exllamav2`
2. This is a JIT compiled extension, which means that the initial launch of tabbyAPI will take some time. The build may also not work due to improper environment configuration.
15. From [source](https://github.com/turboderp/exllamav2#method-1-install-from-source)
3. From [source](https://github.com/turboderp/exllamav2#method-1-install-from-source)
## Other installation methods
@@ -128,15 +128,15 @@ These are short-form instructions for other methods that users can use to instal
### Conda
16. Install [Miniconda3](https://docs.conda.io/projects/miniconda/en/latest/miniconda-other-installer-links.html) with python 3.11 as your base python
17. Create a new conda environment `conda create -n tabbyAPI python=3.11`
18. Activate the conda environment `conda activate tabbyAPI`
19. Install optional dependencies if they aren't present
1. Install [Miniconda3](https://docs.conda.io/projects/miniconda/en/latest/miniconda-other-installer-links.html) with python 3.11 as your base python
2. Create a new conda environment `conda create -n tabbyAPI python=3.11`
3. Activate the conda environment `conda activate tabbyAPI`
4. Install optional dependencies if they aren't present
1. CUDA via
1. CUDA 12 - `conda install -c "nvidia/label/cuda-12.4.1" cuda`
2. Git via `conda install -k git`
20. Clone TabbyAPI via `git clone https://github.com/theroyallab/tabbyAPI`
21. Continue installation steps from:
5. Clone TabbyAPI via `git clone https://github.com/theroyallab/tabbyAPI`
6. Continue installation steps from:
1. [For Beginners](#for-beginners) - Step 3. The start scripts detect if you're in a conda environment and skips the venv check.
2. [For Advanced Users](#For-advanced-users) - Step 3
@@ -170,4 +170,4 @@ volumes:
    # Comment this to build a docker image from source
    image: ghcr.io/theroyallab/tabbyapi:latest
```
7. Run `docker compose -f docker/docker-compose.yml up` to build the dockerfile and start the server.
7. Run `docker compose -f docker/docker-compose.yml up` to build the dockerfile and start the server.

View File

@@ -1,4 +1,8 @@
## Server Options
> [!NOTE]
> If you want the latest config.yml docs, please look at the comments in `config_sample.yml`
TabbyAPI primarily uses a config.yml file to adjust various options. This is the preferred way and has the ability to adjust all options of TabbyAPI.
CLI arguments are also included, but those serve to *override* the options set in config.yml. Therefore, they act a bit differently compared to other programs, especially with booleans.
@@ -9,79 +13,100 @@ In addition, some config.yml options are too complex to represent as command arg
All of these options have descriptive comments above them. You should not need to reference this documentation page unless absolutely necessary.
### Networking Options
| Config Option | Type (Default) | Description |
|-----------------|------------------------|--------------------------------------------------------------|
| host | String (127.0.0.1) | Set the IP address used for hosting TabbyAPI |
| port | Int (5000) | Set the TCP Port use for TabbyAPI |
| disable_auth | Bool (False) | Disables API authentication |
| send_tracebacks | Bool (False) | Send server tracebacks to client.<br><br>Note: It's not recommended to enable this if sharing the instance with others. |
| api_servers | List[String] (["OAI"]) | API servers to enable. Possible values `"OAI", "Kobold"` |
### Logging Options
Note: With CLI args, all logging parameters are prefixed by `log-`. For example, `prompt` will be `--log-prompt true/false`.
| Config Option | Type (Default) | Description |
|-------------------|----------------|--------------------------------------------------------|
| prompt | Bool (False) | Logs prompts to the console |
| generation_params | Bool (False) | Logs request generation options to the console |
| requests | Bool (False) | Logs a request's URL, Body, and Headers to the console |
| Config Option | Type (Default) | Description |
| ---------------------- | ---------------------- | ----------------------------------------------------------------------------------------------------------------------- |
| host | String (127.0.0.1) | Set the IP address used for hosting TabbyAPI |
| port | Int (5000) | Set the TCP Port use for TabbyAPI |
| disable_auth | Bool (False) | Disables API authentication |
| disable_fetch_requests | Bool (False) | Disables fetching external content when responding to requests (ex. fetching images from URLs) |
| send_tracebacks | Bool (False) | Send server tracebacks to client.<br><br>Note: It's not recommended to enable this if sharing the instance with others. |
| api_servers | List[String] (["OAI"]) | API servers to enable. Possible values `"OAI", "Kobold"` |
### Logging Options
| Config Option | Type (Default) | Description |
| --------------------- | -------------- | ------------------------------------------------------ |
| log_prompt | Bool (False) | Logs prompts to the console |
| log_generation_params | Bool (False) | Logs request generation options to the console |
| log_requests | Bool (False) | Logs a request's URL, Body, and Headers to the console |
### Sampling Options
Note: This block is for sampling overrides, not samplers themselves.
| Config Option | Type (Default) | Description |
|-----------------|----------------|--------------------------------------------------------------|
| Config Option | Type (Default) | Description |
| --------------- | -------------- | ------------------------------------------------------------------------- |
| override_preset | String (None) | Startup the given sampler override preset in the sampler_overrides folder |
### Developer Options
Note: These are experimental flags that may be removed at any point.
| Config Option | Type (Default) | Description |
|---------------------------|----------------|--------------------------------------------------------------|
| unsafe_launch | Bool (False) | Skips dependency checks on startup. Only recommended for debugging. |
| disable_request_streaming | Bool (False) | Forcefully disables streaming requests |
| cuda_malloc_backend | Bool (False) | Uses pytorch's CUDA malloc backend to load models. Helps save VRAM.<br><br>Safe to enable. |
| uvloop | Bool (False) | Use a faster asyncio event loop. Can increase performance.<br><br>Safe to enable. |
| Config Option | Type (Default) | Description |
| ------------------------- | -------------- | ----------------------------------------------------------------------------------------------------------------------------------------------- |
| unsafe_launch | Bool (False) | Skips dependency checks on startup. Only recommended for debugging. |
| disable_request_streaming | Bool (False) | Forcefully disables streaming requests |
| cuda_malloc_backend | Bool (False) | Uses pytorch's CUDA malloc backend to load models. Helps save VRAM.<br><br>Safe to enable. |
| realtime_process_priority | Bool (False) | Set the process priority to "Realtime". Administrator/sudo access required, otherwise the priority is set to the highest it can go in userland. |
### Model Options
Note: Most of the options here will only apply on initial model load/startup (ephemeral). They will not persist unless you add the option name to `use_as_default`.
| Config Option | Type (Default) | Description |
|-----------------------|-------------------|--------------------------------------------------------------|
| model_dir | String ("models") | Directory to look for models.<br><br>Note: Persisted across subsequent load requests |
| use_dummy_models | Bool (False) | Send a dummy OAI model card when calling the `/v1/models` endpoint. Used for clients which enforce specific OAI models.<br><br>Note: Persisted across subsequent load requests |
| model_name | String (None) | Folder name of a model to load. The below parameters will not apply unless this is filled out. |
| use_as_default | List[String] ([]) | Keys to use by default when loading models. For example, putting `cache_mode` in this array will make every model load with that value unless specified by the API request.<br><br>Note: Also applies to the `draft` sub-block |
| max_seq_len | Float (None) | Maximum sequence length of the model. Uses the value from config.json if not specified here. |
| override_base_seq_len | Float (None) | Overrides the base sequence length of a model. You probably don't want to use this. max_seq_len is better.<br><br>Note: This is only required for automatic RoPE alpha calculation AND if the model has an incorrect base sequence length (ex. Mistral 7b) |
| tensor_parallel | Bool (False) | Use tensor parallelism to load the model. This ignores the value of gpu_split_auto. |
| gpu_split_auto | Bool (True) | Automatically split the model across multiple GPUs. Manual GPU split isn't used if this is enabled. |
| autosplit_reserve | List[Int] ([96]) | Amount of empty VRAM to reserve when loading with autosplit.<br><br>Represented as an array of MB per GPU used. |
| gpu_split | List[Float] ([]) | Float array of GBs to split a model between GPUs. |
| rope_scale | Float (1.0) | Adjustment for rope scale (or compress_pos_emb) |
| rope_alpha | Float (None) | Adjustment for rope alpha. Leave blank to automatically calculate based on the max_seq_len. |
| cache_mode | String ("FP16") | Cache mode for the model.<br><br>Options: FP16, Q8, Q6, Q4 |
| cache_size | Int (max_seq_len) | <br>Note: If using CFG, the cache size should be 2 * max_seq_len. |
| chunk_size | Int (2048) | Amount of tokens per chunk with ingestion. A lower value reduces VRAM usage at the cost of ingestion speed. |
| max_batch_size | Int (None) | The absolute maximum amount of prompts to process at one time. This value is automatically adjusted based on cache size. |
| prompt_template | String (None) | Name of a jinja2 chat template to apply for this model. Must be located in the `templates` directory. |
| num_experts_per_token | Int (None) | Number of experts to use per-token for MoE models. Pulled from the config.json if not specified. |
| fasttensors | Bool (False) | Possibly increases model loading speeds. |
| Config Option | Type (Default) | Description |
| --------------------- | -------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| model_dir | String ("models") | Directory to look for models.<br><br>Note: Persisted across subsequent load requests |
| inline_model_loading | Bool (False) | Enables ability to switch models using the `model` argument in a generation request. More info in [Usage](https://github.com/theroyallab/tabbyAPI/wiki/03.-Usage#inline-loading) |
| use_dummy_models | Bool (False) | Send a dummy OAI model card when calling the `/v1/models` endpoint. Used for clients which enforce specific OAI models.<br><br>Note: Persisted across subsequent load requests |
| dummy_model_names | List[String] (["gpt-3.5-turbo"]) | List of dummy names to send on model endpoint requests |
| model_name | String (None) | Folder name of a model to load. The below parameters will not apply unless this is filled out. |
| use_as_default | List[String] ([]) | Keys to use by default when loading models. For example, putting `cache_mode` in this array will make every model load with that value unless specified by the API request.<br><br>Note: Also applies to the `draft` sub-block |
| max_seq_len | Float (None) | Maximum sequence length of the model. Uses the value from config.json if not specified here. Also called the max context length. |
| tensor_parallel | Bool (False) | Enables tensor parallelism. Automatically falls back to autosplit if GPU split isn't provided. <br><br>Note: `gpu_split_auto` is ignored when this is enabled. |
| gpu_split_auto | Bool (True) | Automatically split the model across multiple GPUs. Manual GPU split isn't used if this is enabled. |
| autosplit_reserve | List[Int] ([96]) | Amount of empty VRAM to reserve when loading with autosplit.<br><br>Represented as an array of MB per GPU used. |
| gpu_split | List[Float] ([]) | Float array of GBs to split a model between GPUs. |
| rope_scale | Float (1.0) | Adjustment for rope scale (or compress_pos_emb) |
| rope_alpha | Float (None) | Adjustment for rope alpha. Leave blank to automatically calculate based on the max_seq_len. |
| cache_mode | String ("FP16") | Cache mode for the model.<br><br>Options: FP16, Q8, Q6, Q4 |
| cache_size | Int (max_seq_len) | Size of the K/V cache<br><br>Note: If using CFG, the cache size should be 2 * max_seq_len. |
| chunk_size | Int (2048) | Amount of tokens per chunk with ingestion. A lower value reduces VRAM usage at the cost of ingestion speed. |
| max_batch_size | Int (None) | The absolute maximum amount of prompts to process at one time. This value is automatically adjusted based on cache size. |
| prompt_template | String (None) | Name of a jinja2 chat template to apply for this model. Must be located in the `templates` directory. |
| vision | Bool (False) | Enable vision support for the provided model (if it exists). |
| num_experts_per_token | Int (None) | Number of experts to use per-token for MoE models. Pulled from the config.json if not specified. |
### Draft Model Options
Note: Sub-block of Model Options. Same rules apply.
| Config Option | Type (Default) | Description |
|------------------|-------------------|--------------------------------------------------------------|
| Config Option | Type (Default) | Description |
| ---------------- | ----------------- | ------------------------------------------------------------------------------------------ |
| draft_model_dir | String ("models") | Directory to look for draft models.<br><br>Note: Persisted across subsequent load requests |
| draft_model_name | String (None) | String: Folder name of a draft model to load. |
| draft_rope_scale | Float (1.0) | String: RoPE scale value for the draft model. |
| draft_rope_alpha | Float (1.0) | RoPE alpha value for the draft model. Leave blank for auto-calculation. |
| draft_cache_mode | String ("FP16") | Cache mode for the draft model.<br><br>Options: FP16, Q8, Q6, Q4 |
| draft_model_name | String (None) | String: Folder name of a draft model to load. |
| draft_rope_scale | Float (1.0) | String: RoPE scale value for the draft model. |
| draft_rope_alpha | Float (1.0) | RoPE alpha value for the draft model. Leave blank for auto-calculation. |
| draft_cache_mode | String ("FP16") | Cache mode for the draft model.<br><br>Options: FP16, Q8, Q6, Q4 |
| draft_gpu_split | List[Float] ([]) | Float array of GBs to split a draft model between GPUs. |
### Lora Options
Note: Sub-block of Mode Options. Same rules apply.
| Config Option | Type (Default) | Description |
|---------------|------------------|--------------------------------------------------------------|
| lora_dir | String ("loras") | Directory to look for loras.<br><br>Note: Persisted across subsequent load requests |
| loras | List[loras] ([]) | List of lora objects to apply to the model. Each object contains a name and scaling. |
| name | String (None) | Folder name of a lora to load.<br><br>Note: An element of the `loras` key |
| scaling | Float (1.0) | "Weight" to apply the lora on the parent model. For example, applying a lora with 0.9 scaling will lower the amount of application on the parent model.<br><br>Note: An element of the `loras` key |
### Embeddings Options
Note: Most of the options here will only apply on initial embedding model load/startup (ephemeral).
| Config Option | Type (Default) | Description |
|----------------------|-------------------|--------------------------------------------------------------|
| embedding_model_dir | String ("models") | Directory to look for embedding models.<br><br>Note: Persisted across subsequent load requests |
| Config Option | Type (Default) | Description |
| -------------------- | ----------------- | ---------------------------------------------------------------------------------------------------------------------------- |
| embedding_model_dir | String ("models") | Directory to look for embedding models.<br><br>Note: Persisted across subsequent load requests |
| embeddings_device | String ("cpu") | Device to load an embedding model on.<br><br>Options: cpu, cuda, auto<br><br>Note: Persisted across subsequent load requests |
| embedding_model_name | String (None) | Folder name of an embedding model to load using infinity-emb. |
| embedding_model_name | String (None) | Folder name of an embedding model to load using infinity-emb. |