mirror of
https://github.com/theroyallab/tabbyAPI.git
synced 2026-03-14 15:57:27 +00:00
Docs: Update
Update getting started and server options Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
This commit is contained in:
@@ -6,7 +6,7 @@ To get started, make sure you have the following installed on your system:
|
||||
|
||||
- Do NOT install python from the Microsoft store! This will cause issues with pip.
|
||||
- Alternatively, you can use miniconda or uv if it's present on your system.
|
||||
|
||||
|
||||
|
||||
> [!NOTE]
|
||||
> Prefer a video guide? Watch the step-by-step tutorial on [YouTube](https://www.youtube.com/watch?v=03jYz0ijbUU)
|
||||
@@ -40,16 +40,16 @@ To get started, make sure you have the following installed on your system:
|
||||
|
||||
### For Advanced Users
|
||||
|
||||
5. Follow steps 1-2 in the [For Beginners](#for-beginners) section
|
||||
6. Create a python environment through venv:
|
||||
1. Follow steps 1-2 in the [For Beginners](#for-beginners) section
|
||||
2. Create a python environment through venv:
|
||||
1. `python -m venv venv`
|
||||
2. Activate the venv
|
||||
1. On Windows: `.\venv\Scripts\activate`
|
||||
2. On Linux: `source venv/bin/activate`
|
||||
7. Install the pyproject features based on your system:
|
||||
3. Install the pyproject features based on your system:
|
||||
1. Cuda 12.x: `pip install -U .[cu121]`
|
||||
2. ROCm 5.6: `pip install -U .[amd]`
|
||||
8. Start the API by either
|
||||
4. Start the API by either
|
||||
1. Run `start.bat/sh`. The script will check if you're in a conda environment and skip venv checks.
|
||||
2. Run `python main.py` to start the API. This won't automatically upgrade your dependencies.
|
||||
|
||||
@@ -78,19 +78,19 @@ You can also access the configuration parameters under [2. Configuration](https:
|
||||
|
||||
## Where next?
|
||||
|
||||
9. Take a look at the [usage docs](https://github.com/theroyallab/tabbyAPI/wiki/03.-Usage)
|
||||
10. Get started with [community projects](https://github.com/theroyallab/tabbyAPI/wiki/09.-Community-Projects): Find loaders, UIs, and more created by the wider AI community. Any OAI compatible client is also supported.
|
||||
1. Take a look at the [usage docs](https://github.com/theroyallab/tabbyAPI/wiki/03.-Usage)
|
||||
2. Get started with [community projects](https://github.com/theroyallab/tabbyAPI/wiki/09.-Community-Projects): Find loaders, UIs, and more created by the wider AI community. Any OAI compatible client is also supported.
|
||||
## Updating
|
||||
|
||||
There are a couple ways to update TabbyAPI:
|
||||
|
||||
11. **Update scripts** - Inside the update_scripts folder, you can run the following scripts:
|
||||
1. **Update scripts** - Inside the update_scripts folder, you can run the following scripts:
|
||||
1. `update_deps`: Updates dependencies to their latest versions.
|
||||
2. `update_deps_and_pull`: Updates dependencies and pulls the latest commit of the Github repository.
|
||||
|
||||
These scripts exit after running their respective tasks. To start TabbyAPI, run `start.bat` or `start.sh`.
|
||||
|
||||
12. **Manual** - Install the pyproject features and update dependencies depending on your GPU:
|
||||
2. **Manual** - Install the pyproject features and update dependencies depending on your GPU:
|
||||
1. `pip install -U .[cu121]` = CUDA 12.x
|
||||
2. `pip install -U .[amd]` = ROCm 6.0
|
||||
|
||||
@@ -113,11 +113,11 @@ NOTE:
|
||||
|
||||
Here are ways to install exllamav2:
|
||||
|
||||
13. From a [wheel/release](https://github.com/turboderp/exllamav2#method-2-install-from-release-with-prebuilt-extension) (Recommended)
|
||||
1. From a [wheel/release](https://github.com/turboderp/exllamav2#method-2-install-from-release-with-prebuilt-extension) (Recommended)
|
||||
1. Find the version that corresponds with your cuda and python version. For example, a wheel with `cu121` and `cp311` corresponds to CUDA 12.1 and python 3.11
|
||||
14. From [pip](https://github.com/turboderp/exllamav2#method-3-install-from-pypi): `pip install exllamav2`
|
||||
2. From [pip](https://github.com/turboderp/exllamav2#method-3-install-from-pypi): `pip install exllamav2`
|
||||
2. This is a JIT compiled extension, which means that the initial launch of tabbyAPI will take some time. The build may also not work due to improper environment configuration.
|
||||
15. From [source](https://github.com/turboderp/exllamav2#method-1-install-from-source)
|
||||
3. From [source](https://github.com/turboderp/exllamav2#method-1-install-from-source)
|
||||
|
||||
## Other installation methods
|
||||
|
||||
@@ -128,15 +128,15 @@ These are short-form instructions for other methods that users can use to instal
|
||||
|
||||
### Conda
|
||||
|
||||
16. Install [Miniconda3](https://docs.conda.io/projects/miniconda/en/latest/miniconda-other-installer-links.html) with python 3.11 as your base python
|
||||
17. Create a new conda environment `conda create -n tabbyAPI python=3.11`
|
||||
18. Activate the conda environment `conda activate tabbyAPI`
|
||||
19. Install optional dependencies if they aren't present
|
||||
1. Install [Miniconda3](https://docs.conda.io/projects/miniconda/en/latest/miniconda-other-installer-links.html) with python 3.11 as your base python
|
||||
2. Create a new conda environment `conda create -n tabbyAPI python=3.11`
|
||||
3. Activate the conda environment `conda activate tabbyAPI`
|
||||
4. Install optional dependencies if they aren't present
|
||||
1. CUDA via
|
||||
1. CUDA 12 - `conda install -c "nvidia/label/cuda-12.4.1" cuda`
|
||||
2. Git via `conda install -k git`
|
||||
20. Clone TabbyAPI via `git clone https://github.com/theroyallab/tabbyAPI`
|
||||
21. Continue installation steps from:
|
||||
5. Clone TabbyAPI via `git clone https://github.com/theroyallab/tabbyAPI`
|
||||
6. Continue installation steps from:
|
||||
1. [For Beginners](#for-beginners) - Step 3. The start scripts detect if you're in a conda environment and skips the venv check.
|
||||
2. [For Advanced Users](#For-advanced-users) - Step 3
|
||||
|
||||
@@ -170,4 +170,4 @@ volumes:
|
||||
# Comment this to build a docker image from source
|
||||
image: ghcr.io/theroyallab/tabbyapi:latest
|
||||
```
|
||||
7. Run `docker compose -f docker/docker-compose.yml up` to build the dockerfile and start the server.
|
||||
7. Run `docker compose -f docker/docker-compose.yml up` to build the dockerfile and start the server.
|
||||
|
||||
@@ -1,4 +1,8 @@
|
||||
## Server Options
|
||||
|
||||
> [!NOTE]
|
||||
> If you want the latest config.yml docs, please look at the comments in `config_sample.yml`
|
||||
|
||||
TabbyAPI primarily uses a config.yml file to adjust various options. This is the preferred way and has the ability to adjust all options of TabbyAPI.
|
||||
|
||||
CLI arguments are also included, but those serve to *override* the options set in config.yml. Therefore, they act a bit differently compared to other programs, especially with booleans.
|
||||
@@ -9,79 +13,100 @@ In addition, some config.yml options are too complex to represent as command arg
|
||||
All of these options have descriptive comments above them. You should not need to reference this documentation page unless absolutely necessary.
|
||||
|
||||
### Networking Options
|
||||
| Config Option | Type (Default) | Description |
|
||||
|-----------------|------------------------|--------------------------------------------------------------|
|
||||
| host | String (127.0.0.1) | Set the IP address used for hosting TabbyAPI |
|
||||
| port | Int (5000) | Set the TCP Port use for TabbyAPI |
|
||||
| disable_auth | Bool (False) | Disables API authentication |
|
||||
| send_tracebacks | Bool (False) | Send server tracebacks to client.<br><br>Note: It's not recommended to enable this if sharing the instance with others. |
|
||||
| api_servers | List[String] (["OAI"]) | API servers to enable. Possible values `"OAI", "Kobold"` |
|
||||
### Logging Options
|
||||
Note: With CLI args, all logging parameters are prefixed by `log-`. For example, `prompt` will be `--log-prompt true/false`.
|
||||
|
||||
| Config Option | Type (Default) | Description |
|
||||
|-------------------|----------------|--------------------------------------------------------|
|
||||
| prompt | Bool (False) | Logs prompts to the console |
|
||||
| generation_params | Bool (False) | Logs request generation options to the console |
|
||||
| requests | Bool (False) | Logs a request's URL, Body, and Headers to the console |
|
||||
| Config Option | Type (Default) | Description |
|
||||
| ---------------------- | ---------------------- | ----------------------------------------------------------------------------------------------------------------------- |
|
||||
| host | String (127.0.0.1) | Set the IP address used for hosting TabbyAPI |
|
||||
| port | Int (5000) | Set the TCP Port use for TabbyAPI |
|
||||
| disable_auth | Bool (False) | Disables API authentication |
|
||||
| disable_fetch_requests | Bool (False) | Disables fetching external content when responding to requests (ex. fetching images from URLs) |
|
||||
| send_tracebacks | Bool (False) | Send server tracebacks to client.<br><br>Note: It's not recommended to enable this if sharing the instance with others. |
|
||||
| api_servers | List[String] (["OAI"]) | API servers to enable. Possible values `"OAI", "Kobold"` |
|
||||
|
||||
### Logging Options
|
||||
|
||||
| Config Option | Type (Default) | Description |
|
||||
| --------------------- | -------------- | ------------------------------------------------------ |
|
||||
| log_prompt | Bool (False) | Logs prompts to the console |
|
||||
| log_generation_params | Bool (False) | Logs request generation options to the console |
|
||||
| log_requests | Bool (False) | Logs a request's URL, Body, and Headers to the console |
|
||||
|
||||
### Sampling Options
|
||||
|
||||
Note: This block is for sampling overrides, not samplers themselves.
|
||||
| Config Option | Type (Default) | Description |
|
||||
|-----------------|----------------|--------------------------------------------------------------|
|
||||
|
||||
| Config Option | Type (Default) | Description |
|
||||
| --------------- | -------------- | ------------------------------------------------------------------------- |
|
||||
| override_preset | String (None) | Startup the given sampler override preset in the sampler_overrides folder |
|
||||
|
||||
### Developer Options
|
||||
|
||||
Note: These are experimental flags that may be removed at any point.
|
||||
| Config Option | Type (Default) | Description |
|
||||
|---------------------------|----------------|--------------------------------------------------------------|
|
||||
| unsafe_launch | Bool (False) | Skips dependency checks on startup. Only recommended for debugging. |
|
||||
| disable_request_streaming | Bool (False) | Forcefully disables streaming requests |
|
||||
| cuda_malloc_backend | Bool (False) | Uses pytorch's CUDA malloc backend to load models. Helps save VRAM.<br><br>Safe to enable. |
|
||||
| uvloop | Bool (False) | Use a faster asyncio event loop. Can increase performance.<br><br>Safe to enable. |
|
||||
|
||||
| Config Option | Type (Default) | Description |
|
||||
| ------------------------- | -------------- | ----------------------------------------------------------------------------------------------------------------------------------------------- |
|
||||
| unsafe_launch | Bool (False) | Skips dependency checks on startup. Only recommended for debugging. |
|
||||
| disable_request_streaming | Bool (False) | Forcefully disables streaming requests |
|
||||
| cuda_malloc_backend | Bool (False) | Uses pytorch's CUDA malloc backend to load models. Helps save VRAM.<br><br>Safe to enable. |
|
||||
| realtime_process_priority | Bool (False) | Set the process priority to "Realtime". Administrator/sudo access required, otherwise the priority is set to the highest it can go in userland. |
|
||||
|
||||
### Model Options
|
||||
|
||||
Note: Most of the options here will only apply on initial model load/startup (ephemeral). They will not persist unless you add the option name to `use_as_default`.
|
||||
| Config Option | Type (Default) | Description |
|
||||
|-----------------------|-------------------|--------------------------------------------------------------|
|
||||
| model_dir | String ("models") | Directory to look for models.<br><br>Note: Persisted across subsequent load requests |
|
||||
| use_dummy_models | Bool (False) | Send a dummy OAI model card when calling the `/v1/models` endpoint. Used for clients which enforce specific OAI models.<br><br>Note: Persisted across subsequent load requests |
|
||||
| model_name | String (None) | Folder name of a model to load. The below parameters will not apply unless this is filled out. |
|
||||
| use_as_default | List[String] ([]) | Keys to use by default when loading models. For example, putting `cache_mode` in this array will make every model load with that value unless specified by the API request.<br><br>Note: Also applies to the `draft` sub-block |
|
||||
| max_seq_len | Float (None) | Maximum sequence length of the model. Uses the value from config.json if not specified here. |
|
||||
| override_base_seq_len | Float (None) | Overrides the base sequence length of a model. You probably don't want to use this. max_seq_len is better.<br><br>Note: This is only required for automatic RoPE alpha calculation AND if the model has an incorrect base sequence length (ex. Mistral 7b) |
|
||||
| tensor_parallel | Bool (False) | Use tensor parallelism to load the model. This ignores the value of gpu_split_auto. |
|
||||
| gpu_split_auto | Bool (True) | Automatically split the model across multiple GPUs. Manual GPU split isn't used if this is enabled. |
|
||||
| autosplit_reserve | List[Int] ([96]) | Amount of empty VRAM to reserve when loading with autosplit.<br><br>Represented as an array of MB per GPU used. |
|
||||
| gpu_split | List[Float] ([]) | Float array of GBs to split a model between GPUs. |
|
||||
| rope_scale | Float (1.0) | Adjustment for rope scale (or compress_pos_emb) |
|
||||
| rope_alpha | Float (None) | Adjustment for rope alpha. Leave blank to automatically calculate based on the max_seq_len. |
|
||||
| cache_mode | String ("FP16") | Cache mode for the model.<br><br>Options: FP16, Q8, Q6, Q4 |
|
||||
| cache_size | Int (max_seq_len) | <br>Note: If using CFG, the cache size should be 2 * max_seq_len. |
|
||||
| chunk_size | Int (2048) | Amount of tokens per chunk with ingestion. A lower value reduces VRAM usage at the cost of ingestion speed. |
|
||||
| max_batch_size | Int (None) | The absolute maximum amount of prompts to process at one time. This value is automatically adjusted based on cache size. |
|
||||
| prompt_template | String (None) | Name of a jinja2 chat template to apply for this model. Must be located in the `templates` directory. |
|
||||
| num_experts_per_token | Int (None) | Number of experts to use per-token for MoE models. Pulled from the config.json if not specified. |
|
||||
| fasttensors | Bool (False) | Possibly increases model loading speeds. |
|
||||
|
||||
| Config Option | Type (Default) | Description |
|
||||
| --------------------- | -------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
|
||||
| model_dir | String ("models") | Directory to look for models.<br><br>Note: Persisted across subsequent load requests |
|
||||
| inline_model_loading | Bool (False) | Enables ability to switch models using the `model` argument in a generation request. More info in [Usage](https://github.com/theroyallab/tabbyAPI/wiki/03.-Usage#inline-loading) |
|
||||
| use_dummy_models | Bool (False) | Send a dummy OAI model card when calling the `/v1/models` endpoint. Used for clients which enforce specific OAI models.<br><br>Note: Persisted across subsequent load requests |
|
||||
| dummy_model_names | List[String] (["gpt-3.5-turbo"]) | List of dummy names to send on model endpoint requests |
|
||||
| model_name | String (None) | Folder name of a model to load. The below parameters will not apply unless this is filled out. |
|
||||
| use_as_default | List[String] ([]) | Keys to use by default when loading models. For example, putting `cache_mode` in this array will make every model load with that value unless specified by the API request.<br><br>Note: Also applies to the `draft` sub-block |
|
||||
| max_seq_len | Float (None) | Maximum sequence length of the model. Uses the value from config.json if not specified here. Also called the max context length. |
|
||||
| tensor_parallel | Bool (False) | Enables tensor parallelism. Automatically falls back to autosplit if GPU split isn't provided. <br><br>Note: `gpu_split_auto` is ignored when this is enabled. |
|
||||
| gpu_split_auto | Bool (True) | Automatically split the model across multiple GPUs. Manual GPU split isn't used if this is enabled. |
|
||||
| autosplit_reserve | List[Int] ([96]) | Amount of empty VRAM to reserve when loading with autosplit.<br><br>Represented as an array of MB per GPU used. |
|
||||
| gpu_split | List[Float] ([]) | Float array of GBs to split a model between GPUs. |
|
||||
| rope_scale | Float (1.0) | Adjustment for rope scale (or compress_pos_emb) |
|
||||
| rope_alpha | Float (None) | Adjustment for rope alpha. Leave blank to automatically calculate based on the max_seq_len. |
|
||||
| cache_mode | String ("FP16") | Cache mode for the model.<br><br>Options: FP16, Q8, Q6, Q4 |
|
||||
| cache_size | Int (max_seq_len) | Size of the K/V cache<br><br>Note: If using CFG, the cache size should be 2 * max_seq_len. |
|
||||
| chunk_size | Int (2048) | Amount of tokens per chunk with ingestion. A lower value reduces VRAM usage at the cost of ingestion speed. |
|
||||
| max_batch_size | Int (None) | The absolute maximum amount of prompts to process at one time. This value is automatically adjusted based on cache size. |
|
||||
| prompt_template | String (None) | Name of a jinja2 chat template to apply for this model. Must be located in the `templates` directory. |
|
||||
| vision | Bool (False) | Enable vision support for the provided model (if it exists). |
|
||||
| num_experts_per_token | Int (None) | Number of experts to use per-token for MoE models. Pulled from the config.json if not specified. |
|
||||
|
||||
### Draft Model Options
|
||||
|
||||
Note: Sub-block of Model Options. Same rules apply.
|
||||
| Config Option | Type (Default) | Description |
|
||||
|------------------|-------------------|--------------------------------------------------------------|
|
||||
|
||||
| Config Option | Type (Default) | Description |
|
||||
| ---------------- | ----------------- | ------------------------------------------------------------------------------------------ |
|
||||
| draft_model_dir | String ("models") | Directory to look for draft models.<br><br>Note: Persisted across subsequent load requests |
|
||||
| draft_model_name | String (None) | String: Folder name of a draft model to load. |
|
||||
| draft_rope_scale | Float (1.0) | String: RoPE scale value for the draft model. |
|
||||
| draft_rope_alpha | Float (1.0) | RoPE alpha value for the draft model. Leave blank for auto-calculation. |
|
||||
| draft_cache_mode | String ("FP16") | Cache mode for the draft model.<br><br>Options: FP16, Q8, Q6, Q4 |
|
||||
| draft_model_name | String (None) | String: Folder name of a draft model to load. |
|
||||
| draft_rope_scale | Float (1.0) | String: RoPE scale value for the draft model. |
|
||||
| draft_rope_alpha | Float (1.0) | RoPE alpha value for the draft model. Leave blank for auto-calculation. |
|
||||
| draft_cache_mode | String ("FP16") | Cache mode for the draft model.<br><br>Options: FP16, Q8, Q6, Q4 |
|
||||
| draft_gpu_split | List[Float] ([]) | Float array of GBs to split a draft model between GPUs. |
|
||||
|
||||
### Lora Options
|
||||
|
||||
Note: Sub-block of Mode Options. Same rules apply.
|
||||
|
||||
| Config Option | Type (Default) | Description |
|
||||
|---------------|------------------|--------------------------------------------------------------|
|
||||
| lora_dir | String ("loras") | Directory to look for loras.<br><br>Note: Persisted across subsequent load requests |
|
||||
| loras | List[loras] ([]) | List of lora objects to apply to the model. Each object contains a name and scaling. |
|
||||
| name | String (None) | Folder name of a lora to load.<br><br>Note: An element of the `loras` key |
|
||||
| scaling | Float (1.0) | "Weight" to apply the lora on the parent model. For example, applying a lora with 0.9 scaling will lower the amount of application on the parent model.<br><br>Note: An element of the `loras` key |
|
||||
|
||||
### Embeddings Options
|
||||
|
||||
Note: Most of the options here will only apply on initial embedding model load/startup (ephemeral).
|
||||
| Config Option | Type (Default) | Description |
|
||||
|----------------------|-------------------|--------------------------------------------------------------|
|
||||
| embedding_model_dir | String ("models") | Directory to look for embedding models.<br><br>Note: Persisted across subsequent load requests |
|
||||
|
||||
| Config Option | Type (Default) | Description |
|
||||
| -------------------- | ----------------- | ---------------------------------------------------------------------------------------------------------------------------- |
|
||||
| embedding_model_dir | String ("models") | Directory to look for embedding models.<br><br>Note: Persisted across subsequent load requests |
|
||||
| embeddings_device | String ("cpu") | Device to load an embedding model on.<br><br>Options: cpu, cuda, auto<br><br>Note: Persisted across subsequent load requests |
|
||||
| embedding_model_name | String (None) | Folder name of an embedding model to load using infinity-emb. |
|
||||
| embedding_model_name | String (None) | Folder name of an embedding model to load using infinity-emb. |
|
||||
|
||||
Reference in New Issue
Block a user