mirror of https://github.com/theroyallab/tabbyAPI.git synced 2026-05-11 08:20:08 +00:00

Go to file

Josh 09f36f9c05 fix: prevent xformers from pulling cu130 wheels on cu128 hosts (#420 )

The default `pip install .[cu12,extras]` lets pip resolve xformers
transitively (via infinity-emb / sentence-transformers in the extras
group), which can pull a cu130-aligned wheel that requires
libcudart.so.13. On hosts with NVIDIA driver 590.x (cu128-only), this
fails at import time with:

    ImportError: libcudart.so.13: cannot open shared object file

Reproduced on K3s clusters running 12 exllamav2/exllamav3 deployment
pods × 6 hosts; all crash-looped on the published `:latest` image
which had transitively resolved xformers to a cu130 wheel.

Fix: split the install into two pip invocations. Install the cu12 group
first to lock torch + cu128 wheels for exllamav2 / exllamav3 / flash_attn,
then install the extras group with --no-deps so pip cannot resolve
xformers (or any other transitive dep) outside the cu128 lock.

Also align the Windows py3.12 flash_attn wheel version to v0.7.13 to
match the other Windows variants (py3.10, py3.11, py3.13). The py3.12
variant was pinned to v0.7.6 while the rest were on v0.7.13, leaving
py3.12 Windows users on an older flash_attn release with no semantic
reason for the divergence.

Tested on Hydra K3s cluster (NVIDIA 590.48.01-open + cu128 base image
nvidia/cuda:12.8.1-runtime-ubuntu24.04 + torch 2.9.0+cu128). All 12
exllamav2/v3 deployments now import cleanly and serve /v1/models.

Co-authored-by: Josh Jones <scoobydont-666@users.noreply.github.com>

2026-05-09 21:21:17 +02:00

.github

Actions: Update and add Wiki publish

2025-02-17 23:47:38 -05:00

backends

API: Accept JSON schema in request.response_format.json_schema, delay JSON filter until start of content block

2026-05-02 20:29:59 +02:00

colab

Start: Migrate options from cu121/118 to cu12

2025-08-19 22:56:58 -04:00

common

Model: Fix model loading progress display when draft enabled

2026-05-02 20:30:38 +02:00

docker

fix: prevent xformers from pulling cu130 wheels on cu128 hosts (#420 )

2026-05-09 21:21:17 +02:00

docs

Tools: Add step3_5 alias (qwen3_coder tool format)

2026-04-18 19:55:34 +02:00

endpoints

API: Accept JSON schema in request.response_format.json_schema, delay JSON filter until start of content block

2026-05-02 20:29:59 +02:00

loras

Implement lora support (#24 )

2023-12-08 23:38:08 -05:00

models

Tree: Update documentation and configs

2023-11-16 02:30:33 -05:00

sampler_overrides

Sampling: Add adaptive-P params

2026-01-20 19:09:54 +01:00

templates

Rework tool calls and OAI chat completions

2026-03-30 00:22:55 +02:00

tests

OAI endpoints: More rework

2026-04-02 01:26:44 +02:00

update_scripts

Start: Make linux scripts executable

2024-08-03 15:19:31 -04:00

.dockerignore

debloat docker build

2024-09-08 00:02:00 +01:00

.gitignore

OAI: Log raw requests

2026-03-30 01:23:16 +02:00

api_tokens_sample.yml

Improve docker deployment configuration (#163 )

2024-08-18 15:19:18 -04:00

config_sample.yml

Config: Make recurrent cache size configurable

2026-04-17 02:40:22 +02:00

formatting.bat

feat: workflows for formatting/linting (#35 )

2023-12-22 16:20:35 +00:00

formatting.sh

feat: workflows for formatting/linting (#35 )

2023-12-22 16:20:35 +00:00

LICENSE

Create LICENSE

2023-11-16 17:43:23 -05:00

main.py

Config: Make cuda_malloc_async configurable again, change import order to make sure config is loaded before torch is imported

2026-04-17 02:39:16 +02:00

pyproject.toml

Dependencies: Fix Windows FA2 wheel URL for cp312

2026-05-05 10:02:49 +02:00

README.md

Update README.md

2026-04-12 13:44:26 +02:00

start.bat

Start: Add check for uv

2025-08-19 22:57:03 -04:00

start.py

Ruff: Format (line length)

2026-03-30 00:19:07 +02:00

start.sh

Start: Add check for uv

2025-08-19 22:57:03 -04:00

README.md

TabbyAPI

Important

In addition to the README, please read the Wiki page for information about getting started!

Note

Need help? Join the Discord Server and get the Tabby role. Please be nice when asking questions.

Note

Tool calling support has been revamped and now no longer relies on modified Jinja templates. See the docs for more.

Note

Want to run GGUF models? Take a look at YALS, TabbyAPI's sister project.

A FastAPI based application that allows for generating text using an LLM (large language model) using the Exllamav2 and Exllamav3 backends.

TabbyAPI is also the official API backend server for ExllamaV2 and V3.

Disclaimer

This project is marked as rolling release. There may be bugs and changes down the line. Please be aware that you might need to reinstall dependencies if needed.

TabbyAPI is a hobby project made for a small amount of users. It is not meant to run on production servers. For that, please look at other solutions that support those workloads.

Getting Started

Important

Looking for more information? Check out the Wiki.

For a step-by-step guide, choose the format that works best for you:

📖 Read the Wiki – Covers installation, configuration, API usage, and more.

🎥 Watch the Video Guide – A hands-on walkthrough to get you up and running quickly.

Features

OpenAI compatible API
Loading/unloading models
HuggingFace model downloading
Embedding model support
JSON schema + Regex + EBNF support
AI Horde support
Speculative decoding via draft models
Multi-lora with independent scaling (ex. a weight of 0.9)
Inbuilt proxy to override client request parameters/samplers
Flexible Jinja2 template engine for chat completions that conforms to HuggingFace
Concurrent inference with asyncio
Utilizes modern python paradigms
Continuous batching engine using paged attention
Fast classifier-free guidance
OAI style tool/function calling

And much more. If something is missing here, PR it in!

Supported Model Types

TabbyAPI uses Exllama as a powerful and fast backend for model inference, loading, etc. Therefore, the following types of models are supported:

Exl2/GPTQ (deprecated, will be removed in the near future)
Exl3 (Highly recommended)
FP16

In addition, TabbyAPI supports parallel batching using paged attention for Nvidia Ampere GPUs and higher.

Contributing

Use the template when creating issues or pull requests, otherwise the developers may not look at your post.

If you have issues with the project:

Describe the issue in detail
If you have a feature request, please indicate it as such.

If you have a Pull Request

Describe the pull request in detail, what, and why you are changing something

Acknowldgements

TabbyAPI would not exist without the work of other contributors and FOSS projects:

Developers and Permissions

Creators/Developers:

Languages

Python 95.3%

Jupyter Notebook 2.6%

Shell 0.8%

Batchfile 0.7%

Dockerfile 0.3%

Other 0.3%

README.md Unescape Escape

TabbyAPI

Disclaimer

Getting Started

Features

Supported Model Types

Contributing

Acknowldgements

Developers and Permissions

README.md