Commit Graph

1092 Commits

Author SHA1 Message Date
kingbri
ad64942fa1 Tree: Format
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-10-14 23:49:13 -04:00
kingbri
f205349c81 Config: Fix use_as_default application
Apply the default overrides after inline config has been merged.

Do not require an inline config to apply use_as_default and other
overrides.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-10-14 23:45:39 -04:00
kingbri
6f73a0b388 Tree: Format
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-10-14 23:06:20 -04:00
kingbri
5cb8f3ed2c Config: Fix comments for max_seq_len and cache_size
The default is the minimum between max_position_embeddings and cache_size.
On AMD and older than Ampere NVIDIA GPUs, cache_size is ignored due
to not being supported by batching on exl2.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-10-14 23:04:36 -04:00
kingbri
fdb86f4c63 ExllamaV2: Add max_seq_len empty case like ExllamaV3
Also remove the intermediate base_seq_len and target_seq_len variables
to make code clearer.

If paged mode is off, max_seq_len becomes the prime mover since batching
is unavailable.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-10-14 23:02:52 -04:00
kingbri
69a25d7fa6 Config + Endpoints: Make cache_size more prominent
Since cache_size is a more important parameter now for multi-user
setups, mark it as such by placing it below max_seq_len.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-10-14 21:53:33 -04:00
kingbri
62e9fa217a ExllamaV3: Handle max_seq_len defined and cache_size undefined case
The previous changes broke existing configs and max_seq_len was
force-overriden to 4096. This helps single-user setups since they
do not really benefit from the split cache_size max_seq_len mechanism
(except if batching).

cache_size is still the prime mover in exl3 due to its paging mechanism.
Ideally, for multi-user setups, cache_size should take as much VRAM
as possible and max_seq_len should be limited.

Breakdown:
cache_size and max_seq_len specified -> values
only cache_size/max_seq_len specified -> max_seq_len = cache_size and vice versa
neither specified -> cache_size = 4096, max_seq_len = min(max_position_embeddings, cache_size)

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-10-14 21:48:36 -04:00
turboderp
04ca346732 Fix formatting 2025-10-14 03:11:59 +02:00
turboderp
ec50ad17ea Merge branch 'main_seq' 2025-10-14 02:58:00 +02:00
turboderp
8abdfe7b13 Config: replace disable_output_chunking flag with output_chunking 2025-10-14 02:47:52 +02:00
turboderp
7eee3924c7 Merge remote-tracking branch 'origin/main_seq' into main_seq 2025-10-14 00:58:42 +02:00
turboderp
f73e88e9e9 Dependencies: update exllamav3 2025-10-14 00:58:14 +02:00
kingbri
85459ce600 Tree: Format
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-10-09 22:33:53 -04:00
turboderp
01a5915a7b Dependencies: Pin Pydantic to version 2.11.0
For now. There appear to be breaking changes in 2.12.0 that affect both Formatron and FastAPI.
2025-10-08 20:43:26 +02:00
turboderp
4235f98e83 Model: Change cache_size/max_seq_len behavior
- Cache size is now given only by the cache_size config option. Default is 4096 (user should always override to max out VRAM)
- max_seq_len, if not overridden in the config, will default to the model's config.json
- max_seq_len is reduced to be no larger than the cache
2025-10-05 22:16:01 +02:00
turboderp
d672dc2137 API: Fix race condition when client disconnects 2025-10-05 21:23:02 +02:00
turboderp
52e093ae6c Model: Enable max_rq_tokens (output chunking) 2025-10-05 18:54:45 +02:00
turboderp
e09a61969f Model: Fix NCCL detection 2025-10-05 18:52:37 +02:00
kingbri
7a0dddcbd9 Dependencies: Update exllamav3
v0.0.7

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-09-30 17:34:02 -04:00
turboderp
1d3a308709 Fix wiki link in README.md 2025-08-26 13:03:18 +02:00
kingbri
d7eb580e99 Start: Fix uv check
In Windows, checking for a command yields a FileNotFound error if
the utility isn't found. This led to complicated logic which can
be solved by using which instead.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-08-21 18:23:42 -04:00
kingbri
4036c70d75 Tree: Format
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-08-19 22:59:26 -04:00
kingbri
bd3aa5bb04 Docs: Add uv section
UV is now supported as first-party in tabbyAPI's start script, so
add a dedicated section to it and recommend over miniconda.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-08-19 22:57:03 -04:00
kingbri
1f4186512e Start: Add check for uv
Uv is the definitive package installation tool for Python, so add
support to check for it via the start script.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-08-19 22:57:03 -04:00
kingbri
30a3cd75cf Start: Migrate options from cu121/118 to cu12
This encapsulates more cuda versions and makes install easier for
new users.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-08-19 22:56:58 -04:00
kingbri
1344726936 Docs: Sampler overrides part 2
Actually commit the edits.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-08-19 21:19:12 -04:00
Brian
86f27c9c93 Merge pull request #377 from DocShotgun/main
Config: Enable safe sampler overrides by default
2025-08-18 23:12:34 -04:00
kingbri
e07df3951e Docs: Update sampler overrides
Change the sampling subsection to sampler overrides and add a warning
about the default preset.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-08-18 23:06:16 -04:00
kingbri
067d63773e Config: Move sampling higher in the list
This has become a bigger priority with addition of the safe_defaults
noob proofing.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-08-18 22:55:03 -04:00
DocShotgun
6fb0c2cdbd Config: Update description for override_preset default
* We provide safe_defaults as a default in config_sample.yml but not internally
2025-08-18 12:39:52 -07:00
DocShotgun
998abe5ad1 Config: Enable safe sampler overrides by default
* Provides safe fallback samplers, intended for better out-of-the-box support for clients that do not pass sampler params
2025-08-18 12:32:28 -07:00
kingbri
a4d02c2b70 Model: Add log messages for model loading
It's useful to know the split method that the model is being loaded
on.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-08-17 23:09:27 -04:00
kingbri
a3a32c30a4 Model: Add utils file
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-08-17 22:43:19 -04:00
Brian
05791a25a1 Merge pull request #375 from Ph0rk0z/patch-1
experimental: native exllamav3 TP, no fuss
2025-08-17 22:37:25 -04:00
kingbri
43f9483bc4 Model: Add tensor_parallel_backend option
This allows for users to use nccl or native depending on the GPU setup.
NCCL is only available with Linux built wheels.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-08-17 22:35:10 -04:00
kingbri
b9952f319e Merge branch 'main' into exl3-tp 2025-08-17 21:21:40 -04:00
kingbri
f2a39e3a61 Dependencies: Update exllama, torch, and flash attention
Torch: 2.8
ExllamaV2: v0.3.2 torch 2.8
ExllamaV3: v0.0.6 torch 2.8
FA: v2.8.3

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-08-17 21:19:23 -04:00
Forkoz
60ae419746 Model.py TP changes 2025-08-12 21:01:54 +00:00
Brian
6623dbcd86 Merge pull request #373 from AUTOMATIC1111/exl3-logprobs
add logprobs support for exl3
2025-08-05 01:24:06 -04:00
kingbri
fe149489af Tree: Format
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-08-05 01:22:18 -04:00
Brian
83f778db2d Merge pull request #374 from DocShotgun/main
Templating: Support chat_template.jinja
2025-08-05 01:18:25 -04:00
DocShotgun
81a115b781 Templating: Support chat_template.jinja 2025-08-03 16:10:08 -07:00
AUTOMATIC
056527ceb3 add logprobs support for exl3 2025-08-03 11:42:32 +03:00
Brian
03d72a37be Merge pull request #371 from DocShotgun/main
Config: Remove developer arg cuda_malloc_backend
2025-08-01 14:02:57 -04:00
DocShotgun
102af306e5 Config: Remove developer arg cuda_malloc_backend
* cudaMallocAsync is now enabled by default on supported configurations
2025-08-01 10:59:13 -07:00
kingbri
113643c0df Main: Enable cudaMallocAsync backend by default
Works on cuda 12.4 and up. If CUDA doesn't exist, then don't enable
the backend. This is an env var that needs to be set, so it's not really
possible to set it via config.yml.

This used to be experimental, but it's probably fine to keep it enabled
since it only provides a benefit.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-07-27 22:31:38 -04:00
kingbri
0b4ca567f8 API: Persist request IDs and append full_text to finish chunk
Adding these to each generation chunk helps remove redundancy and
unecessary request ID operations.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-07-25 12:27:44 -04:00
kingbri
e77fa0b7a8 Docs: Edit inline loading for breaking changes
Add the model key for the YAML examples.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-07-24 18:11:42 -04:00
kingbri
ab04a6ed60 Dependencies: Bump ExllamaV3
v0.0.5

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-07-18 22:56:35 -04:00
kingbri
bf936f5c39 Dependencies: Update exllamav2
v0.3.2

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-07-13 23:33:12 -04:00