turboderp
52e093ae6c
Model: Enable max_rq_tokens (output chunking)
2025-10-05 18:54:45 +02:00
turboderp
e09a61969f
Model: Fix NCCL detection
2025-10-05 18:52:37 +02:00
kingbri
a4d02c2b70
Model: Add log messages for model loading
...
It's useful to know the split method that the model is being loaded
on.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com >
2025-08-17 23:09:27 -04:00
kingbri
a3a32c30a4
Model: Add utils file
...
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com >
2025-08-17 22:43:19 -04:00
kingbri
43f9483bc4
Model: Add tensor_parallel_backend option
...
This allows for users to use nccl or native depending on the GPU setup.
NCCL is only available with Linux built wheels.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com >
2025-08-17 22:35:10 -04:00
Forkoz
60ae419746
Model.py TP changes
2025-08-12 21:01:54 +00:00
kingbri
fe149489af
Tree: Format
...
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com >
2025-08-05 01:22:18 -04:00
AUTOMATIC
056527ceb3
add logprobs support for exl3
2025-08-03 11:42:32 +03:00
kingbri
0b4ca567f8
API: Persist request IDs and append full_text to finish chunk
...
Adding these to each generation chunk helps remove redundancy and
unecessary request ID operations.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com >
2025-07-25 12:27:44 -04:00
turboderp
0ae878712e
Exl3: Clear image embedding cache on unload
2025-06-25 23:56:21 +02:00
kingbri
a02d39de31
Model: Remove rogue print
...
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com >
2025-06-17 23:09:07 -04:00
kingbri
2913ce29fc
API: Add timings to usage stats
...
It's useful for the client to know what the T/s and total time for
generation are per-request.
Works with both completions and chat completions.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com >
2025-06-17 22:54:51 -04:00
kingbri
5d94d4d022
Merge branch 'main' into breaking
2025-06-17 22:24:32 -04:00
turboderp
21c5af48e1
Tree: Format
2025-06-15 19:30:38 +02:00
turboderp
1c9891bf04
Exl3: Add vision capability
2025-06-15 19:22:51 +02:00
turboderp
4605c0f6bd
Common: Refactor get_image to common functions
2025-06-15 19:20:36 +02:00
turboderp
d357f100d0
Dependencies: Bump ExllamaV3
2025-06-15 19:12:45 +02:00
turboderp
a0c16bba2a
Exl2: Fix banned_strings (move outside of assign_gen_params)
2025-06-15 16:51:42 +02:00
kingbri
2096c9bad2
Model: Default max_seq_len to 4096
...
A common problem in TabbyAPI is that users who want to get up and
running with a model always had issues with max_seq_len causing OOMs.
This is because model devs set max context values in the millions which
requires a lot of VRAM.
To idiot-proof first time setup, make the fallback default 4096 so
users can run their models. If a user still wants to use the model's
max_seq_len, set it to -1.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com >
2025-06-13 14:57:24 -04:00
turboderp
691a080ac7
Dependencies: Bump ExllamaV3 and ExllamaV2
2025-05-31 23:55:04 +02:00
kingbri
0c4cc1eba3
Model: Add prompt logging to ExllamaV3
...
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com >
2025-05-17 22:05:18 -04:00
gakada
ba6248eec0
Exl3: fix add_bos in generator
2025-05-17 19:10:49 +09:00
kingbri
17f3dca6fc
Packaging: Add agnostic method to check version of packages
...
Some packages such as ExllamaV2 and V3 require specific versions for
the latest features. Rather than creating repetitive functions, create
an agnostic function to check the installed package and then report
to the user to upgrade.
This is also sent to requests for loading and unloading, so keep the
error short.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com >
2025-05-17 01:04:24 -04:00
kingbri
084916c04f
Model: Fix autosplit reserve crash with GPU split
...
ExllamaV3 does not accept autosplit_reserve and gpu_split at the same
time.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com >
2025-05-17 00:51:14 -04:00
kingbri
0858b6d4b2
Tree: Format
...
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com >
2025-05-17 00:46:40 -04:00
kingbri
390daeb92f
Model: Create universal HFModel class
...
The HFModel class serves to coalesce all config files that contain
random keys which are required for model usage.
Adding this base class allows us to expand as HuggingFace randomly
changes their JSON schemas over time, reducing the brunt that backend
devs need to feel when their next model isn't supported.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com >
2025-05-13 18:12:38 -04:00
kingbri
bd3fec929c
Tree: Format
...
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com >
2025-05-12 11:32:27 -04:00
kingbri
a524ac3c0f
Model: Fix cache mode again
...
If statements can be difficult to work with.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com >
2025-05-12 11:30:47 -04:00
kingbri
20cad851e9
Model: Fix param call
...
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com >
2025-05-12 09:52:28 -04:00
kingbri
d15eb55f20
Model: Fix exl2 cache mode check
...
FP16 was not included in the validation step.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com >
2025-05-12 09:51:09 -04:00
kingbri
656af41b5d
Model: Always enable decode_special_tokens
...
The frontend should handle the special tokens if they get emitted.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com >
2025-05-09 22:25:50 -04:00
kingbri
42346c6b39
Sampling: Remove skip_special_tokens
...
This parameter is way too confusing and does not make sense in
the modern LLM space.
Change approved by all maintainers.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com >
2025-05-09 22:11:33 -04:00
kingbri
25c77ebf77
Model: Remove exllamav2-specific version check
...
No longer necessary thanks to the agnostic check.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com >
2025-05-09 22:08:15 -04:00
kingbri
638eef401a
Model: Move cache creation to a common function
...
Prevents repetitiveness while also creating a Cache class.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com >
2025-05-08 23:10:03 -04:00
DocShotgun
9dcde59c57
Model: Check for unsupported cache mode in exllamav2
2025-05-06 01:18:15 -07:00
DocShotgun
45b966363e
Tree: Format
2025-05-03 21:01:03 -07:00
DocShotgun
a635a719d7
Model: Enable draft model q-cache in Exl3
...
* Remove unneeded default fp16 cache layer import
2025-05-03 20:59:36 -07:00
DocShotgun
58e34ba4c5
Model: Exl3 cache quant settings lenient with whitespace
2025-05-03 20:35:35 -07:00
DocShotgun
68a660bdb3
Model: Initial Exl3 cache quantization support
2025-05-03 20:35:35 -07:00
turboderp
92ea7ee7cd
Model: Add draft model/speculative decoding
2025-05-04 01:27:42 +02:00
turboderp
1db2cb99cb
Model: Avoid initializing class variables
2025-05-04 01:26:42 +02:00
turboderp
0405a94a89
Model: Cast penalty range to int
2025-05-03 22:28:36 +02:00
turboderp
58c380b8ca
Model: Create generator on load
2025-05-03 18:33:37 +02:00
turboderp
0d949d00b9
Model: Set default max_batch_size
2025-05-03 18:33:37 +02:00
turboderp
8c75b29923
Model: Fix some warnings
2025-05-03 18:33:36 +02:00
kingbri
15cc480cb0
Exl3: Simplify add_bos_token handling
...
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com >
2025-05-02 21:50:42 -04:00
randoentity
d8a8ccfc2a
Model: fix add_bos_token
2025-05-02 21:33:25 -04:00
kingbri
0d02af3c81
Model: Set model_dir on init
...
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com >
2025-05-02 21:33:25 -04:00
kingbri
c89bea030e
Model: Add template fetching to Exl3
...
Use the same functionality as exl2's loader.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com >
2025-05-02 21:33:25 -04:00
kingbri
e8f00412f6
Model: Fetch from generation_config and tokenizer_config in Exl3
...
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com >
2025-05-02 21:33:25 -04:00