Commit Graph

54 Commits

Author SHA1 Message Date
mefich
37aea9de83 Update exl3 backend model.py: fix for unloading vision models
This change ensures that when unloading vlm their vision part is also unloaded.
2025-10-30 12:30:23 +05:00
turboderp
0af29d957a Fix #390 2025-10-15 10:40:19 +02:00
kingbri
62e9fa217a ExllamaV3: Handle max_seq_len defined and cache_size undefined case
The previous changes broke existing configs and max_seq_len was
force-overriden to 4096. This helps single-user setups since they
do not really benefit from the split cache_size max_seq_len mechanism
(except if batching).

cache_size is still the prime mover in exl3 due to its paging mechanism.
Ideally, for multi-user setups, cache_size should take as much VRAM
as possible and max_seq_len should be limited.

Breakdown:
cache_size and max_seq_len specified -> values
only cache_size/max_seq_len specified -> max_seq_len = cache_size and vice versa
neither specified -> cache_size = 4096, max_seq_len = min(max_position_embeddings, cache_size)

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-10-14 21:48:36 -04:00
turboderp
8abdfe7b13 Config: replace disable_output_chunking flag with output_chunking 2025-10-14 02:47:52 +02:00
kingbri
85459ce600 Tree: Format
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-10-09 22:33:53 -04:00
turboderp
4235f98e83 Model: Change cache_size/max_seq_len behavior
- Cache size is now given only by the cache_size config option. Default is 4096 (user should always override to max out VRAM)
- max_seq_len, if not overridden in the config, will default to the model's config.json
- max_seq_len is reduced to be no larger than the cache
2025-10-05 22:16:01 +02:00
turboderp
52e093ae6c Model: Enable max_rq_tokens (output chunking) 2025-10-05 18:54:45 +02:00
turboderp
e09a61969f Model: Fix NCCL detection 2025-10-05 18:52:37 +02:00
kingbri
a4d02c2b70 Model: Add log messages for model loading
It's useful to know the split method that the model is being loaded
on.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-08-17 23:09:27 -04:00
kingbri
a3a32c30a4 Model: Add utils file
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-08-17 22:43:19 -04:00
kingbri
43f9483bc4 Model: Add tensor_parallel_backend option
This allows for users to use nccl or native depending on the GPU setup.
NCCL is only available with Linux built wheels.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-08-17 22:35:10 -04:00
Forkoz
60ae419746 Model.py TP changes 2025-08-12 21:01:54 +00:00
kingbri
fe149489af Tree: Format
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-08-05 01:22:18 -04:00
AUTOMATIC
056527ceb3 add logprobs support for exl3 2025-08-03 11:42:32 +03:00
kingbri
0b4ca567f8 API: Persist request IDs and append full_text to finish chunk
Adding these to each generation chunk helps remove redundancy and
unecessary request ID operations.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-07-25 12:27:44 -04:00
turboderp
0ae878712e Exl3: Clear image embedding cache on unload 2025-06-25 23:56:21 +02:00
kingbri
2913ce29fc API: Add timings to usage stats
It's useful for the client to know what the T/s and total time for
generation are per-request.

Works with both completions and chat completions.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-06-17 22:54:51 -04:00
turboderp
21c5af48e1 Tree: Format 2025-06-15 19:30:38 +02:00
turboderp
1c9891bf04 Exl3: Add vision capability 2025-06-15 19:22:51 +02:00
turboderp
d357f100d0 Dependencies: Bump ExllamaV3 2025-06-15 19:12:45 +02:00
turboderp
691a080ac7 Dependencies: Bump ExllamaV3 and ExllamaV2 2025-05-31 23:55:04 +02:00
kingbri
0c4cc1eba3 Model: Add prompt logging to ExllamaV3
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-05-17 22:05:18 -04:00
gakada
ba6248eec0 Exl3: fix add_bos in generator 2025-05-17 19:10:49 +09:00
kingbri
17f3dca6fc Packaging: Add agnostic method to check version of packages
Some packages such as ExllamaV2 and V3 require specific versions for
the latest features. Rather than creating repetitive functions, create
an agnostic function to check the installed package and then report
to the user to upgrade.

This is also sent to requests for loading and unloading, so keep the
error short.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-05-17 01:04:24 -04:00
kingbri
084916c04f Model: Fix autosplit reserve crash with GPU split
ExllamaV3 does not accept autosplit_reserve and gpu_split at the same
time.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-05-17 00:51:14 -04:00
kingbri
390daeb92f Model: Create universal HFModel class
The HFModel class serves to coalesce all config files that contain
random keys which are required for model usage.

Adding this base class allows us to expand as HuggingFace randomly
changes their JSON schemas over time, reducing the brunt that backend
devs need to feel when their next model isn't supported.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-05-13 18:12:38 -04:00
kingbri
638eef401a Model: Move cache creation to a common function
Prevents repetitiveness while also creating a Cache class.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-05-08 23:10:03 -04:00
DocShotgun
45b966363e Tree: Format 2025-05-03 21:01:03 -07:00
DocShotgun
a635a719d7 Model: Enable draft model q-cache in Exl3
* Remove unneeded default fp16 cache layer import
2025-05-03 20:59:36 -07:00
DocShotgun
58e34ba4c5 Model: Exl3 cache quant settings lenient with whitespace 2025-05-03 20:35:35 -07:00
DocShotgun
68a660bdb3 Model: Initial Exl3 cache quantization support 2025-05-03 20:35:35 -07:00
turboderp
92ea7ee7cd Model: Add draft model/speculative decoding 2025-05-04 01:27:42 +02:00
turboderp
1db2cb99cb Model: Avoid initializing class variables 2025-05-04 01:26:42 +02:00
turboderp
0405a94a89 Model: Cast penalty range to int 2025-05-03 22:28:36 +02:00
turboderp
58c380b8ca Model: Create generator on load 2025-05-03 18:33:37 +02:00
turboderp
0d949d00b9 Model: Set default max_batch_size 2025-05-03 18:33:37 +02:00
turboderp
8c75b29923 Model: Fix some warnings 2025-05-03 18:33:36 +02:00
kingbri
15cc480cb0 Exl3: Simplify add_bos_token handling
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-05-02 21:50:42 -04:00
randoentity
d8a8ccfc2a Model: fix add_bos_token 2025-05-02 21:33:25 -04:00
kingbri
0d02af3c81 Model: Set model_dir on init
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-05-02 21:33:25 -04:00
kingbri
c89bea030e Model: Add template fetching to Exl3
Use the same functionality as exl2's loader.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-05-02 21:33:25 -04:00
kingbri
e8f00412f6 Model: Fetch from generation_config and tokenizer_config in Exl3
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-05-02 21:33:25 -04:00
kingbri
eca403a0e4 Model: Add Exllamav3 sampler
File was not included in previous commit.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-05-02 21:33:25 -04:00
kingbri
bdc5189a4b Exl3: Add chunk size, cache size, and model info
Use the same algorithm for estimating and adjusting cache size based
on multiples of 256 and above max seq len.

Same applies for chunk size.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-05-02 21:33:25 -04:00
kingbri
303e2dde12 Model: Correct exl3 generation, add concurrency, and cleanup
Fixes application of sampler parameters by adding a new sampler builder
interface. Also expose the generator class-wide and add wait_for_jobs.

Finally, allow inline loading to specify the backend.

Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
2025-05-02 21:33:25 -04:00
randoentity
c744790f14 fixup: add sampler logs
Also passing sampler to job with this, no idea if this is correct
2025-05-02 21:33:25 -04:00
randoentity
b35c48da37 fixup: some metrics 2025-05-02 21:33:25 -04:00
randoentity
c0f268f33e fixup: autosplit, start work on metrics 2025-05-02 21:33:25 -04:00
randoentity
306fc7cd15 fixup: autosplit reserve
this probably breaks v2 support
2025-05-02 21:33:25 -04:00
randoentity
acb3adb953 fixup: auto split 2025-05-02 21:33:25 -04:00