kingbri
a524ac3c0f
Model: Fix cache mode again
...
If statements can be difficult to work with.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com >
2025-05-12 11:30:47 -04:00
kingbri
20cad851e9
Model: Fix param call
...
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com >
2025-05-12 09:52:28 -04:00
kingbri
d15eb55f20
Model: Fix exl2 cache mode check
...
FP16 was not included in the validation step.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com >
2025-05-12 09:51:09 -04:00
kingbri
656af41b5d
Model: Always enable decode_special_tokens
...
The frontend should handle the special tokens if they get emitted.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com >
2025-05-09 22:25:50 -04:00
kingbri
42346c6b39
Sampling: Remove skip_special_tokens
...
This parameter is way too confusing and does not make sense in
the modern LLM space.
Change approved by all maintainers.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com >
2025-05-09 22:11:33 -04:00
kingbri
25c77ebf77
Model: Remove exllamav2-specific version check
...
No longer necessary thanks to the agnostic check.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com >
2025-05-09 22:08:15 -04:00
kingbri
638eef401a
Model: Move cache creation to a common function
...
Prevents repetitiveness while also creating a Cache class.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com >
2025-05-08 23:10:03 -04:00
DocShotgun
9dcde59c57
Model: Check for unsupported cache mode in exllamav2
2025-05-06 01:18:15 -07:00
DocShotgun
45b966363e
Tree: Format
2025-05-03 21:01:03 -07:00
DocShotgun
a635a719d7
Model: Enable draft model q-cache in Exl3
...
* Remove unneeded default fp16 cache layer import
2025-05-03 20:59:36 -07:00
DocShotgun
58e34ba4c5
Model: Exl3 cache quant settings lenient with whitespace
2025-05-03 20:35:35 -07:00
DocShotgun
68a660bdb3
Model: Initial Exl3 cache quantization support
2025-05-03 20:35:35 -07:00
turboderp
92ea7ee7cd
Model: Add draft model/speculative decoding
2025-05-04 01:27:42 +02:00
turboderp
1db2cb99cb
Model: Avoid initializing class variables
2025-05-04 01:26:42 +02:00
turboderp
0405a94a89
Model: Cast penalty range to int
2025-05-03 22:28:36 +02:00
turboderp
58c380b8ca
Model: Create generator on load
2025-05-03 18:33:37 +02:00
turboderp
0d949d00b9
Model: Set default max_batch_size
2025-05-03 18:33:37 +02:00
turboderp
8c75b29923
Model: Fix some warnings
2025-05-03 18:33:36 +02:00
kingbri
15cc480cb0
Exl3: Simplify add_bos_token handling
...
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com >
2025-05-02 21:50:42 -04:00
randoentity
d8a8ccfc2a
Model: fix add_bos_token
2025-05-02 21:33:25 -04:00
kingbri
0d02af3c81
Model: Set model_dir on init
...
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com >
2025-05-02 21:33:25 -04:00
kingbri
c89bea030e
Model: Add template fetching to Exl3
...
Use the same functionality as exl2's loader.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com >
2025-05-02 21:33:25 -04:00
kingbri
e8f00412f6
Model: Fetch from generation_config and tokenizer_config in Exl3
...
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com >
2025-05-02 21:33:25 -04:00
kingbri
eca403a0e4
Model: Add Exllamav3 sampler
...
File was not included in previous commit.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com >
2025-05-02 21:33:25 -04:00
kingbri
bdc5189a4b
Exl3: Add chunk size, cache size, and model info
...
Use the same algorithm for estimating and adjusting cache size based
on multiples of 256 and above max seq len.
Same applies for chunk size.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com >
2025-05-02 21:33:25 -04:00
kingbri
303e2dde12
Model: Correct exl3 generation, add concurrency, and cleanup
...
Fixes application of sampler parameters by adding a new sampler builder
interface. Also expose the generator class-wide and add wait_for_jobs.
Finally, allow inline loading to specify the backend.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com >
2025-05-02 21:33:25 -04:00
randoentity
c744790f14
fixup: add sampler logs
...
Also passing sampler to job with this, no idea if this is correct
2025-05-02 21:33:25 -04:00
randoentity
b35c48da37
fixup: some metrics
2025-05-02 21:33:25 -04:00
randoentity
c0f268f33e
fixup: autosplit, start work on metrics
2025-05-02 21:33:25 -04:00
randoentity
306fc7cd15
fixup: autosplit reserve
...
this probably breaks v2 support
2025-05-02 21:33:25 -04:00
randoentity
acb3adb953
fixup: auto split
2025-05-02 21:33:25 -04:00
randoentity
14fb573371
fixup: max_seq_len
...
Whoops
2025-05-02 21:33:25 -04:00
randoentity
daae9ec43d
Exl3: Couldn't wait
...
Just copied some stuff around and it ended up working for basic use.
2025-05-02 21:33:25 -04:00
kingbri
b4ff2f23cf
Exl3: Add token encode, decode, and special token fetch
...
Base class methods
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com >
2025-05-02 21:32:53 -04:00
kingbri
0c1d794390
Model: Add exl3 and associated load functions
...
Initial exl3 compat and loading functionality.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com >
2025-05-02 21:32:39 -04:00
kingbri
242f6b7d2a
Model: Simplify add_bos_token handling
...
Set add_bos_token to True by default in the tokenizer_config stub.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com >
2025-05-02 21:32:28 -04:00
kingbri
4cb3e5d5b1
Tree: Format
...
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com >
2025-05-02 00:23:15 -04:00
kingbri
47cb2a0de9
Model: Add TokenizerConfig stub and add_eos_token fallback
...
This stub fetches the add_eos_token field from the HF tokenizer config.
Ideally, this should be in the backend rather than tabby.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com >
2025-05-02 00:08:01 -04:00
kingbri
aa657fa6e9
API: Ignore add_bos_token in chat completions
...
When fetching special tokens from the model, don't factor in the
add_bos_token and ban_eos_token parameters as switches.
In addition, change the internal handling of add_bos_token to an optional
boolean. This allows us to fallback to the model when selecting whether
or not to add the BOS token, especially for chat completions.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com >
2025-05-01 22:51:15 -04:00
kingbri
b43f0983c8
Model: Fix max_seq_len fallbacks
...
The rope alpha calculation caused an error if max seq len isn't
provided. This is because the model's max sequence length was not
stored as the target for alpha calculation.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com >
2025-04-28 14:09:31 -04:00
kingbri
f070587e9f
Model: Add proper jobs cleanup and fix var calls
...
Jobs should be started and immediately cleaned up when calling the
generation stream. Expose a stream_generate function and append
this to the base class since it's more idiomatic than generate_gen.
The exl2 container's generate_gen function is now internal.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com >
2025-04-24 21:30:55 -04:00
kingbri
7e007f0761
Model: Handle finish chunks and logprobs in separate functions
...
Helps split up and trim the generate_gen function.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com >
2025-04-24 21:19:03 -04:00
kingbri
f2c7da2faf
Tree: Format
...
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com >
2025-04-21 23:21:26 -04:00
kingbri
3f09fcd8c9
Model: Make model params return a model card
...
The model card is a unified structure for sharing model params.
Rather than kwargs, use this instead.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com >
2025-04-21 23:15:46 -04:00
kingbri
13beef8021
Model: Move find_template function to templating
...
Makes sense to extract to a utility function instead.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com >
2025-04-20 18:27:53 -04:00
kingbri
8e238fa8f6
Model: Move calculate_rope_alpha from backend
...
Makes more sense to use as a utility function. Also clarify how the
vars are set.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com >
2025-04-20 18:20:19 -04:00
kingbri
b751e0a1d5
Model: Move inline overrides to common
...
This is applied across containers. Doesn't make sense to put this method
in the backend.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com >
2025-04-20 17:51:57 -04:00
kingbri
034682fcf1
Backends: Add base model container
...
Base class for all model containers. Used in the shared model file
for interface.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com >
2025-04-20 17:24:10 -04:00
kingbri
f15ac1f69d
Model: Reject model requests when unloading
...
If a model is being unloaded, that means its being shut down and
no requests should be accepted from then on.
Also, remove model_is_loaded since we simply check if the container
is None now.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com >
2025-04-19 22:34:06 -04:00
kingbri
3f1d5d396e
Model: Store active jobs in tabby
...
Rather than relying on the generator, use tabby to store the active
job IDs.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com >
2025-04-16 13:17:55 -04:00