Apply the default overrides after inline config has been merged.
Do not require an inline config to apply use_as_default and other
overrides.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
Since cache_size is a more important parameter now for multi-user
setups, mark it as such by placing it below max_seq_len.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
The previous changes broke existing configs and max_seq_len was
force-overriden to 4096. This helps single-user setups since they
do not really benefit from the split cache_size max_seq_len mechanism
(except if batching).
cache_size is still the prime mover in exl3 due to its paging mechanism.
Ideally, for multi-user setups, cache_size should take as much VRAM
as possible and max_seq_len should be limited.
Breakdown:
cache_size and max_seq_len specified -> values
only cache_size/max_seq_len specified -> max_seq_len = cache_size and vice versa
neither specified -> cache_size = 4096, max_seq_len = min(max_position_embeddings, cache_size)
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
- Cache size is now given only by the cache_size config option. Default is 4096 (user should always override to max out VRAM)
- max_seq_len, if not overridden in the config, will default to the model's config.json
- max_seq_len is reduced to be no larger than the cache
This allows for users to use nccl or native depending on the GPU setup.
NCCL is only available with Linux built wheels.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
When revisiting tool calls, the formats have more or less become standard.
For greater compatibility with templates, primarily use the message.tools
parameter and remove the extra custom metadata that is no longer required.
However, unlike other backends, tabbyAPI still uses template metadata
to declare what the tool start string is. This allows for template-level
customization along with giving more power to the user while the server
exists to consume rather than work on a case-by-case basis.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
use_as_default was not being properly applied into model overrides.
For compartmentalization's sake, apply all overrides in a single function
to avoid clutter.
In addition, fix where the traditional /v1/model/load endpoint checks
for draft options. These can be applied via an inline config, so let
any failures fallthrough.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
Anything below the first level of kwargs was not being merged properly.
A more bulletproof solution would be to refactor the loading code
to separate draft and normal model parameters.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
Rather than relying on Content-Length which can be unreliable, ping
the API to get file sizes and work from there.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
Usually, the client and server both are aware of the file size by
sending a Content-Length header. However, HuggingFace has changed
their headers and now does not always send Content-Length.
In this case, show an indeterminate progressbar and mark as complete
once the download finishes.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
It's useful for the client to know what the T/s and total time for
generation are per-request.
Works with both completions and chat completions.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
A common problem in TabbyAPI is that users who want to get up and
running with a model always had issues with max_seq_len causing OOMs.
This is because model devs set max context values in the millions which
requires a lot of VRAM.
To idiot-proof first time setup, make the fallback default 4096 so
users can run their models. If a user still wants to use the model's
max_seq_len, set it to -1.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
Some packages such as ExllamaV2 and V3 require specific versions for
the latest features. Rather than creating repetitive functions, create
an agnostic function to check the installed package and then report
to the user to upgrade.
This is also sent to requests for loading and unloading, so keep the
error short.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
The HFModel class serves to coalesce all config files that contain
random keys which are required for model usage.
Adding this base class allows us to expand as HuggingFace randomly
changes their JSON schemas over time, reducing the brunt that backend
devs need to feel when their next model isn't supported.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
This parameter is way too confusing and does not make sense in
the modern LLM space.
Change approved by all maintainers.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
If an inference dep isn't present, force exit the application. This
occurs after all subcommands have been appropriately processed.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
Seemed out of place in the common load function. In addition, rename
the transformers utils signature which actually takes a directory
instead of a file.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
Adding a comma in the description converts the string to a tuple,
which isn't parseable by argparse's help.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>
Fixes application of sampler parameters by adding a new sampler builder
interface. Also expose the generator class-wide and add wait_for_jobs.
Finally, allow inline loading to specify the backend.
Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>