tabbyAPI

mirror of https://github.com/theroyallab/tabbyAPI.git synced 2026-03-15 00:07:28 +00:00

Author	SHA1	Message	Date
kingbri	f627485534	OAI: Fix completion token fetching The generator returns generated_tokens in the dict. Signed-off-by: kingbri <bdashore3@proton.me>	2024-02-11 01:12:13 -05:00
kingbri	2f568ff573	Config: Expose auto GPU split reserve config The GPU reserve is used as a VRAM buffer to prevent GPU overflow when automatically deciding how to load a model on multiple GPUs. Make this configurable. Signed-off-by: kingbri <bdashore3@proton.me>	2024-02-08 22:09:50 -05:00
kingbri	c7428f0bcd	API: Add logprobs for chat completions Adds chat completion logprob support using OAI's spec. Tokens are not converted to tiktoken here since that will add an extra dependency for no real reason. Signed-off-by: kingbri <bdashore3@proton.me>	2024-02-08 21:26:53 -05:00
kingbri	c02fe4d1db	API: Fix response creation Change chat completion and text completion responses to be more flexible. Signed-off-by: kingbri <bdashore3@proton.me>	2024-02-08 21:26:53 -05:00
kingbri	0af6a38af3	Model: Add logprobs support Returns token offsets, selected tokens, probabilities of tokens post-sampling, and normalized probability of selecting a token pre-sampling (for efficiency purposes). Only for text completions. Chat completions in a later commit. Signed-off-by: kingbri <bdashore3@proton.me>	2024-02-08 21:26:53 -05:00
kingbri	2642ef7156	OAI: Update logprobs type Some logprobs cannot exist, so make the type optional Signed-off-by: kingbri <bdashore3@proton.me>	2024-02-08 21:26:53 -05:00
kingbri	c0ad647fa7	Model: Auto-detect a one GPU setup and fix gpu_split_auto It makes more sense to use gpu split parameters when the user has >1 GPUs. Otherwise, set split and split_auto to False and save the user some VRAM. Signed-off-by: kingbri <bdashore3@proton.me>	2024-02-06 23:08:57 -05:00
kingbri	849179df17	Model: Make loading use less VRAM The model loader was using more VRAM on a single GPU compared to base exllamav2's loader. This was because single GPUs were running using the autosplit config which allocates an extra vram buffer for safe loading. Turn this off for single-GPU setups (and turn it off by default). This change should allow users to run models which require the entire card with hopefully faster T/s. For example, Mixtral with 3.75bpw increased from ~30T/s to 50T/s due to the extra vram headroom on Windows. Signed-off-by: kingbri <bdashore3@proton.me>	2024-02-06 22:29:56 -05:00
kingbri	b827bcbb44	Sampling: Cleanup and update Cleanup how overrides are handled, class naming, and adopt exllamav2's model class to enforce latest stable version methods rather than adding multiple backwards compatability checks. Signed-off-by: kingbri <bdashore3@proton.me>	2024-02-02 23:36:17 -05:00
kingbri	d3781920b3	OAI: Split up utility functions Just like types, put utility functions in their own separate module based on the route. Signed-off-by: kingbri <bdashore3@proton.me>	2024-02-02 23:36:17 -05:00
kingbri	751627e571	OAI: Add fasttensors to model load endpoint Also fix logging when loading prompt templates. Signed-off-by: kingbri <bdashore3@proton.me>	2024-01-25 01:08:02 -05:00
kingbri	b14c5443fd	API: Add sampler override switching Allow users to switch the currently overriden samplers via the API so a restart isn't required to switch the overrides. Signed-off-by: kingbri <bdashore3@proton.me>	2024-01-25 00:15:40 -05:00
kingbri	de0ba7214c	API: Add template switching and unload endpoints Templates can be switched and unloaded without reloading the entire model. Signed-off-by: kingbri <bdashore3@proton.me>	2024-01-25 00:15:40 -05:00
kingbri	6c30f24c83	Tree: Unify sampler parameters and add override support Unify API sampler params into a superclass which should make them easier to manage and inherit generic functions from. Not all frontends expose all sampling parameters due to connections with OAI (that handles sampling themselves with the exception of a few sliders). Add the ability for the user to customize fallback parameters from server-side. In addition, parameters can be forced to a certain value server-side in case the repo automatically sets other sampler values in the background that the user doesn't want. Signed-off-by: kingbri <bdashore3@proton.me>	2024-01-25 00:15:40 -05:00
kingbri	78f920eeda	Tree: Refactor code organization Move common functions into their own folder and refactor the backends to use their own folder as well. Also cleanup imports and alphabetize import statments themselves. Finally, move colab and docker into their own folders as well. Signed-off-by: kingbri <bdashore3@proton.me>	2024-01-25 00:15:40 -05:00
kingbri	7a29664f06	API: Add alias names to field descriptions Helps with understanding API aliases. These aliases should not be used but are helpful for developers who want frontend compat. Signed-off-by: kingbri <bdashore3@proton.me>	2024-01-08 23:00:33 -05:00
kingbri	81b504e8c5	OAI: Fix typical alias AliasChoices takes strings, not an array. Signed-off-by: kingbri <bdashore3@proton.me>	2024-01-05 16:38:39 -05:00
kingbri	2c57dafc59	OAI: Add alias for typical sampling Typical can also be called typical_p Signed-off-by: kingbri <bdashore3@proton.me>	2024-01-05 15:29:53 -05:00
kingbri	d4ed9f703d	Tree: Format Signed-off-by: kingbri <bdashore3@proton.me>	2024-01-04 21:13:30 -05:00
kingbri	cd4bf99598	OAI: Fix autodoc examples for model loading Some values weren't defaulting to correct values. Signed-off-by: kingbri <bdashore3@proton.me>	2024-01-04 20:53:56 -05:00
kingbri	6b04463051	API: Fix CFG reporting THe model endpoint wasn't reporting if CFG is on. Signed-off-by: kingbri <bdashore3@proton.me>	2024-01-02 13:54:16 -05:00
kingbri	b378773d0a	Model: Add CFG support CFG, or classifier-free guidance helps push a model in different directions based on what the user provides. Currently, CFG is ignored if the negative prompt is blank (it shouldn't be used in that way anyways). Signed-off-by: kingbri <bdashore3@proton.me>	2024-01-02 01:46:51 -05:00
kingbri	79a57588d5	API: Add template list endpoint Fetches all template names that a user has in the templates directory for chat completions. Signed-off-by: kingbri <bdashore3@proton.me>	2023-12-29 22:58:55 -05:00
kingbri	dce8c74edc	API: Add clarification and cleanup autodocs It's possible to override parts of the example JSON to give proper examples of values. Signed-off-by: kingbri <bdashore3@proton.me>	2023-12-29 10:28:06 -05:00
kingbri	5dc2df68be	Model: Repetition penalty range -> penalty range All penalties can have a sustain (range) applied to them in exl2, so clarify the parameter. However, the default behaviors change based on if freq OR pres pen is enabled. For the sanity of OAI users, have freq and pres pen only apply on the output tokens when range is -1 (default). But, repetition penalty still functions the same way where -1 means the range is the max seq len. Doing this prevents gibberish output when using the more modern freq and presence penalties similar to llamacpp. NOTE: This logic is still subject to change in the future, but I believe it hits the happy medium for users who want defaults and users who want to tinker around with the sampling knobs. Signed-off-by: kingbri <bdashore3@proton.me>	2023-12-28 18:16:10 -05:00
kingbri	e92ef8f5c7	OAI: Fix rep pen range alias No need to unwrap because the Pydantic alias does that for us. Signed-off-by: kingbri <bdashore3@proton.me>	2023-12-25 15:37:11 -05:00
kingbri	e256ff8182	Samplers: Add frequency and presence penalty Un-alias repetition penalty from the frequency penalty parameter. Signed-off-by: kingbri <bdashore3@proton.me>	2023-12-25 15:27:32 -05:00
kingbri	3461f8294f	Logging: Clarify preferences Preferences are preferences, not a config. Signed-off-by: kingbri <bdashore3@proton.me>	2023-12-23 21:08:10 -05:00
kingbri	80ef379721	Sampling: Add top-a support Currently in exllamav2 dev, but will be in the next release. Signed-off-by: kingbri <bdashore3@proton.me>	2023-12-22 23:50:24 -05:00
AlpinDale	fa47f51f85	feat: workflows for formatting/linting (#35 ) * add github workflows for pylint and yapf * yapf * docstrings for auth * fix auth.py * fix generators.py * fix gen_logging.py * fix main.py * fix model.py * fix templating.py * fix utils.py * update formatting.sh to include subdirs for pylint * fix model_test.py * fix wheel_test.py * rename utils to utils_oai * fix OAI/utils_oai.py * fix completion.py * fix token.py * fix lora.py * fix common.py * add pylintrc and fix model.py * finish up pylint * fix attribute error * main.py formatting * add formatting batch script * Main: Remove unnecessary global Linter suggestion. Signed-off-by: kingbri <bdashore3@proton.me> * switch to ruff * Formatting + Linting: Add ruff.toml Signed-off-by: kingbri <bdashore3@proton.me> * Formatting + Linting: Switch scripts to use ruff Also remove the file and recent file change functions from both scripts. Signed-off-by: kingbri <bdashore3@proton.me> * Tree: Format and lint Signed-off-by: kingbri <bdashore3@proton.me> * Scripts + Workflows: Format Signed-off-by: kingbri <bdashore3@proton.me> * Tree: Remove pylint flags We use ruff now Signed-off-by: kingbri <bdashore3@proton.me> * Tree: Format Signed-off-by: kingbri <bdashore3@proton.me> * Formatting: Line length is 88 Use the same value as Black. Signed-off-by: kingbri <bdashore3@proton.me> * Tree: Format Update to new line length rules. Signed-off-by: kingbri <bdashore3@proton.me> --------- Authored-by: AlpinDale <52078762+AlpinDale@users.noreply.github.com> Co-authored-by: kingbri <bdashore3@proton.me>	2023-12-22 16:20:35 +00:00
kingbri	ab10b263fd	Model: Add override base seq len Some models (such as mistral and mixtral) set their base sequence length to 32k due to assumptions of support for sliding window attention. Therefore, add this parameter to override the base sequence length of a model which helps with auto-calculation of rope alpha. If auto-calculation of rope alpha isn't being used, the max_seq_len parameter works fine as is. Signed-off-by: kingbri <bdashore3@proton.me>	2023-12-20 00:45:39 -05:00
kingbri	c9e43e51aa	API: Add route for draft model list Does the same thing as model list except with draft models. Signed-off-by: kingbri <bdashore3@proton.me>	2023-12-19 23:45:53 -05:00
kingbri	ce2602df9a	Model: Fix max seq len handling Previously, the max sequence length was overriden by the user's config and never took the model's config.json into account. Now, set the default to 4096, but include config.prepare when selecting the max sequence length. The yaml and API request now serve as overrides rather than parameters. Signed-off-by: kingbri <bdashore3@proton.me>	2023-12-19 23:37:52 -05:00
kingbri	c3f7898967	OAI: Add logit bias support Use exllamav2's token bias which is the functional equivalent of OAI's logit bias parameter. Strings are casted to integers on request and errors if an invalid value is passed. Signed-off-by: kingbri <bdashore3@proton.me>	2023-12-18 23:53:47 -05:00
kingbri	bc21f0bbc0	OAI: Add field aliasing Repetition penalty range needs field aliases to support multiple parameter calls. Signed-off-by: kingbri <bdashore3@proton.me>	2023-12-18 23:53:47 -05:00
kingbri	de9a19b5d3	Templating: Add generation prompt appending Append generation prompts if given the flag on an OAI chat completion request. This appends the "assistant" message to the instruct prompt. Defaults to true since this is intended behavior. Signed-off-by: kingbri <bdashore3@proton.me>	2023-12-18 23:53:47 -05:00
kingbri	a87e474660	OAI: Fix chat completion validation Validation wasn't properly run on older pydantic, so ChatCompletionRespChoice was being sent instead of a ChatCompletionMessage when streaming responses. Signed-off-by: kingbri <bdashore3@proton.me>	2023-12-18 23:53:47 -05:00
kingbri	e895eaa4bd	OAI: Clarify types in docs Adding field descriptions show which parameters are used solely for OAI compliance and not actually parsed in the model code. Signed-off-by: kingbri <bdashore3@proton.me>	2023-12-18 23:53:47 -05:00
kingbri	51ca1ff396	Tree: Switch to Pydantic 2 Pydantic 2 has more modern methods and stability compared to Pydantic 1 Signed-off-by: kingbri <bdashore3@proton.me>	2023-12-18 23:53:47 -05:00
kingbri	f631dd6ff7	Templates: Switch to Jinja2 Jinja2 is a lightweight template parser that's used in Transformers for parsing chat completions. It's much more efficient than Fastchat and can be imported as part of requirements. Also allows for unblocking Pydantic's version. Users now have to provide their own template if needed. A separate repo may be usable for common prompt template storage. Signed-off-by: kingbri <bdashore3@proton.me>	2023-12-18 23:53:47 -05:00
kingbri	ad8807a830	Model: Add support for num_experts_by_token New parameter that's safe to edit in exllamav2 v0.0.11. Only recommended for people who know what they're doing. Signed-off-by: kingbri <bdashore3@proton.me>	2023-12-17 18:03:01 -05:00
kingbri	70fbee3edd	OAI: Fix model parameter placement Accidentally edited the Model Card parameters vs the model load request ones. Signed-off-by: kingbri <bdashore3@proton.me>	2023-12-17 14:36:28 -05:00
kingbri	1d0bdfa77c	Model + OAI: Fix parameter parsing Rope alpha changes don't require removing the 1.0 default from Rope scale. Keep defaults when possible to avoid errors. Signed-off-by: kingbri <bdashore3@proton.me>	2023-12-17 14:28:18 -05:00
Veden	3e57125025	OAI: adding optional draft model properties for draft_rope alpha and scale (#28 ) * OAI: adding optional draft model properties for draft_rope alpha and scale * Forgot to set the properties to None	2023-12-17 19:23:45 +00:00
kingbri	1a331afe3a	OAI: Add cache_mode parameter to model Mistakenly forgot that the user can choose what cache mode to use when loading a model. Also add when fetching model info. Signed-off-by: kingbri <bdashore3@proton.me>	2023-12-16 02:47:50 -05:00
kingbri	ed868fd262	OAI: Remove unused parameters Seed and low_mem aren't used, so comment them out. Signed-off-by: kingbri <bdashore3@proton.me>	2023-12-15 14:56:43 -05:00
kingbri	083df7d585	Tree: Add generation logging support Generations can be logged in the console along with sampling parameters if the user enables it in config. Metrics are always logged at the end of each prompt. In addition, the model endpoint tells the user if they're being logged or not for transparancy purposes. Signed-off-by: kingbri <bdashore3@proton.me>	2023-12-12 23:43:35 -05:00
kingbri	db87efde4a	OAI: Add ability to specify fastchat prompt template Sometimes fastchat may not be able to detect the prompt template from the model path. Therefore, add the ability to set it in config.yml or via the request object itself. Also send the provided prompt template on model info request. Signed-off-by: kingbri <bdashore3@proton.me>	2023-12-10 15:43:58 -05:00
kingbri	fd9f3eac87	Model: Add params to current model endpoint Grabs the current model rope params, max seq len, and the draft model if applicable. Signed-off-by: kingbri <bdashore3@proton.me>	2023-12-10 00:40:56 -05:00
kingbri	5ae2a91c04	Tree: Use unwrap and coalesce for optional handling Python doesn't have proper handling of optionals. The only way to handle them is checking via an if statement if the value is None or by using the "or" keyword to unwrap optionals. Previously, I used the "or" method to unwrap, but this caused issues due to falsy values falling back to the default. This is especially the case with booleans were "False" changed to "True". Instead, add two new functions: unwrap and coalesce. Both function to properly implement a functional way of "None" coalescing. Signed-off-by: kingbri <bdashore3@proton.me>	2023-12-09 21:52:17 -05:00

1 2

76 Commits