tabbyAPI

mirror of https://github.com/theroyallab/tabbyAPI.git synced 2026-03-15 00:07:28 +00:00

Author	SHA1	Message	Date
kingbri	5002617eac	Model: Split cache creation into a common function Unifies the switch statement across both draft and model caches. Signed-off-by: kingbri <bdashore3@proton.me>	2024-08-22 14:15:19 -04:00
Ben Gitter	70b9fc95de	[WIP] OpenAI Tools Support/Function calling (#154 ) * returning stop str if exists from gen * added chat template for firefunctionv2 * pulling tool vars from template * adding parsing for tool inputs/outputs * passing tool data from endpoint to chat template, adding tool_start to the stop list * loosened typing on the response tool call, leaning more on the user supplying a quality schema if they want a particular format * non streaming generation prototype * cleaning template * Continued work with type, ingestion into template, and chat template for fire func * Correction - streaming toolcall comes back as delta obj not inside chatcomprespchoice per chat_completion_chunk.py inside OAI lib. * Ruff Formating * Moved stop string and tool updates out of prompt creation func Updated tool pydantic to match OAI Support for streaming Updated generate tool calls to use flag within chat_template and insert tool reminder * Llama 3.1 chat templates Updated fire func template * renamed llama3.1 to chatml_with_headers.. * update name of template * Support for calling a tool start token rather than the string. Simplified tool_params Warning when gen_settings are being overidden becuase user set temp to 0 Corrected schema and tools to correct types for function args. Str for some reason * draft groq tool use model template * changed headers to vars for readablity (but mostly because some models are weird about newlines after headers, so this is an easier way to change globally) * Clean up comments and code in chat comp * Post processed tool call to meet OAI spec rather than forcing model to write json in a string in the middle of the call. * changes example back to args as json rather than string of json * Standardize chat templates to each other * cleaning/rewording * stop elements can also be ints (tokens) * Cleaning/formatting * added special tokens for tools and tool_response as specified in description * Cleaning * removing aux templates - going to live in llm-promp-templates repo instead * Tree: Format Signed-off-by: kingbri <bdashore3@proton.me> * Chat Completions: Don't include internal tool variables in OpenAPI Use SkipJsonSchema to supress inclusion with the OpenAPI JSON. The location of these variables may need to be changed in the future. Signed-off-by: kingbri <bdashore3@proton.me> * Templates: Deserialize metadata on template load Since we're only looking for specific template variables that are static in the template, it makes more sense to render when the template is initialized. Signed-off-by: kingbri <bdashore3@proton.me> * Tools: Fix comments Adhere to the format style of comments in the rest of the project. Signed-off-by: kingbri <bdashore3@proton.me> --------- Co-authored-by: Ben Gitter <gitterbd@gmail.com> Signed-off-by: kingbri <bdashore3@proton.me>	2024-08-17 00:16:25 -04:00
kingbri	63650d2c3c	Model: Disable banned strings if grammar is used ExllamaV2 filters don't allow for rewinding which is what banned strings uses. Therefore, constrained generation via LMFE or outlines is not compatible for now. Signed-off-by: kingbri <bdashore3@proton.me>	2024-08-05 11:08:58 -04:00
kingbri	8ff2586d45	Start: Fix pip update, method calls, and logging platform.system() was not called in some places, breaking the ternary on Windows. Pip's --upgrade flag does not actually update dependencies to their latest versions. That's what the --upgrade-strategy eager flag is for. Tell the user where their start preferences are coming from. Signed-off-by: kingbri <bdashore3@proton.me>	2024-08-04 10:30:26 -04:00
kingbri	b6d2676f1c	Start: Give the user a hint when a module can't be imported If an ImportError or ModuleNotFoundError is raised, tell the user to run the update scripts. Signed-off-by: kingbri <bdashore3@proton.me>	2024-08-03 21:59:06 -04:00
kingbri	2a33ebbf29	Model: Bypass lock checks when shutting down Previously, when a SIGINT was emitted and a model load is running, the API didn't shut down until the load finished due to waitng for the lock. However, when shutting down, the lock doesn't matter since the process is being killed anyway. Signed-off-by: kingbri <bdashore3@proton.me>	2024-08-03 16:05:34 -04:00
kingbri	0bcb4e4a7d	Model: Attach request ID to logs If multiple logs come in at once, track which log corresponds to which request. Signed-off-by: kingbri <bdashore3@proton.me>	2024-08-01 00:25:54 -04:00
kingbri	9390d362dd	Model: Log generation params and metrics after the prompt/response A user's prompt and response can be large in the console. Therefore, always log the smaller payloads (ex. gen params + metrics) after the large chunks. However, it's recommended to keep prompt logging off anyways since it'll result in console spam. Signed-off-by: kingbri <bdashore3@proton.me>	2024-08-01 00:19:21 -04:00
Brian Dashore	1bf062559d	Merge pull request #158 from AlpinDale/embeddings feat: add embeddings support via Infinity-emb	2024-07-31 20:33:12 -04:00
kingbri	46304ce875	Model: Properly pass in max_batch_size from config The override wasn't being passed in before. Also, the default is now none since Exl2 can automatically calculate the max batch size. Signed-off-by: kingbri <bdashore3@proton.me>	2024-07-30 18:42:25 -04:00
kingbri	dc3dcc9c0d	Embeddings: Update config, args, and parameter names Use embeddings_device as the parameter for device to remove ambiguity. Signed-off-by: kingbri <bdashore3@proton.me>	2024-07-30 15:32:26 -04:00
kingbri	f13d0fb8b3	Embeddings: Add model load checks Same as the normal model container. Signed-off-by: kingbri <bdashore3@proton.me>	2024-07-30 11:17:36 -04:00
kingbri	01c7702859	Signal: Fix async signal handling Run unload async functions before exiting the program. Signed-off-by: kingbri <bdashore3@proton.me>	2024-07-30 11:11:05 -04:00
kingbri	fbf1455db1	Embeddings: Migrate and organize Infinity Use Infinity as a separate backend and handle the model within the common module. This separates out the embeddings model from the endpoint which allows for model loading/unloading in core. Signed-off-by: kingbri <bdashore3@proton.me>	2024-07-30 11:00:23 -04:00
kingbri	7522b1447b	Model: Add support for HuggingFace config and bad_words_ids This is necessary for Kobold's API. Current models use bad_words_ids in generation_config.json, but for some reason, they're also present in the model's config.json. Signed-off-by: kingbri <bdashore3@proton.me>	2024-07-26 18:23:22 -04:00
kingbri	b7cb6f0b91	API: Add KoboldAI server Used for interacting with applications that use KoboldAI's API such as horde. Signed-off-by: kingbri <bdashore3@proton.me>	2024-07-26 16:37:30 -04:00
kingbri	3e8ffebdd3	Tree: Format Signed-off-by: kingbri <bdashore3@proton.me>	2024-07-23 14:32:50 -04:00
kingbri	9ad69e8ab6	API: Migrate universal routes to core Place OAI specific routes in the appropriate folder. This is in preperation for adding new API servers that can be optionally enabled. Signed-off-by: kingbri <bdashore3@proton.me>	2024-07-23 14:08:48 -04:00
kingbri	191600a150	Revert "Model: Skip empty token chunks" This reverts commit `21516bd7b5`. This skips EOS and implementing it the proper way seems more costly than necessary. Signed-off-by: kingbri <bdashore3@proton.me>	2024-07-22 18:34:00 -04:00
kingbri	21516bd7b5	Model: Skip empty token chunks This helps make the generation loop more efficient by skipping past chunks that aren't providing any tokens anyways. The offset isn't affected. Signed-off-by: kingbri <bdashore3@proton.me>	2024-07-22 12:23:49 -04:00
kingbri	cae94b920c	API: Add ability to use request IDs Identify which request is being processed to help users disambiguate which logs correspond to which request. Signed-off-by: kingbri <bdashore3@proton.me>	2024-07-21 21:01:05 -04:00
kingbri	933404c185	Model: Warn user if terminating jobs If skip_wait is true, it's best to let the user know that all jobs will be forcibly cancelled. Signed-off-by: kingbri <bdashore3@proton.me>	2024-07-15 11:34:16 -04:00
kingbri	9dae461142	Model: Attempt to recreate generator on a fatal error If a job causes the generator to error, tabby stops working until a relaunch. It's better to try establishing a system of redundancy and remake the generator in the event that it fails. May replace this with an exit signal for a fatal error instead, but not sure. Signed-off-by: kingbri <bdashore3@proton.me>	2024-07-15 01:09:49 -04:00
kingbri	073e9fa6f0	Dependencies: Bump ExllamaV2 v0.1.7 Signed-off-by: kingbri <bdashore3@proton.me>	2024-07-11 14:22:50 -04:00
kingbri	1f46a1130c	OAI: Restrict list permissions for API keys API keys are not allowed to view all the admin's models, templates, draft models, loras, etc. Basically anything that can be viewed on the filesystem outside of anything that's currently loaded is not allowed to be returned unless an admin key is present. This change helps preserve user privacy while not erroring out on list endpoints that the OAI spec requires. Signed-off-by: kingbri <bdashore3@proton.me>	2024-07-11 14:22:50 -04:00
kingbri	62e495fc13	Model: Grammar: Fix lru_cache clear function It's cache_clear not clear_cache. Signed-off-by: kingbri <bdashore3@proton.me>	2024-07-08 15:10:15 -04:00
turboderp	e97ad9cb27	RUFF	2024-07-08 03:51:14 +02:00
turboderp	8bbce3455c	RUFF	2024-07-08 03:49:26 +02:00
turboderp	4cf79c5ae1	Clear tokenizer_data cache when unloading model	2024-07-08 03:31:05 +02:00
turboderp	b7e7df1220	Move tokenizer_data cache to global scope	2024-07-08 02:54:49 +02:00
turboderp	4d0bb1ffc3	Cache creation tokenizer_data in LMFE	2024-07-08 00:51:59 +02:00
turboderp	bb8b02a60a	Wrap arch_compat_overrides in try block Quick fix until exllamav2 0.1.7 releases, since the function isn't defined for 0.1.6.	2024-07-07 07:54:05 +02:00
kingbri	773639ea89	Model: Fix flash-attn checks If flash attention is already turned off by exllamaV2 itself, don't try creating a paged generator. Also condense all the redundant logic into one if statement. Also check arch_compat_overrides to see if flash attention should be disabled for a model arch (ex. Gemma 2) Signed-off-by: kingbri <bdashore3@proton.me>	2024-07-06 20:58:24 -04:00
kingbri	c5ea2abe24	Dependencies: Update ExllamaV2 v0.1.6 Signed-off-by: kingbri <bdashore3@proton.me>	2024-06-23 21:45:04 -04:00
kingbri	c575105e41	ExllamaV2: Cleanup log placements Move the large import errors into the check functions themselves. This helps reduce the difficulty in interpreting where errors are coming from. Signed-off-by: kingbri <bdashore3@proton.me>	2024-06-16 00:16:03 -04:00
Glenn Maynard	8da7644571	Fix exception unloading models. (#138 ) self.generator is None if a model load fails or is cancelled.	2024-06-15 23:44:29 +02:00
DocShotgun	85387d97ad	Fix disabling flash attention in exl2 config (#136 ) * Model: Fix disabling flash attention in exl2 config * Model: Pass no_flash_attn to draft config * Model: Force torch flash SDP off in compatibility mode	2024-06-12 20:00:46 +02:00
DocShotgun	156b74f3f0	Revision to paged attention checks (#133 ) * Model: Clean up paged attention checks * Model: Move cache_size checks after paged attn checks Cache size is only relevant in paged mode * Model: Fix no_flash_attention * Model: Remove no_flash_attention Ability to use flash attention is auto-detected, so this flag is unneeded. Uninstall flash attention to disable it on supported hardware.	2024-06-09 17:28:11 +02:00
DocShotgun	55d979b7a5	Update dependencies, support Python 3.12, update for exl2 0.1.5 (#134 ) * Dependencies: Add wheels for Python 3.12 * Model: Switch fp8 cache to Q8 cache * Model: Add ability to set draft model cache mode * Dependencies: Bump exllamav2 to 0.1.5 * Model: Support Q6 cache * Config: Add Q6 cache and draft_cache_mode to config sample	2024-06-09 17:27:39 +02:00
DocShotgun	dcd9428325	Model: Warn if cache size is too small for CFG (#132 )	2024-06-05 19:40:14 +02:00
DocShotgun	e391d84e40	More extensive checks for paged mode support (#121 ) * Model: More extensive checks for paged attention Previously, TabbyAPI only checked for whether the user's hardware supports flash attention before deciding whether to enabled paged mode. This adds checks for whether no_flash_attention is set, whether flash-attn is installed, and whether the installed version supports paged attention. * Tree: Format * Tree: Lint * Model: Check GPU architecture first Check GPU arch prior to checking whether flash attention 2 is installed	2024-06-05 09:33:21 +02:00
turboderp	dbdcb38ad7	Allow either "[" or "{" prefix to support JSON grammar with top level arrays (#129 )	2024-06-04 02:32:39 +02:00
turboderp	e889fa3efe	Bump exllamav2 to v0.1.4 (#128 )	2024-06-04 02:32:08 +02:00
Brian Dashore	516b52b341	Merge pull request #112 from DocShotgun/main Separate new prompt tokens from those reused from cache in metric logging	2024-05-27 18:04:43 -04:00
kingbri	19961f4126	Dependencies: Update ExllamaV2 v0.1.1 Signed-off-by: kingbri <bdashore3@proton.me>	2024-05-27 13:38:07 -04:00
kingbri	116cf56c87	Model: Auto-round cache size on init Cache size must be a multiple of 256 to work properly in ExllamaV2. Take the config value and set the cache size to one multiple above the remainder of the cache size divided by 256. This is because cache size can never be lower than max_seq_len. If max_seq_len isn't a multiple of 256, this method will never yield a number that's lower than max_seq_len since it's no longer a source of truth. Signed-off-by: kingbri <bdashore3@proton.me>	2024-05-26 21:24:54 -04:00
DocShotgun	ce5e2ec8de	Logging: Clarify new vs cached tokens in prompt processing	2024-05-26 18:21:17 -07:00
DocShotgun	767e6a798a	API + Model: Add support for specifying k/v cache size	2024-05-26 14:17:01 -07:00
kingbri	660f9b8432	OAI: Fix request cancellation behavior Depending on the day of the week, Starlette can work with a CancelledError or using await request.is_disconnected(). Run the same behavior for both cases and allow cancellation. Streaming requests now set an event to cancel the batched job and break out of the generation loop. Signed-off-by: kingbri <bdashore3@proton.me>	2024-05-26 13:00:33 -04:00
kingbri	094c7b1734	Model: Fix paged and FA2 checks If a user is using GPU split, check compute capability on only those GPUs. Autosplit assumes that all GPUs will be used. Signed-off-by: kingbri <bdashore3@proton.me>	2024-05-26 11:29:31 -04:00

1 2 3 4

155 Commits