* Model: Implement basic lora support
* Add ability to load loras from config on launch
* Supports loading multiple loras and lora scaling
* Add function to unload loras
* Colab: Update for basic lora support
* Model: Test vram alloc after lora load, add docs
* Git: Add loras folder to .gitignore
* API: Add basic lora-related endpoints
* Add /loras/ endpoint for querying available loras
* Add /model/lora endpoint for querying currently loaded loras
* Add /model/lora/load endpoint for loading loras
* Add /model/lora/unload endpoint for unloading loras
* Move lora config-checking logic to main.py for better compat with API endpoints
* Revert bad CRLF line ending changes
* API: Add basic lora-related endpoints (fixed)
* Add /loras/ endpoint for querying available loras
* Add /model/lora endpoint for querying currently loaded loras
* Add /model/lora/load endpoint for loading loras
* Add /model/lora/unload endpoint for unloading loras
* Move lora config-checking logic to main.py for better compat with API endpoints
* Model: Unload loras first when unloading model
* API + Models: Cleanup lora endpoints and functions
Condenses down endpoint and model load code. Also makes the routes
behave the same way as model routes to help not confuse the end user.
Signed-off-by: kingbri <bdashore3@proton.me>
* Loras: Optimize load endpoint
Return successes and failures along with consolidating the request
to the rewritten load_loras function.
Signed-off-by: kingbri <bdashore3@proton.me>
---------
Co-authored-by: kingbri <bdashore3@proton.me>
Co-authored-by: DocShotgun <126566557+DocShotgun@users.noreply.github.com>
Low_mem doesn't work in exl2 and it was an experimental option to
begin with. Keep the loading code commented out in case it gets fixed
in the future.
A better alternative is to use 8bit cache which works and helps save
VRAM.
Signed-off-by: kingbri <bdashore3@proton.me>
Some APIs require an OAI model to be sent against the models endpoint.
Fix this by adding a GPT 3.5 turbo entry as first in the list to cover
as many APIs as possible.
Signed-off-by: kingbri <bdashore3@proton.me>
This helps clarify things when users are configuring for the first
time. For example, some users were putting the model name in the
"model" block instead of the "model_name" field.
Signed-off-by: kingbri <bdashore3@proton.me>
Speculative decoding makes use of draft models that ingest the prompt
before forwarding it to the main model.
Add options in the config to support this. API options will occur
in a different commit.
Signed-off-by: kingbri <bdashore3@proton.me>
Models can be loaded and unloaded via the API. Also add authentication
to use the API and for administrator tasks.
Both types of authorization use different keys.
Also fix the unload function to properly free all used vram.
Signed-off-by: kingbri <bdashore3@proton.me>
YAML is a more flexible format when it comes to configuration. Commandline
arguments are difficult to remember and configure especially for
an API with complicated commandline names. Rather than using half-baked
textfiles, implement a proper config solution.
Also add a progress bar when loading models in the commandline.
Signed-off-by: kingbri <bdashore3@proton.me>