* Model: Implement basic lora support
* Add ability to load loras from config on launch
* Supports loading multiple loras and lora scaling
* Add function to unload loras
* Colab: Update for basic lora support
* Model: Test vram alloc after lora load, add docs
* Git: Add loras folder to .gitignore
* API: Add basic lora-related endpoints
* Add /loras/ endpoint for querying available loras
* Add /model/lora endpoint for querying currently loaded loras
* Add /model/lora/load endpoint for loading loras
* Add /model/lora/unload endpoint for unloading loras
* Move lora config-checking logic to main.py for better compat with API endpoints
* Revert bad CRLF line ending changes
* API: Add basic lora-related endpoints (fixed)
* Add /loras/ endpoint for querying available loras
* Add /model/lora endpoint for querying currently loaded loras
* Add /model/lora/load endpoint for loading loras
* Add /model/lora/unload endpoint for unloading loras
* Move lora config-checking logic to main.py for better compat with API endpoints
* Model: Unload loras first when unloading model
* API + Models: Cleanup lora endpoints and functions
Condenses down endpoint and model load code. Also makes the routes
behave the same way as model routes to help not confuse the end user.
Signed-off-by: kingbri <bdashore3@proton.me>
* Loras: Optimize load endpoint
Return successes and failures along with consolidating the request
to the rewritten load_loras function.
Signed-off-by: kingbri <bdashore3@proton.me>
---------
Co-authored-by: kingbri <bdashore3@proton.me>
Co-authored-by: DocShotgun <126566557+DocShotgun@users.noreply.github.com>
Draft wasn't being parsed correctly with the new changes which removed
the draft_enabled bool. There's still some more work to be done with
returning exceptions.
Signed-off-by: kingbri <bdashore3@proton.me>
Lots of bugs were unearthed when switching to the new fallback changes.
Fix them and make sure samplers are being set properly.
Signed-off-by: kingbri <bdashore3@proton.me>
Chat completions require a finish reason to be provided in the OAI
spec once the streaming is completed. This is different from a non-
streaming chat completion response.
Also fix some errors that were raised from the endpoint.
References #15
Signed-off-by: kingbri <bdashore3@proton.me>
This reverts commit cad144126f.
Change this parameter back to repetition_decay. This is different than
rep_pen_slope used in other backends such as kobold and NAI.
Still keep the fallback condition though.
Signed-off-by: kingbri <bdashore3@proton.me>
Alias repetition_penalty_range to repetition_range since that's used
as an internal variable. Perhaps in the future, there should be a function
that allows for iterating through request aliases and give a default value.
Signed-off-by: kingbri <bdashore3@proton.me>
Documented in previous commits. Also make sure that for version checking,
check the value of kwargs instead of if the key is present since requests
pass default values.
Signed-off-by: kingbri <bdashore3@proton.me>
Models can be loaded with a child object called "draft" in the POST
request. Again, models need to be located within the draft model dir
to get loaded.
Signed-off-by: kingbri <bdashore3@proton.me>
Stop conditions was None, causing model to error out when trying to
add the EOS token to a None value.
Authentication failed when Bearer contained an empty string. To fix
this, add a condition which checks array length.
Signed-off-by: kingbri <bdashore3@proton.me>
Responses were not being properly sent as JSON. Only run pydantic's
JSON function on stream responses. FastAPI does the rest with static
responses.
Signed-off-by: kingbri <bdashore3@proton.me>
Chat completions is the endpoint that will be used by OAI in the
future. Makes sense to support it even though the completions
endpoint will be used more often.
Also unify common parameters between the chat completion and completion
requests since they're very similar.
Signed-off-by: kingbri <bdashore3@proton.me>
Models can be loaded and unloaded via the API. Also add authentication
to use the API and for administrator tasks.
Both types of authorization use different keys.
Also fix the unload function to properly free all used vram.
Signed-off-by: kingbri <bdashore3@proton.me>