Commit Graph

658 Commits

Author SHA1 Message Date
kingbri
21712578cf API: Add allowed_tokens support
This is the opposite of banned tokens. Exllama specific implementation
of #181.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-08-29 21:44:42 -04:00
kingbri
10d9419f90 Model: Add BOS token to prompt logs
If add_bos_token is enabled, the BOS token gets appended to the logged
prompt if logging is enabled.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-08-29 21:15:09 -04:00
kingbri
96fce34253 Dependencies: Update ExllamaV2
v0.2.0

Signed-off-by: kingbri <bdashore3@proton.me>
2024-08-28 18:34:00 -04:00
kingbri
a00d972054 Server: Remove unused comments
Leftovers from the new API server log system.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-08-27 21:45:51 -04:00
kingbri
4958c06813 Model: Remove and format comments
The comment in __init__ was outdated and all the kwargs are the
config options anyways.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-08-27 21:43:40 -04:00
TerminalMan
80198ca056 API: Add /v1/health endpoint (#178)
* Add healthcheck

- localhost only /healthcheck endpoint
- cURL healthcheck in docker compose file

* Update Healthcheck Response

- change endpoint to /health
- remove localhost restriction
- add docstring

* move healthcheck definition to top of the file

- make the healthcheck show up first in the openAPI spec

* Tree: Format
2024-08-27 21:37:41 -04:00
Amgad Hasan
872eeed581 Build and push docker image (#171)
* Create docker-image.yml

* Update docker-image.yml
2024-08-26 16:18:10 -04:00
Ben Gitter
045bc98333 Remove rouge print statements within chat_completion.py (#174)
* rouge prompt print

* remove print pt2

* Print Removal Final
2024-08-23 21:28:37 -04:00
turboderp
fe3253f3a9 Model: Account for tokenizer lazy init 2024-08-23 23:51:53 +02:00
turboderp
a676c4bf38 Model: Formatting 2024-08-23 11:15:30 +02:00
turboderp
a3733caeda Model: Fix draft model cache initialization 2024-08-23 11:08:49 +02:00
kingbri
364032e39e Config: Remove developement flag from tensor parallel
Exists in stable ExllamaV2 version.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-08-22 14:15:19 -04:00
kingbri
565b0300d6 Dependencies: Update Exllamav2
v0.1.9

Signed-off-by: kingbri <bdashore3@proton.me>
2024-08-22 14:15:19 -04:00
kingbri
078fbf1080 Model: Add quantized cache support for tensor parallel
Newer versions of exl2 v1.9-dev have quantized cache implemented. Add
those APIs.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-08-22 14:15:19 -04:00
kingbri
871c89063d Model: Add Tensor Parallel support
Use the tensor parallel loader when the flag is enabled. The new loader
has its own autosplit implementation, so gpu_split_auto isn't valid
here.

Also make it easier to determine which cache type to use rather than
multiple if/else statements.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-08-22 14:15:19 -04:00
kingbri
5002617eac Model: Split cache creation into a common function
Unifies the switch statement across both draft and model caches.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-08-22 14:15:19 -04:00
kingbri
ecaddec48a Docker-compose: Add models to bind mounts
At least one bind mount is required in the volumes YAML block otherwise
the docker build fails. Models should be fine to default since it always
exists.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-08-19 22:07:53 -04:00
Amgad Hasan
dae394050e Improve docker deployment configuration (#163) 2024-08-18 15:19:18 -04:00
kingbri
a51acb9db4 Templates: Switch to async jinja engine
This prevents any possible blocking of the event loop due to template
rendering.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-08-17 12:03:41 -04:00
kingbri
b4752c1e62 Templates: Revert to load metadata on runtime
Metadata is generated via a template's module. This requires a single
iteration through the template. If a template tries to access a passed
variable that doesn't exist, it will error.

Therefore, generate the metadata at runtime to prevent these errors
from happening. To optimize further, cache the metadata after the
first generation to prevent the expensive call of making a template
module.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-08-17 11:44:42 -04:00
kingbri
617ac12150 Update README
Signed-off-by: kingbri <bdashore3@proton.me>
2024-08-17 00:35:42 -04:00
Ben Gitter
70b9fc95de [WIP] OpenAI Tools Support/Function calling (#154)
* returning stop str if exists from gen

* added chat template for firefunctionv2

* pulling tool vars from template

* adding parsing for tool inputs/outputs

* passing tool data from endpoint to chat template, adding tool_start to the stop list

* loosened typing on the response tool call, leaning more on the user supplying a quality schema if they want a particular format

* non streaming generation prototype

* cleaning template

* Continued work with type, ingestion into template, and chat template for fire func

* Correction - streaming toolcall comes back as delta obj not inside chatcomprespchoice per chat_completion_chunk.py inside OAI lib.

* Ruff Formating

* Moved stop string and tool updates out of prompt creation func

Updated tool pydantic to match OAI

Support for streaming

Updated generate tool calls to use flag within chat_template and insert tool reminder

* Llama 3.1 chat templates

Updated fire func template

* renamed llama3.1 to chatml_with_headers..

* update name of template

* Support for calling a tool start token rather than the string.

Simplified tool_params

Warning when gen_settings are being overidden becuase user set temp to 0

Corrected schema and tools to correct types for function args. Str for some reason

* draft groq tool use model template

* changed headers to vars for readablity (but mostly because some models are weird about newlines after headers, so this is an easier way to change globally)

* Clean up comments and code in chat comp

* Post processed tool call to meet OAI spec rather than forcing model to write json in a string in the middle of the call.

* changes example back to args as json rather than string of json

* Standardize chat templates to each other

* cleaning/rewording

* stop elements can also be ints (tokens)

* Cleaning/formatting

* added special tokens for tools and tool_response as specified in description

* Cleaning

* removing aux templates - going to live in llm-promp-templates repo instead

* Tree: Format

Signed-off-by: kingbri <bdashore3@proton.me>

* Chat Completions: Don't include internal tool variables in OpenAPI

Use SkipJsonSchema to supress inclusion with the OpenAPI JSON. The
location of these variables may need to be changed in the future.

Signed-off-by: kingbri <bdashore3@proton.me>

* Templates: Deserialize metadata on template load

Since we're only looking for specific template variables that are
static in the template, it makes more sense to render when the template
is initialized.

Signed-off-by: kingbri <bdashore3@proton.me>

* Tools: Fix comments

Adhere to the format style of comments in the rest of the project.

Signed-off-by: kingbri <bdashore3@proton.me>

---------

Co-authored-by: Ben Gitter <gitterbd@gmail.com>
Signed-off-by: kingbri <bdashore3@proton.me>
2024-08-17 00:16:25 -04:00
kingbri
9cc0e70098 Actions: Build kobold docs subpage
Signed-off-by: kingbri <bdashore3@proton.me>
2024-08-08 16:40:50 -04:00
kingbri
685e3836e9 Args: Add api-servers to parser
Also run OpenAPI export after args/config are parsed.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-08-08 16:32:29 -04:00
kingbri
63650d2c3c Model: Disable banned strings if grammar is used
ExllamaV2 filters don't allow for rewinding which is what banned
strings uses. Therefore, constrained generation via LMFE or outlines
is not compatible for now.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-08-05 11:08:58 -04:00
kingbri
34281c2e14 Start: Add --force-reinstall argument
Forces a reinstall of dependencies in the event that one is corrupted
or broken.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-08-04 11:14:38 -04:00
kingbri
ab6c3a53b9 Start: Remove eager upgrade strategy
This will upgrade second-level pinned dependencies to their latest
versions which is not ideal.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-08-04 10:50:57 -04:00
kingbri
8ff2586d45 Start: Fix pip update, method calls, and logging
platform.system() was not called in some places, breaking the
ternary on Windows.

Pip's --upgrade flag does not actually update dependencies to their
latest versions. That's what the --upgrade-strategy eager flag is for.

Tell the user where their start preferences are coming from.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-08-04 10:30:26 -04:00
kingbri
6a0cfd731b Main: Only import psutil when the experimental function is run
Experimental options shouldn't be imported at the top level until the
testing period is over.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-08-03 22:00:15 -04:00
kingbri
b6d2676f1c Start: Give the user a hint when a module can't be imported
If an ImportError or ModuleNotFoundError is raised, tell the user
to run the update scripts.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-08-03 21:59:06 -04:00
kingbri
1aa934664c Issues: Update issue templates
Use forms instead of markdown templates.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-08-03 21:59:02 -04:00
kingbri
87b6a31fad Update .gitignore
Signed-off-by: kingbri <bdashore3@proton.me>
2024-08-03 20:59:28 -04:00
kingbri
4868fc6b10 Update README
Signed-off-by: kingbri <bdashore3@proton.me>
2024-08-03 20:58:26 -04:00
kingbri
5fb9cdc2b1 Dependencies: Add Python 3.12 specific dependencies
Install a prebuilt fastparquet wheel for Windows and add setuptools
since torch may require it for some reason.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-08-03 17:43:14 -04:00
kingbri
2a33ebbf29 Model: Bypass lock checks when shutting down
Previously, when a SIGINT was emitted and a model load is running,
the API didn't shut down until the load finished due to waitng for
the lock. However, when shutting down, the lock doesn't matter since
the process is being killed anyway.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-08-03 16:05:34 -04:00
Brian Dashore
65c16f2a7c Merge pull request #161 from theroyallab/new-start-scripts
Fix pip index bandwidth costs and update start scripts
2024-08-03 15:21:02 -04:00
kingbri
8703b23f89 Start: Make linux scripts executable
Signed-off-by: kingbri <bdashore3@proton.me>
2024-08-03 15:19:31 -04:00
kingbri
b795bfc7b2 Start: Split some prints up
Newlines can be helpful at times.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-08-03 15:14:40 -04:00
kingbri
65e758e134 Tree: Format
Signed-off-by: kingbri <bdashore3@proton.me>
2024-08-03 15:08:24 -04:00
kingbri
7ce46cc2da Start: Rewrite start scripts
Start scripts now don't update dependencies by default due to mishandling
caches from pip. Also add dedicated update scripts and save options
to a JSON file instead of a text one.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-08-03 13:03:24 -04:00
kingbri
e66d213aef Revert "Dependencies: Use hosted pip index instead of Github"
This reverts commit f111052e39.

This was a bad idea since the netlify server has limited bandwidth.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-08-03 11:35:26 -04:00
kingbri
7bf2b07d4c Signals: Exit on async cleanup
The async signal exit function should be the internal for exiting
the program. In addition, prevent the handler from being called
twice by adding a boolean. May become an asyncio event later on.

In addition, make sure to skip_wait when running model.unload.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-08-02 15:11:57 -04:00
kingbri
b124797949 Dependencies: Re-add sentence-transformers
This is actually required for infinity to load a model.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-08-02 14:35:58 -04:00
kingbri
56619810bf Dependencies: Switch sentence-transformers to infinity-emb
Leftover before the transition.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-08-02 13:34:47 -04:00
kingbri
3e42211c3e Config: Embeddings: Make embeddings_device a default when API loading
When loading from the API, the fallback for embeddings_device will be
the same as the config.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-08-01 13:59:49 -04:00
kingbri
54aeebaec1 API: Fix return of current embeddings model
Return a ModelCard instead of a ModelList.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-08-01 13:43:31 -04:00
kingbri
0bcb4e4a7d Model: Attach request ID to logs
If multiple logs come in at once, track which log corresponds to
which request.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-08-01 00:25:54 -04:00
kingbri
9390d362dd Model: Log generation params and metrics after the prompt/response
A user's prompt and response can be large in the console. Therefore,
always log the smaller payloads (ex. gen params + metrics) after
the large chunks.

However, it's recommended to keep prompt logging off anyways since
it'll result in console spam.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-08-01 00:19:21 -04:00
Brian Dashore
1bf062559d Merge pull request #158 from AlpinDale/embeddings
feat: add embeddings support via Infinity-emb
2024-07-31 20:33:12 -04:00
kingbri
f111052e39 Dependencies: Use hosted pip index instead of Github
Installing directly from github causes pip's HTTP cache to not
recognize that the correct version of a package is already installed.
This causes a redownload.

When using the Start.bat script, it updates dependencies automatically
to keep users on the latest versions of a package for security reasons.

A simple pip cache website helps alleviate this problem and allows pip
to find the cached wheels when invoked with an upgrade argument.

Signed-off-by: kingbri <bdashore3@proton.me>
2024-07-30 20:46:37 -04:00