response_prefix is used to add a prefix before generating the next
message. This is used in many cases such as continuining a prompt
(see #96).
Also if a template has BOS token specified, add_bos_token will
append two BOS tokens. Add a check which strips a starting BOS token
from the prompt if it exists.
Signed-off-by: kingbri <bdashore3@proton.me>
Having many utility functions for initialization doesn't make much sense.
Instead, handle anything regarding template creation inside the
class which reduces the amount of function imports.
Signed-off-by: kingbri <bdashore3@proton.me>
A chat completion can now declare extra template_vars to pass when
a template is rendered, opening up the possibility of using state
outside of huggingface's parameters.
Signed-off-by: kingbri <bdashore3@proton.me>
Template modules grab all set vars, including ones that use runtime
vars. If a template var is set to a runtime var and a module is created,
an UndefinedError fires.
Use make_module instead to pass runtime vars when creating a template
module.
Resolves#92
Signed-off-by: kingbri <bdashore3@proton.me>
Best to move the inner workings within its inner function. Also fix
an edge case where stop strings can be a string rather than an array.
Signed-off-by: kingbri <bdashore3@proton.me>
When the model is processing a prompt, add the ability to abort
on request cancellation. This is also a catch for a SIGINT.
Signed-off-by: kingbri <bdashore3@proton.me>
OAI expects finish_reason to be "stop" or "length" (there are others,
but they're not in the current scope of this project).
Make all completions and chat completions responses return this
from the model generation itself rather than putting a placeholder.
Signed-off-by: kingbri <bdashore3@proton.me>
Async generation helps remove many roadblocks to managing tasks
using threads. It should allow for abortables and modern-day paradigms.
NOTE: Exllamav2 itself is not an asynchronous library. It's just
been added into tabby's async nature to allow for a fast and concurrent
API server. It's still being debated to run stream_ex in a separate
thread or manually manage it using asyncio.sleep(0)
Signed-off-by: kingbri <bdashore3@proton.me>
Previously, generation function were bundled with the request function
causing the overall code structure and API to look ugly and unreadable.
Split these up and cleanup a lot of the methods that were previously
overlooked in the API itself.
Signed-off-by: kingbri <bdashore3@proton.me>