mirror of
https://github.com/turboderp-org/exllamav2.git
synced 2026-03-15 00:07:26 +00:00
Update README.md
This commit is contained in:
74
README.md
74
README.md
@@ -3,17 +3,57 @@
|
||||
ExLlamaV2 is an inference library for running local LLMs on modern consumer GPUs.
|
||||
|
||||
|
||||
## Overview of differences compared to V1
|
||||
## New in v0.1.0:
|
||||
|
||||
- ExLlamaV2 now supports paged attention via [Flash Attention](https://github.com/Dao-AILab/flash-attention) 2.5.7+
|
||||
- New generator with dynamic batching, smart prompt caching, K/V cache deduplication and simplified API
|
||||
|
||||

|
||||
|
||||
## Dynamic generator examples
|
||||
|
||||
The dynamic generator supports all inference, sampling and speculative decoding features of the previous two
|
||||
generators, consolidated into one API (with the exception of FP8 cache, though the Q4 cache mode is supported and
|
||||
performs better anyway, see [here](doc/qcache_eval.md).)
|
||||
|
||||
- Single generation:
|
||||
```python
|
||||
output = generator.generate(prompt = "Hello, my name is", max_new_tokens = 200)
|
||||
```
|
||||
- Batched generation:
|
||||
```python
|
||||
outputs = generator.generate(
|
||||
prompt = [
|
||||
"Hello, my name is",
|
||||
"Once upon a time,",
|
||||
"Large language models are",
|
||||
],
|
||||
max_new_tokens = 200
|
||||
)
|
||||
```
|
||||
- Streamed generation with `asyncio`:
|
||||
```python
|
||||
job = ExLlamaV2DynamicJobAsync(
|
||||
generator,
|
||||
input_ids = tokenizer.encode("You can lead a horse to water"),
|
||||
banned_strings = ["make it drink"],
|
||||
gen_settings = ExLlamaV2Sampler.Settings.greedy(),
|
||||
max_new_tokens = 200
|
||||
)
|
||||
async for result in job:
|
||||
text = result.get("text", "")
|
||||
print(text, end = "")
|
||||
```
|
||||
See the full, updated examples [here](https://github.com/turboderp/exllamav2/tree/master/examples).
|
||||
|
||||
|
||||
|
||||
- Faster, better kernels
|
||||
- Cleaner and more versatile codebase
|
||||
- Support for a new quant format (see below)
|
||||
|
||||
|
||||
## Performance
|
||||
|
||||
Some quick tests to compare performance with V1. There may be more performance optimizations in the future, and
|
||||
speeds will vary across GPUs, with slow CPUs still being a potential bottleneck:
|
||||
Some quick tests to compare performance with ExLlama V1. There may be more performance optimizations in the future,
|
||||
and speeds will vary across GPUs, with slow CPUs still being a potential bottleneck:
|
||||
|
||||
| Model | Mode | Size | grpsz | act | 3090Ti | 4090 |
|
||||
|------------|--------------|-------|-------|-----|---------|-------------|
|
||||
@@ -33,13 +73,11 @@ speeds will vary across GPUs, with slow CPUs still being a potential bottleneck:
|
||||
## How to
|
||||
|
||||
To install from the repo you'll need the CUDA Toolkit and either gcc on Linux or (Build Tools for) Visual Studio
|
||||
on Windows). Also make sure you have an appropriate version of [PyTorch](https://pytorch.org/get-started/locally/),
|
||||
then run:
|
||||
on Windows). Also make sure you have an appropriate version of [PyTorch](https://pytorch.org/get-started/locally/), then run:
|
||||
|
||||
```
|
||||
git clone https://github.com/turboderp/exllamav2
|
||||
cd exllamav2
|
||||
# Optionally, create and activate a new conda environment
|
||||
pip install -r requirements.txt
|
||||
pip install .
|
||||
|
||||
@@ -50,13 +88,11 @@ python test_inference.py -m <path_to_model> -p "Once upon a time,"
|
||||
A simple console chatbot is included. Run it with:
|
||||
|
||||
```
|
||||
python examples/chat.py -m <path_to_model> -mode llama
|
||||
# Append the '--gpu_split auto' flag for multi-GPU inference
|
||||
python examples/chat.py -m <path_to_model> -mode llama -gs auto
|
||||
```
|
||||
|
||||
|
||||
The `-mode` argument chooses the prompt format to use. `llama` is for the Llama(2)-chat finetunes, while `codellama`
|
||||
probably works better for CodeLlama-instruct. `raw` will produce a simple chatlog-style chat that works with base
|
||||
The `-mode` argument chooses the prompt format to use. `raw` will produce a simple chatlog-style chat that works with base
|
||||
models and various other finetunes. Run with `-modes` for a list of all available prompt formats. You can also provide
|
||||
a custom system prompt with `-sp`.
|
||||
|
||||
@@ -100,8 +136,11 @@ C++ extension in the process. Instead, the extension will be built the first tim
|
||||
|
||||
### Method 2: Install from release (with prebuilt extension)
|
||||
|
||||
Releases are available [here](https://github.com/turboderp/exllamav2/releases), with prebuilt wheels that contain the
|
||||
extension binaries. Make sure to grab the right version, matching your platform, Python version (`cp`) and CUDA version.
|
||||
Releases are available [here](https://github.com/turboderp/exllamav2/releases), with prebuilt wheels that contain the extension binaries. Make sure to grab
|
||||
the right version, matching your platform, Python version (`cp`) and CUDA version. Crucially, you must also match
|
||||
the prebuilt wheel with your PyTorch version, since the Torch C++ extension ABI breaks with every new version of
|
||||
PyTorch.
|
||||
|
||||
Either download an appropriate wheel or install directly from the appropriate URL:
|
||||
|
||||
```
|
||||
@@ -113,15 +152,12 @@ can also be installed this way, and it will build the extension while installing
|
||||
|
||||
### Method 3: Install from PyPI
|
||||
|
||||
A PyPI package is available as well. It can be installed with:
|
||||
A PyPI package is available as well. This is the same as the JIT version (see above). It can be installed with:
|
||||
|
||||
```
|
||||
pip install exllamav2
|
||||
```
|
||||
|
||||
The version available through PyPI is the JIT version (see above). Still working on a solution for distributing
|
||||
prebuilt wheels via PyPI.
|
||||
|
||||
|
||||
## EXL2 quantization
|
||||
|
||||
|
||||
BIN
doc/dynamic_gen.gif
Normal file
BIN
doc/dynamic_gen.gif
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 16 MiB |
@@ -77,10 +77,10 @@ prompts = [
|
||||
"Please guess the next 20 numbers in this sequence: " + ", ".join(str(n) for n in range(700)),
|
||||
"Write a short essay about cell membranes.",
|
||||
"What's up?",
|
||||
"How do I open a can of beans?",
|
||||
"How do I open a can of soup?",
|
||||
"How do I open a can of strawberry jam?",
|
||||
"How do I open a can of raspberry jam?",
|
||||
# "How do I open a can of beans?",
|
||||
# "How do I open a can of soup?",
|
||||
# "How do I open a can of strawberry jam?",
|
||||
# "How do I open a can of raspberry jam?",
|
||||
"What's the tallest building in Paris?",
|
||||
"What's the most populous nation on Earth?",
|
||||
"What's the most populous nation on Mars?",
|
||||
@@ -90,25 +90,25 @@ prompts = [
|
||||
"Who is Waldo?",
|
||||
"Why is Waldo?",
|
||||
"Is it legal to base jump off the Eiffel Tower?",
|
||||
"Is it legal to base jump into a volcano?",
|
||||
"Why are cats better than dogs?",
|
||||
# "Is it legal to base jump into a volcano?",
|
||||
# "Why are cats better than dogs?",
|
||||
"Why is the Hulk so angry all the time?",
|
||||
"How do I build a time machine?",
|
||||
"What seems out of place in this sequence: " + ", ".join(str(n if n != 123 else 69) for n in range(200)),
|
||||
"Is it legal to grow your own catnip?",
|
||||
"What seems out of place in this sequence: " + ", ".join(str(n if n != 160 else 420) for n in range(400)),
|
||||
"What seems out of place in this sequence: " + ", ".join(str(n if n != 161 else 421) for n in range(400)),
|
||||
"What's inside a black hole?",
|
||||
"What do the numbers 2, 4, 8, 16, 32 and 64 have in common?",
|
||||
"What do the numbers 2, 3, 5, 7, 11 and 13 have in common?",
|
||||
"Is there life on Mars?",
|
||||
"Hello!",
|
||||
"Hi!",
|
||||
"Boop!",
|
||||
"Why are cats better than dogs?",
|
||||
"Why are cats better than dogs?",
|
||||
"Why are cats better than dogs?",
|
||||
"Why are cats better than dogs?",
|
||||
# "What's inside a black hole?",
|
||||
# "What do the numbers 2, 4, 8, 16, 32 and 64 have in common?",
|
||||
# "What do the numbers 2, 3, 5, 7, 11 and 13 have in common?",
|
||||
# "Is there life on Mars?",
|
||||
# "Hello!",
|
||||
# "Hi!",
|
||||
# "Boop!",
|
||||
# "Why are cats better than dogs?",
|
||||
# "Why are cats better than dogs?",
|
||||
# "Why are cats better than dogs?",
|
||||
# "Why are cats better than dogs?",
|
||||
]
|
||||
|
||||
term = Terminal()
|
||||
|
||||
Reference in New Issue
Block a user