Update README.md

This commit is contained in:
turboderp
2024-05-25 22:50:36 +02:00
parent f6e8495e58
commit e6f230bf06
3 changed files with 72 additions and 36 deletions

View File

@@ -3,17 +3,57 @@
ExLlamaV2 is an inference library for running local LLMs on modern consumer GPUs.
## Overview of differences compared to V1
## New in v0.1.0:
- ExLlamaV2 now supports paged attention via [Flash Attention](https://github.com/Dao-AILab/flash-attention) 2.5.7+
- New generator with dynamic batching, smart prompt caching, K/V cache deduplication and simplified API
![alt_text](doc/dynamic_gen.gif)
## Dynamic generator examples
The dynamic generator supports all inference, sampling and speculative decoding features of the previous two
generators, consolidated into one API (with the exception of FP8 cache, though the Q4 cache mode is supported and
performs better anyway, see [here](doc/qcache_eval.md).)
- Single generation:
```python
output = generator.generate(prompt = "Hello, my name is", max_new_tokens = 200)
```
- Batched generation:
```python
outputs = generator.generate(
prompt = [
"Hello, my name is",
"Once upon a time,",
"Large language models are",
],
max_new_tokens = 200
)
```
- Streamed generation with `asyncio`:
```python
job = ExLlamaV2DynamicJobAsync(
generator,
input_ids = tokenizer.encode("You can lead a horse to water"),
banned_strings = ["make it drink"],
gen_settings = ExLlamaV2Sampler.Settings.greedy(),
max_new_tokens = 200
)
async for result in job:
text = result.get("text", "")
print(text, end = "")
```
See the full, updated examples [here](https://github.com/turboderp/exllamav2/tree/master/examples).
- Faster, better kernels
- Cleaner and more versatile codebase
- Support for a new quant format (see below)
## Performance
Some quick tests to compare performance with V1. There may be more performance optimizations in the future, and
speeds will vary across GPUs, with slow CPUs still being a potential bottleneck:
Some quick tests to compare performance with ExLlama V1. There may be more performance optimizations in the future,
and speeds will vary across GPUs, with slow CPUs still being a potential bottleneck:
| Model | Mode | Size | grpsz | act | 3090Ti | 4090 |
|------------|--------------|-------|-------|-----|---------|-------------|
@@ -33,13 +73,11 @@ speeds will vary across GPUs, with slow CPUs still being a potential bottleneck:
## How to
To install from the repo you'll need the CUDA Toolkit and either gcc on Linux or (Build Tools for) Visual Studio
on Windows). Also make sure you have an appropriate version of [PyTorch](https://pytorch.org/get-started/locally/),
then run:
on Windows). Also make sure you have an appropriate version of [PyTorch](https://pytorch.org/get-started/locally/), then run:
```
git clone https://github.com/turboderp/exllamav2
cd exllamav2
# Optionally, create and activate a new conda environment
pip install -r requirements.txt
pip install .
@@ -50,13 +88,11 @@ python test_inference.py -m <path_to_model> -p "Once upon a time,"
A simple console chatbot is included. Run it with:
```
python examples/chat.py -m <path_to_model> -mode llama
# Append the '--gpu_split auto' flag for multi-GPU inference
python examples/chat.py -m <path_to_model> -mode llama -gs auto
```
The `-mode` argument chooses the prompt format to use. `llama` is for the Llama(2)-chat finetunes, while `codellama`
probably works better for CodeLlama-instruct. `raw` will produce a simple chatlog-style chat that works with base
The `-mode` argument chooses the prompt format to use. `raw` will produce a simple chatlog-style chat that works with base
models and various other finetunes. Run with `-modes` for a list of all available prompt formats. You can also provide
a custom system prompt with `-sp`.
@@ -100,8 +136,11 @@ C++ extension in the process. Instead, the extension will be built the first tim
### Method 2: Install from release (with prebuilt extension)
Releases are available [here](https://github.com/turboderp/exllamav2/releases), with prebuilt wheels that contain the
extension binaries. Make sure to grab the right version, matching your platform, Python version (`cp`) and CUDA version.
Releases are available [here](https://github.com/turboderp/exllamav2/releases), with prebuilt wheels that contain the extension binaries. Make sure to grab
the right version, matching your platform, Python version (`cp`) and CUDA version. Crucially, you must also match
the prebuilt wheel with your PyTorch version, since the Torch C++ extension ABI breaks with every new version of
PyTorch.
Either download an appropriate wheel or install directly from the appropriate URL:
```
@@ -113,15 +152,12 @@ can also be installed this way, and it will build the extension while installing
### Method 3: Install from PyPI
A PyPI package is available as well. It can be installed with:
A PyPI package is available as well. This is the same as the JIT version (see above). It can be installed with:
```
pip install exllamav2
```
The version available through PyPI is the JIT version (see above). Still working on a solution for distributing
prebuilt wheels via PyPI.
## EXL2 quantization

BIN
doc/dynamic_gen.gif Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 16 MiB

View File

@@ -77,10 +77,10 @@ prompts = [
"Please guess the next 20 numbers in this sequence: " + ", ".join(str(n) for n in range(700)),
"Write a short essay about cell membranes.",
"What's up?",
"How do I open a can of beans?",
"How do I open a can of soup?",
"How do I open a can of strawberry jam?",
"How do I open a can of raspberry jam?",
# "How do I open a can of beans?",
# "How do I open a can of soup?",
# "How do I open a can of strawberry jam?",
# "How do I open a can of raspberry jam?",
"What's the tallest building in Paris?",
"What's the most populous nation on Earth?",
"What's the most populous nation on Mars?",
@@ -90,25 +90,25 @@ prompts = [
"Who is Waldo?",
"Why is Waldo?",
"Is it legal to base jump off the Eiffel Tower?",
"Is it legal to base jump into a volcano?",
"Why are cats better than dogs?",
# "Is it legal to base jump into a volcano?",
# "Why are cats better than dogs?",
"Why is the Hulk so angry all the time?",
"How do I build a time machine?",
"What seems out of place in this sequence: " + ", ".join(str(n if n != 123 else 69) for n in range(200)),
"Is it legal to grow your own catnip?",
"What seems out of place in this sequence: " + ", ".join(str(n if n != 160 else 420) for n in range(400)),
"What seems out of place in this sequence: " + ", ".join(str(n if n != 161 else 421) for n in range(400)),
"What's inside a black hole?",
"What do the numbers 2, 4, 8, 16, 32 and 64 have in common?",
"What do the numbers 2, 3, 5, 7, 11 and 13 have in common?",
"Is there life on Mars?",
"Hello!",
"Hi!",
"Boop!",
"Why are cats better than dogs?",
"Why are cats better than dogs?",
"Why are cats better than dogs?",
"Why are cats better than dogs?",
# "What's inside a black hole?",
# "What do the numbers 2, 4, 8, 16, 32 and 64 have in common?",
# "What do the numbers 2, 3, 5, 7, 11 and 13 have in common?",
# "Is there life on Mars?",
# "Hello!",
# "Hi!",
# "Boop!",
# "Why are cats better than dogs?",
# "Why are cats better than dogs?",
# "Why are cats better than dogs?",
# "Why are cats better than dogs?",
]
term = Terminal()