mirror of
https://github.com/turboderp-org/exllamav2.git
synced 2026-04-20 14:29:28 +00:00
Update README.md
This commit is contained in:
84
README.md
84
README.md
@@ -11,7 +11,27 @@ still needs a lot of testing and tuning, and a few key features are not yet impl
|
||||
- Support for a new quant format (see below)
|
||||
|
||||
|
||||
## How to do stuff
|
||||
## Performance
|
||||
|
||||
Some quick tests to compare performance with V1. There may be more performance optimizations in the future, and
|
||||
speeds will vary across GPUs, with slow CPUs still being a potential bottleneck:
|
||||
|
||||
| Model | Mode | Size | grpsz | act | V1: 3090Ti | V1: 4090 | V2: 3090Ti | V2: 4090 |
|
||||
|------------|--------------|-------|-------|-----|------------|----------|------------|-------------|
|
||||
| Llama | GPTQ | 7B | 128 | no | 143 t/s | 173 t/s | 175 t/s | **195** t/s |
|
||||
| Llama | GPTQ | 13B | 128 | no | 84 t/s | 102 t/s | 105 t/s | **110** t/s |
|
||||
| Llama | GPTQ | 33B | 128 | yes | 37 t/s | 45 t/s | 45 t/s | **48** t/s |
|
||||
| OpenLlama | GPTQ | 3B | 128 | yes | 194 t/s | 226 t/s | 295 t/s | **321** t/s |
|
||||
| CodeLlama | EXL2 4.0 bpw | 34B | - | - | - | - | 42 t/s | **48** t/s |
|
||||
| Llama2 | EXL2 3.0 bpw | 7B | - | - | - | - | 195 t/s | **224** t/s |
|
||||
| Llama2 | EXL2 4.0 bpw | 7B | - | - | - | - | 164 t/s | **197** t/s |
|
||||
| Llama2 | EXL2 5.0 bpw | 7B | - | - | - | - | 144 t/s | **160** t/s |
|
||||
| Llama2 | EXL2 2.5 bpw | 70B | - | - | - | - | 30 t/s | **35** t/s |
|
||||
| TinyLlama | EXL2 3.0 bpw | 1.1B | - | - | - | - | 536 t/s | **635** t/s |
|
||||
| TinyLlama | EXL2 4.0 bpw | 1.1B | - | - | - | - | 509 t/s | **590** t/s |
|
||||
|
||||
|
||||
## How to
|
||||
|
||||
Clone the repository and install dependencies:
|
||||
|
||||
@@ -25,32 +45,15 @@ python test_inference -m <path_to_model> -p "Once upon a time,"
|
||||
|
||||
For now, a simple console chatbot is included. Run it with:
|
||||
|
||||
`python examples/chat.py -m <path_to_model> -mode llama`
|
||||
```
|
||||
python examples/chat.py -m <path_to_model> -mode llama`
|
||||
```
|
||||
|
||||
The `-mode` argument chooses the prompt format to use. `llama` is for the Llama(2)-chat finetunes, while `codellama`
|
||||
probably works better for CodeLlama-instruct. `raw` will produce a simple chatlog-style chat that works with base
|
||||
models and various other finetunes. You can also provide a custom system prompt with `-p`.
|
||||
|
||||
|
||||
## Performance
|
||||
|
||||
Some quick tests to compare performance with V1. There may be more performance optimizations in the future, and
|
||||
speeds will vary across GPUs, with slow CPUs still being a potential bottleneck:
|
||||
|
||||
| Model | Mode | Size | grpsz | act | V1: 3090Ti | V1: 4090 | V2: 3090Ti | V2: 4090 |
|
||||
|------------|-------------|-------|-------|-----|------------|----------|------------|-------------|
|
||||
| Llama | GPTQ | 7B | 128 | no | 143 t/s | 173 t/s | 175 t/s | **195** t/s |
|
||||
| Llama | GPTQ | 13B | 128 | no | 84 t/s | 102 t/s | 105 t/s | **110** t/s |
|
||||
| Llama | GPTQ | 33B | 128 | yes | 37 t/s | 45 t/s | 45 t/s | **48** t/s |
|
||||
| OpenLlama | GPTQ | 3B | 128 | yes | 194 t/s | 226 t/s | 295 t/s | **321** t/s |
|
||||
| CodeLlama | EXL2 4.0bpw | 34B | - | - | - | - | 42 t/s | **48** t/s |
|
||||
| Llama2 | EXL2 3.0bpw | 7B | - | - | - | - | 195 t/s | **224** t/s |
|
||||
| Llama2 | EXL2 4.0bpw | 7B | - | - | - | - | 164 t/s | **197** t/s |
|
||||
| Llama2 | EXL2 5.0bpw | 7B | - | - | - | - | 144 t/s | **160** t/s |
|
||||
| Llama2 | EXL2 2.5bpw | 70B | - | - | - | - | 30 t/s | **35** t/s |
|
||||
| TinyLlama2 | EXL2 3.0bpw | 1.1B | - | - | - | - | 536 t/s | **635** t/s |
|
||||
| TinyLlama2 | EXL2 4.0bpw | 1.1B | - | - | - | - | 509 t/s | **590** t/s |
|
||||
|
||||
### Installation
|
||||
|
||||
Clone the repository and run `python setup.py install --user`. (PyPi package is coming, be patient.)
|
||||
@@ -62,29 +65,31 @@ for subsequent use.
|
||||
|
||||
## EXL2 quantization
|
||||
|
||||
ExLlamaV2 supports the same 4-bit GPTQ models as V1, but also a new format. The new format is based on the same GPTQ/OBQ
|
||||
optimization approach, supporting 2, 3, 4, 5, 6 and 8-bit quantization. Most notably, by mixing them you can target any
|
||||
*average* bitrate from 2 up to 8 bits. This also allows for multiple quantization settings within each linear
|
||||
layer, producing something akin to sparse quantization wherein more important weights (columns) are quantized with more
|
||||
bits. The same remapping trick that lets ExLlamaV1 work efficiently with act-order models also allows this mixing
|
||||
of formats to happen with minimal impact on performance.
|
||||
ExLlamaV2 supports the same 4-bit GPTQ models as V1, but also a new "EXL2" format. EXL2 is based on the same
|
||||
optimization method as GPTQ and supports 2, 3, 4, 5, 6 and 8-bit quantization. The format allows for mixing quantization
|
||||
levels within a model to achieve any average bitrate between 2 and 8 bits per weight.
|
||||
|
||||
The parameter selection is done automatically by quantizing each matrix multiple times, measuring the quantization
|
||||
error (with respect to the chosen calibration data) for each of a number of possible settings, and then finally creating
|
||||
a quantization scheme for the entire model that minimizes the maximum quantization error while meeting a target average
|
||||
bitrate.
|
||||
Moreover, it's possible to apply multiple quantization levels to each linear layer, producing something akin to sparse
|
||||
quantization wherein more important weights (columns) are quantized with more bits. The same remapping trick that lets
|
||||
ExLlama work efficiently with act-order models allows this mixing of formats to happen with little to no impact on
|
||||
performance.
|
||||
|
||||
In my tests, this allows Llama2 70B to run on a single 24 GB GPU at the full 4096-token context, producing at least
|
||||
coherent output with 2.5 bits per weight. It can still be unstable, so it probably still needs a little optimization.
|
||||
It also only *barely* fits in 24 GB, so it most likely won't work with a desktop environment running on the same GPU.
|
||||
Parameter selection is done automatically by quantizing each matrix multiple times, measuring the quantization
|
||||
error (with respect to the chosen calibration data) for each of a number of possible settings, per layer. Finally, a
|
||||
combination is chosen that minimizes the maximum quantization error over the entire model while meeting a target
|
||||
average bitrate.
|
||||
|
||||
[](doc/screenshot_chat_2.5bit.png)
|
||||
In my tests, this scheme allows Llama2 70B to run on a single 24 GB GPU with a full (4k) context, producing coherent
|
||||
and mostly stable output with 2.55 bits per weight. 13B models run at 2.65 bits within 8 GB of VRAM, although currently
|
||||
none of them uses GQA which effectively limits the context size to 2048. In either case it's unlikely that the model
|
||||
will fit alongside a desktop environment, though. For now.
|
||||
|
||||
[](doc/llama2_70b_chat.png)
|
||||
[](doc/codellama_13b_instruct.png)
|
||||
|
||||
### Conversion
|
||||
|
||||
A script is provided to quantize models. Converting large models can be somewhat slow, so be warned.
|
||||
|
||||
To use it:
|
||||
A script is provided to quantize models. Converting large models can be somewhat slow, so be warned. To use it:
|
||||
|
||||
```
|
||||
python convert.py \
|
||||
@@ -102,6 +107,9 @@ supplied (with the `-m` argument) to subsequent conversion jobs to skip the firs
|
||||
the same model to different bitrates. Once complete, the quantized tensors will be compiled into `output.safetensors`,
|
||||
and this file can replace the safetensors file in the original HF model.
|
||||
|
||||
Roughly speaking, you'll need about 24 GB of VRAM to convert a 70B model, while 7B seems to require about 8 GB. There
|
||||
are optimizations planned to accelerate conversion, utilizing more or larger GPUs.
|
||||
|
||||
### HuggingFace repos
|
||||
|
||||
I've uploaded a few EXL2-quantized models to HuggingFace, [here](https://huggingface.co/turboderp).
|
||||
|
||||
BIN
doc/codellama_13b_instruct.png
Normal file
BIN
doc/codellama_13b_instruct.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 184 KiB |
BIN
doc/codellama_13b_instruct_thumb.png
Normal file
BIN
doc/codellama_13b_instruct_thumb.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 77 KiB |
BIN
doc/llama2_70b_chat.png
Normal file
BIN
doc/llama2_70b_chat.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 173 KiB |
BIN
doc/llama2_70b_chat_thumb.png
Normal file
BIN
doc/llama2_70b_chat_thumb.png
Normal file
Binary file not shown.
|
After Width: | Height: | Size: 94 KiB |
Binary file not shown.
|
Before Width: | Height: | Size: 207 KiB |
Binary file not shown.
|
Before Width: | Height: | Size: 98 KiB |
Reference in New Issue
Block a user