Update README.md

This commit is contained in:
turboderp
2023-09-12 06:41:36 +02:00
parent 7704a6877b
commit c240eb0b70
7 changed files with 46 additions and 38 deletions

View File

@@ -11,7 +11,27 @@ still needs a lot of testing and tuning, and a few key features are not yet impl
- Support for a new quant format (see below)
## How to do stuff
## Performance
Some quick tests to compare performance with V1. There may be more performance optimizations in the future, and
speeds will vary across GPUs, with slow CPUs still being a potential bottleneck:
| Model | Mode | Size | grpsz | act | V1: 3090Ti | V1: 4090 | V2: 3090Ti | V2: 4090 |
|------------|--------------|-------|-------|-----|------------|----------|------------|-------------|
| Llama | GPTQ | 7B | 128 | no | 143 t/s | 173 t/s | 175 t/s | **195** t/s |
| Llama | GPTQ | 13B | 128 | no | 84 t/s | 102 t/s | 105 t/s | **110** t/s |
| Llama | GPTQ | 33B | 128 | yes | 37 t/s | 45 t/s | 45 t/s | **48** t/s |
| OpenLlama | GPTQ | 3B | 128 | yes | 194 t/s | 226 t/s | 295 t/s | **321** t/s |
| CodeLlama | EXL2 4.0 bpw | 34B | - | - | - | - | 42 t/s | **48** t/s |
| Llama2 | EXL2 3.0 bpw | 7B | - | - | - | - | 195 t/s | **224** t/s |
| Llama2 | EXL2 4.0 bpw | 7B | - | - | - | - | 164 t/s | **197** t/s |
| Llama2 | EXL2 5.0 bpw | 7B | - | - | - | - | 144 t/s | **160** t/s |
| Llama2 | EXL2 2.5 bpw | 70B | - | - | - | - | 30 t/s | **35** t/s |
| TinyLlama | EXL2 3.0 bpw | 1.1B | - | - | - | - | 536 t/s | **635** t/s |
| TinyLlama | EXL2 4.0 bpw | 1.1B | - | - | - | - | 509 t/s | **590** t/s |
## How to
Clone the repository and install dependencies:
@@ -25,32 +45,15 @@ python test_inference -m <path_to_model> -p "Once upon a time,"
For now, a simple console chatbot is included. Run it with:
`python examples/chat.py -m <path_to_model> -mode llama`
```
python examples/chat.py -m <path_to_model> -mode llama`
```
The `-mode` argument chooses the prompt format to use. `llama` is for the Llama(2)-chat finetunes, while `codellama`
probably works better for CodeLlama-instruct. `raw` will produce a simple chatlog-style chat that works with base
models and various other finetunes. You can also provide a custom system prompt with `-p`.
## Performance
Some quick tests to compare performance with V1. There may be more performance optimizations in the future, and
speeds will vary across GPUs, with slow CPUs still being a potential bottleneck:
| Model | Mode | Size | grpsz | act | V1: 3090Ti | V1: 4090 | V2: 3090Ti | V2: 4090 |
|------------|-------------|-------|-------|-----|------------|----------|------------|-------------|
| Llama | GPTQ | 7B | 128 | no | 143 t/s | 173 t/s | 175 t/s | **195** t/s |
| Llama | GPTQ | 13B | 128 | no | 84 t/s | 102 t/s | 105 t/s | **110** t/s |
| Llama | GPTQ | 33B | 128 | yes | 37 t/s | 45 t/s | 45 t/s | **48** t/s |
| OpenLlama | GPTQ | 3B | 128 | yes | 194 t/s | 226 t/s | 295 t/s | **321** t/s |
| CodeLlama | EXL2 4.0bpw | 34B | - | - | - | - | 42 t/s | **48** t/s |
| Llama2 | EXL2 3.0bpw | 7B | - | - | - | - | 195 t/s | **224** t/s |
| Llama2 | EXL2 4.0bpw | 7B | - | - | - | - | 164 t/s | **197** t/s |
| Llama2 | EXL2 5.0bpw | 7B | - | - | - | - | 144 t/s | **160** t/s |
| Llama2 | EXL2 2.5bpw | 70B | - | - | - | - | 30 t/s | **35** t/s |
| TinyLlama2 | EXL2 3.0bpw | 1.1B | - | - | - | - | 536 t/s | **635** t/s |
| TinyLlama2 | EXL2 4.0bpw | 1.1B | - | - | - | - | 509 t/s | **590** t/s |
### Installation
Clone the repository and run `python setup.py install --user`. (PyPi package is coming, be patient.)
@@ -62,29 +65,31 @@ for subsequent use.
## EXL2 quantization
ExLlamaV2 supports the same 4-bit GPTQ models as V1, but also a new format. The new format is based on the same GPTQ/OBQ
optimization approach, supporting 2, 3, 4, 5, 6 and 8-bit quantization. Most notably, by mixing them you can target any
*average* bitrate from 2 up to 8 bits. This also allows for multiple quantization settings within each linear
layer, producing something akin to sparse quantization wherein more important weights (columns) are quantized with more
bits. The same remapping trick that lets ExLlamaV1 work efficiently with act-order models also allows this mixing
of formats to happen with minimal impact on performance.
ExLlamaV2 supports the same 4-bit GPTQ models as V1, but also a new "EXL2" format. EXL2 is based on the same
optimization method as GPTQ and supports 2, 3, 4, 5, 6 and 8-bit quantization. The format allows for mixing quantization
levels within a model to achieve any average bitrate between 2 and 8 bits per weight.
The parameter selection is done automatically by quantizing each matrix multiple times, measuring the quantization
error (with respect to the chosen calibration data) for each of a number of possible settings, and then finally creating
a quantization scheme for the entire model that minimizes the maximum quantization error while meeting a target average
bitrate.
Moreover, it's possible to apply multiple quantization levels to each linear layer, producing something akin to sparse
quantization wherein more important weights (columns) are quantized with more bits. The same remapping trick that lets
ExLlama work efficiently with act-order models allows this mixing of formats to happen with little to no impact on
performance.
In my tests, this allows Llama2 70B to run on a single 24 GB GPU at the full 4096-token context, producing at least
coherent output with 2.5 bits per weight. It can still be unstable, so it probably still needs a little optimization.
It also only *barely* fits in 24 GB, so it most likely won't work with a desktop environment running on the same GPU.
Parameter selection is done automatically by quantizing each matrix multiple times, measuring the quantization
error (with respect to the chosen calibration data) for each of a number of possible settings, per layer. Finally, a
combination is chosen that minimizes the maximum quantization error over the entire model while meeting a target
average bitrate.
[![chat_screenshot](doc/screenshot_chat_2.5bit_thumb.png)](doc/screenshot_chat_2.5bit.png)
In my tests, this scheme allows Llama2 70B to run on a single 24 GB GPU with a full (4k) context, producing coherent
and mostly stable output with 2.55 bits per weight. 13B models run at 2.65 bits within 8 GB of VRAM, although currently
none of them uses GQA which effectively limits the context size to 2048. In either case it's unlikely that the model
will fit alongside a desktop environment, though. For now.
[![chat_screenshot](doc/llama2_70b_chat_thumb.png)](doc/llama2_70b_chat.png)
[![chat_screenshot](doc/codellama_13b_instruct_thumb.png)](doc/codellama_13b_instruct.png)
### Conversion
A script is provided to quantize models. Converting large models can be somewhat slow, so be warned.
To use it:
A script is provided to quantize models. Converting large models can be somewhat slow, so be warned. To use it:
```
python convert.py \
@@ -102,6 +107,9 @@ supplied (with the `-m` argument) to subsequent conversion jobs to skip the firs
the same model to different bitrates. Once complete, the quantized tensors will be compiled into `output.safetensors`,
and this file can replace the safetensors file in the original HF model.
Roughly speaking, you'll need about 24 GB of VRAM to convert a 70B model, while 7B seems to require about 8 GB. There
are optimizations planned to accelerate conversion, utilizing more or larger GPUs.
### HuggingFace repos
I've uploaded a few EXL2-quantized models to HuggingFace, [here](https://huggingface.co/turboderp).

Binary file not shown.

After

Width:  |  Height:  |  Size: 184 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 77 KiB

BIN
doc/llama2_70b_chat.png Normal file

Binary file not shown.

After

Width:  |  Height:  |  Size: 173 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 94 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 207 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 98 KiB