6.1 KiB
EXL3 conversion script
Arguments
Basic
-
-i / --in_dir directory: The source model to convert, in unquantized HF format. The directory should contain at least a
config.jsonfile, atokenizer.jsonfile and one or more.safetensorsfiles containing weights. -
-o / --out_dir directory: The destination directory for the converted EXL3 model. Will be created if it doesn't exist, or overwritten if it does.
-
-w / --work_dir directory: Working directory for temporary files. It should have enough free space to store an entire copy of the output model. This is also where checkpoints are stored and the only required argument if -r / --resume is specified.
-
-ss / --shard_size float: Output shard size, in megabytes. Default is 8192. Set this to 0 to disable sharding. Note that writing very large
.safetensorsfiles can require a lot of system RAM. -
-b / --bits float: Target average number of bits per weight.
-
-hb / --head_bits int: Number of bits per weight for the lm_head (output) layer of the model. Must be an integer from 1 to 8, default is 6.
Advanced (generally disregard these options)
- --out_scales | --no_out_scales: Force enable or disable output channel scales. By default, whether to apply this setting is autodetected during quantization, so don't set either of these flags unless there's a good reason to. Mostly they're for debug purposes.
Checkpoints
-
-cpi / --checkpoint_interval int: Interval (in seconds) between checkpoints.
-
-r / --resume: Resume an interrupted job pointed to by -w / --work_dir from the latest checkpoint. If resuming a job, all other arguments such as input and output directories, bitrate etc. are restored from the old job, though some can be overridden. Note that resuming is now explicit, reversing the behavior from ExLlamaV2.
Performance
-
-d / --devices list: Comma-separated list of GPU device IDs to use during quantization. By default only the first visible device (device 0) is used. Adding more devices can speed up quantization if there is sufficient PCIe bandwidth between them. This does not affect memory usage on the first GPU, and very little memory is used on the others, since only the most compute intensive operation (trellis encoding) is distributed.
-
-dr / --device_ratios list: Ratio as comma-separated list. Determines how the encoding workload is distributed when using multiple devices. This is useful if using GPUs with dissimilar compute performance, to prevent slower GPUs from becoming bottlenecks. Ratios are relative, i.e.
1,1,3is the same ratio as3,3,9. -
-pm / --parallel_mode: Fully parallelize quantization across multiple GPUs when possible. By default, multi-GPU quantization works by splitting the trellis encoding workload across multiple devices. For models with many small tensors (especially MoE models) this is inefficient since the resulting tile slices end up being too small for efficient batched encoding. This mode prefers distributing one linear layer to each GPU at a time, allowing larger encoding batches and more overall throughput. This mode is still somewhat experimental but will likely become the default soon.
Debug stuff (ignore these)
-
-lcpi / --last_checkpoint_index int: If specified, don't save checkpoints after this module index.
-
-cr / --cal_rows int: Number of rows of calibration data.
-
-cc / --cal_cols int: Number of columns of calibration data.
-
-v / --verbose: Extra debug output while quantizing.
-
--override_anyway: Allow resuming even when overriding settings that will break the existing job.
-
-img / image_dump: Save all tensors as images in the working directory. May require a large amount of system memory and disk space, can be slow.
Examples
Converting
python convert.py -i /mnt/models/llama3.1-70b-instruct \
-o /mnt/models/llama3.1-70b-instruct-exl3-3.75bpw \
-w /mnt/temp/exl3 \
-b 3.75
Resuming
Resume the job started above if it was interrupted:
python convert.py -w /mnt/temp/exl3 -r
Multi-GPU quant
Convert a model on the first three devices:
python convert.py -i /mnt/models/llama3.1-70b-instruct \
-o /mnt/models/llama3.1-70b-instruct-exl3-3.75bpw \
-w /mnt/temp/exl3 \
-b 3.75 \
-d 0,1,2
Convert a model on the first three devices, using CUDA:2 as the primary device and distributing 3/12, 4/12 and 5/12 of the workload to CUDA:2, CUDA:0 and CUDA:1, respectively:
python convert.py -i /mnt/models/llama3.1-70b-instruct \
-o /mnt/models/llama3.1-70b-instruct-exl3-3.75bpw \
-w /mnt/temp/exl3 \
-b 3.75 \
-d 2,0,1 \
-dr 3,4,5
For dialing in the optimal ratio, monitor GPU usage while quantizing. Usage should periodically jump to 100% for at least one device, and ideally you want the other devices close to that as well, while they are active. Increasing the relative split for a device should increase the relative usage as well.
Expected duration
Some rough estimates of the expected overall time to convert various sizes of models, as of v0.0.1. Quantization kernel is tuned for Ada GPUs at the moment, and especially Ampere performance is likely to improve with more optimization.
| Model size | bpw | 3090 | 4090 | 5090 | 5090 + 2x4090 |
|---|---|---|---|---|---|
| 1B | 2.0 | 11m | 4m | 3m | 2m |
| 1B | 4.0 | 6m | 3m | 2m | 1m |
| 1B | 8.0 | 5m | 3m | 2m | 1m |
| 8B | 2.0 | 1h 08m | 21m | 16m | 10m |
| 8B | 4.0 | 34m | 14m | 12m | 8m |
| 8B | 8.0 | 31m | 14m | 11m | 8m |
| 70B | 2.0 | 9h 04m | 3h 7m | 2h 19m | 1h 24m |
| 70B | 4.0 | 4h 32m | 2h 0m | 1h 50m | 1h 5m |
| 70B | 8.0 | 4h 20m | 1h 58m | 1h 41 | 1h 3m |