exllamav3/doc/convert.md at 2965eec9197bae3333f0051bdb2fed289647ec0a

turboderp-org/exllamav3

Fork 0

mirror of https://github.com/turboderp-org/exllamav3.git synced 2026-03-15 00:07:24 +00:00

Files

turboderp 0c0139d9cf Update docs

2025-11-01 19:40:58 +01:00

6.1 KiB

Raw Blame History

EXL3 conversion script

Arguments

Basic

-i / --in_dir directory: The source model to convert, in unquantized HF format. The directory should contain at least a config.json file, a tokenizer.json file and one or more .safetensors files containing weights.
-o / --out_dir directory: The destination directory for the converted EXL3 model. Will be created if it doesn't exist, or overwritten if it does.
-w / --work_dir directory: Working directory for temporary files. It should have enough free space to store an entire copy of the output model. This is also where checkpoints are stored and the only required argument if -r / --resume is specified.
-ss / --shard_size float: Output shard size, in megabytes. Default is 8192. Set this to 0 to disable sharding. Note that writing very large .safetensors files can require a lot of system RAM.
-b / --bits float: Target average number of bits per weight.
-hb / --head_bits int: Number of bits per weight for the lm_head (output) layer of the model. Must be an integer from 1 to 8, default is 6.

Advanced (generally disregard these options)

--out_scales | --no_out_scales: Force enable or disable output channel scales. By default, whether to apply this setting is autodetected during quantization, so don't set either of these flags unless there's a good reason to. Mostly they're for debug purposes.

Checkpoints

-cpi / --checkpoint_interval int: Interval (in seconds) between checkpoints.
-r / --resume: Resume an interrupted job pointed to by -w / --work_dir from the latest checkpoint. If resuming a job, all other arguments such as input and output directories, bitrate etc. are restored from the old job, though some can be overridden. Note that resuming is now explicit, reversing the behavior from ExLlamaV2.

Performance

-d / --devices list: Comma-separated list of GPU device IDs to use during quantization. By default only the first visible device (device 0) is used. Adding more devices can speed up quantization if there is sufficient PCIe bandwidth between them. This does not affect memory usage on the first GPU, and very little memory is used on the others, since only the most compute intensive operation (trellis encoding) is distributed.
-dr / --device_ratios list: Ratio as comma-separated list. Determines how the encoding workload is distributed when using multiple devices. This is useful if using GPUs with dissimilar compute performance, to prevent slower GPUs from becoming bottlenecks. Ratios are relative, i.e. 1,1,3 is the same ratio as 3,3,9.
-pm / --parallel_mode: Fully parallelize quantization across multiple GPUs when possible. By default, multi-GPU quantization works by splitting the trellis encoding workload across multiple devices. For models with many small tensors (especially MoE models) this is inefficient since the resulting tile slices end up being too small for efficient batched encoding. This mode prefers distributing one linear layer to each GPU at a time, allowing larger encoding batches and more overall throughput. This mode is still somewhat experimental but will likely become the default soon.

Debug stuff (ignore these)

-lcpi / --last_checkpoint_index int: If specified, don't save checkpoints after this module index.
-cr / --cal_rows int: Number of rows of calibration data.
-cc / --cal_cols int: Number of columns of calibration data.
-v / --verbose: Extra debug output while quantizing.
--override_anyway: Allow resuming even when overriding settings that will break the existing job.
-img / image_dump: Save all tensors as images in the working directory. May require a large amount of system memory and disk space, can be slow.

Examples

Converting

python convert.py -i /mnt/models/llama3.1-70b-instruct \
                  -o /mnt/models/llama3.1-70b-instruct-exl3-3.75bpw \
                  -w /mnt/temp/exl3 \
                  -b 3.75

Resuming

Resume the job started above if it was interrupted:

python convert.py -w /mnt/temp/exl3 -r

Multi-GPU quant

Convert a model on the first three devices:

python convert.py -i /mnt/models/llama3.1-70b-instruct \
                  -o /mnt/models/llama3.1-70b-instruct-exl3-3.75bpw \
                  -w /mnt/temp/exl3 \
                  -b 3.75 \
                  -d 0,1,2

Convert a model on the first three devices, using CUDA:2 as the primary device and distributing 3/12, 4/12 and 5/12 of the workload to CUDA:2, CUDA:0 and CUDA:1, respectively:

python convert.py -i /mnt/models/llama3.1-70b-instruct \
                  -o /mnt/models/llama3.1-70b-instruct-exl3-3.75bpw \
                  -w /mnt/temp/exl3 \
                  -b 3.75 \
                  -d 2,0,1 \
                  -dr 3,4,5

For dialing in the optimal ratio, monitor GPU usage while quantizing. Usage should periodically jump to 100% for at least one device, and ideally you want the other devices close to that as well, while they are active. Increasing the relative split for a device should increase the relative usage as well.

Expected duration

Some rough estimates of the expected overall time to convert various sizes of models, as of v0.0.1. Quantization kernel is tuned for Ada GPUs at the moment, and especially Ampere performance is likely to improve with more optimization.

Model size	bpw	3090	4090	5090	5090 + 2x4090
1B	2.0	11m	4m	3m	2m
1B	4.0	6m	3m	2m	1m
1B	8.0	5m	3m	2m	1m
8B	2.0	1h 08m	21m	16m	10m
8B	4.0	34m	14m	12m	8m
8B	8.0	31m	14m	11m	8m
70B	2.0	9h 04m	3h 7m	2h 19m	1h 24m
70B	4.0	4h 32m	2h 0m	1h 50m	1h 5m
70B	8.0	4h 20m	1h 58m	1h 41	1h 3m

6.1 KiB Raw Blame History