Update convert.py instructions

2026-04-19 22:08:55 +00:00 · 2023-12-16 21:51:35 +01:00
parent 660ce041cf
commit 02e2cb4d4a
1 changed files with 19 additions and 14 deletions
--- a/doc/convert.md
+++ b/doc/convert.md
@@ -39,8 +39,11 @@ particularly useful when quantizing the same model to multiple bitrates, since t
 to complete.
  

- **-c / --cal_dataset *file***: (_required_) The calibration dataset in Parquet format. The quantizer concatenates all
-the data in this file into one long string and uses the first _r_ \* _l_ tokens for calibration.   
+- **-c / --cal_dataset *file***: (_optional_) The calibration dataset in Parquet format. The quantizer concatenates all
+the data in this file into one long string and uses the first _r_ \* _l_ tokens for calibration. If this is not
+specified, the default, built-in calibration dataset is used which contains a broad mix of different types of data. It's
+designed to prevent the quantized model from overfitting to any particular mode, language or style, and generally
+results in more robust, reliable outputs, especially at lower bitrates.
  

 - **-l / --length *int***: Length, in tokens, of each calibration row. Default is 2048.
@@ -56,12 +59,6 @@ is 2048.
 - **-mr / --measurement_rows *int***: Number of rows in the calibration batch for the measuring pass. Default is 16.
  

- **-gr / --gpu_rows *int***: Threshold for when to swap the calibration state to and from system RAM, saving a lot of
-VRAM at the cost of some speed. If this number is less than **-r**, system RAM will be used in the quantization pass.
-Likewise for the measurement pass if the number is less than **-mr**. Depending on your available VRAM you may be able
-to run without swapping for smaller models and have to set **-gr** to zero for larger ones. Default is 0.
-  
-
 - **-b / --bits *float***: Target average number of bits per weight.
  

@@ -77,17 +74,21 @@ Note that writing a very large `.safetensors` file can require a lot of system R
 - **-ra / --rope_alpha *float***: RoPE (NTK) alpha to apply to base model for calibration.


- **-rs / --rope_scale *float***: RoPE scaling factor to apply to base model for calibration.
+- **-rs / --rope_scale *float***: RoPE scaling factor to apply to base model for calibration. This settings is not 
+automatically read from the model's config, so it's strongly recommended that you check what setting the model was
+trained/finetuned with. E.g.: deepseek-coder uses a scaling factor of 4, so will be incorrectly calibrated if you
+convert it without `-rs 4`.


 ### Notes

-The converter works in two passes; first it measures how quantization impacts each matrix in the model, and then it
+The converter works in two passes; first it measures how quantization impacts each module of the model, and then it
 actually quantizes the model, choosing quantization parameters for each layer that minimize the overall error while 
 also achieving the desired overall (average) bitrate.

-The first pass is slow, since it effectively quantizes the model about 20 times over, so make sure to save the
-`measurement.json` file so you can skip the measurement pass on subsequent quants of the same model.
+The first pass is slow, since it effectively quantizes the entire model about 12 times over (albeit with a less
+comprehensive sample of the calibration dataset), so make sure to save the `measurement.json` file so you can skip the
+measurement pass on subsequent quants of the same model.

 ### Examples

@@ -137,5 +138,9 @@ python convert.py \

 ### Hardware requirements

-Roughly speaking, you'll need about 24 GB of VRAM to convert a 70B model, while 7B seems to require about 8 GB. Stay
-tuned for more details.
+Roughly speaking, you'll need about 64 GB or RAM and 24 GB of VRAM to convert a 70B model, while 7B seems to require
+about 16 GB of RAM and about 8 GB of VRAM.
+
+The deciding factor for the memory requirement is the *width* of the model rather than the depth, so 120B models that
+have the same hidden size as 70B models have the same hardware requirements. Mixtral 8x7B has much wider feed-forward
+layers so requires about 20 GB of VRAM.