Commit Graph

114 Commits

Author SHA1 Message Date
turboderp
5d5d57083e Increase quant tolerance slightly (for small Qwen2 models esp.) 2024-06-13 20:44:51 +02:00
turboderp
60eedf4622 Add exit status code for quant error 2024-06-13 20:43:49 +02:00
turboderp
56ae879187 Merge branch 'refs/heads/202405-cached-states' into dev 2024-06-03 22:45:05 +02:00
Karl-Johan Alm
0ece2f3006 add layer GPU offloading for hidden/target states 2024-05-22 15:09:57 +09:00
Karl-Johan Alm
b428e239ed optimization: put rfn_sum on cuda and do .item() call out of for loop 2024-05-22 12:43:21 +09:00
turboderp
463373ba1e Merge branch 'master' into dev 2024-05-20 01:50:50 +02:00
turboderp
83baa98ed9 Add machine-parseable output to convert script 2024-05-20 01:49:34 +02:00
turboderp
b1e092af10 Update TODO: items 2024-05-18 06:43:27 +02:00
turboderp
a847f48720 Allow quantizing models with max_seq_len < 2048 2024-05-09 17:25:28 +02:00
turboderp
750c85e2c7 Fixes to allow quantizing Granite 2024-05-09 02:31:21 +02:00
turboderp
0d8bac53ee Cleanup 2024-04-26 23:24:45 +02:00
turboderp
e85404fbfd Quant: Slight VRAM optimization, don't scale H needlessly 2024-04-24 19:11:53 +02:00
turboderp
b68c0bd89b Fix checkpoint interval 2024-04-18 23:18:58 +02:00
turboderp
893e73c360 Fix scale rounding during quant 2024-04-18 22:35:57 +02:00
turboderp
b112b210aa Quant: Add a little more damping 2024-04-18 09:19:46 +02:00
turboderp
740a19a27c Optimize: More robust solver 2024-04-09 23:13:21 +02:00
turboderp
52bc008df9 Don't add metadata when -cf not specified 2024-04-06 17:28:57 +02:00
turboderp
63394ab8a5 Optimizer: Add accuracy bias to first layer 2024-04-06 08:01:43 +02:00
turboderp
5c1fcb693e Quant: Ignore OoM error during second sanity check 2024-04-05 21:49:42 +02:00
turboderp
672c7355a3 Quant: Change snapshot to time instead of layer interval 2024-04-05 21:49:10 +02:00
turboderp
5d9732165e Quant: Swap some state to CPU and attempt to keep more VRAM available in places 2024-04-05 21:48:08 +02:00
turboderp
b8c267e224 Quant: Perform H perm on CPU when H is very large 2024-04-05 21:46:20 +02:00
turboderp
2a5533de3f Quant: Option to load linear layer without allocating scratch space 2024-04-05 21:45:47 +02:00
turboderp
88843a5633 Quant: Offload quanting of very large layers to second GPU 2024-04-05 21:44:21 +02:00
turboderp
97e8123c71 Enable head (qk) norms for quantized models 2024-04-05 21:35:23 +02:00
turboderp
ff2ff0a407 Fix typo 2024-03-29 19:44:11 +01:00
turboderp
c92ffcfc6a Quant: Swap hessians and weights to system RAM 2024-03-29 18:15:54 +01:00
turboderp
4845b1e89d Quant: Save some memory when preparing quantizers for experts 2024-03-29 18:14:48 +01:00
turboderp
7baf3d4198 Adjust warning threshold for uncalibrated experts 2024-03-29 18:12:34 +01:00
turboderp
762d1e4f25 Fix typehints 2024-03-29 18:11:55 +01:00
turboderp
d8871e9ba1 Quantize: Use RTN mode for tensors > 1e9 elements 2024-03-19 18:25:41 +01:00
turboderp
fe7be9ecef MoE: Adjust calibration warning threshold 2024-03-19 18:24:11 +01:00
turboderp
a724caf978 Quantize: bit of cleanup 2024-03-19 18:23:33 +01:00
turboderp
f3ed1dfed4 Quantize: Memory optimizations 2024-03-19 18:22:23 +01:00
turboderp
9c47269913 Add parallel decoder block 2024-03-19 18:20:44 +01:00
turboderp
46c59d0d42 Quantize: RTN mode for very large head layers 2024-03-19 17:45:29 +01:00
turboderp
6a0c5a5aa7 Quantize: Perform very large act-order permutations on CPU 2024-03-19 17:42:39 +01:00
turboderp
5fb2c679cb Add quantization_config to config.json when compiling 2024-03-12 09:09:30 +01:00
turboderp
0b05686e76 Refactor, clean up and consolidate architecture logic 2024-03-06 02:46:47 +01:00
turboderp
dce84866e1 Support for StarCoder2, initial 2024-03-05 21:20:29 +01:00
turboderp
2044f8a31c Set inference_mode when compiling model 2024-02-22 10:48:44 +01:00
turboderp
7af6494afa Drop device tensors for head layer during conversion 2024-02-16 17:31:19 +01:00
turboderp
cedeb616ce Support Qwen2 2024-02-15 20:50:24 +01:00
turboderp
702dd9740a VRAM optimizations during quant 2024-02-15 20:03:47 +01:00
turboderp
0e9d9c1010 Prevent tensors passed to save_file from sharing memory 2024-02-01 10:14:36 +01:00
turboderp
8a0cb9e01d Add last saved checkpoint to status box 2024-02-01 04:56:33 +01:00
turboderp
4c93ce852f Fix remaining time estimate 2024-02-01 04:56:00 +01:00
turboderp
735807e800 Use os.replace to swap checkpoint states in measure.py as well 2024-02-01 04:39:34 +01:00
turboderp
1e70113de3 Don't print avg accuracy, clarify "completed" -> "measured" 2024-02-01 04:24:10 +01:00
Ben Gorlick
6c49870ec0 Micro-optimization in file handling when saving checkpoints in quantize.py by using os.replace for atomic operations 2024-01-31 03:22:08 -08:00