turboderp
|
5d5d57083e
|
Increase quant tolerance slightly (for small Qwen2 models esp.)
|
2024-06-13 20:44:51 +02:00 |
|
turboderp
|
60eedf4622
|
Add exit status code for quant error
|
2024-06-13 20:43:49 +02:00 |
|
turboderp
|
56ae879187
|
Merge branch 'refs/heads/202405-cached-states' into dev
|
2024-06-03 22:45:05 +02:00 |
|
Karl-Johan Alm
|
0ece2f3006
|
add layer GPU offloading for hidden/target states
|
2024-05-22 15:09:57 +09:00 |
|
Karl-Johan Alm
|
b428e239ed
|
optimization: put rfn_sum on cuda and do .item() call out of for loop
|
2024-05-22 12:43:21 +09:00 |
|
turboderp
|
463373ba1e
|
Merge branch 'master' into dev
|
2024-05-20 01:50:50 +02:00 |
|
turboderp
|
83baa98ed9
|
Add machine-parseable output to convert script
|
2024-05-20 01:49:34 +02:00 |
|
turboderp
|
b1e092af10
|
Update TODO: items
|
2024-05-18 06:43:27 +02:00 |
|
turboderp
|
a847f48720
|
Allow quantizing models with max_seq_len < 2048
|
2024-05-09 17:25:28 +02:00 |
|
turboderp
|
750c85e2c7
|
Fixes to allow quantizing Granite
|
2024-05-09 02:31:21 +02:00 |
|
turboderp
|
0d8bac53ee
|
Cleanup
|
2024-04-26 23:24:45 +02:00 |
|
turboderp
|
e85404fbfd
|
Quant: Slight VRAM optimization, don't scale H needlessly
|
2024-04-24 19:11:53 +02:00 |
|
turboderp
|
b68c0bd89b
|
Fix checkpoint interval
|
2024-04-18 23:18:58 +02:00 |
|
turboderp
|
893e73c360
|
Fix scale rounding during quant
|
2024-04-18 22:35:57 +02:00 |
|
turboderp
|
b112b210aa
|
Quant: Add a little more damping
|
2024-04-18 09:19:46 +02:00 |
|
turboderp
|
740a19a27c
|
Optimize: More robust solver
|
2024-04-09 23:13:21 +02:00 |
|
turboderp
|
52bc008df9
|
Don't add metadata when -cf not specified
|
2024-04-06 17:28:57 +02:00 |
|
turboderp
|
63394ab8a5
|
Optimizer: Add accuracy bias to first layer
|
2024-04-06 08:01:43 +02:00 |
|
turboderp
|
5c1fcb693e
|
Quant: Ignore OoM error during second sanity check
|
2024-04-05 21:49:42 +02:00 |
|
turboderp
|
672c7355a3
|
Quant: Change snapshot to time instead of layer interval
|
2024-04-05 21:49:10 +02:00 |
|
turboderp
|
5d9732165e
|
Quant: Swap some state to CPU and attempt to keep more VRAM available in places
|
2024-04-05 21:48:08 +02:00 |
|
turboderp
|
b8c267e224
|
Quant: Perform H perm on CPU when H is very large
|
2024-04-05 21:46:20 +02:00 |
|
turboderp
|
2a5533de3f
|
Quant: Option to load linear layer without allocating scratch space
|
2024-04-05 21:45:47 +02:00 |
|
turboderp
|
88843a5633
|
Quant: Offload quanting of very large layers to second GPU
|
2024-04-05 21:44:21 +02:00 |
|
turboderp
|
97e8123c71
|
Enable head (qk) norms for quantized models
|
2024-04-05 21:35:23 +02:00 |
|
turboderp
|
ff2ff0a407
|
Fix typo
|
2024-03-29 19:44:11 +01:00 |
|
turboderp
|
c92ffcfc6a
|
Quant: Swap hessians and weights to system RAM
|
2024-03-29 18:15:54 +01:00 |
|
turboderp
|
4845b1e89d
|
Quant: Save some memory when preparing quantizers for experts
|
2024-03-29 18:14:48 +01:00 |
|
turboderp
|
7baf3d4198
|
Adjust warning threshold for uncalibrated experts
|
2024-03-29 18:12:34 +01:00 |
|
turboderp
|
762d1e4f25
|
Fix typehints
|
2024-03-29 18:11:55 +01:00 |
|
turboderp
|
d8871e9ba1
|
Quantize: Use RTN mode for tensors > 1e9 elements
|
2024-03-19 18:25:41 +01:00 |
|
turboderp
|
fe7be9ecef
|
MoE: Adjust calibration warning threshold
|
2024-03-19 18:24:11 +01:00 |
|
turboderp
|
a724caf978
|
Quantize: bit of cleanup
|
2024-03-19 18:23:33 +01:00 |
|
turboderp
|
f3ed1dfed4
|
Quantize: Memory optimizations
|
2024-03-19 18:22:23 +01:00 |
|
turboderp
|
9c47269913
|
Add parallel decoder block
|
2024-03-19 18:20:44 +01:00 |
|
turboderp
|
46c59d0d42
|
Quantize: RTN mode for very large head layers
|
2024-03-19 17:45:29 +01:00 |
|
turboderp
|
6a0c5a5aa7
|
Quantize: Perform very large act-order permutations on CPU
|
2024-03-19 17:42:39 +01:00 |
|
turboderp
|
5fb2c679cb
|
Add quantization_config to config.json when compiling
|
2024-03-12 09:09:30 +01:00 |
|
turboderp
|
0b05686e76
|
Refactor, clean up and consolidate architecture logic
|
2024-03-06 02:46:47 +01:00 |
|
turboderp
|
dce84866e1
|
Support for StarCoder2, initial
|
2024-03-05 21:20:29 +01:00 |
|
turboderp
|
2044f8a31c
|
Set inference_mode when compiling model
|
2024-02-22 10:48:44 +01:00 |
|
turboderp
|
7af6494afa
|
Drop device tensors for head layer during conversion
|
2024-02-16 17:31:19 +01:00 |
|
turboderp
|
cedeb616ce
|
Support Qwen2
|
2024-02-15 20:50:24 +01:00 |
|
turboderp
|
702dd9740a
|
VRAM optimizations during quant
|
2024-02-15 20:03:47 +01:00 |
|
turboderp
|
0e9d9c1010
|
Prevent tensors passed to save_file from sharing memory
|
2024-02-01 10:14:36 +01:00 |
|
turboderp
|
8a0cb9e01d
|
Add last saved checkpoint to status box
|
2024-02-01 04:56:33 +01:00 |
|
turboderp
|
4c93ce852f
|
Fix remaining time estimate
|
2024-02-01 04:56:00 +01:00 |
|
turboderp
|
735807e800
|
Use os.replace to swap checkpoint states in measure.py as well
|
2024-02-01 04:39:34 +01:00 |
|
turboderp
|
1e70113de3
|
Don't print avg accuracy, clarify "completed" -> "measured"
|
2024-02-01 04:24:10 +01:00 |
|
Ben Gorlick
|
6c49870ec0
|
Micro-optimization in file handling when saving checkpoints in quantize.py by using os.replace for atomic operations
|
2024-01-31 03:22:08 -08:00 |
|