exllamav2

mirror of https://github.com/turboderp-org/exllamav2.git synced 2026-05-05 21:51:21 +00:00

Author	SHA1	Message	Date
turboderp	5d5d57083e	Increase quant tolerance slightly (for small Qwen2 models esp.)	2024-06-13 20:44:51 +02:00
turboderp	60eedf4622	Add exit status code for quant error	2024-06-13 20:43:49 +02:00
Karl-Johan Alm	b428e239ed	optimization: put rfn_sum on cuda and do .item() call out of for loop	2024-05-22 12:43:21 +09:00
turboderp	83baa98ed9	Add machine-parseable output to convert script	2024-05-20 01:49:34 +02:00
turboderp	750c85e2c7	Fixes to allow quantizing Granite	2024-05-09 02:31:21 +02:00
turboderp	b68c0bd89b	Fix checkpoint interval	2024-04-18 23:18:58 +02:00
turboderp	5c1fcb693e	Quant: Ignore OoM error during second sanity check	2024-04-05 21:49:42 +02:00
turboderp	672c7355a3	Quant: Change snapshot to time instead of layer interval	2024-04-05 21:49:10 +02:00
turboderp	5d9732165e	Quant: Swap some state to CPU and attempt to keep more VRAM available in places	2024-04-05 21:48:08 +02:00
turboderp	2a5533de3f	Quant: Option to load linear layer without allocating scratch space	2024-04-05 21:45:47 +02:00
turboderp	88843a5633	Quant: Offload quanting of very large layers to second GPU	2024-04-05 21:44:21 +02:00
turboderp	ff2ff0a407	Fix typo	2024-03-29 19:44:11 +01:00
turboderp	4845b1e89d	Quant: Save some memory when preparing quantizers for experts	2024-03-29 18:14:48 +01:00
turboderp	7baf3d4198	Adjust warning threshold for uncalibrated experts	2024-03-29 18:12:34 +01:00
turboderp	d8871e9ba1	Quantize: Use RTN mode for tensors > 1e9 elements	2024-03-19 18:25:41 +01:00
turboderp	fe7be9ecef	MoE: Adjust calibration warning threshold	2024-03-19 18:24:11 +01:00
turboderp	a724caf978	Quantize: bit of cleanup	2024-03-19 18:23:33 +01:00
turboderp	f3ed1dfed4	Quantize: Memory optimizations	2024-03-19 18:22:23 +01:00
turboderp	9c47269913	Add parallel decoder block	2024-03-19 18:20:44 +01:00
turboderp	0b05686e76	Refactor, clean up and consolidate architecture logic	2024-03-06 02:46:47 +01:00
turboderp	dce84866e1	Support for StarCoder2, initial	2024-03-05 21:20:29 +01:00
turboderp	7af6494afa	Drop device tensors for head layer during conversion	2024-02-16 17:31:19 +01:00
turboderp	cedeb616ce	Support Qwen2	2024-02-15 20:50:24 +01:00
turboderp	702dd9740a	VRAM optimizations during quant	2024-02-15 20:03:47 +01:00
Ben Gorlick	6c49870ec0	Micro-optimization in file handling when saving checkpoints in quantize.py by using os.replace for atomic operations	2024-01-31 03:22:08 -08:00
turboderp	9c3fd9df3a	Make quantizer sanity check slightly more forgiving	2024-01-30 20:24:40 +01:00
turboderp	7a9d12ae4c	Add non-RMS layernorm, support for Orion	2024-01-22 17:21:01 +01:00
turboderp	1f71d17b89	Use .union() for Python 3.8 compatibility	2024-01-20 06:22:14 +01:00
turboderp	41b15dd1c3	Refactor to consolidate attn params	2024-01-04 04:52:49 +01:00
turboderp	f4fe920a50	Reset snapshot interval	2023-12-27 17:23:58 +01:00
turboderp	02ce583318	Optimize VRAM usage a bit for quantizer	2023-12-26 00:00:37 +01:00
turboderp	b121ee418f	Fix typo	2023-12-17 10:40:11 +01:00
turboderp	d2753a29b8	Mixtral EXL2 support, initial	2023-12-16 16:50:50 +01:00
turboderp	104c367451	Quantizer experiments	2023-12-14 01:29:12 +01:00
turboderp	0d63d6479c	Rework quantization and optimization	2023-12-13 01:00:11 +01:00
turboderp	303d90b65e	Fix regression	2023-12-10 19:51:59 +01:00
turboderp	3c43bad57f	Revert some changes, calibrate to q state again (fixes 70B low bitrate)	2023-12-10 17:34:18 +01:00
turboderp	a89d85a803	Faster and more stable quant	2023-12-10 03:29:06 +01:00
turboderp	2e91239571	New quant optimization procedure	2023-12-08 20:19:57 +01:00
turboderp	644805adba	Reduce VRAM usage when quantizing	2023-12-02 17:18:53 +01:00
turboderp	f0c01a328b	Skip stats for head layer to save system RAM	2023-11-30 19:40:45 +01:00
kingbri	6bfcefe940	Tree: Force utf8 when opening files The default encoding on linux is utf8, but Windows uses cp1252 which isn't compatible with some unicode characters. Signed-off-by: kingbri <bdashore3@proton.me>	2023-11-29 19:21:29 -05:00
turboderp	24f00214c9	Revert last commit	2023-11-27 05:29:04 +01:00
turboderp	4ff25ec6ef	Ensure inference_mode while quanting	2023-11-26 19:22:40 +01:00
turboderp	714a19ca8f	Allow irregular group sizes	2023-11-26 16:53:29 +01:00
turboderp	385965ed61	Fix padding for extended vocab models	2023-09-22 00:21:49 +02:00
turboderp	c0dd3412d5	Compute full error for lm_head	2023-09-20 10:25:58 +02:00
turboderp	6fd006b9d0	More options for converter to facilitate scripting	2023-09-18 18:25:30 +02:00
turboderp	5c247f93aa	More memory tweaks, made swapping states to CPU the default to accommodate quanting 70B on 24GB GPUs	2023-09-16 19:15:29 +02:00
turboderp	9e55e44bcb	Optimize memory usage when quantizing, increase damping factor	2023-09-16 15:08:55 +02:00

1 2

59 Commits