exllamav2

mirror of https://github.com/turboderp-org/exllamav2.git synced 2026-05-04 21:21:25 +00:00

Author	SHA1	Message	Date
turboderp	5d5d57083e	Increase quant tolerance slightly (for small Qwen2 models esp.)	2024-06-13 20:44:51 +02:00
turboderp	60eedf4622	Add exit status code for quant error	2024-06-13 20:43:49 +02:00
turboderp	56ae879187	Merge branch 'refs/heads/202405-cached-states' into dev	2024-06-03 22:45:05 +02:00
Karl-Johan Alm	0ece2f3006	add layer GPU offloading for hidden/target states	2024-05-22 15:09:57 +09:00
Karl-Johan Alm	b428e239ed	optimization: put rfn_sum on cuda and do .item() call out of for loop	2024-05-22 12:43:21 +09:00
turboderp	463373ba1e	Merge branch 'master' into dev	2024-05-20 01:50:50 +02:00
turboderp	83baa98ed9	Add machine-parseable output to convert script	2024-05-20 01:49:34 +02:00
turboderp	b1e092af10	Update TODO: items	2024-05-18 06:43:27 +02:00
turboderp	a847f48720	Allow quantizing models with max_seq_len < 2048	2024-05-09 17:25:28 +02:00
turboderp	750c85e2c7	Fixes to allow quantizing Granite	2024-05-09 02:31:21 +02:00
turboderp	0d8bac53ee	Cleanup	2024-04-26 23:24:45 +02:00
turboderp	e85404fbfd	Quant: Slight VRAM optimization, don't scale H needlessly	2024-04-24 19:11:53 +02:00
turboderp	b68c0bd89b	Fix checkpoint interval	2024-04-18 23:18:58 +02:00
turboderp	893e73c360	Fix scale rounding during quant	2024-04-18 22:35:57 +02:00
turboderp	b112b210aa	Quant: Add a little more damping	2024-04-18 09:19:46 +02:00
turboderp	740a19a27c	Optimize: More robust solver	2024-04-09 23:13:21 +02:00
turboderp	52bc008df9	Don't add metadata when -cf not specified	2024-04-06 17:28:57 +02:00
turboderp	63394ab8a5	Optimizer: Add accuracy bias to first layer	2024-04-06 08:01:43 +02:00
turboderp	5c1fcb693e	Quant: Ignore OoM error during second sanity check	2024-04-05 21:49:42 +02:00
turboderp	672c7355a3	Quant: Change snapshot to time instead of layer interval	2024-04-05 21:49:10 +02:00
turboderp	5d9732165e	Quant: Swap some state to CPU and attempt to keep more VRAM available in places	2024-04-05 21:48:08 +02:00
turboderp	b8c267e224	Quant: Perform H perm on CPU when H is very large	2024-04-05 21:46:20 +02:00
turboderp	2a5533de3f	Quant: Option to load linear layer without allocating scratch space	2024-04-05 21:45:47 +02:00
turboderp	88843a5633	Quant: Offload quanting of very large layers to second GPU	2024-04-05 21:44:21 +02:00
turboderp	97e8123c71	Enable head (qk) norms for quantized models	2024-04-05 21:35:23 +02:00
turboderp	ff2ff0a407	Fix typo	2024-03-29 19:44:11 +01:00
turboderp	c92ffcfc6a	Quant: Swap hessians and weights to system RAM	2024-03-29 18:15:54 +01:00
turboderp	4845b1e89d	Quant: Save some memory when preparing quantizers for experts	2024-03-29 18:14:48 +01:00
turboderp	7baf3d4198	Adjust warning threshold for uncalibrated experts	2024-03-29 18:12:34 +01:00
turboderp	762d1e4f25	Fix typehints	2024-03-29 18:11:55 +01:00
turboderp	d8871e9ba1	Quantize: Use RTN mode for tensors > 1e9 elements	2024-03-19 18:25:41 +01:00
turboderp	fe7be9ecef	MoE: Adjust calibration warning threshold	2024-03-19 18:24:11 +01:00
turboderp	a724caf978	Quantize: bit of cleanup	2024-03-19 18:23:33 +01:00
turboderp	f3ed1dfed4	Quantize: Memory optimizations	2024-03-19 18:22:23 +01:00
turboderp	9c47269913	Add parallel decoder block	2024-03-19 18:20:44 +01:00
turboderp	46c59d0d42	Quantize: RTN mode for very large head layers	2024-03-19 17:45:29 +01:00
turboderp	6a0c5a5aa7	Quantize: Perform very large act-order permutations on CPU	2024-03-19 17:42:39 +01:00
turboderp	5fb2c679cb	Add quantization_config to config.json when compiling	2024-03-12 09:09:30 +01:00
turboderp	0b05686e76	Refactor, clean up and consolidate architecture logic	2024-03-06 02:46:47 +01:00
turboderp	dce84866e1	Support for StarCoder2, initial	2024-03-05 21:20:29 +01:00
turboderp	2044f8a31c	Set inference_mode when compiling model	2024-02-22 10:48:44 +01:00
turboderp	7af6494afa	Drop device tensors for head layer during conversion	2024-02-16 17:31:19 +01:00
turboderp	cedeb616ce	Support Qwen2	2024-02-15 20:50:24 +01:00
turboderp	702dd9740a	VRAM optimizations during quant	2024-02-15 20:03:47 +01:00
turboderp	0e9d9c1010	Prevent tensors passed to save_file from sharing memory	2024-02-01 10:14:36 +01:00
turboderp	8a0cb9e01d	Add last saved checkpoint to status box	2024-02-01 04:56:33 +01:00
turboderp	4c93ce852f	Fix remaining time estimate	2024-02-01 04:56:00 +01:00
turboderp	735807e800	Use os.replace to swap checkpoint states in measure.py as well	2024-02-01 04:39:34 +01:00
turboderp	1e70113de3	Don't print avg accuracy, clarify "completed" -> "measured"	2024-02-01 04:24:10 +01:00
Ben Gorlick	6c49870ec0	Micro-optimization in file handling when saving checkpoints in quantize.py by using os.replace for atomic operations	2024-01-31 03:22:08 -08:00

1 2 3

114 Commits