exllamav3

mirror of https://github.com/turboderp-org/exllamav3.git synced 2026-05-11 16:30:12 +00:00

Author	SHA1	Message	Date
turboderp	ea87af6ea8	Bump to v0.0.28 v0.0.28	2026-03-30 22:03:01 +02:00
turboderp	8b15b32af6	MoE kernel: Include instances for dims not divisible by 256, addresses	2026-03-30 21:58:23 +02:00
turboderp	db082c5d32	Cleanup	2026-03-30 21:17:19 +02:00
turboderp	d5ad174d8f	Quantize: Retry cholesky if H not positive-definite	2026-03-30 21:17:04 +02:00
turboderp	3423357bc6	Sampling: Fix argmax for sorted logits, remove redundant norm before log_gumbel	2026-03-30 21:16:12 +02:00
turboderp	863f96bcae	Bump to v0.0.27 v0.0.27	2026-03-26 02:15:00 +01:00
turboderp	4b58e05fdc	Merge branch 'refs/heads/dev' into fork/Katehuuh/nanochat-ve-scalars # Conflicts: # exllamav3/modules/transformer.py	2026-03-25 22:47:16 +01:00
turboderp	2a99bbe35f	Nanochat: Fix quantization	2026-03-25 22:43:53 +01:00
turboderp	317086503d	Norm: Fix unweighted norm if input dtype != at::kHalf	2026-03-25 21:09:36 +01:00
turboderp	8daabfc207	Nanochat: Rework/refactor new features implementation	2026-03-25 21:09:02 +01:00
turboderp	b381e15ccb	Generator: Give requeued jobs priority on the pending list	2026-03-25 02:05:59 +01:00
turboderp	7d80c39a45	Add IFBench eval	2026-03-25 01:58:45 +01:00
Katehuuh	8aca86c4a3	nanochat: VE, residual scalars, backout; auto-detect key format	2026-03-24 14:20:00 +01:00
turboderp	03d9aaf3f8	Generator: Ensure recurrent checkpoint after every prefill chunk, even if chunks aren't aligned with checkpoint intervals	2026-03-23 21:49:11 +01:00
turboderp	97b0bbc5c0	Docs: Remove outdated quant duration estimates	2026-03-22 23:18:13 +01:00
turboderp	77a42495a5	Conversion: Rework allocation strategy for noninteger bitrates, add --hq mode	2026-03-22 23:17:49 +01:00
turboderp	936483ece2	Generator: Decrease defrag frequency	2026-03-22 18:16:58 +01:00
turboderp	1592d04ffd	Tests: Fix up generator stress test	2026-03-22 18:16:33 +01:00
turboderp	0b898f2cc0	Generator: Fix last recurrent checkpoints not hitting page boundary	2026-03-22 18:02:48 +01:00
turboderp	3e18c72d9e	Generator: Enforce recurrent_checkpoint_interval <= max_chunk_size	2026-03-22 18:02:48 +01:00
turboderp	15647d98d7	Sampling: Fix possible divide-by-zero in rep.penalty kernels	2026-03-22 18:02:48 +01:00
turboderp	d706467d85	recompile.py: Allow overriding tensors defined by the model/architecture but missing from an incomplete input model's SafetensorsCollection	2026-03-22 02:57:54 +01:00
turboderp	8a36ee8b9a	chat.py: Add Qwen3.5-specific ChatML template	2026-03-21 14:16:09 +01:00
turboderp	4fa57eaaeb	Tokenizer: Fix regression in HF template helper	2026-03-17 02:36:16 +01:00
turboderp	ba1ad9ac66	Bump to v0.0.26 v0.0.26	2026-03-16 19:55:04 +01:00
turboderp	a31d2187fc	chat.py: Add probs option	2026-03-16 02:30:14 +01:00
turboderp	2cac1d612d	Sampler: Make sure probs are normalized before log gumbel	2026-03-16 02:28:57 +01:00
turboderp	517c2db5a0	BlockSparseMLP: Work around NVCC constexpr quirk	2026-03-15 20:01:44 +01:00
turboderp	e54c1b8b7a	BlockSparseMLP: Tune kernel size	2026-03-15 17:27:59 +01:00
turboderp	05e2541bb8	BlockSparseMLP: Allow fused path for module with mixed bitrates	2026-03-15 01:46:27 +01:00
turboderp	48de29c05b	BlockSparseMLP: Fix regression when loading to single device	2026-03-15 01:39:30 +01:00
turboderp	5f54aa5f57	convert.py: Fix overflow when mixing bitrates for expert-heavy models	2026-03-15 00:29:37 +01:00
turboderp	cd94bf8f8f	Step3.5: Fix negative activation limit	2026-03-14 23:14:27 +01:00
turboderp	0e61e43e0f	BlockSparseMLP: Add fused MoE kernel	2026-03-14 23:14:27 +01:00
turboderp	fff187224b	Model: Drop all refs to shared tensors after model load	2026-03-14 22:31:36 +01:00
turboderp	3f3c0bc325	Mixtral: Fix MoE out_dtype	2026-03-14 22:29:11 +01:00
turboderp	ebd2efb6bd	chat.py: Random benchmark question feature	2026-03-13 04:31:19 +01:00
turboderp	aaf6337f12	Add OlmoHybridForCausalLM	2026-03-13 00:59:10 +01:00
turboderp	f674142ff3	Qwen3.5: Don't set mrope flag if no vision tower	2026-03-13 00:46:27 +01:00
turboderp	bdfb7929f4	GatedDeltaNet: Support split qkv and conv1d weights	2026-03-13 00:45:59 +01:00
turboderp	f4c56f8c6d	GatedDeltaNet: Handle head sizes up to 256, divisible by down to 32, support beta scale (linear_allow_neg_eigval)	2026-03-13 00:44:37 +01:00
turboderp	f83a9ae242	GatedRMSNorm: Use single warp for head size up to 256	2026-03-13 00:40:05 +01:00
turboderp	42d0854c39	convert.py: Compactify display of module tree	2026-03-13 00:37:39 +01:00
turboderp	1404f7aa48	Bump to v0.0.25 v0.0.25	2026-03-11 23:48:47 +01:00
turboderp	9db029ded5	Separate transpose options for fused expert weights (account for differences between Qwen3Moe and Qwen3_5Moe)	2026-03-11 21:43:45 +01:00
turboderp	e05f4636ee	Add Qwen3_5ForCausalLM and Qwen3_5MoeForCausalLM	2026-03-11 21:00:23 +01:00
turboderp	1b9e58c9b5	BlockSparseMLP: Skip redundant gather	2026-03-11 20:25:56 +01:00
turboderp	d52c49c17f	GatedDeltaNet: Allow bfloat16 a_log	2026-03-11 20:24:04 +01:00
turboderp	ad546f7937	Bump to v0.0.24 v0.0.24	2026-03-08 20:39:35 +01:00
turboderp	63ba4d005c	Generator: If model is recurrent, run last page of prompt in a separate forward pass to create checkpoint Ensures at most 255 tokens have to be reingested per request	2026-03-07 23:32:42 +01:00

1 2 3 4 5 ...

792 Commits