Commit Graph

792 Commits

Author SHA1 Message Date
turboderp
ea87af6ea8 Bump to v0.0.28 v0.0.28 2026-03-30 22:03:01 +02:00
turboderp
8b15b32af6 MoE kernel: Include instances for dims not divisible by 256, addresses 2026-03-30 21:58:23 +02:00
turboderp
db082c5d32 Cleanup 2026-03-30 21:17:19 +02:00
turboderp
d5ad174d8f Quantize: Retry cholesky if H not positive-definite 2026-03-30 21:17:04 +02:00
turboderp
3423357bc6 Sampling: Fix argmax for sorted logits, remove redundant norm before log_gumbel 2026-03-30 21:16:12 +02:00
turboderp
863f96bcae Bump to v0.0.27 v0.0.27 2026-03-26 02:15:00 +01:00
turboderp
4b58e05fdc Merge branch 'refs/heads/dev' into fork/Katehuuh/nanochat-ve-scalars
# Conflicts:
#	exllamav3/modules/transformer.py
2026-03-25 22:47:16 +01:00
turboderp
2a99bbe35f Nanochat: Fix quantization 2026-03-25 22:43:53 +01:00
turboderp
317086503d Norm: Fix unweighted norm if input dtype != at::kHalf 2026-03-25 21:09:36 +01:00
turboderp
8daabfc207 Nanochat: Rework/refactor new features implementation 2026-03-25 21:09:02 +01:00
turboderp
b381e15ccb Generator: Give requeued jobs priority on the pending list 2026-03-25 02:05:59 +01:00
turboderp
7d80c39a45 Add IFBench eval 2026-03-25 01:58:45 +01:00
Katehuuh
8aca86c4a3 nanochat: VE, residual scalars, backout; auto-detect key format 2026-03-24 14:20:00 +01:00
turboderp
03d9aaf3f8 Generator: Ensure recurrent checkpoint after every prefill chunk, even if chunks aren't aligned with checkpoint intervals 2026-03-23 21:49:11 +01:00
turboderp
97b0bbc5c0 Docs: Remove outdated quant duration estimates 2026-03-22 23:18:13 +01:00
turboderp
77a42495a5 Conversion: Rework allocation strategy for noninteger bitrates, add --hq mode 2026-03-22 23:17:49 +01:00
turboderp
936483ece2 Generator: Decrease defrag frequency 2026-03-22 18:16:58 +01:00
turboderp
1592d04ffd Tests: Fix up generator stress test 2026-03-22 18:16:33 +01:00
turboderp
0b898f2cc0 Generator: Fix last recurrent checkpoints not hitting page boundary 2026-03-22 18:02:48 +01:00
turboderp
3e18c72d9e Generator: Enforce recurrent_checkpoint_interval <= max_chunk_size 2026-03-22 18:02:48 +01:00
turboderp
15647d98d7 Sampling: Fix possible divide-by-zero in rep.penalty kernels 2026-03-22 18:02:48 +01:00
turboderp
d706467d85 recompile.py: Allow overriding tensors defined by the model/architecture but missing from an incomplete input model's SafetensorsCollection 2026-03-22 02:57:54 +01:00
turboderp
8a36ee8b9a chat.py: Add Qwen3.5-specific ChatML template 2026-03-21 14:16:09 +01:00
turboderp
4fa57eaaeb Tokenizer: Fix regression in HF template helper 2026-03-17 02:36:16 +01:00
turboderp
ba1ad9ac66 Bump to v0.0.26 v0.0.26 2026-03-16 19:55:04 +01:00
turboderp
a31d2187fc chat.py: Add probs option 2026-03-16 02:30:14 +01:00
turboderp
2cac1d612d Sampler: Make sure probs are normalized before log gumbel 2026-03-16 02:28:57 +01:00
turboderp
517c2db5a0 BlockSparseMLP: Work around NVCC constexpr quirk 2026-03-15 20:01:44 +01:00
turboderp
e54c1b8b7a BlockSparseMLP: Tune kernel size 2026-03-15 17:27:59 +01:00
turboderp
05e2541bb8 BlockSparseMLP: Allow fused path for module with mixed bitrates 2026-03-15 01:46:27 +01:00
turboderp
48de29c05b BlockSparseMLP: Fix regression when loading to single device 2026-03-15 01:39:30 +01:00
turboderp
5f54aa5f57 convert.py: Fix overflow when mixing bitrates for expert-heavy models 2026-03-15 00:29:37 +01:00
turboderp
cd94bf8f8f Step3.5: Fix negative activation limit 2026-03-14 23:14:27 +01:00
turboderp
0e61e43e0f BlockSparseMLP: Add fused MoE kernel 2026-03-14 23:14:27 +01:00
turboderp
fff187224b Model: Drop all refs to shared tensors after model load 2026-03-14 22:31:36 +01:00
turboderp
3f3c0bc325 Mixtral: Fix MoE out_dtype 2026-03-14 22:29:11 +01:00
turboderp
ebd2efb6bd chat.py: Random benchmark question feature 2026-03-13 04:31:19 +01:00
turboderp
aaf6337f12 Add OlmoHybridForCausalLM 2026-03-13 00:59:10 +01:00
turboderp
f674142ff3 Qwen3.5: Don't set mrope flag if no vision tower 2026-03-13 00:46:27 +01:00
turboderp
bdfb7929f4 GatedDeltaNet: Support split qkv and conv1d weights 2026-03-13 00:45:59 +01:00
turboderp
f4c56f8c6d GatedDeltaNet: Handle head sizes up to 256, divisible by down to 32, support beta scale (linear_allow_neg_eigval) 2026-03-13 00:44:37 +01:00
turboderp
f83a9ae242 GatedRMSNorm: Use single warp for head size up to 256 2026-03-13 00:40:05 +01:00
turboderp
42d0854c39 convert.py: Compactify display of module tree 2026-03-13 00:37:39 +01:00
turboderp
1404f7aa48 Bump to v0.0.25 v0.0.25 2026-03-11 23:48:47 +01:00
turboderp
9db029ded5 Separate transpose options for fused expert weights (account for differences between Qwen3Moe and Qwen3_5Moe) 2026-03-11 21:43:45 +01:00
turboderp
e05f4636ee Add Qwen3_5ForCausalLM and Qwen3_5MoeForCausalLM 2026-03-11 21:00:23 +01:00
turboderp
1b9e58c9b5 BlockSparseMLP: Skip redundant gather 2026-03-11 20:25:56 +01:00
turboderp
d52c49c17f GatedDeltaNet: Allow bfloat16 a_log 2026-03-11 20:24:04 +01:00
turboderp
ad546f7937 Bump to v0.0.24 v0.0.24 2026-03-08 20:39:35 +01:00
turboderp
63ba4d005c Generator: If model is recurrent, run last page of prompt in a separate forward pass to create checkpoint
Ensures at most 255 tokens have to be reingested per request
2026-03-07 23:32:42 +01:00