exllamav3

mirror of https://github.com/turboderp-org/exllamav3.git synced 2026-04-29 02:31:34 +00:00

Author	SHA1	Message	Date
turboderp	89b841dd8a	safetensors_alt: Allow writing bfloat16 tensors	2026-02-10 17:47:44 +01:00
turboderp	6e4202eade	Bump to v0.0.21 v0.0.21	2026-02-09 22:19:02 +01:00
turboderp	f9a7448366	Merge branch 'refs/heads/st_test' into dev	2026-02-09 04:35:00 +01:00
turboderp	d3e02500e0	Sigmoid+proj kernel: fix regression (Qwen3-Next)	2026-02-09 04:34:37 +01:00
turboderp	d85690204a	Replacement safetensors lib for quantization	2026-01-27 00:52:54 +01:00
turboderp	428a082276	Add performance test	2026-01-22 23:28:53 +01:00
turboderp	91a11853cd	Update README.md	2026-01-22 23:27:23 +01:00
turboderp	96ba966ad9	Bump to v0.0.20 v0.0.20	2026-01-19 23:21:59 +01:00
turboderp	0ecc37bf97	Fix ComboSampler init when initializing as greedy	2026-01-19 22:57:19 +01:00
turboderp	75ee2c78c3	Add Qwen2_5_VLForConditionalGeneration, refactor HCXVisionV2VisionModel as subclass of Qwen2_5VLVisionModel	2026-01-19 22:48:49 +01:00
turboderp	5a6975747f	Bump to v0.0.19 v0.0.19	2026-01-16 23:28:09 +01:00
turboderp	c39616a7b5	Merge pull request #125 from amanwalksdownthestreet/fix-arch-suffix-parsing arch_list: Strip NVIDIA arch suffixes (sm_120a, sm_90a, etc.)	2026-01-14 22:11:43 +01:00
turboderp	f21b92e978	Add Adaptive-P sampler	2026-01-14 21:58:34 +01:00
turboderp	0d09af403a	Diversity test: use greedy sampling for extraction	2026-01-14 21:40:31 +01:00
Jo-Philipp Wich	4845c8fa25	Add tensor parallel support for MiniMax M2 Q/K norms MiniMax M2 uses Q/K RMSNorm with span_heads=True, which normalizes across ALL heads at each sequence position. When using tensor parallelism, heads are split across devices, so each device only sees a subset of heads and computes incorrect local variance. The fix follows vLLM's approach: - Compute local sum of squares on each TP rank - All-reduce the sum across ranks - Divide by global dimension to get true global mean - Apply normalization with corrected global variance Key changes: - attn.py: Add apply_qk_norms_tp() method with variance all-reduce - attn.py: Modify tp_export/tp_import to handle span_heads norms - rmsnorm.py: Preserve span_heads in tp_export, handle 1D tensors in split - minimax_m2.py: Enable TP support (supports_tp: True) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-12 10:38:59 +00:00
turboderp	e839152802	Add diversity test	2026-01-11 19:12:04 +01:00
turboderp	3186dca9da	generator: Pad token mask when output layer is padded	2026-01-11 19:11:26 +01:00
turboderp	9043690801	generator: Free recurrent state after job completed (prevent memory leak with large job queue)	2026-01-11 17:38:15 +01:00
turboderp	e69d91b12b	model_init: Add sampling args default overrides	2026-01-11 16:38:33 +01:00
turboderp	6b31fc00f5	Add HF tokenizer helper, refactor example	2026-01-11 12:49:12 +01:00
turboderp	288a98f5e3	Refactor sampler args for examples	2026-01-11 12:33:27 +01:00
turboderp	27c68d4e65	Update README.md	2026-01-10 15:59:46 +01:00
turboderp	539410a2a3	Support NanoChatForCausalLM	2026-01-10 15:59:08 +01:00
turboderp	3ecb9f54fb	Merge pull request #136 from mindkrypted/feature/support-solar-open-moe Add support for SolarOpenMoE architecture	2026-01-10 10:55:36 +01:00
mindkrypted	fd8659a6c3	Add support for SolarOpenMoE architecture	2026-01-07 13:45:23 -05:00
turboderp	703b05ab52	Update README.md	2026-01-06 16:08:23 +01:00
turboderp	a17d1a4334	Add HCXVisionV2ForCausalLM architecture	2026-01-06 16:01:54 +01:00
turboderp	7de8641fce	Attn: Add varlen mode	2026-01-06 16:01:54 +01:00
turboderp	a026b32df3	Support IQuestCoderForCausalLM	2026-01-04 12:31:58 +01:00
turboderp	6e75e7b151	chat.py: Fix for models with eos_token_id=null	2026-01-04 02:02:10 +01:00
turboderp	227621e49e	Support HyperCLOVAXForCausalLM	2026-01-03 03:22:50 +01:00
turboderp	a92cf0a13a	Attn: Support custom softmax scale in SDPA mode	2026-01-03 03:22:13 +01:00
turboderp	cff5fd542c	Embedding: Support embedding multiplier	2026-01-03 03:21:55 +01:00
turboderp	452803e73d	Olmo3: Use default RoPE type for SWA layers	2025-12-26 21:38:56 +01:00
turboderp	195d01657a	RoPE: Allow RoPE type override	2025-12-26 21:38:34 +01:00
turboderp	e8b77bba4a	chat.py: Fix prompt tokens/s display	2025-12-25 23:18:50 +01:00
turboderp	80907797a5	chat.py: Add debug mode	2025-12-25 23:18:25 +01:00
turboderp	f0ea2ca858	Linear: Support new FP8 scale format	2025-12-23 21:05:05 +01:00
turboderp	2698a83022	RoPE: Let arch override theta key name	2025-12-23 21:04:41 +01:00
amanwalksdownthestreet	65cfaf3c60	arch_list: Strip NVIDIA arch suffixes (sm_120a, sm_90a, etc.)	2025-12-16 23:06:34 -07:00
turboderp	a32e2219af	Allow -hb 16 while quantizing	2025-12-13 20:55:25 +01:00
turboderp	104268521c	Support Olmo3ForCausalLM	2025-12-13 20:49:03 +01:00
turboderp	bd0f26cd0e	Fix comments	2025-12-10 21:47:30 +01:00
turboderp	1b7009c5b8	Merge remote-tracking branch 'origin/master' v0.0.18	2025-12-10 10:43:17 +01:00
turboderp	f9d0e6038f	Bump to v0.0.18	2025-12-10 10:42:41 +01:00
turboderp	d8be5d638f	chat.py: Read all stop conditions from config.json	2025-12-10 00:53:45 +01:00
turboderp	9b75bc5f58	Support Ministral3ForCausalLM	2025-12-10 00:53:22 +01:00
turboderp	9663357c4f	Convert: Print some more RoPE debug info	2025-12-10 00:52:49 +01:00
turboderp	24caf2c762	RoPE: Accept partial_rotary_factor in rope_parameters	2025-12-10 00:52:29 +01:00
kingbri	e49c02a3aa	Actions: Add builds for torch 2.9 Signed-off-by: kingbri <8082010+kingbri1@users.noreply.github.com>	2025-12-09 15:03:25 -05:00

1 2 3 4 5 ...

732 Commits