exllamav3

mirror of https://github.com/turboderp-org/exllamav3.git synced 2026-05-11 00:10:13 +00:00

Author	SHA1	Message	Date
turboderp	93695e9a7d	RMSNorm/RoPE kernels: Allow BF16/FP32 norm weights	2026-03-02 03:49:13 +01:00
turboderp	e2f4198406	Formatting	2026-03-02 00:53:23 +01:00
turboderp	08ca454ec0	Step 3.5: Fix TP split	2026-03-01 21:32:59 +01:00
turboderp	6386de7a9b	Add Step3p5ForCausalLM	2026-03-01 17:59:28 +01:00
turboderp	76937421ec	convert.py: Make out_scales the default, with options for auto and disable	2026-03-01 17:57:55 +01:00
turboderp	c8c2e6178c	chat.py: Catbench shortcut	2026-03-01 17:57:55 +01:00
turboderp	99f792dce0	Add custom activation limits	2026-03-01 17:57:55 +01:00
turboderp	b272ea3515	Remove C-style conditionals	2026-03-01 15:12:33 +01:00
turboderp	18b2a23d8a	chat.py: Fix error message	2026-03-01 15:10:22 +01:00
turboderp	b0cfe46702	Config: Allow for interpreting config key with incorrect data type as missing key (for weirdly implemented layerwise RoPE settings in some models)	2026-03-01 03:16:32 +01:00
turboderp	489b3aab12	BlockSparseMLP: Allow loading combined experts tensors also when gate and up are not fused	2026-03-01 03:13:56 +01:00
turboderp	4bdd22ea77	BlockSparseMLP: Make sure bias is always applied during calibration	2026-03-01 03:13:03 +01:00
turboderp	f7ccb524e7	Attn: Support headwise gate	2026-03-01 03:12:03 +01:00
turboderp	447c8bb522	Build actions: Add torch 2.10.0 wheels	2026-02-28 23:53:15 +01:00
turboderp	8ef7f4b5dd	Linear: Allow fusing linear layers during unquantized model load	2026-02-22 22:43:34 +01:00
turboderp	c1b16d2fc9	Loader: Allow checking for lists of tensor groups	2026-02-22 22:42:30 +01:00
turboderp	ea1fe0ccea	Cleanup	2026-02-22 15:14:57 +01:00
turboderp	ed5bad7235	Alias __nv_bfloat16 -> bfloat16	2026-02-17 21:24:41 +01:00
turboderp	b2b6f37e12	perf.py: Error out if test length > cache size	2026-02-17 20:04:13 +01:00
turboderp	3f9c053227	Merge pull request #141 Add tensor parallel support for MiniMax M2 Q/K norms	2026-02-16 01:24:34 +01:00
turboderp	abb083ceb8	Merge pull request #103 from mratsim/patch-1 Add size estimation script for model tensors size	2026-02-15 17:58:50 +01:00
turboderp	ae3645c455	Merge pull request #147 from lesj0610/feat/hf-chat-template-compat Tokenizer: robust HF chat template kwargs and output compatibility	2026-02-15 17:58:03 +01:00
turboderp	eca621af79	Merge remote-tracking branch 'origin/dev' into dev	2026-02-15 17:56:31 +01:00
turboderp	1744361cc2	Merge pull request #148 from lesj0610/fix/exaone4-swa-layer-types exaone4: use layer_types as source of truth for SWA layer mapping	2026-02-15 17:55:08 +01:00
turboderp	44f70da0f9	Merge pull request #149 from MikeRoz47/dev Add optional arg to compare_q.py for saving plot files	2026-02-15 17:53:55 +01:00
MikeRoz47	52c2f5794d	Add optional arg to compare_q to allow it to save plots rather than show them	2026-02-15 16:41:18 +00:00
lesj0610	5c076e5f2a	exaone4: prefer layer_types over pattern for SWA layer mapping	2026-02-12 01:48:52 +09:00
lesj0610	019d965eb6	tokenizer: harden HF chat template compatibility and kwargs passthrough	2026-02-12 01:25:30 +09:00
turboderp	701afb9294	Bump to v0.0.22 v0.0.22	2026-02-10 17:48:24 +01:00
turboderp	89b841dd8a	safetensors_alt: Allow writing bfloat16 tensors	2026-02-10 17:47:44 +01:00
turboderp	6e4202eade	Bump to v0.0.21 v0.0.21	2026-02-09 22:19:02 +01:00
turboderp	f9a7448366	Merge branch 'refs/heads/st_test' into dev	2026-02-09 04:35:00 +01:00
turboderp	d3e02500e0	Sigmoid+proj kernel: fix regression (Qwen3-Next)	2026-02-09 04:34:37 +01:00
turboderp	d85690204a	Replacement safetensors lib for quantization	2026-01-27 00:52:54 +01:00
turboderp	428a082276	Add performance test	2026-01-22 23:28:53 +01:00
turboderp	91a11853cd	Update README.md	2026-01-22 23:27:23 +01:00
turboderp	96ba966ad9	Bump to v0.0.20 v0.0.20	2026-01-19 23:21:59 +01:00
turboderp	0ecc37bf97	Fix ComboSampler init when initializing as greedy	2026-01-19 22:57:19 +01:00
turboderp	75ee2c78c3	Add Qwen2_5_VLForConditionalGeneration, refactor HCXVisionV2VisionModel as subclass of Qwen2_5VLVisionModel	2026-01-19 22:48:49 +01:00
turboderp	5a6975747f	Bump to v0.0.19 v0.0.19	2026-01-16 23:28:09 +01:00
turboderp	c39616a7b5	Merge pull request #125 from amanwalksdownthestreet/fix-arch-suffix-parsing arch_list: Strip NVIDIA arch suffixes (sm_120a, sm_90a, etc.)	2026-01-14 22:11:43 +01:00
turboderp	f21b92e978	Add Adaptive-P sampler	2026-01-14 21:58:34 +01:00
turboderp	0d09af403a	Diversity test: use greedy sampling for extraction	2026-01-14 21:40:31 +01:00
Jo-Philipp Wich	4845c8fa25	Add tensor parallel support for MiniMax M2 Q/K norms MiniMax M2 uses Q/K RMSNorm with span_heads=True, which normalizes across ALL heads at each sequence position. When using tensor parallelism, heads are split across devices, so each device only sees a subset of heads and computes incorrect local variance. The fix follows vLLM's approach: - Compute local sum of squares on each TP rank - All-reduce the sum across ranks - Divide by global dimension to get true global mean - Apply normalization with corrected global variance Key changes: - attn.py: Add apply_qk_norms_tp() method with variance all-reduce - attn.py: Modify tp_export/tp_import to handle span_heads norms - rmsnorm.py: Preserve span_heads in tp_export, handle 1D tensors in split - minimax_m2.py: Enable TP support (supports_tp: True) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>	2026-01-12 10:38:59 +00:00
turboderp	e839152802	Add diversity test	2026-01-11 19:12:04 +01:00
turboderp	3186dca9da	generator: Pad token mask when output layer is padded	2026-01-11 19:11:26 +01:00
turboderp	9043690801	generator: Free recurrent state after job completed (prevent memory leak with large job queue)	2026-01-11 17:38:15 +01:00
turboderp	e69d91b12b	model_init: Add sampling args default overrides	2026-01-11 16:38:33 +01:00
turboderp	6b31fc00f5	Add HF tokenizer helper, refactor example	2026-01-11 12:49:12 +01:00
turboderp	288a98f5e3	Refactor sampler args for examples	2026-01-11 12:33:27 +01:00

1 2 3 4 5 ...

711 Commits