exllamav3

mirror of https://github.com/turboderp-org/exllamav3.git synced 2026-04-20 14:29:51 +00:00

Author	SHA1	Message	Date
turboderp	36a636b478	Refactor and rework Gemma4 implementation: - Remove custom quant cache layer stuff for now (cache quant needs to be tested with all the new changes) - Move preprocessing to separate util module - Replace dedicated Gemma4 modules with existing generic modules, make necessary adjustments: - SDPA fallback triggers whenever head_dim > 512 (xformers also added, but its GQA impl. is buggy and needs an annoying workaround that slows it down a lot) - Add necessary extra norms, new transpose args and second residual channel to BlockSparseMLP (dense_mlp becomes shared expert instead) - Add layer scalar per decoder block - Don't apply embedding multiplier to embedded MM tokens - Ensure embedding scaling exactly matches HF bfloat16 version Vision stuff: - Handle non-causal attention in multimodal spans with multiple (flash) attn passes rather than custom mask. - Avoid extending chunk size past the first MM span (allow small amount of redundant processing to keep VRAM overhead relatively constant.) - Fold Gemma4VisionStandardize into Gemma4VisionPooler - Replace Gemma4VisionProjector with RMSNorm+Linear modules - Use 2D RoPE in kernel instead of precomputed sin,cos tensors - Use non-causal attention with no mask (HF reference pads all embeddings to the same size of 280 tokens and then has to apply a custom attn mask to make that work, but the padding tokens are discarded anyway so there's no point)	2026-04-08 03:59:52 +02:00
turboderp	aaf6337f12	Add OlmoHybridForCausalLM	2026-03-13 00:59:10 +01:00
turboderp	60afe8d983	Update README.md	2026-03-07 20:48:24 +01:00
turboderp	e12e6bd759	Update README.md	2026-03-02 15:51:58 +01:00
turboderp	88dcdf782d	Update README.md	2026-03-02 03:49:28 +01:00
turboderp	91a11853cd	Update README.md	2026-01-22 23:27:23 +01:00
turboderp	27c68d4e65	Update README.md	2026-01-10 15:59:46 +01:00
turboderp	703b05ab52	Update README.md	2026-01-06 16:08:23 +01:00
turboderp	a026b32df3	Support IQuestCoderForCausalLM	2026-01-04 12:31:58 +01:00
turboderp	227621e49e	Support HyperCLOVAXForCausalLM	2026-01-03 03:22:50 +01:00
turboderp	104268521c	Support Olmo3ForCausalLM	2025-12-13 20:49:03 +01:00
turboderp	9b58b45999	Update README.md	2025-11-13 17:23:41 +01:00
turboderp	08a82f36a3	Add Glm4V architecture	2025-11-13 13:00:19 +01:00
turboderp	98e1c4017c	Update README.md	2025-11-09 23:02:08 +01:00
turboderp	3533b307c3	Update README.md	2025-11-01 19:42:18 +01:00
turboderp	8098d619f6	Update README.md	2025-09-28 18:23:39 +02:00
turboderp	8c71b0aa57	GatedDeltaNet: Fused kernel for splitting inputs, casting, applying sigmoid etc.	2025-09-21 05:03:43 +02:00
turboderp	b25082a0be	Update README	2025-09-19 19:24:01 +02:00
turboderp	8f28558eed	Update README.md	2025-09-04 02:50:37 +02:00
turboderp	d8213dc04c	Update README.md	2025-09-03 02:16:43 +02:00
turboderp	d8167b0cf4	Update README.md	2025-08-26 13:34:01 +02:00
turboderp	26d89e36e1	Update README	2025-08-01 19:27:42 +02:00
turboderp	2957e20cad	Update README.md	2025-07-12 18:33:18 +02:00
turboderp	0e8ad5418d	Update readme	2025-06-02 02:31:04 +02:00
turboderp	d1e3b2b20e	Update README	2025-05-14 17:53:44 +02:00
turboderp	3b4c3d4dde	Update readme	2025-04-27 01:28:19 +02:00
turboderp	cf84811485	Add cache quantization	2025-04-22 21:52:33 +02:00
turboderp	c0e9242315	Update installation instructions	2025-04-17 19:37:07 +02:00
turboderp	5b900442fe	Fix typo	2025-04-12 00:38:31 +02:00
turboderp	ae9f910be2	Help for conversion script	2025-04-11 00:26:03 +02:00
turboderp	08cf8e91ce	Cleanup and comments	2025-04-09 14:18:23 +02:00
turboderp	f579de4923	Link to model collection	2025-04-06 22:13:32 +02:00
turboderp	90db82667e	Typo	2025-04-06 15:21:55 +02:00
turboderp	942ecaf18b	Formatting	2025-04-06 15:08:25 +02:00
turboderp	543c4b2771	Initial commit	2025-04-06 14:42:49 +02:00

35 Commits