exllamav3

mirror of https://github.com/turboderp-org/exllamav3.git synced 2026-04-19 22:08:58 +00:00

Author	SHA1	Message	Date
JwinPBE	f7da9c58e1	update setup.py with current repository URL	2026-04-18 02:29:04 -04:00
turboderp	36a636b478	Refactor and rework Gemma4 implementation: - Remove custom quant cache layer stuff for now (cache quant needs to be tested with all the new changes) - Move preprocessing to separate util module - Replace dedicated Gemma4 modules with existing generic modules, make necessary adjustments: - SDPA fallback triggers whenever head_dim > 512 (xformers also added, but its GQA impl. is buggy and needs an annoying workaround that slows it down a lot) - Add necessary extra norms, new transpose args and second residual channel to BlockSparseMLP (dense_mlp becomes shared expert instead) - Add layer scalar per decoder block - Don't apply embedding multiplier to embedded MM tokens - Ensure embedding scaling exactly matches HF bfloat16 version Vision stuff: - Handle non-causal attention in multimodal spans with multiple (flash) attn passes rather than custom mask. - Avoid extending chunk size past the first MM span (allow small amount of redundant processing to keep VRAM overhead relatively constant.) - Fold Gemma4VisionStandardize into Gemma4VisionPooler - Replace Gemma4VisionProjector with RMSNorm+Linear modules - Use 2D RoPE in kernel instead of precomputed sin,cos tensors - Use non-causal attention with no mask (HF reference pads all embeddings to the same size of 280 tokens and then has to apply a custom attn mask to make that work, but the padding tokens are discarded anyway so there's no point)	2026-04-08 03:59:52 +02:00
turboderp	da2d335233	Attn: Add paged-attn fallbacks using xformers or SDPA for head_dim > 256	2026-04-07 22:46:12 +02:00
turboderp	57389c5b21	Refactor architecture-specific modules into own directory	2026-04-07 22:46:12 +02:00
turboderp	d908a6c439	Convert: Increase default calibration to 250 rows, add more cal data	2025-10-12 14:12:59 +02:00
turboderp	4356527867	Pin pydantic to 2.11.0	2025-10-09 11:00:25 +02:00
turboderp	9933736be6	TP: Split AVX2 code from .cu objects	2025-09-25 01:47:49 +02:00
turboderp	beac5dc47e	Remove explicit -gencode args again	2025-08-17 19:23:47 +02:00
turboderp	b302438234	Try explicitly setting architectures on nvcc command line	2025-08-17 18:53:47 +02:00
turboderp	5b29ef5008	Try, try again	2025-08-17 18:34:30 +02:00
turboderp	f1f05a7732	Stop Windows Torch from disabling half operators	2025-08-17 18:15:51 +02:00
turboderp	33a4f7bc81	Rework compiler flags (should be correct for Windows now)	2025-08-17 08:26:23 +02:00
turboderp	f3d6f467a5	TP: New AVX2 all-reduce	2025-08-16 23:24:46 +02:00
turboderp	69750c8a56	Fix duplicate subpackage	2025-08-08 06:54:43 +02:00
turboderp	7bb943bc09	Merge branch 'dev' into setup_py_submodule_renames	2025-08-08 06:53:04 +02:00
turboderp	0c5399bdd1	Refactoring	2025-08-08 06:51:11 +02:00
MikeRoz47	31d8af9bbe	Account for renamed/added submodules in setup.py	2025-08-08 02:07:50 +00:00
turboderp	db533103b1	Fix #62 , include new directory in packages	2025-07-17 20:58:07 +02:00
turboderp	327d1f99d6	Revert to flash_attn>=2.7.4.post1 until the wheel situation is sorted out	2025-07-16 19:12:46 +02:00
turboderp	ba4304a44b	Pin flash-attn at 2.7.4.post1	2025-07-15 20:42:36 +02:00
turboderp	08dde73e66	Add Formatron support and improved logit masking	2025-07-11 21:29:40 +02:00
turboderp	e370ed289d	safetensors: Add trie search for tensor file map (marisa_trie)	2025-07-08 19:52:00 +02:00
turboderp	6341b119ef	Loader: Add tensor override script	2025-07-08 18:58:43 +02:00
turboderp	2f12246ec3	Fix requirements	2025-04-07 17:30:33 +02:00
Async0x42	5567364846	Fix Issue #2 , Error: setup script specifies an absolute path	2025-04-06 22:44:41 -04:00
turboderp	543c4b2771	Initial commit	2025-04-06 14:42:49 +02:00

26 Commits