exllamav3

mirror of https://github.com/turboderp-org/exllamav3.git synced 2026-04-28 10:11:37 +00:00

Author	SHA1	Message	Date
turboderp	0ff0e32203	Add Branch Decode demo	2026-04-13 23:40:14 +02:00
turboderp	b0932071dd	Tokenizer: Ensure special tokens are decoded when requested, even if marked as unspecial	2026-04-13 23:36:53 +02:00
turboderp	82f0b633aa	TP: Add AVX-512 host accumulation, increase chunk buffer to 64 kB Much thanks to Carousel Aether	2026-04-13 00:58:12 +02:00
turboderp	149e3e5e85	EXL3 GEMM: Resolve bank conflicts Much thanks to Carousel Aether	2026-04-13 00:55:48 +02:00
turboderp	8877b99855	QCache: Skip dequant when possible outside of SWA window	2026-04-13 00:52:46 +02:00
turboderp	4d405228ef	All-reduce: Fix assumptions about CPUREDUCE_CHUNK_SIZE in kernel	2026-04-12 21:32:25 +02:00
turboderp	a956e04dbd	Attn: Add xformers fallback path for non-cached fwd (primarily, avoids excessive VRAM usage during autosplit load)	2026-04-12 18:13:32 +02:00
turboderp	2fc4fa9843	Build actions: Remove Torch 2.7-2.8 wheels for Python 3.14	2026-04-12 13:56:50 +02:00
turboderp	cb1a436f8f	Add Python 3.14 and Torch 2.11 build actions v0.0.29	2026-04-12 02:31:55 +02:00
turboderp	87830de104	Bump to v0.0.29	2026-04-12 02:27:23 +02:00
turboderp	6b8225ff41	Add bighead-attn and bighead-attn-paged kernels	2026-04-12 01:34:54 +02:00
turboderp	405028b774	Suppress some warnings	2026-04-12 01:31:51 +02:00
turboderp	029e50f004	chat.py: Don't list tokens with near-zero probability	2026-04-10 20:59:43 +02:00
turboderp	e90fe55e89	compare_q.py: Add option to format test dataset with chat template (stable measurements for Gemma4)	2026-04-10 20:59:43 +02:00
Ilia Malanin	ec3d86939c	LoRA: Add PEFT adapter support	2026-04-09 20:16:44 +02:00
turboderp	6271c1de93	Refactor: Collect other processing funcs in mm_processing	2026-04-08 21:25:01 +02:00
turboderp	2e52f80480	Optimizer: Fix bits->final_bits regression	2026-04-08 19:02:38 +02:00
turboderp	0ab97b1568	Fix regression	2026-04-08 04:12:35 +02:00
turboderp	5a09da86c7	Update multimodal example	2026-04-08 04:01:24 +02:00
turboderp	36a636b478	Refactor and rework Gemma4 implementation: - Remove custom quant cache layer stuff for now (cache quant needs to be tested with all the new changes) - Move preprocessing to separate util module - Replace dedicated Gemma4 modules with existing generic modules, make necessary adjustments: - SDPA fallback triggers whenever head_dim > 512 (xformers also added, but its GQA impl. is buggy and needs an annoying workaround that slows it down a lot) - Add necessary extra norms, new transpose args and second residual channel to BlockSparseMLP (dense_mlp becomes shared expert instead) - Add layer scalar per decoder block - Don't apply embedding multiplier to embedded MM tokens - Ensure embedding scaling exactly matches HF bfloat16 version Vision stuff: - Handle non-causal attention in multimodal spans with multiple (flash) attn passes rather than custom mask. - Avoid extending chunk size past the first MM span (allow small amount of redundant processing to keep VRAM overhead relatively constant.) - Fold Gemma4VisionStandardize into Gemma4VisionPooler - Replace Gemma4VisionProjector with RMSNorm+Linear modules - Use 2D RoPE in kernel instead of precomputed sin,cos tensors - Use non-causal attention with no mask (HF reference pads all embeddings to the same size of 280 tokens and then has to apply a custom attn mask to make that work, but the padding tokens are discarded anyway so there's no point)	2026-04-08 03:59:52 +02:00
turboderp	8641cad407	RoPE: Add multidimensional RoPE to kernel	2026-04-08 03:05:12 +02:00
turboderp	89f0302c41	Attn: Add workaround for xformers GQA bug	2026-04-07 22:46:12 +02:00
turboderp	27641630e2	Attn: Add use_k_as_v and v_norm options	2026-04-07 22:46:12 +02:00
turboderp	da2d335233	Attn: Add paged-attn fallbacks using xformers or SDPA for head_dim > 256	2026-04-07 22:46:12 +02:00
turboderp	8864f9213e	RMSNorm: Allow weight key without .weight suffix (Gemma4 kludge)	2026-04-07 22:46:12 +02:00
turboderp	9a89c7cf8e	safetensors: Fix indexing of weights without .weight suffix	2026-04-07 22:46:12 +02:00
turboderp	306c3af85f	safetensors_alt: Fix bug causing writes > 2GB to sometimes fail, prevent C++ backend from segfaulting on failure	2026-04-07 22:46:12 +02:00
turboderp	cbeedc37a4	BlockSparseMLP: Pad intermediate temp tensors for intermediate_size not divisible by 128, quantized path	2026-04-07 22:46:12 +02:00
turboderp	c0fe804d29	BlockSparseMLP: Silu/gelu switch	2026-04-07 22:46:12 +02:00
turboderp	86b6486bee	BlockSparseMLP: Allow per-expert scales (standard softmax routing)	2026-04-07 22:46:12 +02:00
turboderp	c0e0e71879	Embedding: Allow embeddings to stay in BF16 format and apply constant scale in BF16 to match rounding behavior of Gemma	2026-04-07 22:46:12 +02:00
turboderp	c045873365	RMSNorm: Add constant scale factor	2026-04-07 22:46:12 +02:00
turboderp	4220efb225	MoE kernel: Add GELU	2026-04-07 22:46:12 +02:00
turboderp	57389c5b21	Refactor architecture-specific modules into own directory	2026-04-07 22:46:12 +02:00
turboderp	67853a3f81	Examples: Add Gemma4 basic template	2026-04-07 22:46:12 +02:00
turboderp	7046a5c739	perf.py: Fix cache overflow	2026-04-06 01:58:27 +02:00
turboderp	38da9ba65c	convert.py: Free up unnecessary system RAM alloc on resumed job	2026-04-06 01:31:41 +02:00
turboderp	019408a51b	convert.py: Little more feedback	2026-04-06 01:30:29 +02:00
turboderp	0fd31f6609	chat.py: Add Gemma4 template	2026-04-05 19:44:46 +02:00
turboderp	514389a2b5	ppl.py: No default datatype for HF mode	2026-04-05 19:43:46 +02:00
turboderp	3cb0f4381f	Merge branch 'dev' into fork/lesj0610/feat/gemma4-support	2026-04-04 23:31:05 +02:00
turboderp	476ad297ec	BBEH eval: Fix results display	2026-04-04 23:29:19 +02:00
turboderp	5bb4e0d32b	MMLU eval: Fix confidence interval	2026-04-04 22:49:51 +02:00
turboderp	46ea669d80	Tokenizer: Prioritize EOS token ID from tokenizer_config.json if present	2026-04-04 22:31:26 +02:00
turboderp	76acd9c140	Eval: Add BigBench Extra Hard	2026-04-04 22:31:26 +02:00
turboderp	d19844bd79	Generator: Add loop detection	2026-04-04 22:31:26 +02:00
turboderp	9bd2b5ea4d	ppl eval: Combine HF and EXL3 evals into single module, add mode that attempts to replicate default llama.cpp eval tokenization and scoring	2026-04-03 23:37:00 +02:00
lesj0610	cacd03842e	fix(gemma4): stabilize 26b quantized kv cache profile	2026-04-03 23:30:45 +09:00
lesj0610	22d57bf503	fix(gemma4): stabilize 31b q4 cache path	2026-04-03 22:41:25 +09:00
lesj0610	92955b80d0	fix(gemma4): recover q4 kv cache generation	2026-04-03 19:05:01 +09:00

1 2 3 4 5 ...

894 Commits