894 Commits

Author SHA1 Message Date
turboderp
0ff0e32203 Add Branch Decode demo 2026-04-13 23:40:14 +02:00
turboderp
b0932071dd Tokenizer: Ensure special tokens are decoded when requested, even if marked as unspecial 2026-04-13 23:36:53 +02:00
turboderp
82f0b633aa TP: Add AVX-512 host accumulation, increase chunk buffer to 64 kB
Much thanks to Carousel Aether
2026-04-13 00:58:12 +02:00
turboderp
149e3e5e85 EXL3 GEMM: Resolve bank conflicts
Much thanks to Carousel Aether
2026-04-13 00:55:48 +02:00
turboderp
8877b99855 QCache: Skip dequant when possible outside of SWA window 2026-04-13 00:52:46 +02:00
turboderp
4d405228ef All-reduce: Fix assumptions about CPUREDUCE_CHUNK_SIZE in kernel 2026-04-12 21:32:25 +02:00
turboderp
a956e04dbd Attn: Add xformers fallback path for non-cached fwd (primarily, avoids excessive VRAM usage during autosplit load) 2026-04-12 18:13:32 +02:00
turboderp
2fc4fa9843 Build actions: Remove Torch 2.7-2.8 wheels for Python 3.14 2026-04-12 13:56:50 +02:00
turboderp
cb1a436f8f Add Python 3.14 and Torch 2.11 build actions v0.0.29 2026-04-12 02:31:55 +02:00
turboderp
87830de104 Bump to v0.0.29 2026-04-12 02:27:23 +02:00
turboderp
6b8225ff41 Add bighead-attn and bighead-attn-paged kernels 2026-04-12 01:34:54 +02:00
turboderp
405028b774 Suppress some warnings 2026-04-12 01:31:51 +02:00
turboderp
029e50f004 chat.py: Don't list tokens with near-zero probability 2026-04-10 20:59:43 +02:00
turboderp
e90fe55e89 compare_q.py: Add option to format test dataset with chat template (stable measurements for Gemma4) 2026-04-10 20:59:43 +02:00
Ilia Malanin
ec3d86939c LoRA: Add PEFT adapter support 2026-04-09 20:16:44 +02:00
turboderp
6271c1de93 Refactor: Collect other processing funcs in mm_processing 2026-04-08 21:25:01 +02:00
turboderp
2e52f80480 Optimizer: Fix bits->final_bits regression 2026-04-08 19:02:38 +02:00
turboderp
0ab97b1568 Fix regression 2026-04-08 04:12:35 +02:00
turboderp
5a09da86c7 Update multimodal example 2026-04-08 04:01:24 +02:00
turboderp
36a636b478 Refactor and rework Gemma4 implementation:
- Remove custom quant cache layer stuff for now (cache quant needs to be tested with all the new changes)
- Move preprocessing to separate util module
- Replace dedicated Gemma4 modules with existing generic modules, make necessary adjustments:
   - SDPA fallback triggers whenever head_dim > 512 (xformers also added, but its GQA impl. is buggy and needs an annoying workaround that slows it down a lot)
   - Add necessary extra norms, new transpose args and second residual channel to BlockSparseMLP (dense_mlp becomes shared expert instead)
   - Add layer scalar per decoder block
   - Don't apply embedding multiplier to embedded MM tokens
- Ensure embedding scaling exactly matches HF bfloat16 version

Vision stuff:
- Handle non-causal attention in multimodal spans with multiple (flash) attn passes rather than custom mask.
- Avoid extending chunk size past the first MM span (allow small amount of redundant processing to keep VRAM overhead relatively constant.)
- Fold Gemma4VisionStandardize into Gemma4VisionPooler
- Replace Gemma4VisionProjector with RMSNorm+Linear modules
- Use 2D RoPE in kernel instead of precomputed sin,cos tensors
- Use non-causal attention with no mask (HF reference pads all embeddings to the same size of 280 tokens and then has to apply a custom attn mask to make that work, but the padding tokens are discarded anyway so there's no point)
2026-04-08 03:59:52 +02:00
turboderp
8641cad407 RoPE: Add multidimensional RoPE to kernel 2026-04-08 03:05:12 +02:00
turboderp
89f0302c41 Attn: Add workaround for xformers GQA bug 2026-04-07 22:46:12 +02:00
turboderp
27641630e2 Attn: Add use_k_as_v and v_norm options 2026-04-07 22:46:12 +02:00
turboderp
da2d335233 Attn: Add paged-attn fallbacks using xformers or SDPA for head_dim > 256 2026-04-07 22:46:12 +02:00
turboderp
8864f9213e RMSNorm: Allow weight key without .weight suffix (Gemma4 kludge) 2026-04-07 22:46:12 +02:00
turboderp
9a89c7cf8e safetensors: Fix indexing of weights without .weight suffix 2026-04-07 22:46:12 +02:00
turboderp
306c3af85f safetensors_alt: Fix bug causing writes > 2GB to sometimes fail, prevent C++ backend from segfaulting on failure 2026-04-07 22:46:12 +02:00
turboderp
cbeedc37a4 BlockSparseMLP: Pad intermediate temp tensors for intermediate_size not divisible by 128, quantized path 2026-04-07 22:46:12 +02:00
turboderp
c0fe804d29 BlockSparseMLP: Silu/gelu switch 2026-04-07 22:46:12 +02:00
turboderp
86b6486bee BlockSparseMLP: Allow per-expert scales (standard softmax routing) 2026-04-07 22:46:12 +02:00
turboderp
c0e0e71879 Embedding: Allow embeddings to stay in BF16 format and apply constant scale in BF16 to match rounding behavior of Gemma 2026-04-07 22:46:12 +02:00
turboderp
c045873365 RMSNorm: Add constant scale factor 2026-04-07 22:46:12 +02:00
turboderp
4220efb225 MoE kernel: Add GELU 2026-04-07 22:46:12 +02:00
turboderp
57389c5b21 Refactor architecture-specific modules into own directory 2026-04-07 22:46:12 +02:00
turboderp
67853a3f81 Examples: Add Gemma4 basic template 2026-04-07 22:46:12 +02:00
turboderp
7046a5c739 perf.py: Fix cache overflow 2026-04-06 01:58:27 +02:00
turboderp
38da9ba65c convert.py: Free up unnecessary system RAM alloc on resumed job 2026-04-06 01:31:41 +02:00
turboderp
019408a51b convert.py: Little more feedback 2026-04-06 01:30:29 +02:00
turboderp
0fd31f6609 chat.py: Add Gemma4 template 2026-04-05 19:44:46 +02:00
turboderp
514389a2b5 ppl.py: No default datatype for HF mode 2026-04-05 19:43:46 +02:00
turboderp
3cb0f4381f Merge branch 'dev' into fork/lesj0610/feat/gemma4-support 2026-04-04 23:31:05 +02:00
turboderp
476ad297ec BBEH eval: Fix results display 2026-04-04 23:29:19 +02:00
turboderp
5bb4e0d32b MMLU eval: Fix confidence interval 2026-04-04 22:49:51 +02:00
turboderp
46ea669d80 Tokenizer: Prioritize EOS token ID from tokenizer_config.json if present 2026-04-04 22:31:26 +02:00
turboderp
76acd9c140 Eval: Add BigBench Extra Hard 2026-04-04 22:31:26 +02:00
turboderp
d19844bd79 Generator: Add loop detection 2026-04-04 22:31:26 +02:00
turboderp
9bd2b5ea4d ppl eval: Combine HF and EXL3 evals into single module, add mode that attempts to replicate default llama.cpp eval tokenization and scoring 2026-04-03 23:37:00 +02:00
lesj0610
cacd03842e fix(gemma4): stabilize 26b quantized kv cache profile 2026-04-03 23:30:45 +09:00
lesj0610
22d57bf503 fix(gemma4): stabilize 31b q4 cache path 2026-04-03 22:41:25 +09:00
lesj0610
92955b80d0 fix(gemma4): recover q4 kv cache generation 2026-04-03 19:05:01 +09:00