turboderp
0ff0e32203
Add Branch Decode demo
2026-04-13 23:40:14 +02:00
turboderp
b0932071dd
Tokenizer: Ensure special tokens are decoded when requested, even if marked as unspecial
2026-04-13 23:36:53 +02:00
turboderp
82f0b633aa
TP: Add AVX-512 host accumulation, increase chunk buffer to 64 kB
...
Much thanks to Carousel Aether
2026-04-13 00:58:12 +02:00
turboderp
149e3e5e85
EXL3 GEMM: Resolve bank conflicts
...
Much thanks to Carousel Aether
2026-04-13 00:55:48 +02:00
turboderp
8877b99855
QCache: Skip dequant when possible outside of SWA window
2026-04-13 00:52:46 +02:00
turboderp
4d405228ef
All-reduce: Fix assumptions about CPUREDUCE_CHUNK_SIZE in kernel
2026-04-12 21:32:25 +02:00
turboderp
a956e04dbd
Attn: Add xformers fallback path for non-cached fwd (primarily, avoids excessive VRAM usage during autosplit load)
2026-04-12 18:13:32 +02:00
turboderp
2fc4fa9843
Build actions: Remove Torch 2.7-2.8 wheels for Python 3.14
2026-04-12 13:56:50 +02:00
turboderp
cb1a436f8f
Add Python 3.14 and Torch 2.11 build actions
v0.0.29
2026-04-12 02:31:55 +02:00
turboderp
87830de104
Bump to v0.0.29
2026-04-12 02:27:23 +02:00
turboderp
6b8225ff41
Add bighead-attn and bighead-attn-paged kernels
2026-04-12 01:34:54 +02:00
turboderp
405028b774
Suppress some warnings
2026-04-12 01:31:51 +02:00
turboderp
029e50f004
chat.py: Don't list tokens with near-zero probability
2026-04-10 20:59:43 +02:00
turboderp
e90fe55e89
compare_q.py: Add option to format test dataset with chat template (stable measurements for Gemma4)
2026-04-10 20:59:43 +02:00
Ilia Malanin
ec3d86939c
LoRA: Add PEFT adapter support
2026-04-09 20:16:44 +02:00
turboderp
6271c1de93
Refactor: Collect other processing funcs in mm_processing
2026-04-08 21:25:01 +02:00
turboderp
2e52f80480
Optimizer: Fix bits->final_bits regression
2026-04-08 19:02:38 +02:00
turboderp
0ab97b1568
Fix regression
2026-04-08 04:12:35 +02:00
turboderp
5a09da86c7
Update multimodal example
2026-04-08 04:01:24 +02:00
turboderp
36a636b478
Refactor and rework Gemma4 implementation:
...
- Remove custom quant cache layer stuff for now (cache quant needs to be tested with all the new changes)
- Move preprocessing to separate util module
- Replace dedicated Gemma4 modules with existing generic modules, make necessary adjustments:
- SDPA fallback triggers whenever head_dim > 512 (xformers also added, but its GQA impl. is buggy and needs an annoying workaround that slows it down a lot)
- Add necessary extra norms, new transpose args and second residual channel to BlockSparseMLP (dense_mlp becomes shared expert instead)
- Add layer scalar per decoder block
- Don't apply embedding multiplier to embedded MM tokens
- Ensure embedding scaling exactly matches HF bfloat16 version
Vision stuff:
- Handle non-causal attention in multimodal spans with multiple (flash) attn passes rather than custom mask.
- Avoid extending chunk size past the first MM span (allow small amount of redundant processing to keep VRAM overhead relatively constant.)
- Fold Gemma4VisionStandardize into Gemma4VisionPooler
- Replace Gemma4VisionProjector with RMSNorm+Linear modules
- Use 2D RoPE in kernel instead of precomputed sin,cos tensors
- Use non-causal attention with no mask (HF reference pads all embeddings to the same size of 280 tokens and then has to apply a custom attn mask to make that work, but the padding tokens are discarded anyway so there's no point)
2026-04-08 03:59:52 +02:00
turboderp
8641cad407
RoPE: Add multidimensional RoPE to kernel
2026-04-08 03:05:12 +02:00
turboderp
89f0302c41
Attn: Add workaround for xformers GQA bug
2026-04-07 22:46:12 +02:00
turboderp
27641630e2
Attn: Add use_k_as_v and v_norm options
2026-04-07 22:46:12 +02:00
turboderp
da2d335233
Attn: Add paged-attn fallbacks using xformers or SDPA for head_dim > 256
2026-04-07 22:46:12 +02:00
turboderp
8864f9213e
RMSNorm: Allow weight key without .weight suffix (Gemma4 kludge)
2026-04-07 22:46:12 +02:00
turboderp
9a89c7cf8e
safetensors: Fix indexing of weights without .weight suffix
2026-04-07 22:46:12 +02:00
turboderp
306c3af85f
safetensors_alt: Fix bug causing writes > 2GB to sometimes fail, prevent C++ backend from segfaulting on failure
2026-04-07 22:46:12 +02:00
turboderp
cbeedc37a4
BlockSparseMLP: Pad intermediate temp tensors for intermediate_size not divisible by 128, quantized path
2026-04-07 22:46:12 +02:00
turboderp
c0fe804d29
BlockSparseMLP: Silu/gelu switch
2026-04-07 22:46:12 +02:00
turboderp
86b6486bee
BlockSparseMLP: Allow per-expert scales (standard softmax routing)
2026-04-07 22:46:12 +02:00
turboderp
c0e0e71879
Embedding: Allow embeddings to stay in BF16 format and apply constant scale in BF16 to match rounding behavior of Gemma
2026-04-07 22:46:12 +02:00
turboderp
c045873365
RMSNorm: Add constant scale factor
2026-04-07 22:46:12 +02:00
turboderp
4220efb225
MoE kernel: Add GELU
2026-04-07 22:46:12 +02:00
turboderp
57389c5b21
Refactor architecture-specific modules into own directory
2026-04-07 22:46:12 +02:00
turboderp
67853a3f81
Examples: Add Gemma4 basic template
2026-04-07 22:46:12 +02:00
turboderp
7046a5c739
perf.py: Fix cache overflow
2026-04-06 01:58:27 +02:00
turboderp
38da9ba65c
convert.py: Free up unnecessary system RAM alloc on resumed job
2026-04-06 01:31:41 +02:00
turboderp
019408a51b
convert.py: Little more feedback
2026-04-06 01:30:29 +02:00
turboderp
0fd31f6609
chat.py: Add Gemma4 template
2026-04-05 19:44:46 +02:00
turboderp
514389a2b5
ppl.py: No default datatype for HF mode
2026-04-05 19:43:46 +02:00
turboderp
3cb0f4381f
Merge branch 'dev' into fork/lesj0610/feat/gemma4-support
2026-04-04 23:31:05 +02:00
turboderp
476ad297ec
BBEH eval: Fix results display
2026-04-04 23:29:19 +02:00
turboderp
5bb4e0d32b
MMLU eval: Fix confidence interval
2026-04-04 22:49:51 +02:00
turboderp
46ea669d80
Tokenizer: Prioritize EOS token ID from tokenizer_config.json if present
2026-04-04 22:31:26 +02:00
turboderp
76acd9c140
Eval: Add BigBench Extra Hard
2026-04-04 22:31:26 +02:00
turboderp
d19844bd79
Generator: Add loop detection
2026-04-04 22:31:26 +02:00
turboderp
9bd2b5ea4d
ppl eval: Combine HF and EXL3 evals into single module, add mode that attempts to replicate default llama.cpp eval tokenization and scoring
2026-04-03 23:37:00 +02:00
lesj0610
cacd03842e
fix(gemma4): stabilize 26b quantized kv cache profile
2026-04-03 23:30:45 +09:00
lesj0610
22d57bf503
fix(gemma4): stabilize 31b q4 cache path
2026-04-03 22:41:25 +09:00
lesj0610
92955b80d0
fix(gemma4): recover q4 kv cache generation
2026-04-03 19:05:01 +09:00