26 Commits

Author SHA1 Message Date
JwinPBE
f7da9c58e1 update setup.py with current repository URL 2026-04-18 02:29:04 -04:00
turboderp
36a636b478 Refactor and rework Gemma4 implementation:
- Remove custom quant cache layer stuff for now (cache quant needs to be tested with all the new changes)
- Move preprocessing to separate util module
- Replace dedicated Gemma4 modules with existing generic modules, make necessary adjustments:
   - SDPA fallback triggers whenever head_dim > 512 (xformers also added, but its GQA impl. is buggy and needs an annoying workaround that slows it down a lot)
   - Add necessary extra norms, new transpose args and second residual channel to BlockSparseMLP (dense_mlp becomes shared expert instead)
   - Add layer scalar per decoder block
   - Don't apply embedding multiplier to embedded MM tokens
- Ensure embedding scaling exactly matches HF bfloat16 version

Vision stuff:
- Handle non-causal attention in multimodal spans with multiple (flash) attn passes rather than custom mask.
- Avoid extending chunk size past the first MM span (allow small amount of redundant processing to keep VRAM overhead relatively constant.)
- Fold Gemma4VisionStandardize into Gemma4VisionPooler
- Replace Gemma4VisionProjector with RMSNorm+Linear modules
- Use 2D RoPE in kernel instead of precomputed sin,cos tensors
- Use non-causal attention with no mask (HF reference pads all embeddings to the same size of 280 tokens and then has to apply a custom attn mask to make that work, but the padding tokens are discarded anyway so there's no point)
2026-04-08 03:59:52 +02:00
turboderp
da2d335233 Attn: Add paged-attn fallbacks using xformers or SDPA for head_dim > 256 2026-04-07 22:46:12 +02:00
turboderp
57389c5b21 Refactor architecture-specific modules into own directory 2026-04-07 22:46:12 +02:00
turboderp
d908a6c439 Convert: Increase default calibration to 250 rows, add more cal data 2025-10-12 14:12:59 +02:00
turboderp
4356527867 Pin pydantic to 2.11.0 2025-10-09 11:00:25 +02:00
turboderp
9933736be6 TP: Split AVX2 code from .cu objects 2025-09-25 01:47:49 +02:00
turboderp
beac5dc47e Remove explicit -gencode args again 2025-08-17 19:23:47 +02:00
turboderp
b302438234 Try explicitly setting architectures on nvcc command line 2025-08-17 18:53:47 +02:00
turboderp
5b29ef5008 Try, try again 2025-08-17 18:34:30 +02:00
turboderp
f1f05a7732 Stop Windows Torch from disabling half operators 2025-08-17 18:15:51 +02:00
turboderp
33a4f7bc81 Rework compiler flags (should be correct for Windows now) 2025-08-17 08:26:23 +02:00
turboderp
f3d6f467a5 TP: New AVX2 all-reduce 2025-08-16 23:24:46 +02:00
turboderp
69750c8a56 Fix duplicate subpackage 2025-08-08 06:54:43 +02:00
turboderp
7bb943bc09 Merge branch 'dev' into setup_py_submodule_renames 2025-08-08 06:53:04 +02:00
turboderp
0c5399bdd1 Refactoring 2025-08-08 06:51:11 +02:00
MikeRoz47
31d8af9bbe Account for renamed/added submodules in setup.py 2025-08-08 02:07:50 +00:00
turboderp
db533103b1 Fix #62, include new directory in packages 2025-07-17 20:58:07 +02:00
turboderp
327d1f99d6 Revert to flash_attn>=2.7.4.post1 until the wheel situation is sorted out 2025-07-16 19:12:46 +02:00
turboderp
ba4304a44b Pin flash-attn at 2.7.4.post1 2025-07-15 20:42:36 +02:00
turboderp
08dde73e66 Add Formatron support and improved logit masking 2025-07-11 21:29:40 +02:00
turboderp
e370ed289d safetensors: Add trie search for tensor file map (marisa_trie) 2025-07-08 19:52:00 +02:00
turboderp
6341b119ef Loader: Add tensor override script 2025-07-08 18:58:43 +02:00
turboderp
2f12246ec3 Fix requirements 2025-04-07 17:30:33 +02:00
Async0x42
5567364846 Fix Issue #2, Error: setup script specifies an absolute path 2025-04-06 22:44:41 -04:00
turboderp
543c4b2771 Initial commit 2025-04-06 14:42:49 +02:00