4.3 KiB
🔀 #27 - Faster Gemma2
| Author | ikawrakow |
|---|---|
| State | ❌ Closed |
| Created | 2024-08-27 |
| Updated | 2024-08-27 |
Description
In a previous PR I has fused scale - tanh - scale used for "soft-capping" activations into a GGML_OP_SOFTCAP operation. This PR further fuses GGML_OP_SOFTCAP with GGML_OP_SOFT_MAX into a new GGML_OP_SOFT_CAP_MAX operation. This is useful for, e.g., self-attention in the Gemma-2 series of models, and leads to a significant performance increase.
In addition, "soft-capping" is added to flash attention. I see this has also been done in mainline llama.cpp in PR-8542 and PR-9159.
Here some performance comparisons to llama.cpp (build 3631) for Gemma-2-2b on CUDA (RTX-4080), Metal (30-core M2-Max GPU), AVX2 (Ryzen-7950X) and ARM_NEON (M2-Max CPU). The model is quantized with Q4_K_S (the performance gap between this repo and mainline llama.cpp is smaller for this quantization type compared to most other quants).
No Flash attention
| backend | ngl | threads | test | t/s (llama.cpp) | t/s (PR) | Speedup |
|---|---|---|---|---|---|---|
| CUDA | 100 | 1 | tg128 | 239.20 ± 0.27 | 244.47 ± 0.42 | 1.022 |
| 100 | 1 | pp512 | 18413.90 ± 566 | 18824.91 ± 480 | 1.022 | |
| 100 | 1 | pp2048 | 17827.18 ± 106 | 18307.66 ± 77 | 1.027 | |
| 100 | 1 | pp8192 | 8814.67 ± 7.27 | 11673.96 ± 8.07 | 1.324 | |
| 100 | 1 | pp32768 | 2827.13 ± 12.12 | 4634.12 ± 4.84 | 1.639 | |
| AVX2 | 0 | 4 | tg128 | 32.68 ± 0.08 | 35.26 ± 0.05 | 1.079 |
| 0 | 16 | pp512 | 278.34 ± 1.04 | 620.40 ± 3.24 | 2.229 | |
| 0 | 16 | pp2048 | 217.57 ± 0.70 | 562.58 ± 2.31 | 2.586 | |
| 0 | 16 | pp8192 | 111.29 ± 0.15 | 414.44 ± 0.83 | 3.724 | |
| 0 | 16 | pp32768 | 35.78 ± 0.00 | 199.58 ± 0.00 | 5.578 | |
| Metal | 100 | 8 | tg128 | 88.82 ± 0.19 | 91.06 ± 0.18 | 1.025 |
| 100 | 8 | pp512 | 1427.74 ± 1.44 | 1512.66 ± 0.59 | 1.059 | |
| 100 | 8 | pp2048 | 1363.51 ± 0.62 | 1456.12 ± 0.73 | 1.068 | |
| 100 | 8 | pp8192 | 1093.02 ± 0.86 | 1224.56 ± 0.52 | 1.120 | |
| 100 | 8 | pp32768 | 572.65 ± 1.13 | 728.75 ± 5.56 | 1.272 | |
| ARN_NEON | 0 | 8 | tg128 | 54.06 ± 0.15 | 62.49 ± 0.18 | 1.156 |
| 0 | 8 | pp512 | 148.92 ± 0.15 | 243.09 ± 0.06 | 1.632 | |
| 0 | 8 | pp2048 | 130.66 ± 1.84 | 226.46 ± 5.41 | 1.733 | |
| 0 | 8 | pp8192 | 97.95 ± 3.57 | 189.65 ± 4.30 | 1.936 |
For very large prompts (pp32768) the performance difference is striking, reaching 5.5X for AVX2!
Flash attention
Flash attention is only useful on CUDA (on the 3 other platforms I have available performance is lower with flash attention), so here only CUDA results:
| backend | ngl | threads | fa | test | t/s (llama.cpp) | t/s (PR) | Speedup |
|---|---|---|---|---|---|---|---|
| CUDA | 100 | 1 | 1 | tg128 | 251.86 ± 0.56 | 256.15 ± 0.76 | 1.017 |
| CUDA | 100 | 1 | 1 | pp512 | 19127.14 ± 529.58 | 19712.11 ± 167.06 | 1.031 |
| CUDA | 100 | 1 | 1 | pp2048 | 18641.99 ± 72.13 | 19823.18 ± 91.26 | 1.063 |
| CUDA | 100 | 1 | 1 | pp8192 | 13566.85 ± 111.75 | 16108.68 ± 30.32 | 1.187 |
| CUDA | 100 | 1 | 1 | pp32768 | 6472.16 ± 4.43 | 9053.46 ± 9.68 | 1.399 |
40% faster for 32k tokens is quite nice.