Files
ik_llama.cpp/ggml
Iwan Kawrakow e4959f9e46 Experimenting with flash attention on Zen4
This version outperforms no-FA up to 16k tokens, but
still becomes slower at 32k.

Here the t/s for LLaMA-3.1-8B on a Ryzen-7950X

|      test |       t/s no FA  |   Georgi FA    |  This commit FA |
| --------: | ---------------: | -------------: | --------------: |
|     pp256 |    193.46 ± 2.40 |  193.19 ± 5.07 |   197.73 ± 0.72 |
|     pp512 |    192.23 ± 1.83 |  188.14 ± 0.63 |   194.38 ± 0.69 |
|    pp1024 |    189.06 ± 0.72 |  170.81 ± 4.82 |   191.12 ± 1.47 |
|    pp2048 |    181.92 ± 1.21 |  140.36 ± 1.77 |   184.57 ± 1.20 |
|    pp4096 |    165.10 ± 0.95 |  117.50 ± 0.35 |   168.79 ± 0.50 |
|    pp8192 |    137.48 ± 0.75 |   68.54 ± 1.00 |   148.21 ± 0.64 |
|   pp16384 |    100.35 ± 0.93 |                |   105.14 ± 0.00 |
|   pp32768 |     64.44        |                |    57.36        |

Didn't have the patience to run Georgi's FA at 16k tokens.
No error estimate on the 32k result as I only ran 1 sample.
2024-08-30 08:42:34 +03:00
..
2024-07-27 07:55:01 +02:00
2024-07-27 07:55:01 +02:00