mirror of
https://github.com/ikawrakow/ik_llama.cpp.git
synced 2026-03-07 04:20:03 +00:00
This version outperforms no-FA up to 16k tokens, but still becomes slower at 32k. Here the t/s for LLaMA-3.1-8B on a Ryzen-7950X | test | t/s no FA | Georgi FA | This commit FA | | --------: | ---------------: | -------------: | --------------: | | pp256 | 193.46 ± 2.40 | 193.19 ± 5.07 | 197.73 ± 0.72 | | pp512 | 192.23 ± 1.83 | 188.14 ± 0.63 | 194.38 ± 0.69 | | pp1024 | 189.06 ± 0.72 | 170.81 ± 4.82 | 191.12 ± 1.47 | | pp2048 | 181.92 ± 1.21 | 140.36 ± 1.77 | 184.57 ± 1.20 | | pp4096 | 165.10 ± 0.95 | 117.50 ± 0.35 | 168.79 ± 0.50 | | pp8192 | 137.48 ± 0.75 | 68.54 ± 1.00 | 148.21 ± 0.64 | | pp16384 | 100.35 ± 0.93 | | 105.14 ± 0.00 | | pp32768 | 64.44 | | 57.36 | Didn't have the patience to run Georgi's FA at 16k tokens. No error estimate on the 32k result as I only ran 1 sample.