### 🔀 [#172](https://github.com/ikawrakow/ik_llama.cpp/pull/172) - CPU Flash Attention improvements | **Author** | `ikawrakow` | | :--- | :--- | | **State** | ❌ **Closed** | | **Created** | 2025-01-15 | | **Updated** | 2025-01-15 | --- #### Description This PR * Improves FA CPU performance for long contexts * Fixes K-cache quantized to `Q8_0` when not using FA. This was broken because online `Q8_0` quantization packed quants into blocks of 128 (`block_q8_0_x4`), so `K*Q` became garbage when using `Q8_0` quantized K-cache without FA. FA performance improvements are for `AVX2/Zen4`. The following table shows `PP-512` comparison between the main branch and this PR with FA using `bf16` or `Q8_0` for KV cache. Model is LLaMA-3.1-8B quantized to `IQ4_XS` and run-time-repacked to `IQ4_XS_R4`. The CPU is Ryzen 7950X. When the quoted uncertainty in the table is zero, I have run just a single repetition in `llama-bench` (it takes quite a while to process 16k or even 32k tokens) | type_k | type_v | fa | rtr | test | t/s (main) | t/s (pr) | Speedup | | -----: | -----: | -: | --: | ------------: | ---------------: | ---------------: | ------: | | bf16 | bf16 | 1 | 1 | pp128 | 275.27 ± 1.63 | 278.40 ± 1.60 | 1.011 | | bf16 | bf16 | 1 | 1 | pp256 | 276.16 ± 3.46 | 283.51 ± 1.22 | 1.027 | | bf16 | bf16 | 1 | 1 | pp512 | 274.71 ± 0.51 | 276.83 ± 0.36 | 1.008 | | bf16 | bf16 | 1 | 1 | pp1024 | 265.81 ± 1.65 | 270.05 ± 0.41 | 1.016 | | bf16 | bf16 | 1 | 1 | pp2048 | 256.95 ± 0.39 | 260.11 ± 0.14 | 1.012 | | bf16 | bf16 | 1 | 1 | pp4096 | 237.97 ± 0.37 | 242.29 ± 0.75 | 1.018 | | bf16 | bf16 | 1 | 1 | pp8192 | 206.34 ± 1.25 | 213.98 ± 0.35 | 1.037 | | bf16 | bf16 | 1 | 1 | pp16384 | 156.40 ± 0.00 | 173.44 ± 0.00 | 1.109 | | bf16 | bf16 | 1 | 1 | pp32768 | 82.97 ± 0.00 | 122.47 ± 0.00 | 1.476 | | q8_0 | q8_0 | 1 | 1 | pp128 | 273.44 ± 1.04 | 279.27 ± 1.43 | 1.021 | | q8_0 | q8_0 | 1 | 1 | pp256 | 278.57 ± 1.03 | 283.00 ± 0.63 | 1.016 | | q8_0 | q8_0 | 1 | 1 | pp512 | 271.56 ± 0.05 | 275.97 ± 0.79 | 1.016 | | q8_0 | q8_0 | 1 | 1 | pp1024 | 264.31 ± 0.89 | 269.35 ± 0.33 | 1.019 | | q8_0 | q8_0 | 1 | 1 | pp2048 | 253.70 ± 0.24 | 258.22 ± 0.36 | 1.018 | | q8_0 | q8_0 | 1 | 1 | pp4096 | 232.07 ± 0.88 | 236.83 ± 1.38 | 1.021 | | q8_0 | q8_0 | 1 | 1 | pp8192 | 199.90 ± 1.37 | 204.74 ± 0.34 | 1.024 | | q8_0 | q8_0 | 1 | 1 | pp16384 | 153.62 ± 0.00 | 164.50 ± 0.00 | 1.071 | | q8_0 | q8_0 | 1 | 1 | pp32768 | 103.48 ± 0.00 | 113.35 ± 0.00 | 1.095 |