Add GitHub data (#637)

2026-03-11 14:30:02 +00:00 · 2025-07-22 18:18:40 +02:00
parent 9513222ba5
commit 94aa54df76
626 changed files with 175142 additions and 0 deletions
--- a/github-data/pull_requests/136-Q2_K_R4.md
+++ b/github-data/pull_requests/136-Q2_K_R4.md
@@ -0,0 +1,40 @@
+### 🔀 [#136](https://github.com/ikawrakow/ik_llama.cpp/pull/136) - Q2_K_R4
+
+| **Author** | `ikawrakow` |
+| :--- | :--- |
+| **State** | ❌ **Closed** |
+| **Created** | 2024-12-11 |
+| **Updated** | 2024-12-11 |
+
+---
+
+#### Description
+
+Follow up of #118, #119, #120, #121, #122, #123, #129, #130, #132, #134  for `Q2_K`. 
+
+This completes R4 implementation for k-quants on `ARM_NEON`, `AVX2`, and `Zen4`.
+
+We get signifiant performance gains on all platforms.  Here is `PP-512` for LLaMA-3.1-8B on `Zen4` (Ryzen-7950X), `ARM_NEON` (M2-Max) and `AVX2` (Ryzen-5975WX)
+
+| Platform |  Threads | Q2_K_S | Q2_K_R4 | Speedup |
+| ---: | ---: | ---: | ---: | ---: |
+| ARM_NEON |  8 |  73.79 ± 1.92  | 109.07 ± 0.58 | 1.478 |
+| Zen4            | 16 | 205.95 ± 0.77  | 256.19 ± 0.26  | 1.244 |
+| AVX2           | 32 | 214.42 ± 0.54 |  286.91 ± 0.63  | 1.338 |
+
+As `Q2_K` is smaller than other k-quants, here the CPU can do more work before available memory bandwidth saturates when running TG. Hence, we get non-negligible performance gains on all platforms also for TG. 
+Here results for TG-128 on LLaMA-3.1-8B with different numbers of threads:
+
+| Platform |  Threads | Q2_K_S | Q2_K_R4 | Speedup |
+| ---: | ---: | ---: | ---: | ---: |
+| ARM_NEON | 2 | 10.34 ± 0.01 | 12.81 ± 0.01 | 1.239 |
+|                      | 4 | 19.32 ± 0.02 | 23.40 ± 0.08 | 1.211 |
+|                      | 8 | 32.36 ± 0.59 | 36.02 ± 0.40 | 1.113 |
+| Zen4            | 1 |  6.60 ± 0.02  | 9.08 ± 0.12  |  1.376 |
+|                      | 2 |  12.12 ± 0.01 | 16.40 ± 0.00  |  1.353 |
+|                      | 4 |  19.12 ± 0.56  | 20.72 ± 0.19  |  1.084 |
+| AVX2           | 2 | 5.93 ± 0.02   | 10.16 ± 0.30  | 1.713 |
+|                     | 4 | 11.24 ± 0.00    |  17.59 ± 0.01 | 1.565 |
+|                     | 8 |  18.62 ± 0.03  | 21.44 ± 0.00  | 1.151 |
+
+It is actually too bad `Q2_K` is such a low quality quantization as performance is really good. Perhaps I should try to improve it? When I was developing it back then it was much better than any other 2-bit attempt at the time, so I was quite pleased with the result. But with today's knowledge that we can do much better at 2 bpw, perhaps a fresh look could be useful.