mirror of
https://github.com/ikawrakow/ik_llama.cpp.git
synced 2026-04-26 17:39:37 +00:00
Add GitHub data (#637)
This commit is contained in:
23
github-data/pull_requests/123-IQ4_XS_R4.md
Normal file
23
github-data/pull_requests/123-IQ4_XS_R4.md
Normal file
@@ -0,0 +1,23 @@
|
||||
### 🔀 [#123](https://github.com/ikawrakow/ik_llama.cpp/pull/123) - IQ4_XS_R4
|
||||
|
||||
| **Author** | `ikawrakow` |
|
||||
| :--- | :--- |
|
||||
| **State** | ❌ **Closed** |
|
||||
| **Created** | 2024-12-04 |
|
||||
| **Updated** | 2024-12-04 |
|
||||
|
||||
---
|
||||
|
||||
#### Description
|
||||
|
||||
Follow up of #118, #119, #120, #121, #122 for `IQ4_XS`.
|
||||
|
||||
I was curious to see if one can make the interleaved rows strategy work for i- and k-quants with their super-blocks & blocks and two levels of scales. `IQ4_XS` seemed easiest, so I tackled that one first. We get a massive speedup on `ARM_NEON` and a more modest (but still significant) gain on `AVX2/Zen4`. I'm not 100% happy with the `Zen4` implementation, but shuffling scale bits for 4 rows at once is tricky, so for now I have settled on a sub-optimal solution.
|
||||
|
||||
Anyway, here is `PP-512` for LLaMA-3.1-8B on `Zen4` (Ryzen-7950X), `ARM_NEON` (M2-Max) and `AVX2` (Ryzen-5975WX)
|
||||
|
||||
| Platform | Threads | IQ4_XS | IQ4_XS_R4 | Speedup |
|
||||
| ---: | ---: | ---: | ---: | ---: |
|
||||
| ARM_NEON | 8 | 68.23 ± 1.06 | 115.43 ± 0.57 | 1.692 |
|
||||
| Zen4 | 16 | 183.43 ± 0.60 | 223.98 ± 0.12 | 1.221 |
|
||||
| AVX2 | 32 | 195.20 ± 0.40 | 248.25 ± 0.43 | 1.272 |
|
||||
Reference in New Issue
Block a user