Files
ik_llama.cpp/github-data/pull_requests/99 - Enable IQ4_NL for KV-cache in token generation using Flash Attention.md
2025-07-23 13:31:53 +02:00

41 lines
1.4 KiB
Markdown

### 🔀 [#99](https://github.com/ikawrakow/ik_llama.cpp/pull/99) - Enable IQ4_NL for KV-cache in token generation using Flash Attention
| **Author** | `ikawrakow` |
| :--- | :--- |
| **State** | ❌ **Closed** |
| **Created** | 2024-10-20 |
| **Updated** | 2024-10-21 |
---
#### Description
Only added for head size = 128 for now, we can add other head sizes if needed.
For me `-ctk q8_0 -ctv iq4_nl` is the most useful combination in terms of the compromise between generation quality and KV-cache size.
**Update**
Based on @Nexesenex comment in #92, added `IQ4_NL + IQ4_NL` as a possible KV-cache combination for head size of 128. Hopefully this is a better alternative than `Q4_0 + Q4_0` for the VRAM poor.
---
#### 💬 Conversation
👤 **saood06** commented the **2024-10-20** at **18:48:37**:<br>
Since you're enabling q8_0/iq4_nl by default you should update the on_no_fattn_vec_case function in fattn-common.cuh to mention it.
---
👤 **ikawrakow** commented the **2024-10-21** at **08:10:33**:<br>
> Since you're enabling q8_0/iq4_nl by default you should update the on_no_fattn_vec_case function in fattn-common.cuh to mention it.
Thanks for pointing out. It is now updated to reflect the possible quantized cache combinations.
---
👤 **Nexesenex** commented the **2024-10-21** at **09:47:46**:<br>
It works. In the name of the VRAM poor that I do so well represent, thanks! xD