Add GitHub data (#637)

2026-02-03 13:04:59 +00:00 · 2025-07-22 18:18:40 +02:00
parent 9513222ba5
commit 94aa54df76
626 changed files with 175142 additions and 0 deletions
--- a/github-data/pull_requests/1-Offload
+++ b/github-data/pull_requests/1-Offload
@@ -0,0 +1,26 @@
+### 🔀 [#1](https://github.com/ikawrakow/ik_llama.cpp/pull/1) - Offload Bitnet token embeddings to the GPU
+
+| **Author** | `ikawrakow` |
+| :--- | :--- |
+| **State** | ❌ **Closed** |
+| **Created** | 2024-07-26 |
+| **Updated** | 2024-07-26 |
+
+---
+
+#### Description
+
+This PR puts the `token_embedding` tensor on the GPU for the Bitnet-1.58b model. This results in a significantly improved performance on CUDA/Metal as can be seen in the table. `CUDA` is for RTX-4080, `Metal` is for a 30-code M2-Max GPU, the host CPU is a Ryzen-7950X for `CUDA`.
+
+| model  | backend    |    test | t/s (PR)      |  t/s (main)   | Speedup |
+| ------ | ---------: | ------: | ------------: | ------------: | ------: |
+| IQ2_BN | CUDA       |   tg128 | 322.10 ± 0.07 | 241.34 ± 0.27 | 1.325   |   
+| IQ1_BN | CUDA       |   tg128 | 301.44 ± 0.12 | 229.21 ± 0.89 | 1.315   |   
+| IQ2_BN | CUDA       |   pp512 | 10780 ± 164   | 9811 ± 25     | 1.099   |   
+| IQ1_BN | CUDA       |   pp512 | 10661 ± 172   | 9655 ± 21     | 1.104   |   
+| IQ2_BN | Metal      |   pp512 | 723.19 ± 0.53 | 722.66 ± 0.47 | 1.001   |   
+| IQ1_BN | Metal      |   pp512 | 698.25 ± 1.91 | 697.59 ± 2.12 | 1.000   |   
+| IQ2_BN | Metal      |   tg128 | 110.39 ± 0.13 | 95.22 ± 0.55  | 1.159   |   
+| IQ1_BN | Metal      |   tg128 |  76.70 ± 0.05 | 69.33 ± 0.07  | 1.106   |   
+
+Bitnet uses the same tensor for token embeddings and for output. When the token embedding tensor is specified to be on the CPU, as done in mainline `llama.cpp` and here before this PR, this leads to the final matrix multiplication with the output tensor to be performed on the CPU even when using a GPU backend, and this results in a significant drop in performance (the larger the performance differential between the GPU and the host CPU, the larger the effect). As this might affect other models as well (e.g., Gemma), it would be useful to find a more general solution, but I'm finding the back-end stuff in `llama.cpp` to be opaque and hard to understand, so solved in a hacky way just for Bitnet for now.
--- a/github-data/pull_requests/10-iq4_k_
+++ b/github-data/pull_requests/10-iq4_k_
@@ -0,0 +1,13 @@
+### 🔀 [#10](https://github.com/ikawrakow/ik_llama.cpp/pull/10) - iq4_k: speedup quantization by a factor of ~2
+
+| **Author** | `ikawrakow` |
+| :--- | :--- |
+| **State** | ❌ **Closed** |
+| **Created** | 2024-08-03 |
+| **Updated** | 2024-08-03 |
+
+---
+
+#### Description
+
+It is interesting to observe that `clang` produces code that is ~6X faster than the `GCC` result on a simple benchmark that measures the speed of the `best_index_iq4n` function (which is the bottleneck during `IQ4_K` quantization). But when this is used in practice in  `quantize_row_iq4_k_impl_bs16`, the `clang` executable is actually worse than the `GCC` executable. Either way, both compilers need a hand, so this PR gives it to them. This gives us a ~2X speedup in the `IQ4_K` quantization.
--- a/github-data/pull_requests/101-Enable
+++ b/github-data/pull_requests/101-Enable
@@ -0,0 +1,33 @@
+### 🔀 [#101](https://github.com/ikawrakow/ik_llama.cpp/pull/101) - Enable q6_0 in flash attention
+
+| **Author** | `ikawrakow` |
+| :--- | :--- |
+| **State** | ❌ **Closed** |
+| **Created** | 2024-10-21 |
+| **Updated** | 2024-10-22 |
+
+---
+
+#### Description
+
+As with `IQ4_NL`, just for head size of 128 for now. Without `GGML_CUDA_FA_ALL_QUANTS` set, only `Q6_0 + Q5_0` and `Q8_0 + Q6_0` are included. With this the VRAM poor have better options for selecting the best possible (as allowed by VRAM, model size, context length) quantized KV-cache from
+
+| K-cache | V-cache | BPV |
+| -------: | --------: | ----: |
+| Q4_0 | Q4_0 | 4.5 |
+| IQ4_NL | IQ4_NL | 4.5 |
+| Q6_0 | Q5_0 | 6.0 |
+| Q8_0 | IQ4_NL | 6.5 |
+| Q8_0 | Q6_0 | 7.5 |
+| Q8_0 | Q8_0 | 8.5 |
+| F16 | F16 | 16.0 |
+
+---
+
+#### 💬 Conversation
+
+👤 **Nexesenex** commented the **2024-10-21** at **18:14:38**:<br>
+
+Merged in my fork of Kobold CPP. K q6_0 V q5_0 works like a charm. I also activated 16/6, 6/iq4_nl, as well as 8/6 and 6/6, I'll test them tonight or tomorrow.
+
+Thank you (very very much) and congratulation for this, IK, I'm delighted to have those options and thus the best inference quality I can get right now, and I'm gonna release soon an updated version of my fork, with the proper credits of course, so everyone interested and not too scared by downloading my patchwork can enjoy the fruit of your labors on these KV Quants, as some already enjoyed a bit more speed on CPU due to some of your commits that I was able to merge a few months ago!
--- a/github-data/pull_requests/102-Add
+++ b/github-data/pull_requests/102-Add
@@ -0,0 +1,13 @@
+### 🔀 [#102](https://github.com/ikawrakow/ik_llama.cpp/pull/102) - Add support for Granite and GraniteMoE models
+
+| **Author** | `ikawrakow` |
+| :--- | :--- |
+| **State** | ❌ **Closed** |
+| **Created** | 2024-10-22 |
+| **Updated** | 2024-10-22 |
+
+---
+
+#### Description
+
+On CUDA GraniteMoE-1b suffers from precision issues in the attention portion, so I became curious to see why. One way to avoid the NaNs is to set the precision of the `K*Q` matrix multiplication to `F32`. What also fixes it is to apply the attention scale on `Q` before the `K*Q` multiplication (the solution I went with in this PR). One can apply the scale before or after RoPE. It works in both cases, so this really narrows it down to the `K*Q` multiplication suffering from precision issues when done in `f16`.  Strange how these models were trained in the first place.
--- a/github-data/pull_requests/105-Fix
+++ b/github-data/pull_requests/105-Fix
@@ -0,0 +1,16 @@
+### 🐛 [#105](https://github.com/ikawrakow/ik_llama.cpp/pull/105) - Fix quantized k-cache without FA
+
+| **Author** | `ikawrakow` |
+| :--- | :--- |
+| **State** | ❌ **Closed** |
+| **Created** | 2024-10-24 |
+| **Updated** | 2024-10-24 |
+
+---
+
+#### Description
+
+Ref  https://github.com/ggerganov/llama.cpp/pull/10032
+Ref https://github.com/ggerganov/llama.cpp/pull/10021
+
+Closes #103
--- a/github-data/pull_requests/106-Bitnet
+++ b/github-data/pull_requests/106-Bitnet
@@ -0,0 +1,25 @@
+### 🔀 [#106](https://github.com/ikawrakow/ik_llama.cpp/pull/106) - Bitnet changes
+
+| **Author** | `ikawrakow` |
+| :--- | :--- |
+| **State** | ❌ **Closed** |
+| **Created** | 2024-10-24 |
+| **Updated** | 2024-10-25 |
+
+---
+
+#### Description
+
+* Change `IQ1_BN` and `IQ2_BN` to have per row scales. In that way we can handle Bitnet models with and without separate tensor scales
+* Remove `IQ1_TN` and `IQ2_TN`. With the above change these are now redundant. `IQ1_BN` and `IQ2_BN` are also faster, so no reason to keep these around
+* Change `build_bitnet()` to use the standard `llm_build_kv()` function for the self attention portion. I was hoping this would also allow to use FA, but nope, the Bitnet models have a strange head size of 100 that is not supported by the FA implementations.
+
+Everything works except - can you guess? - Metal. There is something wrong with the dot product kernels and I simply don't see what. I have to fix Metal before merging.
+
+On CUDA (RTX-4080) we now get 368 t/s for TG-128 with the 3.3B Bitnet model (`IQ2_BN`). When I first added Bitnet support we were at ~320 t/s, so quite an improvement since then. 
+
+**Update**
+
+I wasted quite some time trying to figure out why the Bitnet changes don't work on Metal. At the end it turned out that it is PR #98 that breaks the Metal back-end. So, this PR reverts #98.
+
+@agray3 Do you have the ability to investigate why #98 breaks the Metal back-end?
--- a/github-data/pull_requests/107-Faster
+++ b/github-data/pull_requests/107-Faster
@@ -0,0 +1,18 @@
+### 🔀 [#107](https://github.com/ikawrakow/ik_llama.cpp/pull/107) - Faster IQ1_BN Metal implementation
+
+| **Author** | `ikawrakow` |
+| :--- | :--- |
+| **State** | ❌ **Closed** |
+| **Created** | 2024-10-26 |
+| **Updated** | 2024-10-26 |
+
+---
+
+#### Description
+
+On my 30-core M2-Max TG-128 for Bitnet-1.58b-3.3B improves from 82 t/s to 94.7 t/s.
+PP-512 goes from 686 t/s to 702 t/s.
+
+Integer multiplications are expensive, so the trick used is to replace them with shifts and additions.
+
+There is also a minor `IQ2_BN` PP-512 improvement (710 -> 714 t/s).
--- a/github-data/pull_requests/108-Another
+++ b/github-data/pull_requests/108-Another
@@ -0,0 +1,17 @@
+### 🔀 [#108](https://github.com/ikawrakow/ik_llama.cpp/pull/108) - Another Bitnet performance improvement on Metal
+
+| **Author** | `ikawrakow` |
+| :--- | :--- |
+| **State** | ❌ **Closed** |
+| **Created** | 2024-10-26 |
+| **Updated** | 2024-10-26 |
+
+---
+
+#### Description
+
+This time just the dequantize function. 
+
+For Bitnet-1.58b-3B on 30-core M2-Max GPU 
+* `IQ1_BN` goes from 702 t/s to 716 t/s
+* `IQ2_BN` goes from 714 t/s to 743 t/s
--- a/github-data/pull_requests/109-Bitnet
+++ b/github-data/pull_requests/109-Bitnet
@@ -0,0 +1,14 @@
+### 🔀 [#109](https://github.com/ikawrakow/ik_llama.cpp/pull/109) - Bitnet CUDA improvements
+
+| **Author** | `ikawrakow` |
+| :--- | :--- |
+| **State** | ❌ **Closed** |
+| **Created** | 2024-10-26 |
+| **Updated** | 2024-10-26 |
+
+---
+
+#### Description
+
+`IQ1_BN` TG-128 on RTX-4080 goes to 340 t/s up from 318 t/s.
+On the front page the performance listed for `IQ1_BN` on CUDA is 301 t/s, so a pretty nice improvement since then.
--- a/github-data/pull_requests/11-Faster
+++ b/github-data/pull_requests/11-Faster
@@ -0,0 +1,7 @@
+### 🔀 [#11](https://github.com/ikawrakow/ik_llama.cpp/pull/11) - Faster iq3_k and iq5_k quantization
+
+| **Author** | `ikawrakow` |
+| :--- | :--- |
+| **State** | ❌ **Closed** |
+| **Created** | 2024-08-05 |
+| **Updated** | 2024-08-05 |
--- a/github-data/pull_requests/110-Bitnet_
+++ b/github-data/pull_requests/110-Bitnet_
@@ -0,0 +1,15 @@
+### 🔀 [#110](https://github.com/ikawrakow/ik_llama.cpp/pull/110) - Bitnet: use the fused mul-silu in the FFN network
+
+| **Author** | `ikawrakow` |
+| :--- | :--- |
+| **State** | ❌ **Closed** |
+| **Created** | 2024-10-26 |
+| **Updated** | 2024-10-26 |
+
+---
+
+#### Description
+
+I had forgotten that `build_bitnet()` does not use the standerd `llm_build_ffn` function, so the fused mul-silu didn't get used automatically for Bitnet when I added it to llm_build_ffn.
+
+This gives us another ~1% speedup for TG-128 on Metal and CUDA.
--- a/github-data/pull_requests/111-Use
+++ b/github-data/pull_requests/111-Use
@@ -0,0 +1,13 @@
+### 🔀 [#111](https://github.com/ikawrakow/ik_llama.cpp/pull/111) - Use fused mul - unary op also for MoE models
+
+| **Author** | `ikawrakow` |
+| :--- | :--- |
+| **State** | ❌ **Closed** |
+| **Created** | 2024-10-26 |
+| **Updated** | 2024-10-26 |
+
+---
+
+#### Description
+
+This gives us a ~1% speedup for MoE models on CUDA and Metal.
--- a/github-data/pull_requests/112-Faster
+++ b/github-data/pull_requests/112-Faster
@@ -0,0 +1,45 @@
+### 🔀 [#112](https://github.com/ikawrakow/ik_llama.cpp/pull/112) - Faster MoE inference
+
+| **Author** | `ikawrakow` |
+| :--- | :--- |
+| **State** | ❌ **Closed** |
+| **Created** | 2024-10-31 |
+| **Updated** | 2025-06-23 |
+
+---
+
+#### Description
+
+This PR
+* Adds a new op `GGML_MULTI_ADD` used to sum up the contributions of the selected experts. It results in, e.g., a 7% improvement of token generation speed for Granite-1B-MoE on CUDA (RTX-4080).
+* Fixes a massive inefficiency in the Metal implementation of MoE matrix multiplications (`kernel_mul_mm_id`). This leads to a nearly 6-fold prompt processing speedup for Granite-1B-MoE on Metal. But even for a much larger model such as Mixtral-8x7B the speedup is nearly a factor of 2 compared to current mainline `llama.cpp` (build: `8f275a7c (3989)`).
+
+---
+
+#### 💬 Conversation
+
+👤 **Nexesenex** commented the **2025-06-23** at **12:59:59**:<br>
+
+Hey IK.
+
+```
+    if (n_expert_used == 1) {
+        return ggml_cont(ctx, ggml_view_2d(ctx, experts, n_embd, n_tokens, experts->nb[2], 0));
+    }
+    if (n_expert_used == 2) {
+        return ggml_add(ctx, ggml_view_2d(ctx, experts, n_embd, n_tokens, experts->nb[2], 0),
+                             ggml_view_2d(ctx, experts, n_embd, n_tokens, experts->nb[2], experts->nb[1]));
+    }
+    return ggml_multi_add(ctx, ggml_view_2d(ctx, experts, n_embd, n_tokens, experts->nb[2], 0), n_expert_used);
+```
+
+What of the case if expert_used >= 3?
+
+For example, on Mistral 8x22b, there's a perplexity benefit to use 3 experts instead of 2 (-2% PPL 512).
+
+---
+
+👤 **Nexesenex** commented the **2025-06-23** at **13:08:58**:<br>
+
+Oh silly me, I just read too fast the code, I understand now.
+Sorry!
--- a/github-data/pull_requests/113-Trellis
+++ b/github-data/pull_requests/113-Trellis
@@ -0,0 +1,262 @@
+### 🔀 [#113](https://github.com/ikawrakow/ik_llama.cpp/pull/113) - Trellis quantization
+
+| **Author** | `ikawrakow` |
+| :--- | :--- |
+| **State** | ❌ **Closed** |
+| **Created** | 2024-11-15 |
+| **Updated** | 2025-06-01 |
+
+---
+
+#### Description
+
+The latest quantization hype is `QTIP` - [paper](https://arxiv.org/pdf/2406.11235), [repository](https://github.com/Cornell-RelaxML/qtip). They use a Trellis approach and report impressive results, so I decided to look into this more closely.
+
+This PR implements what they call "3INST" in their paper. Basically, if we have a seed `seed`, we generate `N` quantized values `q_i` via
+```
+uint32_t u32;
+float16_t * h = reinterpret_cast<float16_t*>(&u32)
+for i in 0...N-1
+    seed = a * seed + b
+    u32 = (mask1 & seed) ^ mask2
+    q_i = h[0] + h[1]
+end
+```
+where `a, b, mask1` and `mask2` are suitable constants. This generates values that are (nearly) normally distributed. One uses this to describe a group of `N` quants with a single `L`-bit seed (index). Apart from borrowing the "3INST" algorithm from the QTIP paper, the implementation here has noting else in common with QTIP - there are no Hadamard transforms, and no (tail-biting) [Viterbi algorithm](https://en.wikipedia.org/wiki/Viterbi_algorithm) is utilized during quantization. Instead, in the usual i-  and k-quants style, quants are organized in blocks and super-blocks with suitable block scales, and the search for the best seed during quantization is done via a clustering algorithm.
+
+The PR adds 3 new quantization types:
+* `IQ2_KT`: `L=16` bits for groups of `N=8` quants. Block size is 32 with a 4-bit block scale, plus a single float scale per tensor row (the 32 bits added by this scale can be safely neglected for typical tensor row sizes), so we end up using 2.125 btw
+* `IQ3_KT`: `L=12` bits for groups of `N=4` quants. Block size is also 32 with a 4-bit block scale, so 3.125 bpw
+* `IQ4_KT`: `L=15` bits for groups of `N=4` quants. Blocks of 32 with 8-bit block scales, so 4.0 bpw. 
+
+### Quantization accuracy
+
+This figure shows quantization error `PPL(Q)/PPL(bf16)-1` for LLaMA-3.1-8B-Instruct (context length of 8192 tokens). The blue symbols are k-quants, the black symbols are i-quants, cyan symbols are iqk-quants (not available in mainline `llama.cpp`), and the orange symbols are the Trellis quants added by this PR. We do see a small but noticeable improvement compared to i- and iqk-quants, with about 0.2 fewer bpw required to achieve the same quantization error.   
+
+![il31a](https://github.com/user-attachments/assets/b899bc97-9a5e-40c1-83bd-fd0bbb0023c1)
+
+How does this compare to the QTIP paper? Unfortunately they report results without fine tuning only for LLaMA-v2. The table shows a comparison between the 2-bit quantizations for LLaMA-v2-7B (the QTIP results are taken from Table 3 in their paper, context length is 4096 tokens)
+
+| Quantization | PPL(f16) | PPL (Q) | Quantization error |
+|------------: | ----: | ----: | ---: |
+| QTIP 2 bpw | 5.12 | 6.82 | 33.2% |
+| IQ2_KT | 4.94 | 6.36 | 28.7% |
+
+Although there are small differences between the PPL computed by `llama.cpp` and by the tools used by the QTIP authors, the quantization error as defined above is basically independent of the specifics of the PPL calculation, so we see that the 2 bpw quantization implemented here slightly outperforms QTIP without fine tuning (at the expense of using 0.125 bpw more bits). Given this, and the above graph, my conclusion is that Trellis based quantization is a small improvement compared to i-,k-,iqk-quants, but nowhere near the hype observed around the Internet.
+
+### Performance 
+
+The QTIP authors give TG speed for their 2 bpw variant on an RTX-6000 Ada GPU (see [here](https://github.com/Cornell-RelaxML/qtip?tab=readme-ov-file#fast-inference)) and a 7B LLaMA model. My GPU is RTX-4080 (so same generation as theirs, but lower specs). I did a quick attempt to get QTIP going in my environment to have apples-to-apples performance comparison, but it was not successful, so I will use the ratio between their `f16` performance on the RTX-6000 (55.9 t/s) to my `fp16` performance on the RTX-4080 (46.2 t/s) to translate QTIP performance on the RTX-6000 (188 t/s) to estimated performance on the RTX-4080:
+```
+QTIP (2 bpw, RTX-4080) = fp16(RTX-4080)/fp16(RTX-6000) * QTIP (2 bpw, RTX-6000) = 46.2/55.9*188 = 155.4 t/s
+```
+In comparison, I get 194 t/s for `IQ2_KT` (with flash attention enabled, which I assume they also use). These results are with the output tensor left as `f16` (which is what is done in QTIP). `IQ2_XSS` achieves 208 t/s (output as `f16`) or 216 t/s (output as `Q5_K`), so QTIP performance is far behind the performance of a model of similar size using a more efficient quantization.
+
+### Caveats
+
+* Quantization is only implemented for a CPU with `AVX2` support. The search for the optimum seed is extremely expensive (the QTIP authors say "prohibitive" for `L >= 12` without their tail-biting search space reduction), so I had to SIMDify to not have to wait forever for a quantization to finish. This PR being mostly a POC for now, I did not want to spend the time implementing for other instruction sets (or even porting to run on a GPU).
+* Even with `AVX2`, quantization is slow - depending on quantization type it takes between 2.5 and 4.5 minutes to quantize LLaMA-3.1-8B on a 32-core Ryzen-5975WX CPU.
+* Inference is only implemented on CUDA. Due to the "3INST" algorithm, I expect low performance on the CPU and on the Apple GPU, so did not bother to implement for those.
+* There are no quantized matrix-vector kernels, so implementation is via the `DMMV` mechanism in `llama.cpp`. The algorithm outputs float values, so one needs to convert to `int8_t` to use the usual quantized dot products. The cost of this  conversion is likely to (more than) offset any advantage one might gain by using SIMD `int8_t` dot products.
+
+---
+
+#### 💬 Conversation
+
+👤 **saood06** commented the **2025-04-07** at **03:27:46**:<br>
+
+Turboderp was also inspired by QTIP when redoing quantization for their new inference engine found [here](https://github.com/turboderp-org/exllamav3).
+
+There is  graphs and more details showing performance of their quants [here](https://github.com/turboderp-org/exllamav3/blob/master/doc/exl3.md).
+
+I'm interested and will look into it (maybe when the inference engine matures a bit) as I haven't tested using just my 3090 for a 70B model in a long while (the few recent times I wanted to use a 70B I use quants that are too big to fit my 3090 and thus need to be only partially offloaded).
+
+---
+
+👤 **compilade** commented the **2025-04-07** at **12:17:42**:<br>
+
+> There is graphs and more details showing performance of their quants [here](https://github.com/turboderp-org/exllamav3/blob/master/doc/exl3.md).
+
+Note that [they did not quantize the embeddings with EXL3](https://old.reddit.com/comments/1jt08di/comment/mlse6qg), while they might have with GGUF (not sure, still needs verification), and this might affect the perplexity graphs since they did not include the size of that tensor in the graphs.
+
+(But since they also untie tied embeddings (to quantize the output tensor), it might be hard to compare fairly depending on the model architecture)
+
+Still looks very promising, though!
+
+---
+
+👤 **saood06** commented the **2025-04-07** at **12:43:17**:<br>
+
+> > There is graphs and more details showing performance of their quants [here](https://github.com/turboderp-org/exllamav3/blob/master/doc/exl3.md).
+> 
+> Note that [they did not quantize the embeddings with EXL3](https://old.reddit.com/comments/1jt08di/comment/mlse6qg), while they might have with GGUF (not sure, still needs verification), and this might affect the perplexity graphs since they did not include the size of that tensor in the graphs.
+> 
+> (But since they also untie tied embeddings (to quantize the output tensor), it might be hard to compare fairly depending on the model architecture)
+> 
+> Still looks very promising, though!
+
+The linked doc page says "Accounting for quantization of the output layer can make a huge difference in practice, especially for smaller models. So I am including two versions of each perplexity graph, one with bitrate on the horizontal axis, and one that measures the entire VRAM footprint of the weights (not counting the embedding layer which for most inference tasks can be relegated to system RAM.)"
+
+So the bpw chart includes the embeddings layer it seems, and the VRAM one does not (both of which useful so I'm glad they offered both).
+
+>Still looks very promising, though!
+
+Yes.
+
+---
+
+👤 **saood06** commented the **2025-04-07** at **13:25:24**:<br>
+
+> I don't like these plots too much. The y-axis needs to be logarithmic, and it needs to be difference to unquantized, not absolute values (else we are chasing differences between possibly different ways of computing perplexity). Also, they massively overemphasize the low bpw range. If you plot on a log scale, you get a more realistic picture. 
+
+Yes but they are good enough for just looking at a VRAM amount and seeing the expected quality for it with the different quants.
+
+>Either way, yes, trellis quantization can bring a 0.1-0.2 bpw reduction in quantized size for the same model quality.
+
+It is more for exllamaV2 to V3 since EXL2 were much worse at low bpw than i-quants. (People did say it did offered better KV cache due to the Hadamard transform added [here](https://github.com/turboderp-org/exllamav2/commit/324404ebe4e3c4dd0447ffc1290c312de1df02be) than llama.cpp even if the model quantization was not as good).
+
+Even though the performance on ik_llama.cpp is lower for CUDA I still prefer it to exllamaV2 because of iqk quants (and also the side benefit of one API implementation) when running models that fit solely on my 3090.
+
+>But is there any indication of performance? I could get my implementation here to be reasonably performant on CUDA, but expect the CPU implementation to be a disaster performance wise.
+
+Exllama is designed for GPUs (and right now only CUDA with ROCm planned) and they are previewing this alongside a new version of their inference software.
+
+The Readme says, 
+
+"Aside from lifting a few of the most successful features from V2 (such as the generator), ExLlamaV3 is largely rewritten from scratch to provide a cleaner, more modular framework for supporting newer architectures. It also introduces a new SOTA quantization format based on [QTIP](https://github.com/Cornell-RelaxML/qtip)"
+
+"The framework is not yet fully optimized. Performance is lacking, especially on Ampere [...]"
+
+>but expect the CPU implementation to be a disaster performance wise.
+
+That is unfortunate.
+
+---
+
+👤 **saood06** commented the **2025-04-08** at **07:21:43**:<br>
+
+Also I forgot to mention it but I did mention your PR to the QTIP authors shortly after you made this draft PR. They said "It seems like they didn't bother making the weights Gaussian first (the IP part of QTIP) before quantizing with a Gaussian codebook (3INST)."
+
+You say in the PR "This generates values that are (nearly) normally distributed." and in a commit message "I also notices that the 3INST generator is not actually  generating a Gaussian distribution." do you think if you followed the author's suggestion it would result in a meaningful difference in quality or is that something you would expect to not matter as much? (I'm not asking you to implement it if you don't know, I know this PR took a long time, and the fact that it is not CPU friendly means it has limited utility for this repo).
+
+---
+
+👤 **ikawrakow** commented the **2025-04-08** at **07:38:55**:<br>
+
+It depends on what the QTIP authors mean by "they didn't bother making the weights Gaussian first". If they mean that I did not apply a Hadamard transform first, I did try that (QuIP/QuIP#/QTIP they all insist on applying Hadamard transforms to model weights before quantization), but it did not improve the result in any way. The thing about Hadamard transforms and imatrix is that they do not mix well - one needs a special imatrix for that. But I have also tried this, without much success. If they mean that I have missed something in the 3INST implementation, and hence the generated sequence is not normally distributed, and it would be better otherwise, I cannot confirm that either. I did a lot of Monte Carlo stuff in the past, so I know a thing or two about random number sequences. I tried an implementation that produces a perfect Gaussian distribution (and quite a bit more efficiently than theirs), but that made results worse.
+
+I was planning to try a sequence that generates quantized values, so CPU inference will be more efficient. But than I started doing other stuff, so that never materialized.
+
+But do the QTIP authors believe theirs is much better than what I have done? My impression was that it was about the same, give or take.
+
+---
+
+👤 **saood06** commented the **2025-04-08** at **08:02:15**:<br>
+
+> I was planning to try a sequence that generates quantized values, so CPU inference will be more efficient. But than I started doing other stuff, so that never materialized.
+
+That sounds interesting.
+
+>It depends on what the QTIP authors mean by ...
+>...
+>But do the QTIP authors believe theirs is much better than what I have done? My impression was that it was about the same, give or take.
+
+I don't know, the one line I quoted ("It seems ...") is the only thing they said to me. I was merely asking out of my own curiosity, I have no intention of testing their implementation but I may end up testing the EXL3 implementation once it has matured.
+
+---
+
+👤 **louiehelm** commented the **2025-04-17** at **20:00:44**:<br>
+
+The Hadamard Bros and other people fixated on rotations aren't doing it primarily to improve LLM weight quantization. It's for eliminating downstream outliers in run-time activations + KV-cache so they can successfully quantize those more aggressively down to 4-bits without scrambling model fidelity.
+
+Activations and KV-cache are only more sensitive to quantization because of 5-10 tokens per model that represent attention sinks (like [BOS] or "\n") which typically have activation values >100,000x than all the other tokens. This is why even though 4-bit activations only cause ~0.0001% average error, it still breaks most models because the error is all concentrated in these 5-10 essential tokens. This can cause models to glitch out or loop when they're over-quantized. Activation values for attention sinks (outlier tokens) end up very finely-calibrated during training so most models immediately become flakey when they're perturbed.
+
+There's another way to resolve this besides submitting to the Hadamard cult. [PrefixQuant](https://arxiv.org/abs/2410.05265) is a fairly small patch to KV-cache and activation handling that marks the 5-10 largest outlier tokens and just always pre-caches them into KV-cache in full f32. Then 4-bit quantize all the other activations and kv-cache for huge speed and memory benefits and no quality trade-off.
+
+---
+
+👤 **saood06** commented the **2025-04-18** at **23:11:20**:<br>
+
+> There's another way to resolve this besides submitting to the Hadamard cult.
+
+The author of ExllamaV3 reported that they will attempt other ideas as well and only go back to Hadamard if they don't work better.
+
+---
+
+👤 **saood06** commented the **2025-04-19** at **11:07:35**:<br>
+
+> [PrefixQuant](https://arxiv.org/abs/2410.05265)
+
+Finally got a chance to read the paper.
+
+>is a fairly small patch
+
+Look at "Table 5: Ablation study on quantization techniques used in PrefixQuant" and "Appendix D. More Ablation Results", the blockwise finetune that took 17 hours on Llama-3-70B with an NVIDIA-A100-80GB GPU and it having to be the correct dataset and having all the training parameters exact which contributed to their results. 
+
+>KV-cache and activation handling that marks the 5-10 largest outlier tokens and just always pre-caches them into KV-cache in full f32.
+
+This still sounds useful they reported this took 13 minutes on Llama-3-70B with an NVIDIA-A100-80GB GPU.
+
+"Appendix H. More Visualizations" was really interesting to me. Thanks for the paper link.
+
+---
+
+👤 **louiehelm** commented the **2025-04-22** at **22:37:09**:<br>
+
+It's fascinating how well your quants track optimal limits from rate-distortion theory.
+
+Optimal R(D) = 2^(-2*bitrate)
+
+![ik_graph_with_optimal2](https://github.com/user-attachments/assets/fac395df-f864-41b8-a131-044c44dc1022)
+
+Some of your new quants actually dip down to only ~1.25 bits of overhead.
+
+That's really good considering "optimal" = infinite codebook (which prob hurt t/s)
+
+---
+
+👤 **ikawrakow** commented the **2025-04-23** at **07:01:57**:<br>
+
+Where does the equation for the optimal R(D) come from?
+
+LLaMA-3 requires about ~1 bpw more to achieve the same quantization error compared to other models (see https://github.com/ikawrakow/ik_llama.cpp/discussions/8). Does this mean that the coding overhead there is < 0.5 bpw? Or does it rather mean that the model weights in LLaMA-3  do contain more information (which is my interpretation)?
+
+---
+
+👤 **saood06** commented the **2025-04-24** at **00:23:38**:<br>
+
+>essentially what LLMs might become in the limit once they're trained hard enough to reach 100% entropy levels (a full 8.0 bits per byte)
+
+Only some recent models are trained at FP8 (such as Deepseek V3/R1), they tend to be BF16, with FP4 training currently in the research stages see [this](https://arxiv.org/pdf/2501.17116)
+
+---
+
+👤 **saood06** commented the **2025-04-24** at **07:15:28**:<br>
+
+Exllama-V3 added cache quantization, 
+
+https://github.com/turboderp-org/exllamav3/commit/cf848114852240a51fb6b9e77c686051c39302b2
+
+They also explain their reasoning in an issue copied below:
+
+>So cache quantization is implemented now. It's a variant of the same technique used in V2, but now with separate bitrates (2-8 bpw plus 0.5 bpw of overhead) for K and V channels. Works a little better than in V2, and it's more flexible.
+>
+>I experimented with realtime trellis quantization, learned channel scales, autoencoders and more, but so far with little success, and not enough benefit to justify the overhead and complexity. There's still much to explore, though. For instance, I think it should be possible to learn an optimal rotation for the keys in a given layer, under a quantization constraint, then bake the same transformation into the Q and K projections, preserving their dot product.
+>
+>But for the time being, it's too much of a side quest, and I need to focus on some other stuff first. In the meantime you can get very usable results from k4v3 quantization, and more-or-less lossless quantization with k5v4. And it's "usable" down to k3v2, depending on the use case. Might make the model more creative or something, who knows (:. I still have to rig up some tests to see if it holds up over long contexts.
+
+---
+
+👤 **ikawrakow** commented the **2025-04-24** at **07:29:50**:<br>
+
+> Does your new Trellis quant also have a +1.1bit gap between L2 70b and L3 70b?
+
+I have not tried it for 70B models. It is too slow for the amount of patience I have. I know some people are OK spending 2 days quantizing a model on a GPU, but I'm not one of those.
+
+---
+
+👤 **ikawrakow** commented the **2025-04-24** at **08:18:08**:<br>
+
+> Worst-case model weights can be approximated as maximally unpredictable Gaussian data -- essentially what LLMs might become in the limit once they're trained hard enough to reach 100% entropy levels
+
+I'm not sure I can follow. On my book, LLMs only work because there are patterns encoded in the model weights, i.e., the model weights of an LLM are pretty much the opposite of a memoryless signal as required for these equations to hold. We also know that the model weights are definitely not Gaussian, and the so called "outliers" (i.e., weights that do not fall within the expectation of a normal distribution) are more important than the others. Also, the rate distortion equation tells us something about the difference between the signal (model weights) and its approximate representation (quantized model weights), but it tells us nothing about how this will affect observations (predicted token probabilities), which are the result of a complex set of linear and non-linear operations on the signal.
--- a/github-data/pull_requests/114-MMQ
+++ b/github-data/pull_requests/114-MMQ
@@ -0,0 +1,43 @@
+### 🔀 [#114](https://github.com/ikawrakow/ik_llama.cpp/pull/114) - MMQ Kernel for Q6_0 (pretty please!)
+
+| **Author** | `Nexesenex` |
+| :--- | :--- |
+| **State** | ❌ **Closed** |
+| **Created** | 2024-11-20 |
+| **Updated** | 2024-11-20 |
+
+---
+
+#### Description
+
+Q6_0 MMQ Kernel attempt.
+
+Of course, if I can reproduce the formatting, compile and run it, I don't understand anything to the maths involved within the main template, and thus, perplexity jumps by a factor 30000 on a pure Q6_0 quant. :D
+
+I used q5_0 as a base.
+
+I know you're not very much into making MMQ Cuda Kernels, but could you please do this one if it's not too bothersome, IK? Qwen2 models are quite popular and good, but their ffn_down tensors have a reversed shape, and thus, need either Q5_1 as a fallback, either Q8_0, which is unsatisfactory in both case for the ratio quality/size of an overall 5-6 bpw quant.
+
+- [x] I have read the [contributing guidelines](https://github.com/ggerganov/llama.cpp/blob/master/CONTRIBUTING.md)
+- Self-reported review complexity:
+  - [ ] Low
+  - [ ] Medium
+  - [x] High (it runs, but perplexity is 200k with force MMQ on a pure Q6_0 Sheared Llama 2 2.7b), instead of the 7-8 expected, and it's way above my league to fix that.
+
+---
+
+#### 💬 Conversation
+
+👤 **ikawrakow** submitted a review the **2024-11-20** at **09:24:50**: 💬 `COMMENTED`
+
+---
+
+👤 **Nexesenex** submitted a review the **2024-11-20** at **15:21:54**: 💬 `COMMENTED`
+
+---
+
+👤 **Nexesenex** commented during a code review the **2024-11-20** at **15:21:54** on `ggml/src/ggml-cuda/mmq.cuh`:<br>
+
+It's hard. Too hard for me still. :)
+
+I don't find a similar template for Q5_0 Cublas in convert.cu, or anything remotely close, so I kept digging if I could find similar and sufficient patterns on another quant, or in common.cuh to have a delta and understand how to transpose. I didn't find what I needed. I am sorry. ^^
--- a/github-data/pull_requests/115-MMQ
+++ b/github-data/pull_requests/115-MMQ
@@ -0,0 +1,31 @@
+### 🔀 [#115](https://github.com/ikawrakow/ik_llama.cpp/pull/115) - MMQ for Q6_0
+
+| **Author** | `ikawrakow` |
+| :--- | :--- |
+| **State** | ❌ **Closed** |
+| **Created** | 2024-11-20 |
+| **Updated** | 2024-11-21 |
+
+---
+
+#### Description
+
+Add MMQ kernel for `Q6_0`.
+
+@Nexesenex
+
+---
+
+#### 💬 Conversation
+
+👤 **Nexesenex** commented the **2024-11-20** at **19:42:01**:<br>
+
+Tested successfully on IK_LLame, PPL is 0.1% above Q6_K on a pure quant of Sheared Llama 2.7b.
+Thanks IK. I'll play with the Qwen models in the next days.
+
+---
+
+👤 **Nexesenex** commented the **2024-11-20** at **19:42:56**:<br>
+
+Tested successfully on IK_LLama, PPL is 0.1% above Q6_K on a pure quant of Sheared Llama 2.7b.
+Thanks IK. I'll play with the Qwen models in the next days.
--- a/github-data/pull_requests/116-Use
+++ b/github-data/pull_requests/116-Use
@@ -0,0 +1,23 @@
+### 🔀 [#116](https://github.com/ikawrakow/ik_llama.cpp/pull/116) - Use Q6_0 instead of Q5_1 for tensors incompatible with IQ5_K/Q5_K
+
+| **Author** | `Nexesenex` |
+| :--- | :--- |
+| **State** | ❌ **Closed** |
+| **Created** | 2024-11-20 |
+| **Updated** | 2024-11-21 |
+
+---
+
+#### Description
+
+- [x] I have read the [contributing guidelines](https://github.com/ggerganov/llama.cpp/blob/master/CONTRIBUTING.md)
+- Self-reported review complexity:
+  - [x] Low
+  - [ ] Medium
+  - [ ] High
+
+---
+
+#### 💬 Conversation
+
+👤 **ikawrakow** submitted a review the **2024-11-21** at **06:12:49**: ✅ `APPROVED`
--- a/github-data/pull_requests/117-Some
+++ b/github-data/pull_requests/117-Some
@@ -0,0 +1,79 @@
+### 🔀 [#117](https://github.com/ikawrakow/ik_llama.cpp/pull/117) - Some minor quant strategies tweaks
+
+| **Author** | `Nexesenex` |
+| :--- | :--- |
+| **State** | ✅ **Open** |
+| **Created** | 2024-11-22 |
+| **Updated** | 2024-11-23 |
+
+---
+
+#### Description
+
+Here's what I'd suggest for starters :
+
+- Rationalize Q2_K_S ffn_down and attn_v (+1% size, -2.5% ppl)
+
+- Bump attn_v and attn_k for Q2_K_S and Q2_K if GQA>=2. Uncripple attn_k for IQ3_XXS / IQ3_XS if GQA>=2
+-> Gemma v2 (GQA2) is popular and sensitive to both. L3 models as well.
+
+- Apply 8 experts rules to :
+  - MOEs with more than 8 experts..
+  - MOEs with 4 experts which should be treated as 8 considering that their shared tensors relative size is already low compared to their ffn tensors).
+  - models with 2 or more experts (such Frankenstein hybrids are published on HF with 2 experts, let them have MOE quants equivalent in bpw to standard models).
+
+- Rationalize MOEs attn_k and attn_v for the 1 & 2 bit IQ quants, and attn_q for 1,2 and small 3 bpw quants.
+
+- Rationalize attn_ouput for IQ2_XXS, IQ2_XS, IQ2_S and IQ2_M (IQ3_XXS is sufficient), in respect for what was done for the IQ1 quants, themselves shrunk in IQ2_KS. (no tests made today except for IQ2_S and M, it's mere common sense).
+
+- rationalize the ffn_down on IQ2_S and IQ2_M. (size is equivalent with the attn_output shrink, ppl drops by 0.5%).
+
+Test made today on Sheared Llama 2.7b, but I use those recipes among others for a long time already;
+
+
+Further ideas for a subsequent PR :
+
+- IQ and IQ_K should maybe not be mixed together unless they are switchable 1:1 on all the supported hardware, accounting also for those having a Cuda MMQ kernel available and those which don't.
+
+- Maybe also the IQ1 IQ2 tree should be dismantled and spread into the tensor trees like every other quants.
+
+
+- [x] I have read the [contributing guidelines](https://github.com/ggerganov/llama.cpp/blob/master/CONTRIBUTING.md)
+- Self-reported review complexity:
+  - [x] Low
+  - [ ] Medium
+  - [ ] High
+
+---
+
+#### 💬 Conversation
+
+👤 **ikawrakow** commented the **2024-11-22** at **15:30:05**:<br>
+
+Can you provide some data to support these changes?
+
+---
+
+👤 **Nexesenex** commented the **2024-11-22** at **16:53:59**:<br>
+
+Not really, IK, i'd have to remake all tests I did during the previous months. I never knew how to log properly LlamaCPP data, so I accumulated knowledge and edits along the way and just restitute you the simplest part of it. I submit that to you in a "trust me bro" fashion because I suppose that you know what I know and then some, and just have more interesting things to do with your skillset than to mess hamster-style with quant strategies like I did since early 2024.
+
+Broadly, there's a few principles that I discovered through your work :
+
+- Most of models will receive well the following structure around a GGML_type (with -2 (lower bpw quant) to +2 (higher bpw quant) degrees of quantization around the base ggml_type) :
+
+- Attn_q : basetype -1 or -2
+- Attn_k : basetype or +1 (you go on -1 sometimes, I tend to disagree with that)
+- Attn_v : basetype +1 or +2. The higher the GQA, the more interesting the bump is, nothing new.
+- attn_output : basetype +1 for 1-2bpw, basetype for 3bpw, basetype -1 for 4bpw or more. (ex : 3.5 bpw attn_output for 2.5bpw ftype doesn't show any benefit compared to a slight bump of ffn_down, for example).
+- ffn_down : basetype +1 as much as possible, especially the first and last eighth of layers, model archs sensitivity are differing vastly for the intermediate layers. Going +1 or +1.5bpw for 1/8 of the layers, instead of +0.5bpw for 3/8 (2 first eights, one last eight or the opposite) of the layers is overkill, especially if the attention tensors are not calibrated for that on the affected layers. 
+- ffn_gate and up are more tricky, but nevertheless the first / last layers bump applies too, especially since L3 models which are more "dense" than their predecessors.
+- embedding and output, the bigger the base weight is, the more you can quantize it, nothing new. High vocab and monolithic embed/output answer to this.
+MOES : 2 experts allow already a bump on the attn tensors, including q and output.
+4 experts should really be treated like 8 experts models, there's no reason at all to discriminate them because they operate the very same (2 experts active), I noticed that on those Pivot/Solar 4 experts model.
+
+So, without any disrespect, pick what you like, I'm sure that some of it makes sense to you, and ditch what's "too much" for your taste.
+
+And if you'd like me to go on with the quant strategies, please tell me, I'd be glad to help on something that I actually can grasp and have experience upon.
+
+Here's for you to eventually get a look on some experiments I made so you can check how far I went : 07ad6c6f321ea3643cff5d38766ce8f13a785bfcmaster_loot_2/
--- a/github-data/pull_requests/118-IQ4_NL_X4.md
+++ b/github-data/pull_requests/118-IQ4_NL_X4.md
@@ -0,0 +1,23 @@
+### 🔀 [#118](https://github.com/ikawrakow/ik_llama.cpp/pull/118) - IQ4_NL_X4
+
+| **Author** | `ikawrakow` |
+| :--- | :--- |
+| **State** | ❌ **Closed** |
+| **Created** | 2024-12-02 |
+| **Updated** | 2024-12-02 |
+
+---
+
+#### Description
+
+In mainline `llama.cpp` they have added various types where `Q4_0` or `IQ4_NL` are repacked by interleaving quants from 4 or 8 consecutive rows. They get significant improvement in prompt processing speed on `ARM`, so I decided to see if interleaved rows can further improve the `iqk_mul_mat` matrix-matrix multiplication speed.
+
+This PR adds `IQ4_NL_X4`, a repacked variant of `IQ4_NL`.  The table below shows `PP-512`comparison between `IQ4_NL` and `IQ4_NL_X4` for LLaMA-3.1-8B-Instruct on `ARM` (M2-Max), `Zen4` (Ryzen-7950X) and `AVX2` (Ryzen-5975WX). Somewhat surprisingly the speedup on Zen4 is larger than the speedup on M2-Max. On `Zen4` `IQ4_NL_X4` is now the fastest quantization type for prompt processing, beating even `bf16` (237 t/s on the Ryzen-7950X CPU, which has native `bf16` support).   
+
+| Platform |  Threads | IQ4_NL | IQ4_NL_X4 | Speedup |
+| ---: | ---: | ---: | ---: | ---: |
+| ARM_NEON |  8 |   85.11 ± 0.47 | 110.32  ± 0.53 | 1.296 |
+| Zen4            | 16 | 168.21 ± 0.60 | 262.69 ± 0.65 | 1.562 |
+| AVX2.          | 32 | 186.81 ± 0.17 | 231.45 ± 0.61 | 1.240 |
+
+For reference:  On my M2-Max mainline `llama.cpp` (build: `3420909d`) achieves 92.3 t/s for `IQ4_NL_4_4`.
--- a/github-data/pull_requests/119-Q4_0_R4.md
+++ b/github-data/pull_requests/119-Q4_0_R4.md
@@ -0,0 +1,25 @@
+### 🔀 [#119](https://github.com/ikawrakow/ik_llama.cpp/pull/119) - Q4_0_R4
+
+| **Author** | `ikawrakow` |
+| :--- | :--- |
+| **State** | ❌ **Closed** |
+| **Created** | 2024-12-02 |
+| **Updated** | 2024-12-02 |
+
+---
+
+#### Description
+
+`Q4_0` repacked with 4 interleaved rows as `IQ4_NL_X4` (see PR #118).
+
+PP-512 for LLaMA-3.1-8B for `ARM_NEON` (M2-Max), `Zen4` (Ryzen-7950X) and `AVX2` (Risen-5975WX):
+
+| Platform |  Threads | Q4_0 | Q4_0_R4 | Speedup |
+| ---: | ---: | ---: | ---: | ---: |
+| ARM_NEON |  8 |   84.57 ± 0.94 | 115.79 ± 0.86 | 1.369 |
+| Zen4            | 16 | 185.89 ± 0.84 | 278.15 ± 0.39 | 1.496 |
+| AVX2.          | 32 | 190.73 ± 0.39 | 251.00 ± 0.51 | 1.316 |
+
+On `Zen4` `Q4_0_R4` is now the prompt processing champion.
+
+Here the hand-written assembly for `Q4_0_4_4` in mainline `llama.cpp` achieves 122.8 t/s on my M2-Max, so beats `Q4_0_R4` by a small margin. My guess is that `Q4_0_4_4` is slightly better because there the `0x88` xor mask (which converts the unsigned 4-bit quants to signed 4-bit quants shifted 4 bits to the left) is already applied. But this trick is only useful for the `ARM` instruction set, and is absolutely not useful on `x86_64`, so I did not use it.
--- a/github-data/pull_requests/12-q2_K_
+++ b/github-data/pull_requests/12-q2_K_
@@ -0,0 +1,42 @@
+### 🔀 [#12](https://github.com/ikawrakow/ik_llama.cpp/pull/12) - q2_K: allow it to detect ternary nets and quantize accordingly
+
+| **Author** | `ikawrakow` |
+| :--- | :--- |
+| **State** | ❌ **Closed** |
+| **Created** | 2024-08-05 |
+| **Updated** | 2024-08-05 |
+
+---
+
+#### Description
+
+It looks like they have abandoned the Bitnet quants in PR-8151 in `llama.cpp` and are now going for quantization types in blocks of 256 similar to k- and i-quants. This of course removes support for 3B Bitnet (number of columns is not a multiple of 256) without clunky stuff such as padding, so they are going for [TriLM](https://huggingface.co/collections/SpectraSuite/trilms-unpacked-668d5f62afe0f4036925b1d2) instead, being excited about the newly added `TQ1_0` and `TQ2_0` quantizations, and `TQ2_0` being the fastest quant around on `AVX2`. So, I decided to check how it compares to the CPU implementation here.
+
+The `IQ1_BN` and `IQ2_BN` quants in this repo rely on the tensors in the model converted to `GGUF`  being prepared as ternary, with separate tensors holding the scales. Instead of adding yet another hack to the `convert_hf_to_gguf.py` conversion script, for a quick comparison I added to the `Q2_K` quantization function a ternary net detection. If a ternary net is detected, the quants only take values `0, 1, 2`, all block scales and mins are set to one, and the super-block scale/min are set to the max value found in the row. But to be able to quantize to `Q2_K_S` without an imatrix, I also needed the ability to ignore the build-in imatrix rules, which I added to the `llama-quantize` tool and to `llama.cpp`. With these changes, a `Q2_K_S` quantization of the 3.9B TriLM model matches `fp16` perplexity (using `Q6_K` for `output.weight` and `Q4_K` for `token_embedding.weight`). It is actually even slightly better than `fp16`, I'm getting `PPL = 11.1531` for `fp16` and `PPL = 11.1240` for `Q2_K_S`.
+
+We can now compare performance of `Q2_K_S` to the new `TQ_2` quantization in `llama.cp`. I'm using the 3.9B TriLM variant. The command line to quantize with this PR is
+```
+./bin/llama-quantize --pure --output-weight-type q6_K --token-embedding-type q4_K --ignore-imatrix-rules $trilm_model $output_file q2_K_S`
+```
+
+Here is what I find for `PR-8151` on my Ryzen-7950X CPU:
+
+| model                          |       size |     params | backend    | threads |          test |              t/s |
+| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | ---------------: |
+| llama ?B TQ2_0 - 2.06 bpw ternary |   1.08 GiB |     3.99 B | CPU        |      16 |         pp512 |    275.78 ± 0.68 |
+| llama ?B TQ2_0 - 2.06 bpw ternary |   1.08 GiB |     3.99 B | CPU        |       2 |         tg128 |     29.69 ± 0.07 |
+| llama ?B TQ2_0 - 2.06 bpw ternary |   1.08 GiB |     3.99 B | CPU        |       4 |         tg128 |     46.65 ± 0.07 |
+| llama ?B TQ2_0 - 2.06 bpw ternary |   1.08 GiB |     3.99 B | CPU        |       8 |         tg128 |     48.15 ± 0.03 |
+| llama ?B TQ2_0 - 2.06 bpw ternary |   1.08 GiB |     3.99 B | CPU        |      16 |         tg128 |     46.13 ± 0.03 |
+
+And here is what I get for `Q2_K_S` in this repo:
+
+| model                          |       size |     params | backend    | threads |          test |              t/s |
+| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | ---------------: |
+| llama ?B Q2_K - Small          |   1.33 GiB |     3.99 B | CPU        |      16 |         pp512 |    360.60 ± 0.92 |
+| llama ?B Q2_K - Small          |   1.33 GiB |     3.99 B | CPU        |       2 |         tg128 |     25.81 ± 0.04 |
+| llama ?B Q2_K - Small          |   1.33 GiB |     3.99 B | CPU        |       4 |         tg128 |     39.91 ± 0.35 |
+| llama ?B Q2_K - Small          |   1.33 GiB |     3.99 B | CPU        |       8 |         tg128 |     38.77 ± 2.11 |
+| llama ?B Q2_K - Small          |   1.33 GiB |     3.99 B | CPU        |      16 |         tg128 |     38.55 ± 0.02 |
+
+So, despite wasting time for unnecessary block scale multiplications, we still outperform `TQ2_0` by 30% for prompt processing. Token generation is off course memory bound and, with the `Q2_K_S` quantized model being ~25% larger than `TQ2_0`, peak TG performance is ~15% lower.
--- a/github-data/pull_requests/120-Q8_0_R4.md
+++ b/github-data/pull_requests/120-Q8_0_R4.md
@@ -0,0 +1,21 @@
+### 🔀 [#120](https://github.com/ikawrakow/ik_llama.cpp/pull/120) - Q8_0_R4
+
+| **Author** | `ikawrakow` |
+| :--- | :--- |
+| **State** | ❌ **Closed** |
+| **Created** | 2024-12-03 |
+| **Updated** | 2024-12-03 |
+
+---
+
+#### Description
+
+Following PR #118, #119: `Q8_0` repacked with 4 interleaved rows.
+
+PP-512 for LLaMA-3.1-8B for `ARM_NEON` (M2-Max), `Zen4` (Ryzen-7950X) and `AVX2` (Risen-5975WX):
+
+| Platform |  Threads | Q8_0 | Q8_0_R4 | Speedup |
+| ---: | ---: | ---: | ---: | ---: |
+| ARM_NEON |  8 |   83.69 ± 1.53 | 112.95 ± 0.17 | 1.350 |
+| Zen4            | 16 | 175.61 ± 0.71 | 268.98 ± 0.31 | 1.532 |
+| AVX2           | 32 | 213.95 ± 0.44  | 234.40 ± 0.60  | 1.096 |
--- a/github-data/pull_requests/121-Q5_0_R4.md
+++ b/github-data/pull_requests/121-Q5_0_R4.md
@@ -0,0 +1,37 @@
+### 🔀 [#121](https://github.com/ikawrakow/ik_llama.cpp/pull/121) - Q5_0_R4
+
+| **Author** | `ikawrakow` |
+| :--- | :--- |
+| **State** | ❌ **Closed** |
+| **Created** | 2024-12-03 |
+| **Updated** | 2024-12-03 |
+
+---
+
+#### Description
+
+Follow up of #118, #119, #120 for `Q5_0`.
+
+Here is PP-512 for LLaMA-3.1-8B on `Zen4` (Risen-7950X), `ARM_NEON` (M2-Max) and `AVX2` (Ryzen-5975WX)
+
+| Platform |  Threads | Q5_0 | Q5_0_R4 | Speedup |
+| ---: | ---: | ---: | ---: | ---: |
+| ARM_NEON |  8 |   71.04 ± 0.83 | 99.59 ± 1.06 | 1.402 |
+| Zen4            | 16 | 157.46 ± 0.50  | 256.70 ± 0.42  | 1.630 |
+| AVX2           | 32 | 171.99 ± 0.50  | 236.33 ± 0.56  | 1.374 |
+
+Here I see a benefit even for TG. E.g., on the Ryzen-7950X I get for TG-128
+
+| Threads | Q5_0 | Q5_0_R4 | Speedup |
+| ---: | ---: | ---: | ---: | 
+| 2 |   9.06 ± 0.00 | 9.87 ± 0.00 | 1.089 |
+| 4 | 11.06 ± 0.15  | 11.73 ± 0.00   | 1.061 |
+
+It is worth comparing `Q5_0_R4` to mainline `llama.cpp` (`build: 3420909d (4234)`) on the M2-Max:
+
+| Task | Threads | t/s mainline | t/s (PR) | Speedup |
+| ---: | ---: | ---: | ---: | ---: |
+| pp512 | 8 | 26.49 ± 0.61 | 99.59 ± 1.06 | 3.758 |
+| tg128  | 2 | 6.38 ± 0.01 | 8.75 ± 0.01 | 1.371 |
+| tg128 | 4 | 12.27 ± 0.10 | 16.46 ± 0.08 | 1.341 |
+| tg128 | 8 | 20.60 ± 0.14 | 22.07 ± 0.32  | 1.071 |
--- a/github-data/pull_requests/122-Q6_0_R4.md
+++ b/github-data/pull_requests/122-Q6_0_R4.md
@@ -0,0 +1,21 @@
+### 🔀 [#122](https://github.com/ikawrakow/ik_llama.cpp/pull/122) - Q6_0_R4
+
+| **Author** | `ikawrakow` |
+| :--- | :--- |
+| **State** | ❌ **Closed** |
+| **Created** | 2024-12-03 |
+| **Updated** | 2024-12-03 |
+
+---
+
+#### Description
+
+Follow up of #118, #119, #120, #121 for `Q6_0`.
+
+Here is PP-512 for LLaMA-3.1-8B on `Zen4` (Risen-7950X), `ARM_NEON` (M2-Max) and `AVX2` (Ryzen-5975WX)
+
+| Platform |  Threads | Q6_0 | Q6_0_R4 | Speedup |
+| ---: | ---: | ---: | ---: | ---: |
+| ARM_NEON |  8 |   73.21 ± 1.10 | 94.96 ± 0.90 | 1.297 |
+| Zen4            | 16 | 159.04 ± 0.58  | 257.25 ± 0.26   | 1.638 |
+| AVX2           | 32 | 174.19 ± 0.58  | 231.53 ± 0.60 | 1.329 |
--- a/github-data/pull_requests/123-IQ4_XS_R4.md
+++ b/github-data/pull_requests/123-IQ4_XS_R4.md
@@ -0,0 +1,23 @@
+### 🔀 [#123](https://github.com/ikawrakow/ik_llama.cpp/pull/123) - IQ4_XS_R4
+
+| **Author** | `ikawrakow` |
+| :--- | :--- |
+| **State** | ❌ **Closed** |
+| **Created** | 2024-12-04 |
+| **Updated** | 2024-12-04 |
+
+---
+
+#### Description
+
+Follow up of #118, #119, #120, #121, #122 for `IQ4_XS`.
+
+I was curious to see if one can make the interleaved rows strategy work for i- and k-quants with their super-blocks & blocks and two levels of scales. `IQ4_XS` seemed easiest, so I tackled that one first. We get a massive speedup on `ARM_NEON` and a more modest (but still significant) gain on `AVX2/Zen4`. I'm not 100% happy with the `Zen4` implementation, but shuffling scale bits for 4 rows at once is tricky, so for now I have settled on a sub-optimal solution.  
+
+Anyway, here is `PP-512` for LLaMA-3.1-8B on `Zen4` (Ryzen-7950X), `ARM_NEON` (M2-Max) and `AVX2` (Ryzen-5975WX)
+
+| Platform |  Threads | IQ4_XS | IQ4_XS_R4 | Speedup |
+| ---: | ---: | ---: | ---: | ---: |
+| ARM_NEON |  8 |  68.23 ± 1.06  | 115.43 ± 0.57  | 1.692 |
+| Zen4            | 16 | 183.43 ± 0.60  | 223.98 ± 0.12  | 1.221 |
+| AVX2           | 32 | 195.20 ± 0.40  | 248.25 ± 0.43 | 1.272 |
--- a/github-data/pull_requests/124-iq2_bn_r4_
+++ b/github-data/pull_requests/124-iq2_bn_r4_
@@ -0,0 +1,39 @@
+### 🔀 [#124](https://github.com/ikawrakow/ik_llama.cpp/pull/124) - iq2_bn_r4: fastest Bitnet CPU implementation on the planet
+
+| **Author** | `ikawrakow` |
+| :--- | :--- |
+| **State** | ❌ **Closed** |
+| **Created** | 2024-12-06 |
+| **Updated** | 2024-12-06 |
+
+---
+
+#### Description
+
+In the footsteps of #118, #119, #120, #121, #122, #123, this PR adds `IQ2_BN_R4`, a 4-rows interleaved packing of the 2-bit Bitnet quantization type `IQ2_BN`.
+
+Here is `PP-512` for Bitner-1.58b-3B on `Zen4` (Ryzen-7950X), `ARM_NEON` (M2-Max) and `AVX2` (Ryzen-5975WX)
+
+| Platform |  Threads | IQ2_BN | IQ2_BN_R4 | Speedup |
+| ---: | ---: | ---: | ---: | ---: |
+| ARM_NEON |  8 |  246.57 ± 1.66 | 304.68 ± 0.77  | 1.236 |
+| Zen4            | 16 | 631.27 ± 2.81  | 834.46 ± 2.77  | 1.322 |
+| AVX2           | 32 | 694.17 ± 0.60  | 704.62 ± 0.60 | 1.0125 |
+
+There aren't enough vector registers on AVX2 for all necessary accumulators when processing 8 right matrix columns at once. Hence, one needs two passes per left matrix interleaved row, so the gain on AVX2 is very minor. But on Zen4 we now achieve 834 t/s! In comparison, [T-MAC](https://github.com/microsoft/T-MAC), a repository with currently 607 stars making bold claims about being the fastest Bitnet CPU implementation achieves 300 t/s on the same Ryzen-7950X system. 
+
+TG is of course memory bound, but for small number of threads I also observe a speedup. The table shows measurements for TG-128 on the above 3 platforms (table only shows results up to the number of threads that achieves maximum performance):
+
+| Platform |  Threads | IQ2_BN | IQ2_BN_R4 | Speedup |
+| ---: | ---: | ---: | ---: | ---: |
+| ARM_NEON | 1 | 21.01 ± 0.08 | 24.75 ± 0.08 | 1.178 |
+|                      | 2 | 39.15 ± 0.02 | 45.48 ± 0.08 | 1.162 |
+|                      | 4 | 64.39 ± 0.17 | 71.82 ± 1.84 | 1.115 |
+|                      | 8 |  99.60  ± 0.53 | 100.74 ± 1.13 | 1.011 |
+| Zen4            | 1 | 25.91 ± 0.12 | 30.35 ± 0.15 | 1.171 |
+|                      | 2 | 45.03 ± 0.22 | 50.93 ± 0.18 | 1.131 |
+|                      | 4 | 57.42 ± 0.08 | 57.40 ± 0.06 | 1.000 |
+| AVX2            | 1 | 16.39 ± 0.00 | 18.42 ± 0.11 | 1.124 |
+|                      | 2 | 29.94 ± 0.03 | 31.56 ± 0.01 | 1.054 |
+|                      | 4 | 44.09 ± 0.02 | 45.26 ± 0.03 | 1.027 |
+|                      | 8 | 47.28 ± 0.04  | 49.25 ± 0.02 | 1.042 |
--- a/github-data/pull_requests/125-R4
+++ b/github-data/pull_requests/125-R4
@@ -0,0 +1,18 @@
+### 🔀 [#125](https://github.com/ikawrakow/ik_llama.cpp/pull/125) - R4 improvements on ARM_NEON
+
+| **Author** | `ikawrakow` |
+| :--- | :--- |
+| **State** | ❌ **Closed** |
+| **Created** | 2024-12-08 |
+| **Updated** | 2024-12-08 |
+
+---
+
+#### Description
+
+This PR accomplishes two things:
+* Reduces bloat by using a template for the `ARM_NEON` matrix multiplication implementation of interleaved rows quants `Q4_0_R4, Q5_0_R4, Q6_0_R4, IQ4_NL_X4, IQ4_XS_R4, Q8_0_R4` (and I should do the same for `AVX2/Zen4`)
+* Achieves a ~7% PP speedup for all `R4` quants except `IQ4_XS_R4`. With this
+  - `Q4_0_R4` now outperforms the hand-written assembly in mainline `llama.cpp` by a small margin (125 t/s vs 123 t/s)
+  - `Q8_0_R4` becomes the fastest type for prompt processing on `ARM_NEON` (PP-512 = 128 t/s for LLaMA-3.1-8B on M2-Max).
+  - All `R4` quants achieve PP-512 > 100 t/s for LLaMA-3.1-8B on M2-Max
--- a/github-data/pull_requests/126-Rename
+++ b/github-data/pull_requests/126-Rename
@@ -0,0 +1,15 @@
+### 🔀 [#126](https://github.com/ikawrakow/ik_llama.cpp/pull/126) - Rename iq4_nl_x4 to iq4_nl_r4
+
+| **Author** | `ikawrakow` |
+| :--- | :--- |
+| **State** | ❌ **Closed** |
+| **Created** | 2024-12-08 |
+| **Updated** | 2024-12-08 |
+
+---
+
+#### Description
+
+To be consistent with the other quants interleaving 4 rows.
+
+I started the interleaved rows experiment with `IQ4_NL` and named the packing `IQ4_NL_X4`. But then I thought that `_X4` is actually ambiguous. 4 times of what? We already have quants where 4 consecutive blocks are packed together into a larger "X4" block. Because of that I named all following interleaved rows quants using "_R4" (as in 4 rows). To be consistent with this naming convention this PR renames `IQ4_NL_X4` to `IQ4_NL_R4`.
--- a/github-data/pull_requests/127-Q4_0_R4
+++ b/github-data/pull_requests/127-Q4_0_R4
@@ -0,0 +1,13 @@
+### 🔀 [#127](https://github.com/ikawrakow/ik_llama.cpp/pull/127) - Q4_0_R4 on CUDA
+
+| **Author** | `ikawrakow` |
+| :--- | :--- |
+| **State** | ✅ **Open** |
+| **Created** | 2024-12-08 |
+| **Updated** | 2025-01-09 |
+
+---
+
+#### Description
+
+With the massive improvements in prompt processing speed on the CPU achieved via interleaving 4 tensor rows (see #118, #119, #120, #121, #122, #123, #124), I was curious to see if one can get a good implementation for the `X_R4` quants on CUDA. This PR is a POC that implements CUDA dequantization and matrix x vector multiplication for `Q4_0_R4`. It achieves the same TG speed as `Q4_0`. It was disappointing to not get a speedup via row interleaving, but at least there is no performance regression. To make it a full PR I should also implement quantized matrix x matrix multiplication for `Q4_0_R4` (here it is done via dequantize to `f16` and cuBLAS, so it is slower than `Q4_0` MMQ).
--- a/github-data/pull_requests/128-Faster
+++ b/github-data/pull_requests/128-Faster
@@ -0,0 +1,13 @@
+### 🔀 [#128](https://github.com/ikawrakow/ik_llama.cpp/pull/128) - Faster IQ4_XS_R4 on Zen4
+
+| **Author** | `ikawrakow` |
+| :--- | :--- |
+| **State** | ❌ **Closed** |
+| **Created** | 2024-12-08 |
+| **Updated** | 2024-12-08 |
+
+---
+
+#### Description
+
+We now get PP-512(LLaMA-3.1-8B) = 254 t/s on a Ryzen-7950X CPU, up from 224 t/s.
--- a/github-data/pull_requests/129-Q4_K_R4.md
+++ b/github-data/pull_requests/129-Q4_K_R4.md
@@ -0,0 +1,37 @@
+### 🔀 [#129](https://github.com/ikawrakow/ik_llama.cpp/pull/129) - Q4_K_R4
+
+| **Author** | `ikawrakow` |
+| :--- | :--- |
+| **State** | ❌ **Closed** |
+| **Created** | 2024-12-09 |
+| **Updated** | 2024-12-09 |
+
+---
+
+#### Description
+
+Follow up of #118, #119, #120, #121, #122, #123  for `Q4_K`.
+
+After having demonstrated interleaved rows with blocks and super-blocks for `IQ4_XS` in #123, here the corresponding implementation for `Q4_K`. To not have an explosion of quantization types, `Q4_K_R4` corresponds to `Q4_K_S` (and there is no `_R4` variant for `Q4_K_M`).
+
+We get a massive speedup on `ARM_NEON` and quite significant gain on `AVX2/Zen4`. The `Zen4` implementation could probably be optimized further. Here is `PP-512` for LLaMA-3.1-8B on `Zen4` (Ryzen-7950X), `ARM_NEON` (M2-Max) and `AVX2` (Ryzen-5975WX)
+
+| Platform |  Threads | Q4_K_S | Q4_K_R4 | Speedup |
+| ---: | ---: | ---: | ---: | ---: |
+| ARM_NEON |  8 |  68.73 ± 0.88  | 110.02 ± 1.31  | 1.601 |
+| Zen4            | 16 | 198.92 ± 0.69  | 259.19 ± 0.24  | 1.303 |
+| AVX2           | 32 | 206.39 ± 0.28  | 282.78 ± 0.54  | 1.370 |
+
+Here we gain even for TG. Here results for TG-128 on LLaMA-3.1-8B with different numbers of threads:
+
+| Platform |  Threads | Q4_K_S | Q4_K_R4 | Speedup |
+| ---: | ---: | ---: | ---: | ---: |
+| ARM_NEON |  2 |  11.38 ± 0.00  | 12.17 ± 0.01  | 1.069 |
+|                       |  4 |  18.08 ± 0.44  | 21.56 ± 0.06  | 1.192 |
+|                       |  8 |  25.02 ± 0.17   | 25.39 ± 0.14  | 1.015 |
+| Zen4            | 1 |  5.73 ± 0.01  | 8.95 ± 0.00  |  1.562 |
+|                      | 2 |  10.47 ± 0.01  | 13.37 ± 0.00  |  1.277 |
+|                      | 4 |  13.38 ± 0.63  | 14.03 ± 0.01  |  1.049 |
+| AVX2           | 2 | 4.60 ± 0.00   | 7.61 ± 0.00  | 1.370 |
+|                     | 4 | 8.55 ± 0.00    | 12.01 ± 0.00  | 1.403 |
+|                     | 8 |  11.67 ± 0.00   | 13.83 ± 0.00  | 1.185 |
--- a/github-data/pull_requests/13-Adding
+++ b/github-data/pull_requests/13-Adding
@@ -0,0 +1,99 @@
+### 🔀 [#13](https://github.com/ikawrakow/ik_llama.cpp/pull/13) - Adding IQ2_TN for use with ternary models
+
+| **Author** | `ikawrakow` |
+| :--- | :--- |
+| **State** | ❌ **Closed** |
+| **Created** | 2024-08-06 |
+| **Updated** | 2024-08-07 |
+
+---
+
+#### Description
+
+They have abandoned the `Q1_3` and `Q2_2` quants in [PR-8151](https://github.com/ggerganov/llama.cpp/pull/8151) in `llama.cpp`, and have moved on to `TQ1_0` and `TQ2_0`. Like k-quants, these use blocks of 256 weights and utilize `Q8_K` for quantized dot products on the CPU. This removes support for [Bitnet b1.58](https://huggingface.co/1bitLLM/bitnet_b1_58-3B) (unless one adds padding to a multiple of 256), so they are now focussing on the [TriLM models](https://huggingface.co/collections/SpectraSuite/trilms-unpacked-668d5f62afe0f4036925b1d2). Unlike the previous `Q1_3` and `Q2_2`, where the quantized data only holds the ternary `-1/0/+1` values and the tensor scale is added via a separate `ggml_scale` operation, the new `TQ1_0` and `TQ2_0` include a scale in each block of 256. This basically wastes 0.0625 bpw, but has the advantage that one can simply reuse the standard `llama.cpp` computation graphs.
+
+Based on the `PP-512` and `TG-128` figures posted in [PR-8151](https://github.com/ggerganov/llama.cpp/pull/8151), `TQ2_0` performance is much better than the earlier `Q2_0` attempt, so I became curious to see how @compilade's implementation compares to what we can do with `iqk_mul_mat` in this repo, and here we are.
+
+The PR adds `IQ2_TN` (`TN` as `TriNet`). Implementation for `Zen4`, `AVX2`, `ARM_NEON`, `CUDA` and `Metal` is provided.
+
+Let's look at some performance comparisons. We will focus on the largest TriLM model, which has ~4B parameters. Quantized with 2.0625 bpw the model size is 1.08 GiB.
+
+### AVX2
+
+`AVX2` was tested on a 32-core Ryzen-5975WX CPU. Not everybody has a 32-core CPU handy, so I have added performance values for fewer threads.
+
+| threads |          test |  t/s (PR-8151)   |  t/s (this PR) | Speedup |
+| ------: | ------------: | ---------------: | -------------: | ------: |
+|    32   |         pp512 |    430.18 ± 0.56 |  490.73 ± 0.62 |  1.141  |
+|    16   |         pp512 |    258.47 ± 0.21 |  306.37 ± 0.03 |  1.185  |
+|     8   |         pp512 |    141.94 ± 0.04 |  175.45 ± 0.06 |  1.236  |
+|     4   |         pp512 |     74.72 ± 0.02 |   91.78 ± 0.01 |  1.228  |
+|     1   |         tg128 |     15.75 ± 0.01 |   15.71 ± 0.01 |  1.000  |
+|     2   |         tg128 |     24.22 ± 0.02 |   26.50 ± 0.00 |  1.094  |
+|     4   |         tg128 |     33.66 ± 0.14 |   41.63 ± 0.04 |  1.237  |
+|     8   |         tg128 |     44.34 ± 0.07 |   48.62 ± 0.03 |  1.097  |
+|    16   |         tg128 |     49.58 ± 0.05 |   48.09 ± 0.03 |  0.970  |
+
+I would say @compilade has done remarkably well here, coming to within ~14% for PP performance. Although, for fewer than 32 threads, the performance gap increases to about ~23%. My guess is that the 23% is a more realistic value for the performance difference, and as the number of threads increases we see more the effect of `ggml` inefficiencies (thread synchronization, operations that do not scale with number of threads, etc.), which then narrows the gap. Nevertheless, even 23% is remarkable considering the performance differences for other quants (see main page). For TG the performance is the same for 1 thread (not much one can do there, the bit arrangement is so simple that there aren't many different ways to implement effectively with `AVX2`). The implementation in this PR then becomes faster, I guess due to better cache utilization. But this better per thread performance leads to too much memory bandwidth contention above 8 threads, so `TQ2_0` is able to arrive at a slightly better performance at 16 threads. 
+
+### Zen4
+
+I have also tested on a `Zen4` CPU (16-core Ryzen-7950X). `Zen4` implements some of the `AVX512` instruction set, and there is a dedicated implementation for that for `IQ2_TN`. The `TQ2_0` quants are implemented in pure `AVX2`, so one might think the performance comparison is unfair. But, at least as far as I know,  the `Zen4` core implements 512-bit instructions as two separate 256-bit instructions in hardware, so one does not gain much by operating on 512-bit wide vectors. The main advantage comes from having more vector registers (32 vs 16 on `AVX2`), but the way matrix multiplications are done in `ggml` (a series of vector x vector dot products), one cannot really take advantage of that. Anyway, here is the performance comparison on the Ryzen-7950X CPU
+
+| threads |          test |  t/s (PR-8151)   |  t/s (this PR)   | Speedup |
+| ------: | ------------: | ---------------: | ---------------: | ------: |
+|      16 |         pp512 |    276.74 ± 0.75 |    429.97 ± 1.41 |  1.553  |
+|       8 |         pp512 |    151.50 ± 0.46 |    250.88 ± 0.31 |  1.656  |
+|       4 |         pp512 |     78.82 ± 0.64 |    131.29 ± 0.23 |  1.665  |
+|       1 |         tg128 |     18.76 ± 0.40 |     20.11 ± 0.05 |  1.072  |
+|       2 |         tg128 |     29.38 ± 0.05 |     35.69 ± 0.07 |  1.215  |
+|       4 |         tg128 |     46.39 ± 0.04 |     48.62 ± 0.01 |  1.048  |
+|       8 |         tg128 |     47.94 ± 0.03 |     48.28 ± 0.04 |  1.007  |
+
+Here the PP performance gap is more significant at around 66%, reducing to 55% at 16 threads. If we look at TG performance for 1 thread, the ~7% performance difference comes from using `_mm512_dpbusd_epi32`, which is a fused multiply-add operation, whereas on `AVX2` one needs to use `_mm256_maddubs_epi16` followed by `_mm256_add_epi16` to accumulate the result. The TG performance gap then widens due to better cache utilization, and then decreases towards zero with increasing numbers of threads as the memory bandwidth is saturated. The 66% PP performance gap is hence the combination of the ~7% due to the use a fused multiply-add, and ~60% due to better utilization of vector registers while performing a multiplication of a row in the left matrix with several columns in the right matrix, where the unpacked quants for a block are held in vector registers.
+
+### ARM_NEON
+
+Here @compilade's implementation does not do very well, at least not on the M2-Max laptop where I have tested. But perhaps this is just due to the fact that @compilade used a Cortex A72 CPU in their development, and that CPU may as well behave very differently from the M2-Max.
+
+| threads |          test |  t/s (PR-8151)   |  t/s (this PR)   | Speedup |
+| ------: | ------------: | ---------------: | ---------------: | ------: |
+|       8 |         pp512 |     79.15 ± 0.21 |    206.60 ± 0.14 | 2.610   |
+|       2 |         tg128 |     17.61 ± 0.01 |     28.42 ± 0.05 | 1.614   |
+|       4 |         tg128 |     32.40 ± 0.02 |     49.23 ± 0.09 | 1.519   |
+|       8 |         tg128 |     51.64 ± 0.70 |     76.37 ± 0.22 | 1.479   |
+
+### CUDA and Metal
+
+There is no GPU implementation in PR-8151, so here just the performance values for this PR. `CUDA` is tested on RTX-4080, `Metal` on a 30-code M2-Max GPU.
+
+| backend |          test |  t/s (this PR)   |
+| ------: | ------------: | ---------------: |
+| CUDA    |         pp512 |  9937     ± 81  |
+| CUDA    |         tg128 |   299.19  ± 0.15 |
+| Metal   |         pp512 |   891.52  ± 0.49 |
+| Metal   |         tg128 |     98.52 ± 0.16 |
+          
+I have not bothered implementing the MMQ stuff, so CUDA PP performance is via dequantize and cuBLAS gemm.
+
+---
+
+#### 💬 Conversation
+
+👤 **compilade** commented the **2024-08-06** at **17:00:57**:<br>
+
+This is great!
+
+> ARM_NEON
+> Here @compilade's implementation does not do very well
+
+Yeah, I did not particularly optimize the ARM_NEON implementation for recent ARM CPUs (yet), especially since I did not use `vdotq_s32` (although I was planning to), because the Cortex-A72 and the Cortex-A53 in the CPUs of my test machines do not support that and were faster with `vmlal_s8` than with `ggml_vdotq_s32`.
+
+---
+
+I see `IQ2_TN` mostly has the same format as `TQ2_0`, except that the float16 scale is before the packed weights instead of after.
+But if I understand it correctly, both store the packed values in the same order and packed in the same way (same offset). Does that mean the Metal and CUDA implementations for `IQ2_TN` would also work for `TQ2_0`?
+
+Do you have plans for `IQ2_TN` to replace `TQ2_0`, or is this something done in parallel to see how fast it can get with better matrix multiplication than lots of dot products?
+
+Either way, I really appreciate your work on this. This was a pleasant surprise to see in my notifications.
--- a/github-data/pull_requests/130-Q6_K_R4.md
+++ b/github-data/pull_requests/130-Q6_K_R4.md
@@ -0,0 +1,39 @@
+### 🔀 [#130](https://github.com/ikawrakow/ik_llama.cpp/pull/130) - Q6_K_R4
+
+| **Author** | `ikawrakow` |
+| :--- | :--- |
+| **State** | ❌ **Closed** |
+| **Created** | 2024-12-10 |
+| **Updated** | 2024-12-10 |
+
+---
+
+#### Description
+
+Follow up of #118, #119, #120, #121, #122, #123, #129  for `Q6_K`. 
+
+If nothing else `Q6_K` is routinely used for the output tensor, so having a better `Q6_K` performance would be useful.
+
+We get a large speedup on `ARM_NEON` and non-negligible gains on `AVX2/Zen4`.  Here is `PP-512` for LLaMA-3.1-8B on `Zen4` (Ryzen-7950X), `ARM_NEON` (M2-Max) and `AVX2` (Ryzen-5975WX)
+
+| Platform |  Threads | Q6_K | Q6_K_R4 | Speedup |
+| ---: | ---: | ---: | ---: | ---: |
+| ARM_NEON |  8 |  57.57 ± 0.61  | 83.25 ± 0.81  | 1.446 |
+| Zen4            | 16 | 195.20 ± 0.74  | 243.25 ± 0.31  | 1.246 |
+| AVX2           | 32 | 194.51 ± 0.35  | 264.16 ± 0.44  | 1.358 |
+
+Except on `ARM_NEON`, where TG performance is slightly lower for small numbers of threads, we gain even for TG. Here results for TG-128 on LLaMA-3.1-8B with different numbers of threads:
+
+| Platform |  Threads | Q6_K | Q6_K_R4 | Speedup |
+| ---: | ---: | ---: | ---: | ---: |
+| ARM_NEON |  2 |  7.46 ± 0.03  | 7.35 ± 0.01  | 0.985 |
+|                       |  4 |  13.88 ± 0.02  | 13.80 ± 0.01  | 0.994 |
+|                       |  8 |  18.31 ± 0.16   |  18.57 ± 0.14  | 1.014 |
+| Zen4            | 1 |  5.38 ± 0.00  | 7.94 ± 0.00  |  1.476 |
+|                      | 2 |  8.93 ± 0.00 | 10.38 ± 0.00  |  1.162 |
+|                      | 4 |  9.97 ± 0.27  | 10.18 ± 0.01  |  1.021 |
+| AVX2           | 2 | 4.75 ± 0.00    | 5.78 ± 0.01  | 1.217 |
+|                     | 4 | 7.57 ± 0.00    | 8.47 ± 0.00  | 1.119 |
+|                     | 8 |  8.23 ± 0.00   | 9.14 ± 0.00  | 1.111 |
+
+With this Zen4 implementation, for TG the available memory bandwidth is fully saturated with just 2 threads!
--- a/github-data/pull_requests/131-Slightly
+++ b/github-data/pull_requests/131-Slightly
@@ -0,0 +1,13 @@
+### 🔀 [#131](https://github.com/ikawrakow/ik_llama.cpp/pull/131) - Slightly faster Q4_K_R4 and IQ4_XS_R4 on Zen4
+
+| **Author** | `ikawrakow` |
+| :--- | :--- |
+| **State** | ❌ **Closed** |
+| **Created** | 2024-12-10 |
+| **Updated** | 2024-12-10 |
+
+---
+
+#### Description
+
+~1-2% speedup.
--- a/github-data/pull_requests/132-Q5_K_R4.md
+++ b/github-data/pull_requests/132-Q5_K_R4.md
@@ -0,0 +1,57 @@
+### 🔀 [#132](https://github.com/ikawrakow/ik_llama.cpp/pull/132) - Q5_K_R4
+
+| **Author** | `ikawrakow` |
+| :--- | :--- |
+| **State** | ❌ **Closed** |
+| **Created** | 2024-12-10 |
+| **Updated** | 2024-12-10 |
+
+---
+
+#### Description
+
+Follow up of #118, #119, #120, #121, #122, #123, #129, #130  for `Q5_K`. 
+
+We get a large speedup on `ARM_NEON` and non-negligible gains on `AVX2/Zen4`.  Here is `PP-512` for LLaMA-3.1-8B on `Zen4` (Ryzen-7950X), `ARM_NEON` (M2-Max) and `AVX2` (Ryzen-5975WX)
+
+| Platform |  Threads | Q5_K | Q5_K_R4 | Speedup |
+| ---: | ---: | ---: | ---: | ---: |
+| ARM_NEON |  8 |  61.07 ± 0.95  | 96.13 ± 2.38  | 1.574 |
+| Zen4            | 16 | 188.73 ± 0.75   | 248.30 ± 0.29  | 1.316 |
+| AVX2           | 32 | 188.11 ± 0.29 |  269.18 ± 0.40  | 1.431 |
+
+On `AVX2/Zen4` we gain even for TG. Here results for TG-128 on LLaMA-3.1-8B with different numbers of threads:
+
+| Platform |  Threads | Q6_K | Q6_K_R4 | Speedup |
+| ---: | ---: | ---: | ---: | ---: |
+| Zen4            | 1 |  5.12 ± 0.00   | 7.07 ± 0.01  |  1.380 |
+|                      | 2 |  9.31 ± 0.00 | 11.54 ± 0.0  |  1.240 |
+|                      | 4 |  11.33 ± 0.37  | 11.89 ± 0.00  |  1.049 |
+| AVX2           | 2 | 4.04 ± 0.00    | 6.40 ± 0.00  | 1.584 |
+|                     | 4 | 7.57 ± 0.00    | 9.95 ± 0.00  | 1.314 |
+|                     | 8 |  9.75 ± 0.00  | 11.00 ± 0.00  | 1.128 |
+
+I decided to check the current state of mainline `llama.cpp` for `Q5_K_S`.
+
+Hahaha - here is what we get on my M2-Max (`build: 7736837d (4274)`)
+
+| model                          |       size |     params | backend    | threads |          test |                  t/s |
+| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
+| llama 8B Q5_K - Small          |   5.21 GiB |     8.03 B | CPU        |       8 |         pp512 |         27.69 ± 0.09 |
+| llama 8B Q5_K - Small          |   5.21 GiB |     8.03 B | CPU        |       2 |         tg128 |          6.39 ± 0.01 | 
+| llama 8B Q5_K - Small          |   5.21 GiB |     8.03 B | CPU        |       4 |         tg128 |         12.18 ± 0.02 |
+| llama 8B Q5_K - Small          |   5.21 GiB |     8.03 B | CPU        |       8 |         tg128 |         19.68 ± 0.64 |
+
+The performance gap in prompt processing for `Q5_K` has now grown to 3.5X, and it is ~30% slower for TG with 2 threads.
+
+Here is what I get on my Ryzen-7950X (`build: 26a8406b (4295)`)
+
+| model                          |       size |     params | backend    | threads |          test |                  t/s |
+| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
+| llama 8B Q5_K - Small          |   5.21 GiB |     8.03 B | CPU        |      16 |         pp512 |         75.88 ± 0.26 |
+| llama 8B Q5_K - Small          |   5.21 GiB |     8.03 B | CPU        |       1 |         tg128 |          4.10 ± 0.00 |
+| llama 8B Q5_K - Small          |   5.21 GiB |     8.03 B | CPU        |       2 |         tg128 |          7.66 ± 0.01 |
+| llama 8B Q5_K - Small          |   5.21 GiB |     8.03 B | CPU        |       4 |         tg128 |         11.26 ± 0.00 |
+| llama 8B Q5_K - Small          |   5.21 GiB |     8.03 B | CPU        |       8 |         tg128 |         11.20 ± 0.22 |
+
+3.26X slower for prompt processing, 72%/51% slower for TG at 1/2 thread.
--- a/github-data/pull_requests/134-Q3_K_R4.md
+++ b/github-data/pull_requests/134-Q3_K_R4.md
@@ -0,0 +1,32 @@
+### 🔀 [#134](https://github.com/ikawrakow/ik_llama.cpp/pull/134) - Q3_K_R4
+
+| **Author** | `ikawrakow` |
+| :--- | :--- |
+| **State** | ❌ **Closed** |
+| **Created** | 2024-12-11 |
+| **Updated** | 2024-12-11 |
+
+---
+
+#### Description
+
+Follow up of #118, #119, #120, #121, #122, #123, #129, #130, #132  for `Q3_K`. 
+
+We get a massive speedup on `ARM_NEON` and non-negligible gains on `AVX2/Zen4`.  Here is `PP-512` for LLaMA-3.1-8B on `Zen4` (Ryzen-7950X), `ARM_NEON` (M2-Max) and `AVX2` (Ryzen-5975WX)
+
+| Platform |  Threads | Q3_K | Q3_K_R4 | Speedup |
+| ---: | ---: | ---: | ---: | ---: |
+| ARM_NEON |  8 |  55.42 ± 1.00  | 106.89 ± 1.14  | 1.929 |
+| Zen4            | 16 | 193.89 ± 0.43   | 236.77 ± 0.35  | 1.221 |
+| AVX2           | 32 | 199.22 ± 0.41 |  262.34 ± 0.50  | 1.317 |
+
+On `AVX2/Zen4` we gain even for TG. Here results for TG-128 on LLaMA-3.1-8B with different numbers of threads:
+
+| Platform |  Threads | Q3_K | Q3_K_R4 | Speedup |
+| ---: | ---: | ---: | ---: | ---: |
+| Zen4            | 1 |  5.47 ± 0.01   | 6.78 ± 0.00 |  1.239 |
+|                      | 2 |  10.25 ± 0.00 | 12.46 ± 0.00   |  1.216 |
+|                      | 4 |  15.21 ± 0.59  | 17.02 ± 0.09   |  1.119 |
+| AVX2           | 2 | 5.02 ± 0.01    | 8.21 ± 0.00  | 1.635 |
+|                     | 4 | 9.33 ± 0.00    |  13.67 ± 0.00 | 1.465 |
+|                     | 8 |  14.85 ± 0.02  | 16.67 ± 0.00  | 1.123 |
--- a/github-data/pull_requests/135-Better
+++ b/github-data/pull_requests/135-Better
@@ -0,0 +1,32 @@
+### 🔀 [#135](https://github.com/ikawrakow/ik_llama.cpp/pull/135) - Better ARM_NEON implementation for R4 quants
+
+| **Author** | `ikawrakow` |
+| :--- | :--- |
+| **State** | ❌ **Closed** |
+| **Created** | 2024-12-11 |
+| **Updated** | 2024-12-11 |
+
+---
+
+#### Description
+
+We get improved performance for `IQ4_XS_R4`, `Q4_K_R4`, `Q5_K_R4`, `Q6_K_R4`. The trick was to accumulate super-blocks in `int32_t`, thus avoiding expensive `int -> float` conversions.
+
+Here performance comparisons for LLaMA-3.1-8B on M2-Max between the previous implementation and this PR
+
+| Quant |  Task | Threads | t/s (main) | t/s (PR) | Speedup | 
+| ---: | ---: | ---: | ---: | ---: | ---: | 
+| IQ4_XS_R4 | pp512 | 8 | 115.43 ± 0.57 | 131.28 ± 0.51 | 1.137 |
+|                      | tg128 | 2 | 12.71 ± 0.01 | 13.44 ± 0.01 | 1.057 |
+|                      | tg128 | 4 | 22.35 ± 0.17 | 22.98 ± 0.05  | 1.028 |
+| Q4_K_R4    | pp512 | 8 | 110.02 ± 1.31 | 122.12 ± 1.28 | 1.110 |
+|                      | tg128 | 2 | 12.17 ± 0.01 | 13.72 ± 0.01 | 1.127 |
+|                      | tg128 | 4 | 21.56 ± 0.06  | 22.46 ± 0.20 | 1.042 |
+| Q5_K_R4.    | pp512 | 8 | 96.90 ± 0.79 | 108.66 ± 0.27 | 1.121 |
+|                      | tg128 | 2 | 8.22 ± 0.01 | 8.66 ± 0.01 | 1.054 |
+|                      | tg128 | 4 | 15.54 ± 0.09 | 16.13 ± 0.05 | 1.038 |
+| Q6_K_R4     | pp512 | 8 | 83.25 ± 0.81 | 104.19 ± 1.96 | 1.252 |
+|                      | tg128 | 2 | 7.35 ± 0.01 | 8.05 ± 0.00 | 1.095 |
+|                      | tg128 | 4 | 13.80 ± 0.01 | 14.92 ± 0.03 | 1.081 |
+ 
+TG results only up to 4 threads because at 8 threads the result is 100% memory bound, so the same within noise.
--- a/github-data/pull_requests/136-Q2_K_R4.md
+++ b/github-data/pull_requests/136-Q2_K_R4.md
@@ -0,0 +1,40 @@
+### 🔀 [#136](https://github.com/ikawrakow/ik_llama.cpp/pull/136) - Q2_K_R4
+
+| **Author** | `ikawrakow` |
+| :--- | :--- |
+| **State** | ❌ **Closed** |
+| **Created** | 2024-12-11 |
+| **Updated** | 2024-12-11 |
+
+---
+
+#### Description
+
+Follow up of #118, #119, #120, #121, #122, #123, #129, #130, #132, #134  for `Q2_K`. 
+
+This completes R4 implementation for k-quants on `ARM_NEON`, `AVX2`, and `Zen4`.
+
+We get signifiant performance gains on all platforms.  Here is `PP-512` for LLaMA-3.1-8B on `Zen4` (Ryzen-7950X), `ARM_NEON` (M2-Max) and `AVX2` (Ryzen-5975WX)
+
+| Platform |  Threads | Q2_K_S | Q2_K_R4 | Speedup |
+| ---: | ---: | ---: | ---: | ---: |
+| ARM_NEON |  8 |  73.79 ± 1.92  | 109.07 ± 0.58 | 1.478 |
+| Zen4            | 16 | 205.95 ± 0.77  | 256.19 ± 0.26  | 1.244 |
+| AVX2           | 32 | 214.42 ± 0.54 |  286.91 ± 0.63  | 1.338 |
+
+As `Q2_K` is smaller than other k-quants, here the CPU can do more work before available memory bandwidth saturates when running TG. Hence, we get non-negligible performance gains on all platforms also for TG. 
+Here results for TG-128 on LLaMA-3.1-8B with different numbers of threads:
+
+| Platform |  Threads | Q2_K_S | Q2_K_R4 | Speedup |
+| ---: | ---: | ---: | ---: | ---: |
+| ARM_NEON | 2 | 10.34 ± 0.01 | 12.81 ± 0.01 | 1.239 |
+|                      | 4 | 19.32 ± 0.02 | 23.40 ± 0.08 | 1.211 |
+|                      | 8 | 32.36 ± 0.59 | 36.02 ± 0.40 | 1.113 |
+| Zen4            | 1 |  6.60 ± 0.02  | 9.08 ± 0.12  |  1.376 |
+|                      | 2 |  12.12 ± 0.01 | 16.40 ± 0.00  |  1.353 |
+|                      | 4 |  19.12 ± 0.56  | 20.72 ± 0.19  |  1.084 |
+| AVX2           | 2 | 5.93 ± 0.02   | 10.16 ± 0.30  | 1.713 |
+|                     | 4 | 11.24 ± 0.00    |  17.59 ± 0.01 | 1.565 |
+|                     | 8 |  18.62 ± 0.03  | 21.44 ± 0.00  | 1.151 |
+
+It is actually too bad `Q2_K` is such a low quality quantization as performance is really good. Perhaps I should try to improve it? When I was developing it back then it was much better than any other 2-bit attempt at the time, so I was quite pleased with the result. But with today's knowledge that we can do much better at 2 bpw, perhaps a fresh look could be useful.
--- a/github-data/pull_requests/137-Fix
+++ b/github-data/pull_requests/137-Fix
@@ -0,0 +1,13 @@
+### 🐛 [#137](https://github.com/ikawrakow/ik_llama.cpp/pull/137) - Fix AVX2 implementation of iq4_nl_r4
+
+| **Author** | `ikawrakow` |
+| :--- | :--- |
+| **State** | ❌ **Closed** |
+| **Created** | 2024-12-11 |
+| **Updated** | 2024-12-11 |
+
+---
+
+#### Description
+
+The implementation was using  `_mm256_maddubs_epi16`, which overflows (and gets saturated) with the unsigned version of the non-linear quants `IQ4_NL` lookup table. This PR fixes it without a noticeable performance loss.
--- a/github-data/pull_requests/138-IQ4_K_R4.md
+++ b/github-data/pull_requests/138-IQ4_K_R4.md
@@ -0,0 +1,46 @@
+### 🔀 [#138](https://github.com/ikawrakow/ik_llama.cpp/pull/138) - IQ4_K_R4
+
+| **Author** | `ikawrakow` |
+| :--- | :--- |
+| **State** | ❌ **Closed** |
+| **Created** | 2024-12-12 |
+| **Updated** | 2024-12-12 |
+
+---
+
+#### Description
+
+On to R4 implementation of the new iqk quants.
+
+First `IQ4_K`
+
+We get very signifiant performance gains on `ARM_NEON` and more modest gains on `AVX2/Zen4`. I suspect my `AVX2/Zen4` implementation is not optimum, but I did not see a better way for now.
+
+Here is `PP-512` for LLaMA-3.1-8B on `Zen4` (Ryzen-7950X), `ARM_NEON` (M2-Max) and `AVX2` (Ryzen-5975WX)
+
+| Platform |  Threads | IQ4_K | IQ4_K_R4 | Speedup |
+| ---: | ---: | ---: | ---: | ---: |
+| ARM_NEON |  8 |  58.20 ± 1.03  | 108.02 ± 1.10 | 1.856 |
+| Zen4            | 16 | 182.20 ± 0.38 | 232.63 ± 0.39  | 1.277 |
+| AVX2           | 32 | 206.43 ± 0.49 |  227.60 ± 0.46  | 1.103 |
+
+We get decent performance gains for TG as well.
+Here results for TG-128 on LLaMA-3.1-8B with different numbers of threads:
+
+| Platform |  Threads | Q2_K_S | Q2_K_R4 | Speedup |
+| ---: | ---: | ---: | ---: | ---: |
+| ARM_NEON | 2 | 8.44 ± 0.02 | 10.56 ± 0.01 | 1.251 |
+|                      | 4 | 15.90 ± 0.05 | 19.32 ± 0.14 | 1.215 |
+|                      | 8 | 24.54 ± 0.15 | 25.16 ± 0.03  | 1.025 |
+| Zen4            | 1 |  5.26 ± 0.00  | 6.73 ± 0.00  |  1.279 |
+|                      | 2 |  9.71 ± 0.01 | 12.43 ± 0.00  |  1.269 |
+|                      | 4 |  13.48 ± 0.06  | 14.00 ± 0.03  |  1.039 |
+| AVX2           | 2 | 4.02 ± 0.00   | 6.91 ± 0.00 | 1.719 |
+|                     | 4 | 8.03 ± 0.00    |  11.13 ± 0.00 | 1.386 |
+|                     | 8 |  11.81 ± 0.00  | 12.75 ± 0.00  | 1.079 |
+
+- [x] I have read the [contributing guidelines](https://github.com/ggerganov/llama.cpp/blob/master/CONTRIBUTING.md)
+- Self-reported review complexity:
+  - [ ] Low
+  - [ ] Medium
+  - [ ] High
--- a/github-data/pull_requests/139-Faster
+++ b/github-data/pull_requests/139-Faster
@@ -0,0 +1,34 @@
+### 🔀 [#139](https://github.com/ikawrakow/ik_llama.cpp/pull/139) - Faster R4 quants on Zen4
+
+| **Author** | `ikawrakow` |
+| :--- | :--- |
+| **State** | ❌ **Closed** |
+| **Created** | 2024-12-13 |
+| **Updated** | 2024-12-13 |
+
+---
+
+#### Description
+
+Use integer accumulators for dot products within superblocks. I did not use this originally because according to [this Intel reference](https://www.intel.com/content/www/us/en/docs/intrinsics-guide/index.html#ig_expand=6440,3715,4851,465,488,6424,488,4200,6554,83,4843,5760,5740,6548,6548,852,3669,6205,6205,3669,3675,5750,6375,6437,3869,2675,2675,3850,3869,2946,2946,308,1741,6044,6073,6585,7030,4851,4874,6196,6068,1741,4760,6077,4236,3667,4236,488,4044,3669,5741,6009,3869,691,5303,3843,3667,4843,110,5743,4772,1741,4046,4044,6077,4860,4860,3715,1866,1866,1866,4044,1863,1866,1866,3707,3715,5114,3667,3667,3667,5831,5738,3669,92,2692,4110,4203,4239,3869,94,853,856,1598,4953,6068,5997,4851,5997,4953,4931,6571,420,5068,488,488,4998,5010,3847,3842,4897,114,6007,4863,4761,6005,6008,3910,882,3921,6008,5002,6007,6598,1159,1159,144,828,486,823,299,337,823,4838,4239,2692,1607,6077,6006,4860,828,486,5704,6007,6007,6009,882,2692,2705,473,6007,3866,6007,4239,114,84,344,6006,5002,3869,5824,4690,143,4874,5234,5251,823,5234,2103,2662,2936,3670,2124,1664,5234,2632,5256,5234,5234,1622,461,1583,2252,4772,823,674,344,5234,2629,4175,5506,5512,5500,6189,6424,2692,2705,2671,5997,4986,679,2943,4960,4990,6068,6059,3667,6068,1750,1753,6189,2962,6053,4949,7003,7021,2930,3667,6077,782,6604,5086,6000,6047,6000,5997,6006,6000,6009,6000,6411,770,2938,4236,2965,6053,1753,1866,463,6050,2932,5798,6050,2932,6050,2930,5997,5053,4953,5994,6000,5056,2962,5056,6053,613,6000,6000,5056,2962,4642,4772,6601,1619,4772,6053,5041,4772&text=_mm256_mullo_epi32) the `_mm256_mullo_epi32()` instruction has an extremely high latency. But given that on `ARM_NEON` the use of integer dot product accumulation resulted in significant performance boost (see #135), I decided to still try. Outcome: it is faster, despite the high latency of the integer multiplication. 
+
+Here PP-512 and TG-128 measurements for LLaMA-3.1-8B on Zen4 (Ryzen-7950X CPU):
+
+| Quant | Threads | Task | t/s (main) | t/s (PR) | Speedup |
+| ---: | ---: | ---: | ---: | ---: | ---: |
+| Q2_K_R4 | 16 | pp512 | 256.19 ± 0.26 | 272.69 ± 0.13 | 1.064 |
+|                  |  1   | tg128 | 9.08 ± 0.12 | 9.95 ± 0.0 | 1.096 |
+|                  |  2  | tg128 | 16.40 ± 0.00 | 17.44 ± 0.01 | 1.063 |
+|                  |  4  | tg128 | 20.72 ± 0.12 | 20.97 ± 0.08 | 1.012 |
+| Q3_K_R4 | 16 | pp512 | 236.77 ± 0.35 | 255.84 ± 0.20 | 1.081 |
+|                  |  1  | tg128 | 6.78 ± 0.00 | 7.16 ± 0.07 | 1.056 |
+|                  |  2  | tg128 | 12.46 ± 0.00  | 13.00 ± 0.01 | 1.043 |
+|                  |  4  | tg128 | 17.02 ± 0.09 | 17.20 ± 0.24  | 1.012 |
+| Q4_K_R4 | 16 | pp512 | 262.40 ± 0.28 | 268.09 ± 0.12 | 1.022 |
+| IQ4_XS_R4 | 16 | pp512 | 256.80 ± 0.35 | 271.95 ± 0.39 | 1.059 |
+| Q5_K_R4 | 16 | pp512 | 248.30 ± 0.29 | 256.68 ± 0.31 | 1.034 |
+| Q6_K_R4 | 16 | pp512 | 243.25 ± 0.31 | 261.33 ± 0.38 | 1.074 |
+|                  |  1  | tg128 | 7.94 ± 0.00 | 8.34 ± 0.00 | 1.050 |
+|                  |  2  | tg128 | 10.38 ± 0.00 | 10.38 ± 0.00 | 1.000 |
+
+For `Q4_K_R4, Q5_K_R4` and `IQ4_XS_R4` matrix-vector multiplications are done with a different implementation where this change is not applicable, so no TG results for those.
--- a/github-data/pull_requests/14-Adding
+++ b/github-data/pull_requests/14-Adding
@@ -0,0 +1,32 @@
+### 🔀 [#14](https://github.com/ikawrakow/ik_llama.cpp/pull/14) - Adding IQ6_K
+
+| **Author** | `ikawrakow` |
+| :--- | :--- |
+| **State** | ❌ **Closed** |
+| **Created** | 2024-08-09 |
+| **Updated** | 2024-08-09 |
+
+---
+
+#### Description
+
+This PR
+
+* Adds `IQ6_K` - see #8 for motivation
+* Fixes the Zen4 implementation of `IQ3_K`, `IQ4_K` and `IQ5_K`
+
+### New IQ6_K
+
+The graph below is a copy of the graph in #8 with the quantization error of the new `IQ6_K` non-linear quantization type added (cyan circle near 6.6 bpw).  We observe a significant improvement compared to `Q6_K` (0.4% vs 0.65%). LLaMA-3.1-8B quantization error is better too (0.15% vs 0.26%), so I think this is a worthwhile addition.
+
+![l31_70B](https://github.com/user-attachments/assets/e8b4447c-cbf3-4bb8-9185-793f06510e3f)
+
+### Fixing the Zen4 implementation of `IQ3_K`, `IQ4_K` and `IQ5_K`
+
+While working on `IQ6_K`, I have noticed that there is a problem with the Zen4 implementation of the `IQ3,4,5_K` quants. I was using the standard k-quants matrix multiplication template (`mul_mat_qX_K_q8_K_AVX512`). On Zen4, this template uses the `_mm512_dpbusd_epi32` instruction to perform the dot product between the quants of the left matrix and the `Q8_K` quants of the right matrix, which produces a SIMD vector containing 32-bit integer results. But for k-quants these 32-bit integers fall within `int16_t` range, so they get packed to 16-bit and are then multiplied with the block scales. But for the 3+ bit non-linear quants, the `_mm512_dpbusd_epi32` may go outside of the `int16_t` range, which then leads to truncation and a wrong result. I have now corrected the implementation. This results in a small performance regression. The table below shows a performance comparison for LLaMA-3.1-8B between the original Zen4 implementation and the corrected Zen4 implementation for `IQ3_K` on a Ryzen-7950X  (using 16 threads for PP-512 and 4 threads for TG-128)
+
+|   | t/s (PP-512) | t/s (TG-128) |
+| ---: | ----: | ----: |
+| Before fix | 180.77 ± 0.62 | 16.10 ± 0.16 |
+| After fix  | 167.69 ± 0.69 | 15.84 ± 0.33 |
+| Ratio      | 0.940 | 0.984 |
--- a/github-data/pull_requests/141-Q8_K_R8_
+++ b/github-data/pull_requests/141-Q8_K_R8_
@@ -0,0 +1,23 @@
+### 🔀 [#141](https://github.com/ikawrakow/ik_llama.cpp/pull/141) - Q8_K_R8: Fastest quantized matrix multiplications
+
+| **Author** | `ikawrakow` |
+| :--- | :--- |
+| **State** | ❌ **Closed** |
+| **Created** | 2024-12-14 |
+| **Updated** | 2024-12-14 |
+
+---
+
+#### Description
+
+This PR adds `Q8_K_R8` - 8-rows interleaved version of `Q8_K`. With that, we break the world record in prompt processing speed. Here is what we get for PP-512 with LLaMA-3.1-8B on `Zen4` (Ryzen-7950X), `AVX2` (Ryzen-5975WX) and `ARM_NEON` (M2-Max):
+
+| Platform | PP-512 (Q8_0_R4) | PP-512 (Q8_K_R8) | Speedup |
+| ---: | ---: | ---: | ---: |
+| ARM_NEON | 128.29 ± 1.50 | 172.52 ± 4.17 | 1.345 |
+| Zen4            | 268.98 ± 0.31 | 368.85 ± 0.73 | 1.371 |
+| AVX2            | 234.40 ± 0.60 | 293.72 ± 0.34 | 1.253 | 
+
+On the Ryzen-7950X, which provides native `bf16` support, this is nearly 60% faster than `bf16`. On the M2-Max, which has native `fp16` support, `Q8_K_R8` is 87% faster than `fp16`!
+  
+**Note on AVX2**: In the `AVX2` implementation one needs to use the `_mm256_madd_epi16(x, y)` instruction, where `x` holds unsigned 8-bit integers and `y` has signed 8-bit integers. In the initial implementation I forgot for the 177'th time that the unsigned integers still need to be within `0...127`, else adding up two adjacent products (as the instruction does) may overflow the `int16_t` range (and gets silently truncated if it does), so I was making the `Q8_K_R8` quants unsigned (simply `xor 0x80`). This implementation resulted in 354 t/s on the Ryzen-5975WX. Sadly, one needs to "unsign" the `Q8_K_R8` quants with `_mm256_sign_epi8(x, x)`, and then apply the sign to the activation quants before taking the dot product. This is quite costly and `AVX2` performance drops to 293 t/s. Being curious about the effect that the `int16_t` overflow might have, I computed LLaMA-3.1-8B-Instruct perplexity (context 512 tokens) with the original and with the correct implementation. I get `PPL = 7.3725` with the overflowing variant, and `PPL = 7.3443` with the correct implementation. I.e., the effect is small but noticeable.
--- a/github-data/pull_requests/142-BF16_R16
+++ b/github-data/pull_requests/142-BF16_R16
@@ -0,0 +1,22 @@
+### 🔀 [#142](https://github.com/ikawrakow/ik_llama.cpp/pull/142) - BF16_R16 - 16 interleaved bf16 rows  
+
+| **Author** | `ikawrakow` |
+| :--- | :--- |
+| **State** | ❌ **Closed** |
+| **Created** | 2024-12-14 |
+| **Updated** | 2024-12-15 |
+
+---
+
+#### Description
+
+After breaking the world record for 8-bit quantized matrix multiplications with `Q8_K_R8` in PR #141, I got excited to try to speed up `bf16` CPU inference. This PR is the somewhat disappointing result. I tried interleaving 4, 8, and 16 rows, 16 is fastest (but only very slightly faster than 8). It is disappointing because we only gain about 11% in prompt processing speed compared to the `bf16` implementation in `iqk_mul_mat` (but that one is already ~3X faster compared to mainline `llama.cpp`). On the bright side we do get TG speedup - 3.12 t/s vs 2.5 t/s for LLaMA-3.1-8B with 1 thread on a Ryzen-7950X, 4.25 t/s vs 3.9 t/s with 2 threads (and 2 threads fully saturate the memory bandwidth when using `BF16_R16`).   
+
+Anyway, here a table with the `BF16_R16` PP-512 and TG-128 speeds on Ryzen-7950X
+
+| model                          |       size |     params | backend    | threads |          test |              t/s |
+| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | ---------------: |
+| llama 8B BF16_R16               |  14.96 GiB |     8.03 B | CPU        |      16 |         pp512 |    263.15 ± 0.19 |
+| llama 8B BF16_R16               |  14.96 GiB |     8.03 B | CPU        |       1 |         tg128 |      3.12 ± 0.00 |
+| llama 8B BF16_R16               |  14.96 GiB |     8.03 B | CPU        |       2 |         tg128 |      4.25 ± 0.00 |
+| llama 8B BF16_R16               |  14.96 GiB |     8.03 B | CPU        |       4 |         tg128 |      4.14 ± 0.00 |
--- a/github-data/pull_requests/143-Slightly
+++ b/github-data/pull_requests/143-Slightly
@@ -0,0 +1,15 @@
+### 🔀 [#143](https://github.com/ikawrakow/ik_llama.cpp/pull/143) - Slightly faster IQ4_XS_R4 on AVX2
+
+| **Author** | `ikawrakow` |
+| :--- | :--- |
+| **State** | ❌ **Closed** |
+| **Created** | 2024-12-16 |
+| **Updated** | 2024-12-16 |
+
+---
+
+#### Description
+
+PPL-512(LLaMA-3.1-8B) on Ryzen-5975WX goes to 262.2 t/s up from 248.2 t/s.
+
+On AVX2/Zen4 it is much better to interleave 8 rows - see [this branch](https://github.com/ikawrakow/ik_llama.cpp/tree/ik/iq4_xs_r8). We get 284 t/s on Zen4 and 275 t/s on AVX2. But the `ARM_NEON` implementation becomes extremely messy, and we get ~1-2% lower performance. Hence sticking with 4 interleaved rows for now.
--- a/github-data/pull_requests/144-Slightly
+++ b/github-data/pull_requests/144-Slightly
@@ -0,0 +1,13 @@
+### 🔀 [#144](https://github.com/ikawrakow/ik_llama.cpp/pull/144) - Slightly faster IQ4_K_R4 on AVX2/Zen4
+
+| **Author** | `ikawrakow` |
+| :--- | :--- |
+| **State** | ❌ **Closed** |
+| **Created** | 2024-12-16 |
+| **Updated** | 2024-12-16 |
+
+---
+
+#### Description
+
+We get PP-512(LLaMA-3.1-8B) = 251 t/s (Ryzen-7950X) or 249 t/s (Ryzen-5975WX), up from 232/227 t/s.
--- a/github-data/pull_requests/145-IQ3_K_R4.md
+++ b/github-data/pull_requests/145-IQ3_K_R4.md
@@ -0,0 +1,38 @@
+### 🔀 [#145](https://github.com/ikawrakow/ik_llama.cpp/pull/145) - IQ3_K_R4
+
+| **Author** | `ikawrakow` |
+| :--- | :--- |
+| **State** | ❌ **Closed** |
+| **Created** | 2024-12-17 |
+| **Updated** | 2024-12-17 |
+
+---
+
+#### Description
+
+Adding `IQ3_K` with 4 interleaved rows.
+
+We get very signifiant performance gains on `ARM_NEON` and more modest gains on `AVX2/Zen4`. Overall slower than other `_R4` quants, which is expected as 3-bit quantization is always kind of slow. 
+
+Here is `PP-512` for LLaMA-3.1-8B on `Zen4` (Ryzen-7950X), `ARM_NEON` (M2-Max) and `AVX2` (Ryzen-5975WX)
+
+| Platform |  Threads | IQ3_K | IQ3_K_R4 | Speedup |
+| ---: | ---: | ---: | ---: | ---: |
+| ARM_NEON |  8 |  54.94 ± 0.79  | 93.83 ± 0.09 | 1.708 |
+| Zen4            | 16 | 180.13 ± 0.48 | 230.33 ± 0.13  | 1.279 |
+| AVX2           | 32 | 197.59 ± 0.43 |  253.36 ± 0.50  | 1.282 |
+
+We get decent performance gains for TG as well.
+Here results for TG-128 on LLaMA-3.1-8B with different numbers of threads:
+
+| Platform |  Threads | IQ3_K | IQ3_K_R4 | Speedup |
+| ---: | ---: | ---: | ---: | ---: |
+| ARM_NEON | 2 |  5.84 ± 0.00 | 6.71 ± 0.05 | 1.149 |
+|                      | 4 | 11.14 ± 0.00 | 12.83 ± 0.01 | 1.152 |
+|                      | 8 | 20.59 ± 0.17 | 23.07 ± 0.16  | 1.120 |
+| Zen4            | 1 |  5.06 ± 0.00  | 5.64 ± 0.00  |  1.115 |
+|                      | 2 |  9.58 ± 0.01 | 10.50 ± 0.01  |  1.096 |
+|                      | 4 |  16.56 ± 0.05  | 16.77 ± 0.32  |  1.013 |
+| AVX2           | 2 | 4.45 ± 0.00   | 6.83 ± 0.00 | 1.535 |
+|                     | 4 | 8.24 ± 0.00   |  12.51 ± 0.00 | 1.518 |
+|                     | 8 |  14.59 ± 0.04  | 16.23 ± 0.00  | 1.112 |
--- a/github-data/pull_requests/146-IQ2_K_R4.md
+++ b/github-data/pull_requests/146-IQ2_K_R4.md
@@ -0,0 +1,38 @@
+### 🔀 [#146](https://github.com/ikawrakow/ik_llama.cpp/pull/146) - IQ2_K_R4
+
+| **Author** | `ikawrakow` |
+| :--- | :--- |
+| **State** | ❌ **Closed** |
+| **Created** | 2024-12-17 |
+| **Updated** | 2024-12-17 |
+
+---
+
+#### Description
+
+Adding `IQ2_K` with 4 interleaved rows.
+
+We get very signifiant performance gains on `ARM_NEON` and more modest gains on `AVX2/Zen4`. 
+
+Here is `PP-512` for LLaMA-3.1-8B on `Zen4` (Ryzen-7950X), `ARM_NEON` (M2-Max) and `AVX2` (Ryzen-5975WX)
+
+| Platform |  Threads | IQ2_K | IQ2_K_R4 | Speedup |
+| ---: | ---: | ---: | ---: | ---: |
+| ARM_NEON |  8 |  59.71 ± 0.91  | 107.93 ± 0.75 | 1.808 |
+| Zen4            | 16 | 198.79 ± 0.58 | 250.19 ± 0.42   | 1.259 |
+| AVX2           | 32 | 209.02 ± 0.16 |  287.17 ± 0.64  | 1.374 |
+
+We get decent performance gains for TG as well.
+Here results for TG-128 on LLaMA-3.1-8B with different numbers of threads:
+
+| Platform |  Threads | IQ2_K | IQ2_K_R4 | Speedup |
+| ---: | ---: | ---: | ---: | ---: |
+| ARM_NEON | 2 |  8.22 ± 0.01 | 9.79 ± 0.00 | 1.191 |
+|                      | 4 | 15.12 ± 0.01  | 18.25 ± 0.02 | 1.207 |
+|                      | 8 | 28.01 ± 0.13 | 32.33 ± 0.26  | 1.154 |
+| Zen4            | 1 |  6.56 ± 0.00  | 7.13 ± 0.11  |  1.087 |
+|                      | 2 |  11.89 ± 0.00 | 13.35 ± 0.01  |  1.123 |
+|                      | 4 |  19.37 ± 1.84  | 21.55 ± 0.86  |  1.113 |
+| AVX2           | 2 | 5.06 ± 0.00   | 8.83 ± 0.00 | 1.745 |
+|                     | 4 | 9.63 ± 0.00   |  16.28 ± 0.00 | 1.691 |
+|                     | 8 |  17.45 ± 0.08  | 22.11 ± 0.00  | 1.267 |
--- a/github-data/pull_requests/147-Be
+++ b/github-data/pull_requests/147-Be
@@ -0,0 +1,19 @@
+### 🔀 [#147](https://github.com/ikawrakow/ik_llama.cpp/pull/147) - Be able to repack tensors at run time
+
+| **Author** | `ikawrakow` |
+| :--- | :--- |
+| **State** | ❌ **Closed** |
+| **Created** | 2024-12-17 |
+| **Updated** | 2024-12-17 |
+
+---
+
+#### Description
+
+It is a bit of a hack as I didn't see a good way to figure out if tensors may be uploaded to a GPU later on. But if running on the CPU it works fine. Just use
+```
+-rtr or --run-time-repack
+```
+and all tensors types that have a corresponding type with interleaved rows will be repacked. 
+
+**Note**: turning on run time repacking will automatically turn off `mmap`.
--- a/github-data/pull_requests/148-Slightly
+++ b/github-data/pull_requests/148-Slightly
@@ -0,0 +1,7 @@
+### 🔀 [#148](https://github.com/ikawrakow/ik_llama.cpp/pull/148) - Slightly better matrix x vector on Zen4/AVX2 for iq2_k_r4, iq3_k_r4, iq4_k_r4
+
+| **Author** | `ikawrakow` |
+| :--- | :--- |
+| **State** | ❌ **Closed** |
+| **Created** | 2024-12-17 |
+| **Updated** | 2024-12-17 |
--- a/github-data/pull_requests/149-IQ5_K_R4.md
+++ b/github-data/pull_requests/149-IQ5_K_R4.md
@@ -0,0 +1,42 @@
+### 🔀 [#149](https://github.com/ikawrakow/ik_llama.cpp/pull/149) - IQ5_K_R4
+
+| **Author** | `ikawrakow` |
+| :--- | :--- |
+| **State** | ❌ **Closed** |
+| **Created** | 2024-12-18 |
+| **Updated** | 2025-03-27 |
+
+---
+
+#### Description
+
+Adding `IQ5_K` with 4 interleaved rows.
+
+We get very signifiant performance gains on `ARM_NEON` and more modest gains on `AVX2/Zen4`. 
+
+Here is `PP-512` for LLaMA-3.1-8B on `Zen4` (Ryzen-7950X), `ARM_NEON` (M2-Max) and `AVX2` (Ryzen-5975WX)
+
+| Platform |  Threads | IQ5_K | IQ5_K_R4 | Speedup |
+| ---: | ---: | ---: | ---: | ---: |
+| ARM_NEON |  8 |  53.80 ± 1.08  |  93.33 ± 2.02 | 1.735 |
+| Zen4            | 16 | 168.09 ± 0.58 | 230.23 ± 0.23    | 1.370 |
+| AVX2           | 32 | 177.16 ± 0.31 |  231.50 ± 0.43   | 1.307 |
+
+TG does not look good on AVX2/Zen4. On ARM_NEON we get a decent performance gain.
+Here results for TG-128 on LLaMA-3.1-8B with different numbers of threads:
+
+| Platform |  Threads | IQ5_K | IQ5_K_R4 | Speedup |
+| ---: | ---: | ---: | ---: | ---: |
+| ARM_NEON | 2 |  5.92 ± 0.07 | 6.98 ± 0.00 | 1.179 |
+|                      | 4 | 11.53 ± 0.01  | 13.35 ± 0.01 | 1.158 |
+|                      | 8 | 20.29 ± 0.46 | 21.17 ± 0.18  | 1.043 |
+
+---
+
+#### 💬 Conversation
+
+👤 **saood06** commented the **2025-03-27** at **06:53:47**:<br>
+
+>TG does not look good on AVX2/Zen4
+
+Does this mean regression compared to non-interleaved or just no benefit?
--- a/github-data/pull_requests/150-IQ4_KS_R4.md
+++ b/github-data/pull_requests/150-IQ4_KS_R4.md
@@ -0,0 +1,38 @@
+### 🔀 [#150](https://github.com/ikawrakow/ik_llama.cpp/pull/150) - IQ4_KS_R4
+
+| **Author** | `ikawrakow` |
+| :--- | :--- |
+| **State** | ❌ **Closed** |
+| **Created** | 2024-12-18 |
+| **Updated** | 2024-12-18 |
+
+---
+
+#### Description
+
+Adding `IQ4_KS` with 4 interleaved rows.
+
+We get very signifiant performance gains on `ARM_NEON` and good gains on `AVX2/Zen4`. 
+
+Here is `PP-512` for LLaMA-3.1-8B on `Zen4` (Ryzen-7950X), `ARM_NEON` (M2-Max) and `AVX2` (Ryzen-5975WX)
+
+| Platform |  Threads | IQ4_KS | IQ4_KS_R4 | Speedup |
+| ---: | ---: | ---: | ---: | ---: |
+| ARM_NEON |  8 |  67.29 ± 1.02  | 124.91 ± 0.62 | 1.856 |
+| Zen4            | 16 | 180.42 ± 0.68 | 266.05 ± 0.45  | 1.475 |
+| AVX2           | 32 | 201.79 ± 0.48 |  245.37 ± 0.52  | 1.216 |
+
+We get decent performance gains for TG as well.
+Here results for TG-128 on LLaMA-3.1-8B with different numbers of threads:
+
+| Platform |  Threads | IQ4_KS | IQ4_KS_R4 | Speedup |
+| ---: | ---: | ---: | ---: | ---: |
+| ARM_NEON | 2 |  10.84 ± 0.01 | 12.55 ± 0.00 | 1.158 |
+|                      | 4 | 19.81 ± 0.12 | 22.06 ± 0.06 | 1.114 |
+|                      | 8 | 25.74 ± 0.47 | 26.47 ± 0.21  | 1.039 |
+| Zen4            | 1 |  6.18 ± 0.00  | 7.97 ± 0.11  |  1.290 |
+|                      | 2 |  11.73 ± 0.02 | 13.43 ± 0.00  |  1.145 |
+|                      | 4 |  13.09 ± 1.13  | 14.46 ± 0.00  |  1.105 |
+| AVX2           | 2 | 4.74 ± 0.00  | 7.30 ± 0.00 | 1.540 |
+|                     | 4 | 8.75 ± 0.00  |  11.39 ± 0.00 | 1.302 |
+|                     | 8 |  12.38 ± 0.01  | 12.73 ± 0.00  | 1.028 |
--- a/github-data/pull_requests/151-fix
+++ b/github-data/pull_requests/151-fix
@@ -0,0 +1,23 @@
+### 🐛 [#151](https://github.com/ikawrakow/ik_llama.cpp/pull/151) - fix typo
+
+| **Author** | `Nexesenex` |
+| :--- | :--- |
+| **State** | ❌ **Closed** |
+| **Created** | 2024-12-20 |
+| **Updated** | 2024-12-20 |
+
+---
+
+#### Description
+
+- [x] I have read the [contributing guidelines](https://github.com/ggerganov/llama.cpp/blob/master/CONTRIBUTING.md)
+- Self-reported review complexity:
+  - [x] Low
+  - [ ] Medium
+  - [ ] High
+
+---
+
+#### 💬 Conversation
+
+👤 **ikawrakow** submitted a review the **2024-12-20** at **11:02:09**: ✅ `APPROVED`
--- a/github-data/pull_requests/152-IQ3_XXS_R4.md
+++ b/github-data/pull_requests/152-IQ3_XXS_R4.md
@@ -0,0 +1,43 @@
+### 🔀 [#152](https://github.com/ikawrakow/ik_llama.cpp/pull/152) - IQ3_XXS_R4
+
+| **Author** | `ikawrakow` |
+| :--- | :--- |
+| **State** | ❌ **Closed** |
+| **Created** | 2024-12-20 |
+| **Updated** | 2024-12-20 |
+
+---
+
+#### Description
+
+Sub-4 bpw i-quants have a terrible CPU performance, so I was curious to see if we can improve by interleaving rows.
+
+This PR adds `IQ3_XXS_R4`, a 4-row interleaved version of `IQ3_XXS`.
+
+We get decent performance gains, but still remain much slower than k- or legacy quants. I think there is still potential for optimization, but I was getting constantly confused about shuffling signs and scales, so at the end gave up with this result. 
+
+Here is `PP-512` for LLaMA-3.1-8B on `Zen4` (Ryzen-7950X), `ARM_NEON` (M2-Max) and `AVX2` (Ryzen-5975WX)
+
+| Platform |  Threads | IQ3_XXS | IQ3_XXS_R4 | Speedup |
+| ---: | ---: | ---: | ---: | ---: |
+| ARM_NEON |  8 |  48.18 ± 0.69  | 67.45 ± 0.78 | 1.400 |
+| Zen4            | 16 | 107.42 ± 0.33 | 141.62 ± 0.30   | 1.318 |
+| AVX2           | 32 | 142.38 ± 0.48 |  184.42 ± 0.26  | 1.295 |
+
+We get decent performance gains for TG as well, especially on `AVX2`.
+Here results for TG-128 on LLaMA-3.1-8B with different numbers of threads:
+
+| Platform |  Threads | IQ4_KS | IQ4_KS_R4 | Speedup |
+| ---: | ---: | ---: | ---: | ---: |
+| ARM_NEON | 2 |  3.46 ± 0.02 | 4.79 ± 0.00 | 1.384 |
+|                      | 4 | 6.65 ± 0.01 | 8.78 ± 0.04 | 1.320 |
+|                      | 8 | 10.83 ± 0.18 | 15.95 ± 0.25  | 1.473 |
+| Zen4            | 2 |  5.18 ± 0.00  | 6.53 ± 0.00  |  1.261 |
+|                      | 4 |  9.70 ± 0.0 | 12.15 ± 0.00   |  1.253 |
+|                      | 8 |  17.19 ± 0.18  | 17.93 ± 0.00  |  1.044 |
+| AVX2           | 2 | 2.04 ± 0.0  | 4.07 ± 0.00 | 1.995 |
+|                     | 4 | 4.04 ± 0.00  |  7.94 ± 0.00 | 1.965 |
+|                     | 8 |  7.40 ± 0.01  | 14.16 ± 0.06  | 1.914 |
+|                     | 16 |  13.64 ± 0.00  |  17.92 ± 0.01  | 1.314 |
+
+We now manage to saturate the available memory bandwidth on the Ryzen CPUs at 8 (Ryzen-7950X) or 16 (Ryzen-5975WX) threads, but are far from being memory bound on the M2-Max.
--- a/github-data/pull_requests/153-IQ3_XXS_R4.md
+++ b/github-data/pull_requests/153-IQ3_XXS_R4.md
@@ -0,0 +1,43 @@
+### 🔀 [#153](https://github.com/ikawrakow/ik_llama.cpp/pull/153) - IQ3_XXS_R4
+
+| **Author** | `ikawrakow` |
+| :--- | :--- |
+| **State** | ❌ **Closed** |
+| **Created** | 2024-12-20 |
+| **Updated** | 2024-12-20 |
+
+---
+
+#### Description
+
+Sub-4 bpw i-quants have a terrible CPU performance, so I was curious to see if we can improve by interleaving rows.
+
+This PR adds `IQ3_XXS_R4`, a 4-row interleaved version of `IQ3_XXS`.
+
+We get decent performance gains, but still remain much slower than k- or legacy quants. I think there is still potential for optimization, but I was getting constantly confused about shuffling signs and scales, so at the end gave up with this result. 
+
+Here is `PP-512` for LLaMA-3.1-8B on `Zen4` (Ryzen-7950X), `ARM_NEON` (M2-Max) and `AVX2` (Ryzen-5975WX)
+
+| Platform |  Threads | IQ3_XXS | IQ3_XXS_R4 | Speedup |
+| ---: | ---: | ---: | ---: | ---: |
+| ARM_NEON |  8 |  48.18 ± 0.69  | 67.45 ± 0.78 | 1.400 |
+| Zen4            | 16 | 107.42 ± 0.33 | 141.62 ± 0.30   | 1.318 |
+| AVX2           | 32 | 142.38 ± 0.48 |  184.42 ± 0.26  | 1.295 |
+
+We get decent performance gains for TG as well, especially on `AVX2`.
+Here results for TG-128 on LLaMA-3.1-8B with different numbers of threads:
+
+| Platform |  Threads | IQ4_KS | IQ4_KS_R4 | Speedup |
+| ---: | ---: | ---: | ---: | ---: |
+| ARM_NEON | 2 |  3.46 ± 0.02 | 4.79 ± 0.00 | 1.384 |
+|                      | 4 | 6.65 ± 0.01 | 8.78 ± 0.04 | 1.320 |
+|                      | 8 | 10.83 ± 0.18 | 15.95 ± 0.25  | 1.473 |
+| Zen4            | 2 |  5.18 ± 0.00  | 6.53 ± 0.00  |  1.261 |
+|                      | 4 |  9.70 ± 0.0 | 12.15 ± 0.00   |  1.253 |
+|                      | 8 |  17.19 ± 0.18  | 17.93 ± 0.00  |  1.044 |
+| AVX2           | 2 | 2.04 ± 0.0  | 4.07 ± 0.00 | 1.995 |
+|                     | 4 | 4.04 ± 0.00  |  7.94 ± 0.00 | 1.965 |
+|                     | 8 |  7.40 ± 0.01  | 14.16 ± 0.06  | 1.914 |
+|                     | 16 |  13.64 ± 0.00  |  17.92 ± 0.01  | 1.314 |
+
+We now manage to saturate the available memory bandwidth on the Ryzen CPUs at 8 (Ryzen-7950X) or 16 (Ryzen-5975WX) threads, but are far from being memory bound on the M2-Max.
--- a/github-data/pull_requests/154-IQ2_XXS_R4.md
+++ b/github-data/pull_requests/154-IQ2_XXS_R4.md
@@ -0,0 +1,43 @@
+### 🔀 [#154](https://github.com/ikawrakow/ik_llama.cpp/pull/154) - IQ2_XXS_R4
+
+| **Author** | `ikawrakow` |
+| :--- | :--- |
+| **State** | ❌ **Closed** |
+| **Created** | 2024-12-20 |
+| **Updated** | 2024-12-20 |
+
+---
+
+#### Description
+
+Sub-4 bpw i-quants have a terrible CPU performance, so I was curious to see if we can improve by interleaving rows.
+
+This PR adds `IQ2_XXS_R4`, a 4-row interleaved version of `IQ2_XXS`.
+
+We get decent performance gains, but still remain much slower than k- or legacy quants. I think there is still potential for optimization, but I was getting constantly confused about shuffling signs and scales, so at the end gave up with this result. 
+
+Here is `PP-512` for LLaMA-3.1-8B on `Zen4` (Ryzen-7950X), `ARM_NEON` (M2-Max) and `AVX2` (Ryzen-5975WX)
+
+| Platform |  Threads | IQ2_XXS | IQ2_XXS_R4 | Speedup |
+| ---: | ---: | ---: | ---: | ---: |
+| ARM_NEON |  8 |  56.40 ± 0.99  | 76.34 ± 0.58 | 1.354 |
+| Zen4            | 16 | 134.68 ± 0.31 | 153.60 ± 0.23   | 1.140 |
+| AVX2           | 32 | 155.48 ± 0.17 |  195.72 ± 0.20  | 1.259 |
+
+We get very decent performance gains for TG as well, especially on `AVX2`.
+Here results for TG-128 on LLaMA-3.1-8B with different numbers of threads:
+
+| Platform |  Threads | IQ2_XXS | IQ2_XXS_R4 | Speedup |
+| ---: | ---: | ---: | ---: | ---: |
+| ARM_NEON | 2 |  4.40 ± 0.03  | 6.65 ± 0.00  | 1.511 |
+|                      | 4 | 8.61 ± 0.01  | 12.20 ± 0.02 | 1.417 |
+|                      | 8 | 15.84 ± 0.34 | 21.76 ± 0.31  | 1.374 |
+| Zen4            | 2 |  6.59 ± 0.00  | 8.66 ± 0.00  |  1.314 |
+|                      | 4 |  11.62 ± 0.81 | 15.49 ± 0.36  |  1.333 |
+|                      | 8 |  20.40 ± 0.70  | 23.37 ± 0.03  |  1.146 |
+| AVX2           | 2 | 2.62 ± 0.00  | 5.54 ± 0.00 | 2.115 |
+|                     | 4 | 5.17 ± 0.00  |  10.81 ± 0.00 | 2.091 |
+|                     | 8 |  9.49 ± 0.02  | 18.93 ± 0.08   | 1.995 |
+|                     | 16 |  16.97 ± 0.00  |  25.70 ± 0.01  | 1.514 |
+
+We now manage to saturate the available memory bandwidth on the Ryzen CPUs at 8 (Ryzen-7950X) or 16 (Ryzen-5975WX) threads, but are far from being memory bound on the M2-Max.
--- a/github-data/pull_requests/155-IQ2_XS_R4.md
+++ b/github-data/pull_requests/155-IQ2_XS_R4.md
@@ -0,0 +1,41 @@
+### 🔀 [#155](https://github.com/ikawrakow/ik_llama.cpp/pull/155) - IQ2_XS_R4
+
+| **Author** | `ikawrakow` |
+| :--- | :--- |
+| **State** | ❌ **Closed** |
+| **Created** | 2024-12-21 |
+| **Updated** | 2024-12-21 |
+
+---
+
+#### Description
+
+Sub-4 bpw i-quants have a terrible CPU performance, so I was curious to see if we can improve by interleaving rows.
+
+This PR adds `IQ2_XS_R4`, a 4-row interleaved version of `IQ2_XS`.
+
+We get very modest performance gains. I guess, the combination of loading data from a  large table, blocks of 16 quants, and perhaps me not having found the optimum bit packing kills the performance. 
+
+Anyway, here is `PP-512` for LLaMA-3.1-8B on `Zen4` (Ryzen-7950X), `ARM_NEON` (M2-Max) and `AVX2` (Ryzen-5975WX)
+
+| Platform |  Threads | IQ2_XS | IQ2_XS_R4 | Speedup |
+| ---: | ---: | ---: | ---: | ---: |
+| ARM_NEON |  8 |  45.55 ± 0.28   | 54.13 ± 0.19 | 1.188 |
+| Zen4            | 16 | 135.43 ± 0.65 | 156.55 ± 0.51  | 1.156 |
+| AVX2           | 32 | 157.34 ± 0.27 |   192.60 ± 0.37  | 1.224 |
+
+We get some performance gains for TG as well, especially on `AVX2`.
+Here results for TG-128 on LLaMA-3.1-8B with different numbers of threads:
+
+| Platform |  Threads | IQ2_XS | IQ2_XS_R4 | Speedup |
+| ---: | ---: | ---: | ---: | ---: |
+| ARM_NEON | 2 |  5.10 ± 0.02  | 5.91 ± 0.01  | 1.159 |
+|                      | 4 | 9.71 ± 0.09  | 10.90 ± 0.03 | 1.123 |
+|                      | 8 | 17.21 ± 0.77 | 19.30 ± 0.56  | 1.121 |
+| Zen4            | 2 |  6.54 ± 0.01  | 6.90 ± 0.00  |  1.055 |
+|                      | 4 |  12.23 ± 0.02 | 12.79 ± 0.00  |  1.046 |
+|                      | 8 |  21.19 ± 0.01  | 22.12 ± 0.01 |  1.044 |
+| AVX2           | 2 | 3.16 ± 0.00  | 4.54 ± 0.00 | 1.437 |
+|                     | 4 | 6.13 ± 0.00  |  8.75 ± 0.00  | 1.427 |
+|                     | 8 |  11.31 ± 0.05  | 15.67 ± 0.05  | 1.385 |
+|                     | 16 |  19.41 ± 0.01  |  22.28 ± 0.00  | 1.148 |
--- a/github-data/pull_requests/156-IQ2_S_R4.md
+++ b/github-data/pull_requests/156-IQ2_S_R4.md
@@ -0,0 +1,41 @@
+### 🔀 [#156](https://github.com/ikawrakow/ik_llama.cpp/pull/156) - IQ2_S_R4
+
+| **Author** | `ikawrakow` |
+| :--- | :--- |
+| **State** | ❌ **Closed** |
+| **Created** | 2024-12-21 |
+| **Updated** | 2024-12-21 |
+
+---
+
+#### Description
+
+Sub-4 bpw i-quants have a terrible CPU performance, so I was curious to see if we can improve by interleaving rows.
+
+This PR adds `IQ2_S_R4`, a 4-row interleaved version of `IQ2_S`.
+
+We get very modest performance gains. I guess, the combination of loading data from a  large table, blocks of 16 quants, and perhaps me not having found the optimum bit packing kills the performance. 
+
+Anyway, here is `PP-512` for LLaMA-3.1-8B on `Zen4` (Ryzen-7950X), `ARM_NEON` (M2-Max) and `AVX2` (Ryzen-5975WX)
+
+| Platform |  Threads | IQ2_S | IQ2_S_R4 | Speedup |
+| ---: | ---: | ---: | ---: | ---: |
+| ARM_NEON |  8 |  44.68 ± 0.20  | 50.40 ± 0.18  | 1.128 |
+| Zen4            | 16 | 117.47 ± 0.47 | 148.51 ± 0.51  | 1.264 |
+| AVX2           | 32 | 150.92 ± 0.25  |  177.59 ± 0.40  | 1.177 |
+
+We get some performance gains for TG as well, especially on `AVX2`.
+Here results for TG-128 on LLaMA-3.1-8B with different numbers of threads:
+
+| Platform |  Threads | IQ2_S | IQ2_S_R4 | Speedup |
+| ---: | ---: | ---: | ---: | ---: |
+| ARM_NEON | 2 |  4.30 ± 0.00   | 4.56 ± 0.01  | 1.084 |
+|                      | 4 | 8.20 ± 0.03   | 8.64 ± 0.02 | 1.054 |
+|                      | 8 | 15.07 ± 0.35 | 16.12 ± 0.17  | 1.070 |
+| Zen4            | 2 |  5.31 ± 0.01  | 5.56 ± 0.0  |  1.047 |
+|                      | 4 |  9.53 ± 0.29 | 10.52 ± 0.02  |  1.104 |
+|                      | 8 |  17.80 ± 0.03  | 18.66 ± 0.05 |  1.048 |
+| AVX2           | 2 | 2.60 ± 0.00  | 3.83 ± 0.0 | 1.473 |
+|                     | 4 | 5.02 ± 0.00  |   7.40 ± 0.00  | 1.474 |
+|                     | 8 |  9.69 ± 0.04  | 13.97 ± 0.03  | 1.442 |
+|                     | 16 |  16.70 ± 0.00 |  19.52 ± 0.00  | 1.169 |
--- a/github-data/pull_requests/157-R4
+++ b/github-data/pull_requests/157-R4
@@ -0,0 +1,36 @@
+### 🔀 [#157](https://github.com/ikawrakow/ik_llama.cpp/pull/157) - R4 i-quants improvements
+
+| **Author** | `ikawrakow` |
+| :--- | :--- |
+| **State** | ❌ **Closed** |
+| **Created** | 2024-12-22 |
+| **Updated** | 2024-12-22 |
+
+---
+
+#### Description
+
+Unpacking k- and i-quants is computationally expensive. Because of this, it is useful to re-use the unpacked quants for multiplication with as many columns in the right matrix as possible. At the same time one also needs to restrict the number of columns being used to some maximum number so that accumulated results can remain in vector registers, so in `iqk_mul_mat` up to 8 columns are used. But  unpacking `IQ2_XXS`, `IQ2_XS`, `IQ2_S`, `IQ3_XXS` is computationally so expensive that is cheaper to load/unload accumulated results to/from vector registers so that unpacked quants can be reused more than 8 times.
+
+This PR adds this change using 16 columns. We get non-negligible performance gains for `IQ2_XXS`, `IQ2_XS`, `IQ2_S`, `IQ3_XXS`, and even gain somewhat for `IQ3_K`, `IQ4_K`, `IQ4_KS`, and `IQ5_K`.
+
+The table shows PP-512 performance comparisons between the main branch and this PR for LLaMA-3.1-8B and the affected quants on `ARM_NEON` (M2-Max), `Zen4` (Ryzen-7950X) and `AVX2` (Ryzen-5075WX). When a given quantization/platform combination is missing in the table, the change did not improve performance, so it was not enabled for the given combination.
+
+| Quantization | Platform | Threads | t/s (main) | t/s (PR) | Speedup |
+| ---: | ---: | ---: | ---: | ---: | --- |
+| IQ2_XXS_R4 | ARM_NEON | 8 | 76.34 ± 0.58 | 85.33 ± 1.59 | 1.118 |
+|                        | Zen4             | 16 | 151.08 ± 0.22 | 162.72 ± 0.49 | 1.077 |
+|                        | AVX2             | 32 | 195.72 ± 0.20 | 221.85 ± 0.38  | 1.134 |
+| IQ2_XS_R4   | ARM_NEON  | 8 | 54.13 ± 0.19  | 67.99 ± 0.22 | 1.256 |
+|                        | AVX2            | 32 | 192.60 ± 0.37 | 220.56 ± 0.48 | 1.145 |
+| IQ2_M_R4      | ARM_NEON | 8 | 50.40 ± 0.18 | 62.29 ± 0.21 | 1.236 |
+|                         | Zen4            | 16 | 148.51 ± 0.51 | 169.49 ± 0.53  | 1.141 |
+|                         | AVX2           | 32 | 176.76 ± 0.27 | 203.35 ± 0.46 | 1.150 |
+| IQ3_XXS_R4 | ARM_NEON | 8 | 67.45 ± 0.78 | 73.56 ± 1.26 | 1.091 |
+|                        | Zen4             | 16 | 141.62 ± 0.30  | 149.41 ± 0.49  | 1.055 |
+|                        | AVX2            | 32 | 184.42 ± 0.26 | 206.96 ± 0.44  | 1.122 |
+| IQ3_K_R4     | Zen4             | 16 | 230.33 ± 0.13  | 243.34 ± 0.50 | 1.056 |
+| IQ4_KS_R4  | AVX2             | 32 | 245.37 ± 0.52 | 250.76 ± 0.50 | 1.022 |
+| IQ4_K_R4    | AVX2             | 32 | 249.11 ± 0.38  | 264.23 ± 0.41 | 1.061 |
+| IQ5_K_R4    | Zen4             | 16 | 230.23 ± 0.23 | 240.65 ± 0.58 | 1.045 |
+|                      | AVX2             | 32 | 231.50 ± 0.43 | 245.98 ± 0.37 | 1.063 |
--- a/github-data/pull_requests/158-Faster
+++ b/github-data/pull_requests/158-Faster
@@ -0,0 +1,22 @@
+### 🔀 [#158](https://github.com/ikawrakow/ik_llama.cpp/pull/158) - Faster R4 legacy quants
+
+| **Author** | `ikawrakow` |
+| :--- | :--- |
+| **State** | ❌ **Closed** |
+| **Created** | 2024-12-22 |
+| **Updated** | 2024-12-22 |
+
+---
+
+#### Description
+
+It seems converting `fp16` to `fp32` is extremely slow on the Ryzen-5975WX CPU (or `ggml`'s `GGML_FP16_TO_FP32` is inadequate), so it is better to convert the `fp16` `Q8_1_x4` block scales using `AVX2` intrinsics, store the result, and then use the converted `fp32` scales when performing the dot product. This PR does that on `AVX2` for `Q4_0_R4, Q5_0_R4, Q6_0_R4` and `Q8_0_R4`.  There was no benefit on the Ryzen-7950X (`Zen4`), so not implemented there.
+
+The table shows PP-512 comparison between the main branch and this PR for LLaMA-3.1-8B on the Ryzen-5975WX
+
+| Quant | t/s (main) | t/s (PR) | Speedup |
+| ---: | ---: | ---: | ---: |
+| Q4_0_R4 | 251.00 ± 0.51 | 283.61 ± 0.50 | 1.130 |
+| Q5_0_R4 | 236.33 ± 0.56 | 271.57 ± 0.52 | 1.149 |
+| Q6_0_R4 | 231.53 ± 0.60  | 260.22 ± 0.53 | 1.124 |
+| Q8_0_R4 | 234.40 ± 0.60  | 246.11 ± 0.54 | 1.050 |
--- a/github-data/pull_requests/16-Fix
+++ b/github-data/pull_requests/16-Fix
@@ -0,0 +1,13 @@
+### 🐛 [#16](https://github.com/ikawrakow/ik_llama.cpp/pull/16) - Fix Makefile
+
+| **Author** | `ikawrakow` |
+| :--- | :--- |
+| **State** | ❌ **Closed** |
+| **Created** | 2024-08-09 |
+| **Updated** | 2024-08-09 |
+
+---
+
+#### Description
+
+I always use cmake, so had forgotten to pay attention to the Makefile.
--- a/github-data/pull_requests/161-MSVC
+++ b/github-data/pull_requests/161-MSVC
@@ -0,0 +1,50 @@
+### 🐛 [#161](https://github.com/ikawrakow/ik_llama.cpp/pull/161) - MSVC fixes
+
+| **Author** | `ikawrakow` |
+| :--- | :--- |
+| **State** | ❌ **Closed** |
+| **Created** | 2024-12-22 |
+| **Updated** | 2024-12-23 |
+
+---
+
+#### Description
+
+@Nexesenex Does this fix #160?
+
+---
+
+#### 💬 Conversation
+
+👤 **Nexesenex** commented the **2024-12-22** at **16:44:51**:<br>
+
+![2024-12-22 17_44_11-ik_llama cpp fks - Microsoft Visual Studio](https://github.com/user-attachments/assets/fa7d1243-dd07-470b-84c0-e6203129c061)
+
+Sadly not.
+
+---
+
+👤 **ikawrakow** commented the **2024-12-22** at **17:15:34**:<br>
+
+And now?
+
+---
+
+👤 **Nexesenex** commented the **2024-12-22** at **17:47:25**:<br>
+
+![2024-12-22 17_44_11-ik_llama cpp fks - Microsoft Visual Studio](https://github.com/user-attachments/assets/b44fa02c-aa3c-41ea-99d5-61972cb10e5f)
+
+Same.
+
+---
+
+👤 **ikawrakow** commented the **2024-12-22** at **17:51:20**:<br>
+
+Did you pull? These errors are from the previous version, and not what is currently on this branch.
+
+---
+
+👤 **Nexesenex** commented the **2024-12-23** at **06:18:47**:<br>
+
+I apologize, I didn't compile the updated branch indeed. (-*-)
+It works now, thank you.
--- a/github-data/pull_requests/162-IQ3_S_R4.md
+++ b/github-data/pull_requests/162-IQ3_S_R4.md
@@ -0,0 +1,39 @@
+### 🔀 [#162](https://github.com/ikawrakow/ik_llama.cpp/pull/162) - IQ3_S_R4
+
+| **Author** | `ikawrakow` |
+| :--- | :--- |
+| **State** | ❌ **Closed** |
+| **Created** | 2024-12-23 |
+| **Updated** | 2024-12-23 |
+
+---
+
+#### Description
+
+Sub-4 bpw i-quants have a terrible CPU performance, so I was curious to see if we can improve by interleaving rows.
+
+This PR adds `IQ3_S_R4`, a 4-row interleaved version of `IQ3_S`.
+
+We get significant performance gains. Here is `PP-512` for LLaMA-3.1-8B on `Zen4` (Ryzen-7950X), `ARM_NEON` (M2-Max) and `AVX2` (Ryzen-5975WX)
+
+| Platform |  Threads | IQ3_S | IQ3_S_R4 | Speedup |
+| ---: | ---: | ---: | ---: | ---: |
+| ARM_NEON |  8 |  42.97 ± 1.28  | 80.61 ± 0.41  | 1.876 |
+| Zen4            | 16 | 104.66 ± 0.68 | 159.08 ± 0.57 | 1.520 |
+| AVX2           | 32 | 132.50 ± 0.37  |  231.41 ± 0.45 | 1.746 |
+
+We get decent performance gains for TG as well, especially on `AVX2`.
+Here results for TG-128 on LLaMA-3.1-8B with different numbers of threads:
+
+| Platform |  Threads | IQ3_S | IQ3_S_R4 | Speedup |
+| ---: | ---: | ---: | ---: | ---: |
+| ARM_NEON | 2 |  3.00 ± 0.00  | 3.40 ± 0.00 | 1.133 |
+|                      | 4 | 5.74 ± 0.02  | 6.60 ± 0.01  | 1.150 |
+|                      | 8 | 9.25 ± 0.83 | 12.27 ± 0.33  | 1.326 |
+| Zen4            | 2 |  4.17 ± 0.00  | 4.38 ± 0.01 |  1.050 |
+|                      | 4 |  7.82 ± 0.05 | 8.14 ± 0.01  |  1.041 |
+|                      | 8 |  14.29 ± 0.02  | 14.41 ± 0.02 |  1.008 |
+| AVX2           | 2 |  1.98 ± 0.00  | 3.31 ± 0.00 | 1.672 |
+|                     | 4 | 3.87 ± 0.00  |   6.49 ± 0.00  | 1.677 |
+|                     | 8 |  7.13 ± 0.01  | 11.63 ± 0.02  | 1.631 |
+|                     | 16 |  12.97 ± 0.00 |  15.81 ± 0.00  | 1.219 |
--- a/github-data/pull_requests/163-q4_0_r4_
+++ b/github-data/pull_requests/163-q4_0_r4_
@@ -0,0 +1,13 @@
+### 🔀 [#163](https://github.com/ikawrakow/ik_llama.cpp/pull/163) - q4_0_r4: Use AVX2 version for matrix x vector
+
+| **Author** | `ikawrakow` |
+| :--- | :--- |
+| **State** | ❌ **Closed** |
+| **Created** | 2024-12-23 |
+| **Updated** | 2024-12-23 |
+
+---
+
+#### Description
+
+Performance is better. Packing quants into 512-bit registers is costly and when we have just 1 column to multiply, using the `AVX512` version becomes slower. I had already done this for most (all?) other quants, but somehow missed `Q4_0`.
--- a/github-data/pull_requests/168-Falcon3
+++ b/github-data/pull_requests/168-Falcon3
@@ -0,0 +1,78 @@
+### 🔀 [#168](https://github.com/ikawrakow/ik_llama.cpp/pull/168) - Falcon3 changes
+
+| **Author** | `ikawrakow` |
+| :--- | :--- |
+| **State** | ❌ **Closed** |
+| **Created** | 2025-01-10 |
+| **Updated** | 2025-01-10 |
+
+---
+
+#### Description
+
+Two changes:
+* Add pre-tokenizer for `Falcon3` (same as `llama3`)
+* Use integer arithmetic to perform the summation of a row of activations for `Q8_K16`
+
+The second change is required for the `IQ2_BN_R4` 4-row interleaved variant. The existing implementation just sums up the `f32` values. This is fine with the original BitNet models and also with the TriLM ternary models, but with the Falcon3 ternary models I observe too large of a difference between the GPU and the CPU perplexity result. With this change the difference is greatly reduced and `IQ2_BN_R4` actually arrives at a slightly lower PPL than Microsoft's BitNet implementation (which is claimed to be "losless").
+
+---
+
+#### 💬 Conversation
+
+👤 **ikawrakow** commented the **2025-01-10** at **12:56:49**:<br>
+
+Oh, here some performance figures for `IQ2_BN` and Microsoft's [Bitnet](https://github.com/microsoft/BitNet) `I2_S` quants, which claim to be the fastest CPU implementation of ternary transformer models. Tests run on a Ryzen-7950X CPU. 
+
+After following the Bitnet instructions:
+```
+git clone --recursive https://github.com/microsoft/BitNet.git
+cd BitNet
+conda create -n bitnet-cpp python=3.9
+conda activate bitnet-cpp
+pip install -r requirements.txt
+python setup_env.py --hf-repo tiiuae/Falcon3-7B-Instruct-1.58bit -q i2_s
+```
+I'm finding that their `e2e_benchmark.py` Python script is not really working. Or, more precisely, it is working but giving a dismal performance. With
+```
+python3 utils/e2e_benchmark.py -m models/Falcon3-7B-Instruct-1.58bit/ggml-model-i2_s.gguf -n 0 -p 512 -t 16
+```
+I get this:
+| model                          |       size |     params | backend    | threads | n_batch |          test |                  t/s |
+| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------: | ------------: | -------------------: |
+| llama 3B I2_S - 2 bpw ternary  |   3.05 GiB |     7.46 B | CPU        |      16 |       1 |         pp512 |         22.15 ± 0.07 |
+
+Hahaha. 22 t/s for PP-512? Fortunately for us, BitNet is just a thin wrapper around `llama.cpp`, so we can run the `llama-bench` tool, which the  `e2e_benchmark.py ` uses under the hood, directly:
+```
+./build/bin/llama-bench -m models/Falcon3-7B-Instruct-1.58bit/ggml-model-i2_s.gguf -p 512 -n 128 -t 16
+```
+and we get
+
+| model                          |       size |     params | backend    | threads |          test |                  t/s |
+| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
+| llama 3B I2_S - 2 bpw ternary  |   3.05 GiB |     7.46 B | CPU        |      16 |         pp512 |        187.90 ± 0.99 |
+| llama 3B I2_S - 2 bpw ternary  |   3.05 GiB |     7.46 B | CPU        |       8 |         tg128 |         23.39 ± 0.05 |
+
+In comparison, here is what we get with `IQ2_BN` (using `-rtr 1` to interleave 4 rows when loading the model:
+| model                          |       size |     params | backend    | threads |          test |              t/s |
+| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | ---------------: |
+| llama ?B IQ2_BN - 2.00 bpw Bitnet |   2.07 GiB |     7.46 B | CPU        |      16 |         pp512 |    465.85 ± 1.91 |
+| llama ?B IQ2_BN - 2.00 bpw Bitnet |   2.07 GiB |     7.46 B | CPU        |       8 |         tg128 |     28.03 ± 0.04 |
+
+So, 2.5X for PP-512, and ~20% better for TG-128 (both achieve maximum performance at 8 threads). TG-128 is of course memory bound and the BitNet authors make claims about energy efficiency, so let's look at TG with fewer threads:
+
+| model                          |       size |     params | backend    | threads |          test |                  t/s |
+| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
+| llama 3B I2_S - 2 bpw ternary  |   3.05 GiB |     7.46 B | CPU        |       1 |         tg128 |          9.64 ± 0.05 |
+| llama 3B I2_S - 2 bpw ternary  |   3.05 GiB |     7.46 B | CPU        |       2 |         tg128 |         15.45 ± 0.04 |
+| llama 3B I2_S - 2 bpw ternary  |   3.05 GiB |     7.46 B | CPU        |       4 |         tg128 |         22.21 ± 0.20 |
+
+vs
+
+| model                          |       size |     params | backend    | threads |          test |              t/s |
+| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | ---------------: |
+| llama ?B IQ2_BN - 2.00 bpw Bitnet |   2.07 GiB |     7.46 B | CPU        |       1 |         tg128 |     12.83 ± 0.24 |
+| llama ?B IQ2_BN - 2.00 bpw Bitnet |   2.07 GiB |     7.46 B | CPU        |       2 |         tg128 |     22.46 ± 0.03 |
+| llama ?B IQ2_BN - 2.00 bpw Bitnet |   2.07 GiB |     7.46 B | CPU        |       4 |         tg128 |     27.62 ± 0.05 |
+
+OK. Now I can claim that `IQ2_BN` is almost 4X more energy efficient than BitNet as we get (almost) the same performance at 2 threads as their maximum performance at 8 threads.
--- a/github-data/pull_requests/169-Be
+++ b/github-data/pull_requests/169-Be
@@ -0,0 +1,31 @@
+### 🔀 [#169](https://github.com/ikawrakow/ik_llama.cpp/pull/169) - Be able to re-quantize MS BitNet I2_S models
+
+| **Author** | `ikawrakow` |
+| :--- | :--- |
+| **State** | ❌ **Closed** |
+| **Created** | 2025-01-10 |
+| **Updated** | 2025-01-10 |
+
+---
+
+#### Description
+
+Closes #167 
+
+I also saw requests for `Falcon3-10B-1.58b` being made in the mainline `llama.cpp` and `llamafile` repositories, so decided to add the ability to use this model with `ik_llama.cpp`.
+
+1. Get a ternary model in Microsoft's `I2_S` format. E.g., for  ` Falcon3-10B-1.58b`
+```
+huggingface-cli download tiiuae/Falcon3-10B-Instruct-1.58bit-GGUF
+```
+
+2. Re-quantize to one of the ternary quantization types in this repository. E.g.,
+```
+./bin/llama-quantize --allow-requantize path_to_model/ggml-model-i2_s.gguf output.gguf iq2_bn
+```
+
+Works on the CPU **and** GPU (CUDA or Metal)
+
+Enjoy!
+
+I see perplexity is quite high (higher than the Falcon3 7B Instruct ternary model), so not sure how useful this model is in practice.
--- a/github-data/pull_requests/17-Merge
+++ b/github-data/pull_requests/17-Merge
@@ -0,0 +1,15 @@
+### 🔀 [#17](https://github.com/ikawrakow/ik_llama.cpp/pull/17) - Merge mainline - Aug 12 2024
+
+| **Author** | `ikawrakow` |
+| :--- | :--- |
+| **State** | ❌ **Closed** |
+| **Created** | 2024-08-12 |
+| **Updated** | 2024-08-12 |
+
+---
+
+#### Description
+
+Mainly for the LLaMA-3.1 RoPE related changes, not much else of interest.
+
+Mainline commit hash: 4134999e01f31256b15342b41c4de9e2477c4a6c
--- a/github-data/pull_requests/170-MoE
+++ b/github-data/pull_requests/170-MoE
@@ -0,0 +1,17 @@
+### 🐛 [#170](https://github.com/ikawrakow/ik_llama.cpp/pull/170) - MoE fix for R4 quants
+
+| **Author** | `ikawrakow` |
+| :--- | :--- |
+| **State** | ❌ **Closed** |
+| **Created** | 2025-01-12 |
+| **Updated** | 2025-01-12 |
+
+---
+
+#### Description
+
+This PR adds two fixes:
+* Make sure number of tensor rows being processed by one thread is a multiple of the number of interleaved rows when using `R4` quants also in `iqk_mul_mat_mow`
+* Fix logic when we have a matrix multiplication kernel that processes 16 columns of the right matrix per kernel call (introduced on 907cde6be). The bug shows up when the number of columns in the right matrix is greater than 16 (so this kernel gets used), and the number of columns is not divisible by 16 (so there are leftover columns to be processed), so did not get caught by the usual `TG-128` and `PP-512` testing.
+
+If quantized to `R4` quants, MoE models now work. But if run-time-repacking is used (`-rtr` command line option) to repack non-`R4` quants to `R4`, something goes wrong for MoE models that I'm not able to figure out. It is really bizarre because in the former case (quantize directly into `R4`) four rows are quantized to the corresponding non-`R4` quant in a temporary buffer and then repacked to `R4`. In the later case, 4 rows are copied into a temporary buffer and then repacked, storrng the repacked data into the memory from where the data was copied. The exact same repacking function is used in both cases, so I don't see how `rtr` can fail. What is even more bizarre is that `rtr` always works for non-MoE models, and also works for some quantization types for MoE models.
--- a/github-data/pull_requests/171-Fix
+++ b/github-data/pull_requests/171-Fix
@@ -0,0 +1,19 @@
+### 🐛 [#171](https://github.com/ikawrakow/ik_llama.cpp/pull/171) - Fix lower FA performance for even batch sizes
+
+| **Author** | `ikawrakow` |
+| :--- | :--- |
+| **State** | ❌ **Closed** |
+| **Created** | 2025-01-12 |
+| **Updated** | 2025-01-12 |
+
+---
+
+#### Description
+
+This PR fixes the lower performance for even batch sizes reported in #164. The graph shows a t/s comparison between the main branch and this PR using
+```
+./bin/llama-batched-bench -m some_model.gguf -pps -t 16 -npp 256 -ntg 128 -npl 1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16 -c 4096 -rtr -fa 
+```
+for LLaMA-3.1-8B-Instruct quantized with `IQ4_XS` on a Ryzen-7950X CPU. We see the strange zig zag  behavior with FA enabled is no longer there. For fun I have also added the latest `llama.cpp` performance for this model on this CPU (`llama.cpp` build: `4465 (9a483999)`). The performance difference for a batch size of 16 is a factor of 2.7X.
+
+![batches](https://github.com/user-attachments/assets/eae98329-b921-4a65-b5ca-ef2b81ee82d9)
--- a/github-data/pull_requests/172-CPU
+++ b/github-data/pull_requests/172-CPU
@@ -0,0 +1,38 @@
+### 🔀 [#172](https://github.com/ikawrakow/ik_llama.cpp/pull/172) - CPU Flash Attention improvements
+
+| **Author** | `ikawrakow` |
+| :--- | :--- |
+| **State** | ❌ **Closed** |
+| **Created** | 2025-01-15 |
+| **Updated** | 2025-01-15 |
+
+---
+
+#### Description
+
+This PR
+* Improves FA CPU performance for long contexts
+* Fixes K-cache quantized to `Q8_0` when not using FA. This was broken because online `Q8_0` quantization packed quants into blocks of 128 (`block_q8_0_x4`), so `K*Q` became garbage when using `Q8_0` quantized K-cache without FA.
+
+FA performance improvements are for `AVX2/Zen4`. The following table shows `PP-512` comparison between the main branch and this PR with FA using `bf16` or `Q8_0` for KV cache. Model is LLaMA-3.1-8B quantized to `IQ4_XS` and run-time-repacked to `IQ4_XS_R4`. The CPU is Ryzen 7950X. When the quoted uncertainty in the table is zero, I have run just a single repetition in `llama-bench` (it takes quite a while to process 16k or even 32k tokens)
+
+   | type_k | type_v | fa | rtr |          test |    t/s (main)    |    t/s (pr)      | Speedup |
+| -----: | -----: | -: | --: | ------------: | ---------------: | ---------------: | ------: |
+|   bf16 |   bf16 |  1 |   1 |         pp128 |    275.27 ± 1.63 |    278.40 ± 1.60 | 1.011   |   
+|   bf16 |   bf16 |  1 |   1 |         pp256 |    276.16 ± 3.46 |    283.51 ± 1.22 | 1.027   |   
+|   bf16 |   bf16 |  1 |   1 |         pp512 |    274.71 ± 0.51 |    276.83 ± 0.36 | 1.008   |   
+|   bf16 |   bf16 |  1 |   1 |        pp1024 |    265.81 ± 1.65 |    270.05 ± 0.41 | 1.016   |   
+|   bf16 |   bf16 |  1 |   1 |        pp2048 |    256.95 ± 0.39 |    260.11 ± 0.14 | 1.012   |   
+|   bf16 |   bf16 |  1 |   1 |        pp4096 |    237.97 ± 0.37 |    242.29 ± 0.75 | 1.018   |   
+|   bf16 |   bf16 |  1 |   1 |        pp8192 |    206.34 ± 1.25 |    213.98 ± 0.35 | 1.037   |   
+|   bf16 |   bf16 |  1 |   1 |       pp16384 |    156.40 ± 0.00 |    173.44 ± 0.00 | 1.109   |   
+|   bf16 |   bf16 |  1 |   1 |       pp32768 |     82.97 ± 0.00 |    122.47 ± 0.00 | 1.476   |   
+|   q8_0 |   q8_0 |  1 |   1 |         pp128 |    273.44 ± 1.04 |    279.27 ± 1.43 | 1.021   |   
+|   q8_0 |   q8_0 |  1 |   1 |         pp256 |    278.57 ± 1.03 |    283.00 ± 0.63 | 1.016   |   
+|   q8_0 |   q8_0 |  1 |   1 |         pp512 |    271.56 ± 0.05 |    275.97 ± 0.79 | 1.016   |   
+|   q8_0 |   q8_0 |  1 |   1 |        pp1024 |    264.31 ± 0.89 |    269.35 ± 0.33 | 1.019   |   
+|   q8_0 |   q8_0 |  1 |   1 |        pp2048 |    253.70 ± 0.24 |    258.22 ± 0.36 | 1.018   |   
+|   q8_0 |   q8_0 |  1 |   1 |        pp4096 |    232.07 ± 0.88 |    236.83 ± 1.38 | 1.021   |   
+|   q8_0 |   q8_0 |  1 |   1 |        pp8192 |    199.90 ± 1.37 |    204.74 ± 0.34 | 1.024   |   
+|   q8_0 |   q8_0 |  1 |   1 |       pp16384 |    153.62 ± 0.00 |    164.50 ± 0.00 | 1.071   |   
+|   q8_0 |   q8_0 |  1 |   1 |       pp32768 |    103.48 ± 0.00 |    113.35 ± 0.00 | 1.095   |
--- a/github-data/pull_requests/173-More
+++ b/github-data/pull_requests/173-More
@@ -0,0 +1,31 @@
+### 🔀 [#173](https://github.com/ikawrakow/ik_llama.cpp/pull/173) - More Flash Attention improvements
+
+| **Author** | `ikawrakow` |
+| :--- | :--- |
+| **State** | ❌ **Closed** |
+| **Created** | 2025-01-19 |
+| **Updated** | 2025-01-20 |
+
+---
+
+#### Description
+
+This PR further improves the Flash Attention implementation as follows:
+* Slightly faster `V * softmax(K * Q)` implementation. This benefits all V-cache types
+* Faster implementation when the K-cache is quantized with `Q8_0` via run-time-repacking to `Q8_0_R4`.
+
+The following graph shows prompt processing speed as a function of prompt length for LLaMA-3.1-8B quantized with `IQ4_XS` on a Ryzem-7950X CPU. The PR results are shown with black (`BF16` KV-cache) and red (`Q8_0` KV-cache) triangles, circles are used for the main branch.  I have reused the graph from the last post in #25 by just adding the results for this PR, so mainline `llama.cpp` performance is shown as well. I'm particularly pleased with the fact that `Q8_0` KV-cache is now on per or even slightly better than the natively supported 16-bit float type as `Q8_0` quantized KV-cache is basically lossless while reducing required memory by 2X.
+
+For reference, with a `Q8_K_R8`-quantized model we achieve 380 t/s for 512 tokens, and 150 t/s for 32k tokens.   
+
+![pp512_vs_ctx](https://github.com/user-attachments/assets/cc1e7ce5-c596-47b0-a56a-912a196d2e38)
+
+---
+
+#### 💬 Conversation
+
+👤 **ikawrakow** commented the **2025-01-20** at **06:57:27**:<br>
+
+Here is the performance relative to a GPU (RTX-4080) for the above graph
+
+![pp_gpu_vs_cpu1](https://github.com/user-attachments/assets/b103b599-b4e6-4775-8c2a-b7fff69fe61c). We observe the ratio now decreasing with increasing prompt length $\Rightarrow$ the utilization of available FLOPs in the FA implementation is now better on the CPU compared to the GPU.
--- a/github-data/pull_requests/174-On
+++ b/github-data/pull_requests/174-On
@@ -0,0 +1,15 @@
+### 🔀 [#174](https://github.com/ikawrakow/ik_llama.cpp/pull/174) - On Zen4 repack fp16 models to bf16_r16
+
+| **Author** | `ikawrakow` |
+| :--- | :--- |
+| **State** | ❌ **Closed** |
+| **Created** | 2025-01-21 |
+| **Updated** | 2025-01-21 |
+
+---
+
+#### Description
+
+...when run-time-repacking is requested via `-rtr`
+
+This massively improves performance. As this is opt-in, we do not worry about possible precision loss in the `f16 -> bf16` conversion.
--- a/github-data/pull_requests/175-Better
+++ b/github-data/pull_requests/175-Better
@@ -0,0 +1,21 @@
+### 🔀 [#175](https://github.com/ikawrakow/ik_llama.cpp/pull/175) - Better BF16 support on AVX2
+
+| **Author** | `ikawrakow` |
+| :--- | :--- |
+| **State** | ❌ **Closed** |
+| **Created** | 2025-01-22 |
+| **Updated** | 2025-01-22 |
+
+---
+
+#### Description
+
+On the main branch `bf16` models are computed via `ggml`, which results in a horrible performance. This PR adds much better `GEMM` an `GEMV` for `bf16 x fp32`. The table shows a performance comparison between the main branch and this PR for LLaMA-3.1-8B-Instruct on a Ryzen-5975WX CPU
+
+ | model         |       size |     params | threads |      test |   t/s (main)     |  t/s (PR)     |  Speedup |
+| ------------- | ---------: | ---------: | ------: | --------: | ---------------: | ------------: | -------: |
+| llama 8B BF16 |  14.96 GiB |     8.03 B |      32 |     pp512 |     47.17 ± 0.04 | 152.80 ± 0.12 |  3.239   |   
+| llama 8B BF16 |  14.96 GiB |     8.03 B |       1 |     tg128 |      1.37 ± 0.00 |   2.06 ± 0.00 |  1.504   |
+| llama 8B BF16 |  14.96 GiB |     8.03 B |       2 |     tg128 |      2.53 ± 0.00 |   3.21 ± 0.00 |  1.269   |
+| llama 8B BF16 |  14.96 GiB |     8.03 B |       4 |     tg128 |      3.19 ± 0.00 |   3.64 ± 0.00 |  1.141   |
+| llama 8B BF16 |  14.96 GiB |     8.03 B |       8 |     tg128 |      3.39 ± 0.00 |   3.64 ± 0.00 |  1.074   |
--- a/github-data/pull_requests/176-Deepseek
+++ b/github-data/pull_requests/176-Deepseek
@@ -0,0 +1,59 @@
+### 🔀 [#176](https://github.com/ikawrakow/ik_llama.cpp/pull/176) - Deepseek V3 support added
+
+| **Author** | `saood06` |
+| :--- | :--- |
+| **State** | ❌ **Closed** |
+| **Created** | 2025-01-23 |
+| **Updated** | 2025-01-23 |
+
+---
+
+#### Description
+
+Very direct port of https://github.com/ggerganov/llama.cpp/pull/11049.
+
+Tested working with IQ4_K_R4 and IQ4_K. No tests so far on any quant that is supported by llama.cpp so that performance can be compared.
+
+Tested on dual socket Xeon E5-2690 v3
+Prompt processing:11.5 t/s for IQ4_K, 9.8 t/s IQ4_K_R4
+Token generation: 2.75 t/s for IQ4_K, 3.10 t/s for IQ4_K_R4
+
+- [x] I have read the [contributing guidelines](https://github.com/ggerganov/llama.cpp/blob/master/CONTRIBUTING.md)
+- Self-reported review complexity:
+  - [X] Low
+  - [ ] Medium
+  - [ ] High
+
+---
+
+#### 💬 Conversation
+
+👤 **ikawrakow** submitted a review the **2025-01-23** at **16:09:41**: ✅ `APPROVED`
+
+---
+
+👤 **ikawrakow** commented the **2025-01-23** at **17:00:50**:<br>
+
+@saood06 
+
+Quick question: current `llama.cpp` has this check for Deepseek-V3:
+```c++
+    } else if (tmpl_contains(LU8("<｜Assistant｜>")) && tmpl_contains(LU8("<｜User｜>")) && tmpl_contains(LU8("<｜end▁of▁sentence｜>"))) {
+        return LLM_CHAT_TEMPLATE_DEEPSEEK_3;
+```
+while the check you added with this PR is
+```c++
+    else if (tmpl == "deepseek3" || tmpl_contains(LU8("'<｜Assistant｜>' + message['content'] + '<｜end▁of▁sentence｜>'"))) {
+```
+The check for `tmpl == "deepseek3"` is done before in `llama.cpp`, so this is not an issue, but the remainder is not the same. Is this a problem? Or would it be a problem if I just made it the same as `llama.cpp` ?
+
+---
+
+👤 **saood06** commented the **2025-01-23** at **18:00:03**:<br>
+
+The change you are referencing happened in https://github.com/ggerganov/llama.cpp/commit/ec7f3ac9ab33e46b136eb5ab6a76c4d81f57c7f1 I was not aware of that till now.
+
+
+>Is this a problem? Or would it be a problem if I just made it the same as llama.cpp ?
+
+ You can change it if you want but both work, based on the chat_templates for the models that have been released.
--- a/github-data/pull_requests/177-Update
+++ b/github-data/pull_requests/177-Update
@@ -0,0 +1,13 @@
+### 🔀 [#177](https://github.com/ikawrakow/ik_llama.cpp/pull/177) - Update chat templates
+
+| **Author** | `ikawrakow` |
+| :--- | :--- |
+| **State** | ❌ **Closed** |
+| **Created** | 2025-01-23 |
+| **Updated** | 2025-01-24 |
+
+---
+
+#### Description
+
+Basically sync with `llama.cpp`
--- a/github-data/pull_requests/178-Interleave
+++ b/github-data/pull_requests/178-Interleave
@@ -0,0 +1,182 @@
+### 🔀 [#178](https://github.com/ikawrakow/ik_llama.cpp/pull/178) - Interleave 8 rows (Q8_0, IQ4_XS)
+
+| **Author** | `ikawrakow` |
+| :--- | :--- |
+| **State** | ❌ **Closed** |
+| **Created** | 2025-01-26 |
+| **Updated** | 2025-01-31 |
+
+---
+
+#### Description
+
+One can get better performance on `AVX2/Zen4` by interleaving 8 instead of 4 rows. I did not do it earlier because in my previous attempts performance on `ARM` suffered significantly. But in this PR I found an `ARM_NEON` implementation for 8 interleaved rows for `Q8_0` and `IQ4_XS` that is not slower or is even slightly faster than 4 interleaved rows.
+
+Run-time-repacking from `Q8_0/IQ4_XS` will of course work, but models quantized to `Q8_0_R4` or `IQ4_XS_R4` will stop working, so putting it out there for testing and feedback.
+
+I did not rename the types to `_R8` yet but will in case this gets merged.
+
+Below is a graph showing prompt processing (a.k.a. prefill) performance for LLaMA-3.1-8B quantized with `IQ4_XS` on a Ryzen-7950X CPU. The cyan symbols are the results with this PR. We now get over 300 t/s for prompts  less than 1000 tokens. 
+
+![pp512_vs_ctx](https://github.com/user-attachments/assets/e532b929-894a-4187-9290-7a84b5286919)
+
+@saood06 Can you test if this improves `IQ4_XS_R4` performance on your system?
+
+---
+
+#### 💬 Conversation
+
+👤 **saood06** commented the **2025-01-26** at **17:03:11**:<br>
+
+@ikawrakow 
+
+Tested on my Xeon E5-2683 v4 machine via llama-bench.
+
+| model                          |       size |     params | fa | rtr |          test |            master t/s |            PR t/s |
+| ------------------------------ | ---------: |  ---------- |  -: | --: | ------------: | ---------------: | ---------------: |
+| llama 70B IQ4_XS - 4.25 bpw    |  34.30 GiB |    68.98 B |  1 |   1 |         pp512 |      7.00 |      7.10 |
+
+
+If you want me to test on my other machine (dual socket Xeon E5-2690 v3) or other models let me know. 
+
+Also any chance you can sync the RPC code (mostly care about #11047 and to a lesser degree #9389 and #11424/#9296), if not I'll do it when I have some free time and submit a PR.
+
+---
+
+👤 **saood06** commented the **2025-01-27** at **13:06:04**:<br>
+
+Testing the batch performance difference showing the peak numbers
+
+
+IQ4_XS_R8:
+|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
+|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
+|   128 |    128 |   14 |   1920 |   18.944 |     6.76 |  272.880 |     6.57 |  291.824 |     6.58 |
+
+IQ4_XS_R4:
+|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
+|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
+|   128 |    128 |   11 |   1536 |   19.367 |     6.61 |  220.288 |     6.39 |  239.655 |     6.41 |
+
+---
+
+👤 **ikawrakow** commented the **2025-01-27** at **13:28:46**:<br>
+
+So, it looks like a small (~2%) improvement. OK to merge? (IIRC, you had this giant R1 model that will become useless after the merge if it is `IQ4_XS_R4`.
+
+---
+
+👤 **saood06** commented the **2025-01-27** at **14:12:11**:<br>
+
+> So, it looks like a small (~2%) improvement.
+
+Yes, it is an improvement, (there is an edge case where R4 was better and that was at batch size 4).
+
+>OK to merge? (IIRC, you had this giant R1 model that will become useless after the merge if it is `IQ4_XS_R4`.
+
+Yes, it is okay to merge. That model is an IQ4_K_R4 (and IQ4_K), not IQ4_XS, as I prefer your quants over the mainline ones. Which is why I didn't have comparison data for it to mainline.
+
+On the note of the R1 quant this PR [llama.cpp/pull/11446](https://github.com/ggerganov/llama.cpp/pull/11446) will make me reconvert anyway, I want to use it and also it is easy to grab it now before the KV refactor it is waiting for to implement MLA KV cache. I was going to bring that up anyway in the Deepseek PR because it is a change to the the GGUF for Deepseek.
+
+#11397 is also showing significant improvements to Deepseek.
+
+---
+
+👤 **ikawrakow** commented the **2025-01-27** at **15:41:40**:<br>
+
+> On the note of R1, this PR 11446 will make me reconvert anyway
+
+What is being measured in the graph in this PR? It says "Token generation rate", but what tool is being used?
+
+---
+
+👤 **fairydreaming** commented the **2025-01-27** at **19:42:36**:<br>
+
+> > On the note of R1, this PR 11446 will make me reconvert anyway
+> 
+> What is being measured in the graph in this PR? It says "Token generation rate", but what tool is being used?
+
+That would be my modified llama-bench from this PR: https://github.com/ggerganov/llama.cpp/pull/11126
+It allows to measure token generation rate after processing a prompt of given size.
+
+---
+
+👤 **ikawrakow** commented the **2025-01-28** at **14:06:19**:<br>
+
+@fairydreaming  Thanks for the clarification. 
+
+I played a bit with your PR 11466. TG after a long prompt looks great compared to `llama.cpp`, but it seems this comes at the expense of a much reduced prompt processing speed? Here is what I get on my Ryzen-7950X
+
+* **llama.cpp** 
+
+| model                          |       size |     params | backend    | threads |          test |                  t/s |
+| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
+| deepseek2 16B F16              |  29.26 GiB |    15.71 B | CPU        |      16 |         pp256 |        150.29 ± 0.31 |
+| deepseek2 16B F16              |  29.26 GiB |    15.71 B | CPU        |      16 |         pp512 |        153.23 ± 0.13 |
+| deepseek2 16B F16              |  29.26 GiB |    15.71 B | CPU        |      16 |        pp1024 |        149.27 ± 0.22 |
+| deepseek2 16B F16              |  29.26 GiB |    15.71 B | CPU        |      16 |        pp4096 |        133.74 ± 0.20 |
+| deepseek2 16B F16              |  29.26 GiB |    15.71 B | CPU        |      16 |        pp8192 |        117.74 ± 0.03 |
+
+* **PR 11466**
+
+| model                          |       size |     params | backend    | threads |          test |                  t/s |
+| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
+| deepseek2 16B F16              |  29.37 GiB |    15.76 B | CPU        |      16 |         pp256 |        142.08 ± 0.27 |
+| deepseek2 16B F16              |  29.37 GiB |    15.76 B | CPU        |      16 |         pp512 |        140.53 ± 0.03 |
+| deepseek2 16B F16              |  29.37 GiB |    15.76 B | CPU        |      16 |        pp1024 |        133.17 ± 0.12 |
+| deepseek2 16B F16              |  29.37 GiB |    15.76 B | CPU        |      16 |        pp4096 |        101.17 ± 0.10 |
+| deepseek2 16B F16              |  29.37 GiB |    15.76 B | CPU        |      16 |        pp8192 |         77.08 ± 0.08 |
+
+(I did not have the patience to wait for the 16k tokens benchmark to finish).
+
+---
+
+👤 **fairydreaming** commented the **2025-01-28** at **14:12:33**:<br>
+
+@ikawrakow Yup, I noticed this. I'm planning to reorganize tensor dimensions for the prompt processing in the PR, hopefully this will fix the issue.
+
+---
+
+👤 **saood06** commented the **2025-01-29** at **09:03:52**:<br>
+
+@fairydreaming 
+> It allows to measure token generation rate after processing a prompt of given size.
+
+Can't this be done already with batched-bench by setting a batch size of 1, and it has the benefit of showing PP speed as well.
+
+>it helped, but only a bit (pp rate is 6-8% higher with these changes), it's still slower than the original implementation.
+
+Can you push that change? For my use cases the TG benefits outweigh the loss in PP, I'll try looking into the performance as well.
+
+---
+
+👤 **fairydreaming** commented the **2025-01-29** at **10:09:22**:<br>
+
+@saood06
+
+> @fairydreaming
+> 
+> > It allows to measure token generation rate after processing a prompt of given size.
+> 
+> Can't this be done already with batched-bench by setting a batch size of 1, and it has the benefit of showing PP speed as well.
+
+That is correct.
+
+> > it helped, but only a bit (pp rate is 6-8% higher with these changes), it's still slower than the original implementation.
+> 
+> Can you push that change? For my use cases the TG benefits outweigh the loss in PP, I'll try looking into the performance as well.
+
+Pushed.
+
+---
+
+👤 **saood06** commented the **2025-01-30** at **19:32:55**:<br>
+
+@ikawrakow 
+>I did not rename the types to _R8 yet but will in case this gets merged.
+
+---
+
+👤 **ikawrakow** commented the **2025-01-31** at **06:31:03**:<br>
+
+Will do when I come back from FOSDEM.
--- a/github-data/pull_requests/179-Minor
+++ b/github-data/pull_requests/179-Minor
@@ -0,0 +1,31 @@
+### 🔀 [#179](https://github.com/ikawrakow/ik_llama.cpp/pull/179) - Minor performance improvements
+
+| **Author** | `ikawrakow` |
+| :--- | :--- |
+| **State** | ❌ **Closed** |
+| **Created** | 2025-01-27 |
+| **Updated** | 2025-01-27 |
+
+---
+
+#### Description
+
+This PR does two things
+1. It changes `Q4_0_R4` to 8 interleaved rows
+1. It adds the ability to apply platform specific transformations of the tensor data while repacking
+
+Examples for the usage of 2.:
+* On `ARM_NEON` it is useful to apply a `XOR` operation with a mask `0x88` to `Q4_0` quants. In this way one does not need to subtract `8` during run time. This tweak improves `Q4_0` PP performance by nearly 5% on my M2-Max CPU. This is absolutely not useful on `AVX2/Zen4`, so this becomes a platform specific transformation when run-time-repacking on an `ARM_NEON` CPU. 
+* On `Zen4` one can add `128` to the signed `Q8` quants to make them unsigned (so they can be used directly in `_mmXXX_dpbusd_epi32()`. This improves `Q8_0` and `Q8_K_R8` performance by about 3%. The transformation is not useful on `ARM_NEON` (one needs signed `int8_t`'s) or vanilla `AVX2` (the `_mm256_maddubs_epi16` dot product may overflow), so it only gets applied when repacking on `Zen4`.
+
+The table shows some comparisons for `PP-512` LlaMA-3.1-8B for the affected quantization types using Flash Attention and `Q8_0` KV-cache.
+
+| model            | backend    |          test |      t/s (main)  |   t/s  (PR)    |  Speedup |
+| ---------------- | ---------- | ------------: | ---------------: | -------------: | -------: |
+| llama 8B Q4_0    | NEON       |         pp512 |    130.92 ± 0.10 |  137.39 ± 0.32 |  1.049   |   
+| llama 8B Q8_K_R8 | Zen4       |         pp512 |    380.75 ± 1.52 |  390.40 ± 0.88 |  1.025   |   
+| llama 8B Q8_0    | Zen4       |         pp512 |    295.62 ± 0.80 |  307.80 ± 0.34 |  1.041   |   
+| llama 8B Q4_0    | Zen4       |         pp512 |    281.38 ± 0.73 |  294.43 ± 0.68 |  1.046   |   
+| llama 8B Q4_0    | AVX2       |         pp512 |    302.61 ± 0.29 |  316.23 ± 0.31 |  1.045   | 
+
+I really wanted to hit 400 t/s for `Q8_K_R8`, but it will be on another day.
--- a/github-data/pull_requests/180-Deepseek
+++ b/github-data/pull_requests/180-Deepseek
@@ -0,0 +1,340 @@
+### 🔀 [#180](https://github.com/ikawrakow/ik_llama.cpp/pull/180) - Deepseek MLA Optimizations
+
+| **Author** | `saood06` |
+| :--- | :--- |
+| **State** | ❌ **Closed** |
+| **Created** | 2025-01-29 |
+| **Updated** | 2025-02-10 |
+
+---
+
+#### Description
+
+Very direct port of https://github.com/ggerganov/llama.cpp/pull/11446
+
+Tested working with Q4_K_S on dual socket Xeon E5-2690 v3, performance compared with llama.cpp below.
+| model                          |       size |     params |         test |              llama.cpp t/s |              ik_llama.cpp t/s |
+| ------------------------------ | ---------: | ---------: |  ------------: | ---------------: | ---------------: |
+| deepseek2 671B Q4_K - Small    | 355.33 GiB |   672.05 B |          pp512 |      7.63  |      8.53 |
+| deepseek2 671B Q4_K - Small    | 355.33 GiB |   672.05 B |         tg128 |     2.74  |      3.11  |
+
+Tests in: https://github.com/ikawrakow/ik_llama.cpp/pull/180#issuecomment-2624940338
+
+This PR also contains things I missed in my last PR in the convert_hf_to_gguf.py.
+
+@ikawrakow 
+Is there any chance to convert old imatrix files (such as [this](https://huggingface.co/mradermacher/DeepSeek-R1-i1-GGUF/blob/main/imatrix.dat)) to include the components you get from splitting kv_b included in it. I'm not sure how impactful missing them would be as right now it obviously prints "did not find weights for attn_k_b.weight/attn_v_b.weight". I do not have the capability to generate new imatrix.dat files, and it would be nice if it wasn't needed as it is quite resource intensive to do.
+
+
+- Self-reported review complexity:
+  - [X] Low
+  - [ ] Medium
+  - [ ] High
+
+---
+
+#### 💬 Conversation
+
+👤 **ikawrakow** commented the **2025-01-29** at **09:16:02**:<br>
+
+Here is how much time is being spent in the various matrix multiplications in the attention part when processing a prompt of 8192 tokens:
+
+| result tensor |  time (s) |
+| ------------: | ---------: |
+| kq                  |  4.116 |
+| kqv                 |  2.372 |
+| kqv_out             |  0.458 |
+| kv                  |  0.253 |
+| kv_pe_compresseed   |  0.219 |
+| q                   |  0.687 |
+| total               |  8.107 |
+
+And here is with this PR:
+
+| result tensor |  time (s) |
+| ------------: | ---------: |
+| kq_nope             |  8.343 |
+| kq_pe               |  2.495 |
+| kqv                 |  0.401 |
+| kqv_compressed      |  7.120 |
+| kqv_out             |  0.473 |
+| kv_pe_compresseed   |  0.224 |
+| q                   |  0.693 |
+| q_nope2             |  0.240 |
+| total               |  19.989 |
+
+I.e., attention is 2.5X slower with the PR. In addition, I'm finding that on the main branch `0.114` seconds are spent in `GGML_OP_ADD` operations, and `0.194` seconds for `GGML_OP_CONT`. In this PR  `3.320` seconds go into `GGML_OP_ADD`, and `2.701` seconds into `GGML_OP_CONT` (basically making copies). For reference, total processing time is `27.73` seconds on main and `45.47` seconds with the PR.
+
+Maybe this can be useful when trying to optimize.
+
+---
+
+👤 **saood06** commented the **2025-01-29** at **09:28:49**:<br>
+
+>This hurts prompt processing (a.k.a prefill) speed very significantly.
+>[...]
+>I think we need to either try to understand why the attention part is so much slower when processing batches of tokens and fix it, or simply wait for @fairydreaming to fix their PR.
+
+Changed to draft. PP does seem to have regressions, I'll have direct comparisons against old version soon, generating an iq4_k_r4 quant now (PP in main for me was 11.5 t/s for iq4_k and  9.8 t/s for iq4_k_r4 at pp512,  9.22 t/s at PP1024 for IQ4_K).
+
+>Maybe this can be useful when trying to optimize.
+
+Thank you for the op time breakdown. 
+
+I was drawn in to this PR for the TG benefits, it should have also been a draft for the reason that it would mean GGUF's wouldn't be cross compatible, as this is also a draft in llama.cpp. I just want to have it here because it does optimize for a workload where TG dominates, and R1 as a reasoning model it often does.
+
+---
+
+👤 **ikawrakow** commented the **2025-01-29** at **09:33:33**:<br>
+
+@saood06 Perhaps a good way to move forward is to add an additional architecture (`deepseek-mla` or similar), but keep the original  `deepseek2/3`. In this way, depending on use case, one can choose the improved TG speed after long prompts or the better PP speed when generating a few tokens after processing a long prompt.
+
+---
+
+👤 **saood06** commented the **2025-01-29** at **10:21:32**:<br>
+
+>Perhaps a good way to move forward is to add an additional architecture (deepseek-mla or similar), but keep the original deepseek2/3. In this way, depending on use case, one can choose the improved TG speed after long prompts or the better PP speed when generating a few tokens after processing a long prompt.
+
+I'll do that. I'll still leave it in a draft as I'm waiting to see how it progresses in llama.cpp, and for me to more thoroughly evaluate how it performs at long prompt lengths vs main.
+
+---
+
+👤 **ikawrakow** commented the **2025-01-29** at **11:40:16**:<br>
+
+So, as far as I can tell, the attention implementation in this PR leads to ~3X more multiply-adds (madds) when performing matrix multiplications. For prompt processing here we need `2 x 512 x 16 x n_token^2` madds, whereas the original implementation requires  `(192 + 128) x 16 x n_token^2` madds. For TG, the PR still requires 3X more madds, namely `2 x 512 x n_prompt` madds here vs `(192 + 128) x 16 x n_prompt` on main. The only reason TG ends up being faster here is the shape of the tensors: On main it is 16 matrix multiplications each being `192 x n_prompt  * 192 x 1` (`K*Q`) or `n_prompt x 128 * n_prompt x 1` (`V*softmax(K*Q)`). I.e., we have 16 GEMVs, which are 100% memory bound on modern CPU's. In this PR the TG shapes are  `512 x n_prompt * 512 x 16` and `n_prompt x 512 * n_prompt x 16`, so real GEMMs with much higher FLOPs, so we end up needing less time despite doing more work. Hence, the way it is implemented, there is no way one can recover PP performance.
+
+These figures are of course specific to the Deepseek2-Lite model. It may be different for a much larger model where rank-512 decomposition may really be "low-rank". It isn't for this model relative to the head sizes, number of heads, and hidden dimension.
+
+---
+
+👤 **fairydreaming** commented the **2025-01-29** at **12:49:35**:<br>
+
+@ikawrakow I think applying the trick with "absorbing" matrices mentioned in the DeepSeek V2 paper shall fix this, I'm working on that.
+
+---
+
+👤 **ikawrakow** commented the **2025-01-29** at **13:14:33**:<br>
+
+@fairydreaming 
+
+Great!
+
+Btw, I observe that `attn_kv_b.weight` is still present in the model. Is it needed, given that we now have `attn_k_b.weight` and `attn_v_b.weight` ?
+
+---
+
+👤 **fairydreaming** commented the **2025-01-30** at **11:23:08**:<br>
+
+@ikawrakow Unfortunately the idea with speeding things up thanks to the matrix absorption is wrong: https://github.com/ggerganov/llama.cpp/pull/11446#issuecomment-2624177134
+
+I'm not sure why they mentioned it in the DeepSeek paper.
+
+Regarding other possible optimizations do you know how much work is needed to add support for multiplication of transposed matrices to ggml_mul_mat()? The problem is that I use kv cache for multiplication both directly and then in transposed form. I got around this problem by storing kv cache in both regular and transposed forms, but it doubles the amount of required memory.
+
+---
+
+👤 **fairydreaming** commented the **2025-01-30** at **12:39:37**:<br>
+
+> @fairydreaming
+
+> Out of curiosity, did you ever try this repository with your Epyc CPU?
+
+Sure, I checked it a while ago (before the optimization work):
+
+Regular llama.cpp:
+
+```
+$ ./build/bin/llama-bench --numa distribute -t 32 -m /mnt/md0/models/deepseek-v3-Q4_K_S.gguf
+| model                          |       size |     params | backend    | threads |          test |                  t/s |
+| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
+| deepseek2 671B Q4_K - Small    | 353.90 GiB |   671.03 B | CPU        |      32 |         pp512 |         26.08 ± 0.23 |
+| deepseek2 671B Q4_K - Small    | 353.90 GiB |   671.03 B | CPU        |      32 |         tg128 |          9.57 ± 0.03 |
+```
+
+ik_llama.cpp:
+
+```
+$ ./llama-bench --numa distribute -t 32 -m /mnt/md0/models/deepseek-v3-Q4_K_S.gguf
+| model                          |       size |     params | backend    | threads |          test |              t/s |
+| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | ---------------: |
+| deepseek2 671B Q4_K - Small    | 353.90 GiB |   671.03 B | CPU        |      32 |         pp512 |     49.47 ± 0.11 |
+| deepseek2 671B Q4_K - Small    | 353.90 GiB |   671.03 B | CPU        |      32 |         tg128 |     10.01 ± 0.09 |
+```
+
+Generation was ~4.6% faster, while prompt processing was ~90% faster, impressive!
+
+---
+
+👤 **ikawrakow** commented the **2025-01-30** at **13:42:04**:<br>
+
+10 t/s TG for Deepseek-R1 - wow! 
+
+PP should be ~50% faster now for `Q4_K_S`. 
+
+I'm playing with Deepseek-Lite and I'm finding that the CUDA performance is pretty bad - 3500 t/s for PP-512 and 142 t/s for TG-128 on an RTX-4080. This is for `IQ4_XS` fully offloaded to the GPU. On my Ryzen-7950X CPU I'm getting PP-512 = 525 t/s, TG-128 = 36 t/s. So, less than 7X slower for PP (normally the RTX-4080 is ~25X faster) and less than 4X slower for TG (despite the paltry 64 GB/s memory bandwidth for the Ryzen-7950X). So, I guess, your Epyc system wipes the floor with any GPU setup using partial GPU offload of Deepseek-R1.
+
+---
+
+👤 **saood06** commented the **2025-01-30** at **16:15:26**:<br>
+
+I ran batched-bench at batch size 1 with TG at 32 at various PP to show PP performance and TG performance at different context lengths. Batched-bench numbers are noisy because they do not use repetitions like llama-bench and this model on this machine seems to have some variance, but all data is shown after dropping the cache's and running the model until it is fully in the page cache.
+
+IQ4_K_R4 with this PR:
+
+|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
+|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
+|   128 |     32 |    1 |    160 |   22.569 |     5.67 |   10.237 |     3.13 |   32.806 |     4.88 |
+|   256 |     32 |    1 |    288 |   38.648 |     6.62 |   10.699 |     2.99 |   49.347 |     5.84 |
+|   512 |     32 |    1 |    544 |   76.447 |     6.70 |   10.793 |     2.96 |   87.240 |     6.24 |
+|  1024 |     32 |    1 |   1056 |  144.100 |     7.11 |   10.788 |     2.97 |  154.888 |     6.82 |
+|  2048 |     32 |    1 |   2080 |  312.306 |     6.56 |   12.624 |     2.53 |  324.930 |     6.40 |
+|  4096 |     32 |    1 |   4128 |  745.760 |     5.49 |   12.929 |     2.48 |  758.688 |     5.44 |
+|  8192 |     32 |    1 |   8224 | 2023.859 |     4.05 |   16.017 |     2.00 | 2039.877 |     4.03 |
+
+IQ4_K_R4 on main:
+|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
+|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
+|   128 |     32 |    1 |    160 |   20.958 |     6.11 |   10.999 |     2.91 |   31.956 |     5.01 |
+|   256 |     32 |    1 |    288 |   38.777 |     6.60 |   11.780 |     2.72 |   50.558 |     5.70 |
+|   512 |     32 |    1 |    544 |   63.574 |     8.05 |   12.474 |     2.57 |   76.047 |     7.15 |
+|  1024 |     32 |    1 |   1056 |  118.630 |     8.63 |   14.462 |     2.21 |  133.092 |     7.93 |
+|  2048 |     32 |    1 |   2080 |  258.999 |     7.91 |   18.241 |     1.75 |  277.239 |     7.50 |
+|  4096 |     32 |    1 |   4128 |  574.593 |     7.13 |   26.023 |     1.23 |  600.616 |     6.87 |
+|  8192 |     32 |    1 |   8224 | 1391.722 |     5.89 |   43.056 |     0.74 | 1434.778 |     5.73 |
+
+
+Looking at the 8K context results, PP does drop from 5.89 to 4.05, but TG jumps from 0.74 to 2.00. At q8_0 (results below) PP again drops 6.06 to 4.03, but TG benefits going from 0.99 to 1.94. I would test/run this model at even higher context, but I would either need a smaller quant or to use RPC (for reference the KV cache at n_ctx of 8224 is 40,233.55 MiB)
+
+<details>
+  <summary>Expand to see more runs with q8_0 and q6_0 K cache tested as well</summary>
+
+  PR with q6_0 K cache:
+
+|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
+|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
+|   128 |     32 |    1 |    160 |   14.948 |     8.56 |   10.498 |     3.05 |   25.446 |     6.29 |
+|   256 |     32 |    1 |    288 |   35.061 |     7.30 |   10.430 |     3.07 |   45.491 |     6.33 |
+|   512 |     32 |    1 |    544 |   69.842 |     7.33 |   10.936 |     2.93 |   80.778 |     6.73 |
+|  1024 |     32 |    1 |   1056 |  142.141 |     7.20 |   11.083 |     2.89 |  153.224 |     6.89 |
+|  2048 |     32 |    1 |   2080 |  313.431 |     6.53 |   11.415 |     2.80 |  324.846 |     6.40 |
+|  4096 |     32 |    1 |   4128 |  763.385 |     5.37 |   12.964 |     2.47 |  776.349 |     5.32 |
+|  8192 |     32 |    1 |   8224 | 2076.578 |     3.94 |   16.371 |     1.95 | 2092.948 |     3.93 |
+
+
+  PR with q8_0 K cache:
+
+|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
+|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
+|   128 |     32 |    1 |    160 |   15.804 |     8.10 |   10.288 |     3.11 |   26.092 |     6.13 |
+|   256 |     32 |    1 |    288 |   34.806 |     7.35 |   10.436 |     3.07 |   45.242 |     6.37 |
+|   512 |     32 |    1 |    544 |   69.839 |     7.33 |   10.597 |     3.02 |   80.437 |     6.76 |
+|  1024 |     32 |    1 |   1056 |  141.519 |     7.24 |   10.909 |     2.93 |  152.428 |     6.93 |
+|  2048 |     32 |    1 |   2080 |  310.669 |     6.59 |   11.430 |     2.80 |  322.099 |     6.46 |
+|  4096 |     32 |    1 |   4128 |  751.935 |     5.45 |   12.970 |     2.47 |  764.905 |     5.40 |
+|  8192 |     32 |    1 |   8224 | 2031.924 |     4.03 |   16.499 |     1.94 | 2048.424 |     4.01 |
+  
+  Second run of PR without K cache quantization:
+
+|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
+|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
+|   128 |     32 |    1 |    160 |   20.898 |     6.12 |   10.378 |     3.08 |   31.276 |     5.12 |
+|   256 |     32 |    1 |    288 |   40.503 |     6.32 |   10.407 |     3.07 |   50.910 |     5.66 |
+|   512 |     32 |    1 |    544 |   70.978 |     7.21 |   10.629 |     3.01 |   81.607 |     6.67 |
+|  1024 |     32 |    1 |   1056 |  144.713 |     7.08 |   10.879 |     2.94 |  155.592 |     6.79 |
+|  2048 |     32 |    1 |   2080 |  311.658 |     6.57 |   11.718 |     2.73 |  323.376 |     6.43 |
+|  4096 |     32 |    1 |   4128 |  754.120 |     5.43 |   12.996 |     2.46 |  767.116 |     5.38 |
+|  8192 |     32 |    1 |   8224 | 2037.022 |     4.02 |   16.437 |     1.95 | 2053.458 |     4.00 |
+
+  main with q6_0 K cache:
+
+|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
+|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
+|   128 |     32 |    1 |    160 |   18.503 |     6.92 |   10.480 |     3.05 |   28.983 |     5.52 |
+|   256 |     32 |    1 |    288 |   31.320 |     8.17 |   10.858 |     2.95 |   42.178 |     6.83 |
+|   512 |     32 |    1 |    544 |   57.909 |     8.84 |   11.459 |     2.79 |   69.368 |     7.84 |
+|  1024 |     32 |    1 |   1056 |  118.199 |     8.66 |   12.679 |     2.52 |  130.878 |     8.07 |
+|  2048 |     32 |    1 |   2080 |  250.592 |     8.17 |   15.486 |     2.07 |  266.078 |     7.82 |
+|  4096 |     32 |    1 |   4128 |  541.938 |     7.56 |   20.315 |     1.58 |  562.253 |     7.34 |
+|  8192 |     32 |    1 |   8224 | 1353.169 |     6.05 |   30.144 |     1.06 | 1383.313 |     5.95 |
+
+
+
+
+  main with q8_0 K cache:
+
+|    PP |     TG |    B |   N_KV |   T_PP s | S_PP t/s |   T_TG s | S_TG t/s |      T s |    S t/s |
+|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
+|   128 |     32 |    1 |    160 |   16.825 |     7.61 |   10.586 |     3.02 |   27.411 |     5.84 |
+|   256 |     32 |    1 |    288 |   33.362 |     7.67 |   10.894 |     2.94 |   44.255 |     6.51 |
+|   512 |     32 |    1 |    544 |   54.048 |     9.47 |   11.869 |     2.70 |   65.917 |     8.25 |
+|  1024 |     32 |    1 |   1056 |  109.381 |     9.36 |   13.128 |     2.44 |  122.509 |     8.62 |
+|  2048 |     32 |    1 |   2080 |  238.006 |     8.60 |   15.567 |     2.06 |  253.574 |     8.20 |
+|  4096 |     32 |    1 |   4128 |  553.239 |     7.40 |   21.099 |     1.52 |  574.339 |     7.19 |
+|  8192 |     32 |    1 |   8224 | 1351.138 |     6.06 |   32.240 |     0.99 | 1383.377 |     5.94 |
+
+
+
+</details>
+
+>I think one should make Flash Attention work with different K and V head sizes. 
+
+If that happened it would also have the benefit of allowing V cache quantization (not sure why FA is needed for that), which this model could really benefit from in it's current implementation which uses the space of MHA. A proper MLA implementation would take up far less space. 
+
+>I'm playing with Deepseek-Lite and I'm finding that the CUDA performance is pretty bad
+
+Other people have reported poor performance even for the larger Deepseek models with TG at 10-14 t/s (although with an IQ1 based quant) even fully offloaded with datacenter GPU's, and around the same performance for a 192GB Mac.
+
+>So, I guess, your Epyc system wipes the floor with any GPU setup using partial GPU offload of Deepseek-R1.
+
+Partial offload is reported benefited by this: https://github.com/ggerganov/llama.cpp/pull/11397 and it is something I plan to test/use.
+
+---
+
+👤 **ikawrakow** commented the **2025-01-30** at **17:12:27**:<br>
+
+> not sure why FA is needed for that
+
+Because without FA `V` gets transposed, which would break the quantization blocks if `V` was quantized. It gets transposed because in that way the matrix multiplication with `softmax(K*Q^T)` is much faster. With FA, `V` is not transposed, which allows to quantize it. But, at least on the CPU, performance suffers quite a bit because of that. E.g., for a large context where all this matters, I see about 37% of the FA compute time to be spent for `K*Q^T`, about 10% for `softmax(K*Q^T)`, and the remaining 53% for `V*softmax(K*Q^T)`. I.e., the matrix multiplication with the not transposed `V` is ~50% slower compared to `K*Q^T`, although both multiplications require the same number of multiply-adds.
+
+> Other people have reported poor performance even for the larger Deepseek models with TG at 10-14 t/s (although with an IQ1 based quant) even fully offloaded with datacenter GPU's, and around the same performance for a 192GB Mac.
+
+I just made Deepseek-Lite also work on my Mac (M2-Max). I get TG-128 = 70 t/s on the CPU using `IQ4_NL_R4`, so basically half of an RTX-4080. Mainline `llama.cpp` gets 80 t/s on the M2-Max GPU (30 core version) and 63 t/s on the CPU for `IQ4_NL`. PP-512 is even more interesting: I get 292 t/s on the CPU, mainline `llama.cpp` manages  205 t/s on the CPU, but just 60 t/s on the GPU! So, there is some very serious bottleneck there, both on `CUDA` and `Metal`, for the Deepseek models.
+
+---
+
+👤 **fairydreaming** commented the **2025-02-01** at **08:09:20**:<br>
+
+> So, as far as I can tell, the attention implementation in this PR leads to ~3X more multiply-adds (madds) when performing matrix multiplications. For prompt processing here we need `2 x 512 x 16 x n_token^2` madds, whereas the original implementation requires `(192 + 128) x 16 x n_token^2` madds. For TG, the PR still requires 3X more madds, namely `2 x 512 x n_prompt` madds here vs `(192 + 128) x 16 x n_prompt` on main. The only reason TG ends up being faster here is the shape of the tensors: On main it is 16 matrix multiplications each being `192 x n_prompt * 192 x 1` (`K*Q`) or `n_prompt x 128 * n_prompt x 1` (`V*softmax(K*Q)`). I.e., we have 16 GEMVs, which are 100% memory bound on modern CPU's. In this PR the TG shapes are `512 x n_prompt * 512 x 16` and `n_prompt x 512 * n_prompt x 16`, so real GEMMs with much higher FLOPs, so we end up needing less time despite doing more work. Hence, the way it is implemented, there is no way one can recover PP performance.
+
+This is something that I kind of intuitively expected, I mean the whole point of DeepSeek MLA is to reduce KV cache memory size by storing the "compressed" latent representation of KV vectors, but we still have to perform additional calculations to "decompress" and use them to calculate attentions scores and attention output.
+
+---
+
+👤 **saood06** commented the **2025-02-09** at **15:02:19**:<br>
+
+This is superseded by #188. Closing
+
+---
+
+👤 **jukofyork** commented the **2025-02-10** at **16:48:36**:<br>
+
+@saood06
+
+Just saw your linked post.
+
+I see you have a slightly faster prompt processing speed, but what I'm confused about is why when I have everything on the GPU apart from the 3 sets of non-shared experts' tensors, why batch processing it's gaining anything hardly, eg:
+
+- I can get 3.5 -5 tokens per second for token generation with careful NUMA placement and 30 threads of a 2-CPU system with ~78GB/s per node.
+- I can only get 9-10 tokens per second when using a batch of 1024+ and it should be pulling each set of tensors from RAM to VRAM and doing the work for the 1024 tokens in parallel. IMO this shouild be showing speeds like what KTrasnformers is, but it's nothing like this and I'm near 100% sure there will be some glaring flaw in the way this is handled ***if*** I could actually profile the GGML stuff and see clearly WTF is going on to cause this!
+
+---
+
+👤 **jukofyork** commented the **2025-02-10** at **17:15:49**:<br>
+
+> > I can only get 9-10 tokens per second for prompt processing when using a batch of 1024+ and it should be pulling each set of tensors from RAM to VRAM and doing the work for the 1024 tokens in parallel with 15x the memory bandwidth and 100x+ the compute. IMO this should be showing speeds like what KTrasnformers is, but it's nothing like this and I'm near 100% sure there will be some glaring flaw in the way this is handled if I could actually profile the GGML stuff and see clearly WTF is going on to cause this!
+> 
+> Can you try this fork, without MLA and this PR: #200 which adds FA support. This should be the fastest prompt processing you can do. Fairydreaming on his system with this fork without MLA and without FA and more optimizations reported 50 tok/s. [#180 (comment)](https://github.com/ikawrakow/ik_llama.cpp/pull/180#issuecomment-2624398627)
+> 
+> If you want to try MLA, just use the -mla flag, which will turn MLA on.
+
+Thanks - I will do, but it will probably be a couple of days due to running another experiment.
--- a/github-data/pull_requests/181-Various.md
+++ b/github-data/pull_requests/181-Various.md
@@ -0,0 +1,17 @@
+### 🔀 [#181](https://github.com/ikawrakow/ik_llama.cpp/pull/181) - Various
+
+| **Author** | `ikawrakow` |
+| :--- | :--- |
+| **State** | ❌ **Closed** |
+| **Created** | 2025-01-29 |
+| **Updated** | 2025-01-29 |
+
+---
+
+#### Description
+
+PR started by me adding the `-gp` option to `llama-bench` as per https://github.com/ggerganov/llama.cpp/pull/11126 because I wanted to test TG performance after a long prompt to be able to compare to the MLA attention implementation in  https://github.com/ggerganov/llama.cpp/pull/11446.
+
+But then I noticed that the repacked `Q8_0` and `Q4_0` quants do not work for row tensor sizes that are not a multiple of 128 (4 x block size of 32), which is the case for some of the tensors in Deepseek2-Lite that I used for testing, so I fixed that.
+
+And than I was comparing performance after the fix on `Llama-3.2-1B`, and noticed that FA with `Q8_0` K-cache does not work.  `Llama-3.2-1B` has a head size of 64 and there was a comment in the code that `Q8_0` does not work for a head sizes less than 128, so I fixed that as well.
--- a/github-data/pull_requests/182-Faster
+++ b/github-data/pull_requests/182-Faster
@@ -0,0 +1,22 @@
+### 🔀 [#182](https://github.com/ikawrakow/ik_llama.cpp/pull/182) - Faster Q4_K_R4 and Q5_K_R4 on AVX2/Zen4
+
+| **Author** | `ikawrakow` |
+| :--- | :--- |
+| **State** | ❌ **Closed** |
+| **Created** | 2025-01-30 |
+| **Updated** | 2025-01-30 |
+
+---
+
+#### Description
+
+TG is about the same. PP-512 comparison between main and this PR for LLaMA-3.1-8B on a Ryzen-5975WX (`AVX2`) and a Ryzen-7950X (`Zen4`)
+
+| model            | backend    | threads |    test |   t/s (main)     |   t/s (PR)    |  Speedup |
+| ---------------- | ---------- | ------: | ------: | ---------------: | ------------: | -------: |
+| llama 8B Q4_K_S  | AVX2       |      32 |   pp512 |    291.90 ± 0.64 | 327.98 ± 0.51 |  1.124   |   
+| llama 8B Q5_K_S  | AVX2       |      32 |   pp512 |    273.59 ± 0.37 | 302.13 ± 0.61 |  1.104   |   
+| llama 8B Q4_K_S  | Zen4       |      16 |   pp512 |    258.78 ± 1.05 | 267.69 ± 0.31 |  1.034   |   
+| llama 8B Q5_K_S  | Zen4       |      16 |   pp512 |    246.19 ± 0.65 | 249.12 ± 0.42 |  1.012   |
+
+The improvement on `Zen4` is very minor. The benefit there is bloat reduction as I'm now reusing the same implementation as `AVX2`.
--- a/github-data/pull_requests/184-Deepseek-Lite.md
+++ b/github-data/pull_requests/184-Deepseek-Lite.md
@@ -0,0 +1,18 @@
+### 🔀 [#184](https://github.com/ikawrakow/ik_llama.cpp/pull/184) - Deepseek-Lite
+
+| **Author** | `ikawrakow` |
+| :--- | :--- |
+| **State** | ❌ **Closed** |
+| **Created** | 2025-01-30 |
+| **Updated** | 2025-01-30 |
+
+---
+
+#### Description
+
+I was playing with Deepseek-Lite and noticed that
+* Quantization mixes are inadequate, so added a few quick changes to that
+* As some of the tensors row sizes are not divisible by 256, we get quite a few tensors quantized with `IQ4_NL`, so I noticed that after repacking to `IQ4_NL_R4` it does not work for row sizes that are not a multiple of 128 (4 blocks). So, I fixed that (AVX2 and Zen4)
+* Once at it, also fixed `Q5_0_R4` and `Q6_0_R4`
+
+Quantization error as measured by PPL is surprisingly low for the low-bit quants, even `IQ1_S` is kind of semi-usable. It is not a "true" `IQ1_S` quantization as quite a few tensors get quantized to `IQ4_NL`, and I changed the attention tensors, which represent a tiny fraction of the overall model sizes, to be quantized with much higher bpw.  We end up using 2.525 bpw for the repeating layers, and `PPL(IQ1_S)/PPL(fp16) - 1 = 49.4%`. But I now understand the hype around the Internet when the other day somebody was pretending to have invented 1-bit quantization and quantization mixes by using `IQ1_S` in `llama.cpp` for Deepseek-R1.
--- a/github-data/pull_requests/185-IQ1_S_R4_
+++ b/github-data/pull_requests/185-IQ1_S_R4_
--- a/github-data/pull_requests/186-iq1_s_r4_
+++ b/github-data/pull_requests/186-iq1_s_r4_
@@ -0,0 +1,20 @@
+### 🔀 [#186](https://github.com/ikawrakow/ik_llama.cpp/pull/186) - iq1_s_r4: slightly faster NEON gemm/gemv
+
+| **Author** | `ikawrakow` |
+| :--- | :--- |
+| **State** | ❌ **Closed** |
+| **Created** | 2025-02-05 |
+| **Updated** | 2025-02-05 |
+
+---
+
+#### Description
+
+DeepSeek-Lite on M2-Max CPU:
+
+| model                  | threads |     test |   t/s (main)     |     t/s (PR)     |  Speedup |
+| ---------------------- | ------: | -------: | ---------------: | ---------------: | -------: |
+| deepseek2 16B IQ1_S_R4 |       2 |    tg128 |     22.76 ± 0.15 |     24.07 ± 0.19 |  1.058   |
+| deepseek2 16B IQ1_S_R4 |       4 |    tg128 |     37.83 ± 0.00 |     39.58 ± 0.02 |  1.046   |
+| deepseek2 16B IQ1_S_R4 |       8 |    tg128 |     62.01 ± 0.02 |     65.26 ± 0.82 |  1.052   |
+| deepseek2 16B IQ1_S_R4 |       8 |    pp512 |    251.97 ± 0.09 |    283.20 ± 0.54 |  1.124   |
--- a/github-data/pull_requests/187-IQ1_M_R4_
+++ b/github-data/pull_requests/187-IQ1_M_R4_
@@ -0,0 +1,39 @@
+### 🔀 [#187](https://github.com/ikawrakow/ik_llama.cpp/pull/187) - IQ1_M_R4: better 1.75 bpw quants
+
+| **Author** | `ikawrakow` |
+| :--- | :--- |
+| **State** | ❌ **Closed** |
+| **Created** | 2025-02-06 |
+| **Updated** | 2025-02-06 |
+
+---
+
+#### Description
+
+Following in the foot steps of #185, this PR adds `IQ1_M_R4`, a 4-row interleaved version of `IQ1_M`. 
+
+* I have removed the `f16` super-block scale (replaced with a `f16` per row scale) and have changed the 3-bit `IQ1_M` block scales with 4 bit. Hence, we end up using the same 1.75 bpw as `IQ1_M`.
+* The above change allows to implement `IQ1_M_R4` with a block size of 32. I wanted to have this because DeepSeek-Lite, the model I'm testing with, has a lot of tensors with row sizes not divisible by 256, so a significant fraction of tensors gets quantized to `IQ4_NL` when using `IQ1_M`
+*  Quantization mixes for MoE models are adjusted. Today's mainline `llama.cpp` arrives at a context-512 perplexity (`PPL(512)` in what follows) of 20.75 for DeepSeek-Lite using 2.74 bpw with `IQ1_M`. The `IQ1_M_R4` quantization in this PR gets `PPL-512 = 8.85` with 1.966 bpw for the repeating layers.
+* `IQ1_M_R4` is **much faster** on the CPU compared to `IQ1_M` (see tables below). I never implemented iqk-style GEMM for `IQ1_S/IQ1_M`, so these quantization types run at the snail speed of mainline `llama.cpp`.
+* Caveat: it is CPU only for now.
+
+The following table compares prompt processing (pp512) and token generation (tg128) speed for LLaMA-3.1-8B on `AVX2` (Ryzen-5975WX), `Zen4` (Ryzen-7950X) and `ARM_NEON` (M2-Max CPU). I didn't use DeepSeek-Lite for this comparison to avoid the difference in quantization types one ends up with due to not all tensors having row sizes that are multiple of 256.
+
+| platform   | threads |          test |     t/s (IQ1_M)  |   t/s (IQ1_M_R4) |  Speedup |
+| ---------- | ------: | ------------: | ---------------: | ---------------: | -------: |
+| AVX2       |      32 |         pp512 |     43.98 ± 0.09 |    187.94 ± 0.21 |  4.273   |
+| Zen4       |      16 |         pp512 |     26.70 ± 0.03 |    149.57 ± 0.31 |  5.602   |
+| NEON       |       8 |         pp512 |     17.61 ± 0.03 |     95.04 ± 0.16 |  5.397   |
+| AVX2       |       2 |         tg128 |      2.66 ± 0.00 |      3.96 ± 0.00 |  1.489   |
+|            |       4 |         tg128 |      5.25 ± 0.00 |      7.76 ± 0.00 |  1.478   |
+|            |       8 |         tg128 |      9.93 ± 0.16 |     13.71 ± 0.01 |  1.381   |
+|            |      16 |         tg128 |     17.14 ± 0.00 |     22.60 ± 0.01 |  1.319   |
+|            |      32 |         tg128 |     23.91 ± 0.01 |     25.39 ± 0.02 |  1.062   |
+| Zen4       |       2 |         tg128 |      3.39 ± 0.00 |      5.29 ± 0.00 |  1.560   |
+|            |       4 |         tg128 |      6.50 ± 0.00 |     10.19 ± 0.00 |  1.568   |
+|            |       8 |         tg128 |     11.68 ± 0.01 |     17.54 ± 0.01 |  1.502   |
+|            |      16 |         tg128 |     19.13 ± 0.05 |     25.91 ± 0.43 |  1.354   |
+| NEON       |       2 |         tg128 |      4.16 ± 0.00 |      5.27 ± 0.01 |  1.267   |
+|            |       4 |         tg128 |      7.88 ± 0.00 |      9.99 ± 0.01 |  1.268   |
+|            |       8 |         tg128 |     14.74 ± 0.26 |     19.19 ± 0.01 |  1.302   |
--- a/github-data/pull_requests/188-Add
+++ b/github-data/pull_requests/188-Add
@@ -0,0 +1,91 @@
+### 🔀 [#188](https://github.com/ikawrakow/ik_llama.cpp/pull/188) - Add optional MLA
+
+| **Author** | `ikawrakow` |
+| :--- | :--- |
+| **State** | ❌ **Closed** |
+| **Created** | 2025-02-06 |
+| **Updated** | 2025-02-11 |
+
+---
+
+#### Description
+
+This PR is derived from #180. The difference to #180 is that MLA is made optional. It is off by default, and can be turned on using the added `-mla` or `--use-mla` command line option.
+
+Rationale: MLA improves TG speed, especially when there is a long context. But it also makes prompt processing significantly slower. Hence, MLA is made optional since advantage/disadvantage is use case dependent.
+
+Being able to select or deselect MLA at run time is possible due to the fact that #180 leaves the original `wkv_b` tensor and its decomposition into `wk_b` and `wv_b` in the model. This is somewhat wasteful, but these tensors are not very large and now come handy to easily select between the two attention implementations. 
+
+In addition:
+* It is now possible to use a model converted without this PR so that the `wk_b` and `wk_v` tensors are missing. In this case MLA will be disabled even if requested on the command line
+* Eliminated some unnecessary copies (`ggml_cont`). This repo has supported non-contiguous RoPE for a while and con-contiguous RMS norm on CUDA was added in #190 (the CPU has always supported non-contiguous RMS norm).
+
+---
+
+#### 💬 Conversation
+
+👤 **saood06** commented the **2025-02-08** at **11:23:52**:<br>
+
+There were some other change's in the gguf-py/gguf/tensor_mapping.py that are in that branch that I missed porting over earlier. 
+
+The next thing I was going to do was remove the old KV from being allocated, I hadn't gotten around to it, as I had a workaround from the mmap KV cache feature, but it should be a relatively simple fix, when I have more time I'll look into it.
+
+---
+
+👤 **saood06** commented the **2025-02-08** at **19:51:36**:<br>
+
+@ikawrakow I made #195 to merge into this with the things mentioned.
+
+---
+
+👤 **ikawrakow** commented the **2025-02-09** at **11:09:23**:<br>
+
+I think we can merge this now.
+
+---
+
+👤 **saood06** submitted a review the **2025-02-09** at **17:28:01**: ✅ `APPROVED`<br>
+
+LGTM, good catch on applying cache quantization, it was something I had missed. BF16 makes sense when it is faster, but I never bothered as I'm assuming it would come with a large quality loss. 
+
+Once this is merged I'll make PR's for the warmup MoE fix and then the mmap KV allocator .
+
+Testing was a bit of a pain without the warmup MoE fix as loading in experts takes much longer (and it is already quite long as this server has no SSD only HDD) and takes many runs instead of just one warmup, PP seems slightly lower compared to my local testing branch but that might just be variance, or from the mmap KV allocator that I have yet to make a PR for.
+
+---
+
+👤 **ikawrakow** commented the **2025-02-09** at **17:48:32**:<br>
+
+> BF16 makes sense when it is faster, but I never bothered as I'm assuming it would come with a large quality loss.
+
+Why? Most modern models are trained in `bf16`, so `bf16` will be better than `fp16`. But if the CPU does not have native `bf16` support it will be somewhat slower. 
+
+> Once this is merged I'll make PR's for the warmup MoE fix and then the mmap KV allocator .
+
+Sounds good.
+
+---
+
+👤 **saood06** commented the **2025-02-09** at **18:28:01**:<br>
+
+> > BF16 makes sense when it is faster, but I never bothered as I'm assuming it would come with a large quality loss.
+> 
+> Why? Most modern models are trained in `bf16`, so `bf16` will be better than `fp16`. But if the CPU does not have native `bf16` support it will be somewhat slower.
+> 
+I mispoke, I meant I never bothered quantizing the MLA version down to Q4 or Q6 as I did with the non MLA solution. I know most models are bf16 native (Deepseek was FP8 native which I had to upscale to BF16 before making the GGUF), and I would use BF16 if I had a modern processor with support for it. 
+
+The old solution was MHA, which quantizes down very well, and is large enough to warrant it. Heavy GQA does not, MLA is sized like GQA and also small enough where I'm fine leaving it in F16, as my CPU is old and doesn't do BF16 but if I had a modern CPU I would use BF16.
+
+---
+
+👤 **saood06** submitted a review the **2025-02-11** at **20:15:12**: 💬 `COMMENTED`
+
+---
+
+👤 **saood06** commented during a code review the **2025-02-11** at **20:20:39** on `src/llama.cpp`:<br>
+
+With the above change only one of these should be allocated so that is the only one that should be displayed as KV self size
+
+---
+
+👤 **saood06** submitted a review the **2025-02-11** at **20:20:40**: 💬 `COMMENTED`
--- a/github-data/pull_requests/189-Rename
+++ b/github-data/pull_requests/189-Rename
@@ -0,0 +1,13 @@
+### 🔀 [#189](https://github.com/ikawrakow/ik_llama.cpp/pull/189) - Rename q4_0_r4, q8_0_r4 and iq4_xs_r4 to _r8
+
+| **Author** | `ikawrakow` |
+| :--- | :--- |
+| **State** | ❌ **Closed** |
+| **Created** | 2025-02-06 |
+| **Updated** | 2025-02-06 |
+
+---
+
+#### Description
+
+to reflect the actual number of interleaved rows.
--- a/github-data/pull_requests/19-Skip
+++ b/github-data/pull_requests/19-Skip
@@ -0,0 +1,26 @@
+### 🔀 [#19](https://github.com/ikawrakow/ik_llama.cpp/pull/19) - Skip barriers of noops
+
+| **Author** | `ikawrakow` |
+| :--- | :--- |
+| **State** | ❌ **Closed** |
+| **Created** | 2024-08-14 |
+| **Updated** | 2024-08-14 |
+
+---
+
+#### Description
+
+`GGML_OP_RESHAPE, GGML_OP_VIEW, GGML_OP_PERMUTE, GGML_OP_TRANSPOSE`, along with `GGML_OP_NONE`, are all noops in `ggml`. I.e., nothing happens. But `ggml` still has a thread barrier after them, which wastes time. The waste is not too bad for large models where computations are long compared to the time taken for thread synchronization. But for small models skipping those unnecessary waits makes a noticeable difference.
+
+Let's look at a really tiny model - the [99M parameter TriLM ternary model](https://huggingface.co/SpectraSuite/TriLM_99M_Unpacked) quantized with `IQ2_TN`.  The following table compares performance for PP-512 and TG-128 with and without the change in this PR
+
+| CPU        | threads |          test |    t/s (main)    |  t/s (PR)        |  Speedup |
+| ---------- | ------: | ------------: | ---------------: | ---------------: | -------: |
+| Ryzen-7950X|      16 |         pp512 | 11386.75 ± 19.08 | 11587.58 ± 34.26 |  1.018   |   
+| Ryzen-7950X|       8 |         tg128 |   1312.25 ± 1.02 |   1460.80 ± 1.69 |  1.113   |   
+| M2-Max     |       8 |         pp512 |  7642.81 ± 22.07 |   7680.29 ± 9.29 |  1.005   |   
+| M2-Max     |       8 |         tg128 |   992.83 ± 18.17 |  1096.47 ± 14.45 |  1.104   |
+
+So, basically, for such a small model `ggml` spends 10% of its time waiting for threads to pass through a barrier after a noop when generating tokens.
+
+There are other barriers that can be eliminated. E.g., the typical attention block involves matrix multiplications of the `Q, K` and `V` tensors with the **same** activations, so there is no need to synchronize threads after each such matrix multiplications. In a similar way, in the feed-forward portion of the network the `ffn_up` and `ffn_gate` tensors multiply the same activations, so one can save another barrier there. This is left for a future PR.
--- a/github-data/pull_requests/190-cuda_
+++ b/github-data/pull_requests/190-cuda_
@@ -0,0 +1,15 @@
+### 🔀 [#190](https://github.com/ikawrakow/ik_llama.cpp/pull/190) - cuda: non-contiguous rms norm
+
+| **Author** | `ikawrakow` |
+| :--- | :--- |
+| **State** | ❌ **Closed** |
+| **Created** | 2025-02-06 |
+| **Updated** | 2025-02-07 |
+
+---
+
+#### Description
+
+Derived from https://github.com/ggerganov/llama.cpp/pull/11659
+
+Minor benefit for DeepSeek-Lite (~2% faster TG).
--- a/github-data/pull_requests/191-Add
+++ b/github-data/pull_requests/191-Add
@@ -0,0 +1,13 @@
+### 🔀 [#191](https://github.com/ikawrakow/ik_llama.cpp/pull/191) - Add additional checks for iq1_s_r4 quantization
+
+| **Author** | `ikawrakow` |
+| :--- | :--- |
+| **State** | ❌ **Closed** |
+| **Created** | 2025-02-07 |
+| **Updated** | 2025-02-07 |
+
+---
+
+#### Description
+
+Something goes wrong when quantizing DeepSeek-R1 with `IQ1_S_R4` (see #185), so adding additional checks in the quantization.
--- a/github-data/pull_requests/192-Revert
+++ b/github-data/pull_requests/192-Revert
@@ -0,0 +1,19 @@
+### 🔀 [#192](https://github.com/ikawrakow/ik_llama.cpp/pull/192) - Revert [#79](https://github.com/ikawrakow/ik_llama.cpp/issues/79)
+
+| **Author** | `ikawrakow` |
+| :--- | :--- |
+| **State** | ❌ **Closed** |
+| **Created** | 2025-02-07 |
+| **Updated** | 2025-02-08 |
+
+---
+
+#### Description
+
+While testing potential improvements of `IQ1_S_R4` quantization, I ran into NaNs while running a DeepSeek-Lite perplexity calculation. I did a `grep -r` on a folder with many big files while running the calculation and suddenly I got a NaN PPL. I repeated the calculation without doing anything else at the same time and the NaN did not happen. I then ran with 32 threads on a 16-core system and was able to reliably get a NaN at some random chunk.
+
+This means there is a race.
+   
+The race was most likely introduced in #79 (avoid repeating already done quantizations of activations). I honestly do not understand why there could be a race, or even less do I understand why it would only happen for DeepSeek-Lite quantized with `IQ1_S_R4`. I have done countless runs since #79 and never observed anything suspicious.
+
+Either way, this PR reverts #79. After doing so, there aren't any NaNs no matter how busy I make the system while running DeepSeek-Lite inference.  Hopefully this will also fix the NaNs @saood06 gets with `IQ1_S_R4` quantized DeepSeek-R1 (see discussion in #185).
--- a/github-data/pull_requests/193-RPC
+++ b/github-data/pull_requests/193-RPC
@@ -0,0 +1,79 @@
+### 🔀 [#193](https://github.com/ikawrakow/ik_llama.cpp/pull/193) - RPC sync
+
+| **Author** | `saood06` |
+| :--- | :--- |
+| **State** | ❌ **Closed** |
+| **Created** | 2025-02-08 |
+| **Updated** | 2025-06-15 |
+
+---
+
+#### Description
+
+I grabbed all of the changes needed for [llama.cpp/pull/11047](https://github.com/ggerganov/llama.cpp/pull/11047) , which was https://github.com/ggerganov/llama.cpp/pull/9912 and https://github.com/ggerganov/llama.cpp/pull/9040 
+
+This compiles, but has not been tested yet.
+
+---
+
+#### 💬 Conversation
+
+👤 **ikawrakow** commented the **2025-02-08** at **13:23:08**:<br>
+
+I never use RPC, have never looked into the RPC code, so I'll have to rely on you for self-review and testing.
+
+---
+
+👤 **saood06** commented the **2025-02-10** at **16:40:34**:<br>
+
+@jukofyork 
+>I strongly suspect something funky is going on 
+
+There is, see this comment: https://github.com/ikawrakow/ik_llama.cpp/pull/180#issuecomment-2625090660
+
+
+This fork has much faster PP speeds, has Deepseek MLA support with a flag (-mla), this PR should allow RPC to work, and I'm working on porting the add option to override model tensor buffers.
+
+---
+
+👤 **saood06** commented the **2025-02-27** at **23:11:54**:<br>
+
+This has been tested, and does not currently work. I'm not sure why as the errors I'm getting seem to have never been encountered by people on llama.cpp.
+
+---
+
+👤 **saood06** submitted a review the **2025-02-27** at **23:14:23**: 💬 `COMMENTED`
+
+---
+
+👤 **saood06** commented during a code review the **2025-02-27** at **23:14:23** on `ggml/src/ggml-rpc.cpp`:<br>
+
+The RPC client crashes here, which happens as the RPC server hits an issue.
+
+---
+
+👤 **saood06** submitted a review the **2025-02-27** at **23:17:32**: 💬 `COMMENTED`
+
+---
+
+👤 **saood06** commented during a code review the **2025-02-27** at **23:17:32** on `ggml/src/ggml-rpc.cpp`:<br>
+
+I'm fairly certain this is where the RPC server is crashing, although it doesn't print the message as I never ran with GGML_DEBUG on.
+
+---
+
+👤 **saood06** commented the **2025-04-12** at **04:39:37**:<br>
+
+> @saood06
+> 
+> I just came across another [llama.cpp fork called prima.cpp](https://github.com/Lizonghang/prima.cpp?tab=readme-ov-file#-key-features) which claims to have improved support for multi-device distributed inferencing.
+> 
+> I haven't tried it, just saw it on reddit today. Might be worth a shot given your GPU is in a different system than your big RAM box.
+
+Thanks for the link, it is interesting. I think it would work for dense models but not as well for MoE because as far as I can tell it doesn't handle `-ot` ([this](https://github.com/Lizonghang/prima.cpp/commit/631daadd92bfd27504c89d14ff6cd3d4ae007d53) commit looks relevant) . I'd also need windows support which is on the roadmap (but I might see what the issue is by trying to build it on my machine, and see if I can fix it), and the GPU machine has to run windows (my big RAM box runs clear linux, and I have other servers that run FreeBSD and Proxmox).
+
+---
+
+👤 **saood06** commented the **2025-06-15** at **11:26:50**:<br>
+
+Closed as superseded by #480 / #506
--- a/github-data/pull_requests/194-Use
+++ b/github-data/pull_requests/194-Use
@@ -0,0 +1,45 @@
+### 🔀 [#194](https://github.com/ikawrakow/ik_llama.cpp/pull/194) - Use Q8_K_128 for IQ1_S_R4 and IQ1_M_R4 matrix multiplications
+
+| **Author** | `ikawrakow` |
+| :--- | :--- |
+| **State** | ❌ **Closed** |
+| **Created** | 2025-02-08 |
+| **Updated** | 2025-02-09 |
+
+---
+
+#### Description
+
+@saood06 is still observing NaNs for DeepSeek-R1 quantized with `IQ1_S_R4`. As I don't see what else could be wrong, I'm making the following hypothesis:
+
+1. Given the discussions about DeepSeek-R1 becoming "dumb" when `fp16` is used for some of the attention tensors, I hypothesize that there are activations that go beyond the range of `fp16` floats, which get truncated when converted from `fp32` for `fp16` for multiplications with some `fp16` model tensor.
+2. If this is the case, using `Q8_1` as quantization type for activations, as `IQ1_S_R4` does, can be futile:
+  * Suppose there is some block of 32 activations that has a maximum $x_{\rm max} > {\rm f16}_{\rm max}$
+  * Suppose that the block scale $d = x_{\rm max}/127$ is in the `f16`  range. This is likely to be the case as `Q8_0` attention tensors are reported to behave better than `fp16`.
+  * In `Q8_1` we also compute $s = d \sum q_i$, where $q_i$ are the 8-bit quants. The scaled sum $s$ is also stored as `fp16`. If one gets unlucky, it can overflow, despite $d$ being in range
+  * If this occurs, we will get a completely bogus result for the `IQ1_S_R4` dot product with this block. To make the calculation more efficient on `AVX2`, we use ternary quants $0, 1, 2$ (instead of $-1, 0, 1$) to multiply the Q8 quants (so we can use `_mm256_maddubs_epi16`) , and then recover the correct result by subtracting $s$ from the result. But if $s$ is wrong (truncated because outside the `fp16` range), this does not work and we get a wrong result.
+ 
+To test this hypothesis, this draft PR uses `Q8_K_128` for `IQ1_S_R4` and `IQ1_M_R4` matrix multiplications. `Q8_K_128` is a new 8-bit  quantization type similar to `Q8_K` but with blocks of 128 (so I can test with DeepSeek-Lite). It is draft because I haven't done the `ARM_NEON` implementation. `Q8_K_128` uses a 32-bit float scale, and the sums over blocks of 32 are stored as `int16_t` without multiplying with $d$, hence we cannot run into 16-bit float range issues. Perplexity for DeepSeek-Lite is slightly lower compared to using `Q8_1`, which indicates that there may be non-fatal truncation effects also there (normally one expects a slightly higher accuracy from using `Q8_0` or `Q8_1` because of the smaller block size).
+
+Would appreciate if this gets tested with DeepSeek-R1.
+
+---
+
+#### 💬 Conversation
+
+👤 **saood06** commented the **2025-02-08** at **21:39:38**:<br>
+
+@ikawrakow 
+>Would appreciate if this gets tested with DeepSeek-R1.
+
+Done.
+
+[1]3.7099,[2]4.6162,[3]3.5438,[4]3.4199,[5]3.5375,[6]3.5710,[7]3.5428,[8]3.6748,[9]3.7417,[10]3.6724,[11]3.7879,[12]3.9602,[13]4.0477,[14]4.1439,[15]4.2809,[16]4.1981,[17]4.3853,[18]4.5141,[19]4.4493,[20]4.3848,[21]4.4664,[22]4.3290,[23]4.1912,[24]4.1799,[25]4.0693,[26]4.0135,[27]4.0672,[28]4.0459,[29]4.1110,[30]4.1116,[31]4.1261,[32]4.1192,[33]4.1756,[34]4.2340,[35]4.3112,[36]4.3722,[37]4.3822,[38]4.4260,[39]4.4568,[40]4.5164,[41]4.5661,[42]4.5563,[43]4.5975,[44]4.5821,[45]4.6738,[46]4.7199,[47]4.7029,[48]4.6934,[49]4.6900,[50]4.7087,[51]4.7637,[52]4.7736,[53]4.8515,[54]4.8776,[55]4.9119,[56]4.9504,[57]4.9769,[58]5.0124,[59]5.0024,[60]5.0545,[61]5.1015,[62]5.1639,[63]5.2095,[64]5.2599,
+
+No more `NaN`'s, nice! It's impressive how quickly you found the race condition and this issue.
+
+---
+
+👤 **ikawrakow** commented the **2025-02-09** at **06:02:29**:<br>
+
+Thank you for this! The decisive hint to solve it was the discussion about DeepSeek-R1 being dumb with `fp16` attention tensors that you alerted me to.
--- a/github-data/pull_requests/195-
+++ b/github-data/pull_requests/195-
@@ -0,0 +1,32 @@
+### 🔀 [#195](https://github.com/ikawrakow/ik_llama.cpp/pull/195) -  Deepseek MLA Optimizations V2
+
+| **Author** | `saood06` |
+| :--- | :--- |
+| **State** | ❌ **Closed** |
+| **Created** | 2025-02-08 |
+| **Updated** | 2025-02-09 |
+
+---
+
+#### Description
+
+@ikawrakow 
+
+This PR contains the following things
+- A fairydreaming commit that is supposed to increase PP
+- Avoid allocating the MHA KV cache in MLA mode
+- Adds a change I originally missed that is used for gguf-py.
+
+I will follow up with:
+- Having all the MoE experts load during warmup, that can be placed in this PR if you want, or a separate one. It is a very large QoL feature for large MoE. Without it the model is slowly loaded in on use, with it, the model is loaded immediately and at a faster rate.
+- The mmap based KV cache buffer, it is functional but I have yet to make it a CLI option.
+
+---
+
+#### 💬 Conversation
+
+👤 **ikawrakow** submitted a review the **2025-02-09** at **07:36:43**: ✅ `APPROVED`<br>
+
+Looks good. I added a minor change to check if `wk_b` and `wv_b` are available before turning on MLA (so we don't crash if someone is using an old model and asked for MLA).
+
+PP-4096 for `Q8_0_R8` quantized DeepSeek-Lite with `-mla` goes up to 292 t/s from 275 t/s with this change.
--- a/github-data/pull_requests/197-FA_
+++ b/github-data/pull_requests/197-FA_
@@ -0,0 +1,22 @@
+### 🔀 [#197](https://github.com/ikawrakow/ik_llama.cpp/pull/197) - FA: Add option to build all FA kernels
+
+| **Author** | `ikawrakow` |
+| :--- | :--- |
+| **State** | ❌ **Closed** |
+| **Created** | 2025-02-09 |
+| **Updated** | 2025-02-09 |
+
+---
+
+#### Description
+
+Similar to the CUDA situation.
+It is OFF by default.
+If OFF, only `F16, Q8_0, Q6_0`, and, if the CPU provides native `BF16` support, `BF16` CPU FA kernels will be included.
+To enable all,
+```
+cmake -DGGML_IQK_FA_ALL_QUANTS=1 ...
+```
+
+This cuts compilation time for `iqk_mul_mat.cpp` by almost half (45 seconds vs 81 seconds on my Ryzen-7950X).
+This is poor men's solution of the long build time until #183 is tackled.
--- a/github-data/pull_requests/198-
+++ b/github-data/pull_requests/198-
@@ -0,0 +1,33 @@
+### 🔀 [#198](https://github.com/ikawrakow/ik_llama.cpp/pull/198) -  Load all MoE experts during warmup and make warmup 1 token
+
+| **Author** | `saood06` |
+| :--- | :--- |
+| **State** | ❌ **Closed** |
+| **Created** | 2025-02-09 |
+| **Updated** | 2025-02-10 |
+
+---
+
+#### Description
+
+First commit is a port of: https://github.com/ggerganov/llama.cpp/pull/11571
+
+The second commit is based on what fairydreaming has reported here https://github.com/ggerganov/llama.cpp/discussions/11733 and also unify's warmup to always be one token.
+
+This allows warmup to actually warmup an MoE model as all experts are exercised.
+
+---
+
+#### 💬 Conversation
+
+👤 **ikawrakow** submitted a review the **2025-02-10** at **07:12:56**: ✅ `APPROVED`<br>
+
+LGTM, but it does nothing on the single socket computers I have currently available, so relying on the comments in the linked PR and issue that this really improves things on NUMA systems.
+
+---
+
+👤 **saood06** commented the **2025-02-10** at **14:52:48**:<br>
+
+> LGTM, but it does nothing on the single socket computers I have currently available, so relying on the comments in the linked PR and issue that this really improves things on NUMA systems.
+
+The first commit, should work on any system to help MoE loading (Deepseek is the most noticeable because of it's large size and expert count but it should help, but all MoE should benefit) . It is only the the second commit is designed to benefit NUMA systems.
--- a/github-data/pull_requests/2-Offload
+++ b/github-data/pull_requests/2-Offload
@@ -0,0 +1,13 @@
+### 🔀 [#2](https://github.com/ikawrakow/ik_llama.cpp/pull/2) - Offload Bitnet token embeddings to the GPU - the right way
+
+| **Author** | `ikawrakow` |
+| :--- | :--- |
+| **State** | ❌ **Closed** |
+| **Created** | 2024-07-26 |
+| **Updated** | 2024-07-26 |
+
+---
+
+#### Description
+
+OK, I should have checked how it was done for Gemma and do the same for Bitnet. But better late than never.
--- a/github-data/pull_requests/20-iq2_k_
+++ b/github-data/pull_requests/20-iq2_k_
@@ -0,0 +1,15 @@
+### 🔀 [#20](https://github.com/ikawrakow/ik_llama.cpp/pull/20) - iq2_k: slightly better bpw - accuracy compromise
+
+| **Author** | `ikawrakow` |
+| :--- | :--- |
+| **State** | ❌ **Closed** |
+| **Created** | 2024-08-19 |
+| **Updated** | 2024-08-19 |
+
+---
+
+#### Description
+
+For LLaMA-3.1 models:
+* It is better to quantize all of attn_v with iq3_k instead of half of attn_v with iq4_k
+* Quantizing attn_output with iq3_k results in a larger PPL decrease compared to what one expects from the added bpw.
--- a/github-data/pull_requests/200-DeepSeek
+++ b/github-data/pull_requests/200-DeepSeek
@@ -0,0 +1,41 @@
+### 🔀 [#200](https://github.com/ikawrakow/ik_llama.cpp/pull/200) - DeepSeek FA support (CPU only)
+
+| **Author** | `ikawrakow` |
+| :--- | :--- |
+| **State** | ❌ **Closed** |
+| **Created** | 2025-02-10 |
+| **Updated** | 2025-02-11 |
+
+---
+
+#### Description
+
+This PR adds FA support for models where K and V head sizes are different, such as DeepSeek-R1 and DeepSeek-Lite. It only works with the standard attention mechanism, I have yet to look into FA with MLA.
+
+We get a nice speedup for PP, increasing with context length, but TG is not faster. I want to play some more with it, but throwing it out there if someone wants to try. For sure this allows longer contexts to be processed as `-ctk q8_0 -ctv q8_0` seems perfectly adequate.
+
+---
+
+#### 💬 Conversation
+
+👤 **ikawrakow** commented the **2025-02-11** at **09:08:44**:<br>
+
+So, I did get some minor FA speed improvements for TG, but I don't see what else one could do, so I'll merge it.
+
+Here is a performance comparison between baseline (`Q8_0` K-cache, no FA, no MLA), MLA (`Q8_0` K-cache) and FA (`Q8_0` K and V cache) for DeepSeek-Lite running on a Ryzen-7950X CPU. Both graphs show the MLA and FA performance ratio to baseline.  
+
+First graph is prompt processing speed. We see FA giving a ~40% performance boost at 16k tokens compared to baseline. MLA is 2X slower than baseline and 2.8X slower than FA at 16k tokens.
+
+![ds2_pp](https://github.com/user-attachments/assets/426446de-5371-4305-8ac1-4da5e3501145)
+
+The second graph is token generation speed (TG-64) after a prompt of a given length (i.e., TG speed as a function of the number of tokens in the KV cache). We do get some performance gains for very long prompts from FA (~10% at 16k tokens), but by far not as much as from MLA. MLA is 1.57X faster than baseline and 1.43X faster than FA at 16k tokens. 
+  
+![ds2_tg](https://github.com/user-attachments/assets/0b9fefcc-2f83-4b8f-8734-fd24c2104fe5)
+
+---
+
+👤 **ikawrakow** commented the **2025-02-11** at **10:33:34**:<br>
+
+Recently I read somewhere that for the "common enterprise workflow" (whatever that means) the number of generated tokens is typically only about 10% of the prompt tokens. I don't know if that is true, but for the sake of argument, let's assume for a moment that it is. In that case the best way to measure overall model performance is to use `llama-bench -pg Npp,Ntg`, where `Ntg=0.1*Npp` is the number of generated tokens and `Npp` is the number of prompt tokens.  The following graph shows `PG` performance as a function of prompt length. The black symbols are mainline `llama.cpp build b9ab0a4d (4687)` (most current version as of today), the red symbols are for baseline `ik_llama.cpp` (no FA, no MLA), the green symbols are for MLA, and the blue symbols are for FA from this PR. The model is DeepSeek-Lite quantized with `IQ4_XS`. All use `Q8_0` for K cache, FA uses `Q8_0` also for V cache. All runs are on a Ryzen-7950X CPU. If we buy the claim that `Ntg ~ 0.1*Npp` in the "typical enterprise workflow", then there is no benefit from MLA over baseline, while FA is ~26% better for long prompts. Mainline `llama.cpp` is, as usual, slower. 1.45X for short prompts, increasing to 1.7X slower for prompts with 16k tokens.
+
+![ds2_pg](https://github.com/user-attachments/assets/910f830d-31a6-4d66-8df9-b90e30b8f68d)
--- a/github-data/pull_requests/202-Fix
+++ b/github-data/pull_requests/202-Fix
@@ -0,0 +1,46 @@
+### 🐛 [#202](https://github.com/ikawrakow/ik_llama.cpp/pull/202) - Fix imatrix overprotectiveness
+
+| **Author** | `ikawrakow` |
+| :--- | :--- |
+| **State** | ❌ **Closed** |
+| **Created** | 2025-02-11 |
+| **Updated** | 2025-02-12 |
+
+---
+
+#### Description
+
+I hear reports that people are having trouble creating imatrix data for models with many experts (e.g., DeepSeek-R1, Arctic). For such models it may be very hard to activate all experts in all layers, which it turns out leads to the data for **the entire** tensor containing experts with missing data to be not stored in the imatrix file. Which then prevents usage of the imatrix data for low-bit quantization of such models.
+
+It wasn't like this when I added the imatrix to `llama.cpp`, but it turns out the protection police has been at work and has added these checks, which I then inherited when syncing with upstream. Thanks to @saood06 for making me aware of this unfortunate situation.
+
+This PR reduces the powers of the protection police. If a tensor is found that has partial contributions to the imatrix data, instead of simply skipping it, we now
+* Check if it is a tensor containing experts
+* If so, count how many experts are missing data
+* If less than 5% of the experts are missing data, we
+   - Warn the user, but still store the data in the imatrix file
+   - Set the imatrix weights to 1 for the experts missing data
+
+The rationale behind this approach is that if an expert was never activated after processing a significant amount of calibration data, this expert cannot be very important, so we can afford to quantize it with low bpw quants even without guidance on the importance of columns of this expert.
+
+Strictly speaking it would be better to leave the zeros in the imatrix data of experts that have never been activated. But this would require to go and add proper protection against all-zeros imatrices, along with the appropriate corrective action, for all quants, and not just for `IQ1_S_R4` as I did in #191. So, for now we go with same-importance columns for never activated experts.
+
+---
+
+#### 💬 Conversation
+
+👤 **saood06** commented the **2025-02-11** at **17:09:17**:<br>
+
+>for the entire tensor containing experts
+
+Not entirely related to this, but do you know why GGUF stores all the experts together? (I just checked the initial PR in mainline for an MoE and no rationale was given for this).
+
+I plan to port over code that lets you override where certain tensors are allocated which allows you to store non-shared experts on RAM and all else on VRAM. If the experts were not consolidated into one large tensor this could easily allow for expert parallelism which would benefit NUMA systems.
+
+---
+
+👤 **ikawrakow** commented the **2025-02-11** at **17:16:38**:<br>
+
+> but do you know why GGUF stores all the experts together?
+
+No I don't. The initial MoE implementation was not like that, and then it got changed. I have kept the ability to use the original version in my fork (so I don't need to re-download MoE models that were created before the change).
--- a/github-data/pull_requests/204-Fix
+++ b/github-data/pull_requests/204-Fix
@@ -0,0 +1,13 @@
+### 🐛 [#204](https://github.com/ikawrakow/ik_llama.cpp/pull/204) - Fix iqk_mul_mat on AVX512 systems that are missing BF16 support
+
+| **Author** | `ikawrakow` |
+| :--- | :--- |
+| **State** | ❌ **Closed** |
+| **Created** | 2025-02-12 |
+| **Updated** | 2025-02-12 |
+
+---
+
+#### Description
+
+Fixes #203
--- a/Show More
+++ b/Show More