Add GitHub data (#637)

This commit is contained in:
Thomas
2025-07-22 18:18:40 +02:00
committed by GitHub
parent 9513222ba5
commit 94aa54df76
626 changed files with 175142 additions and 0 deletions

View File

@@ -0,0 +1,26 @@
### 🗣️ [#100](https://github.com/ikawrakow/ik_llama.cpp/discussions/100) - New argument / env variable for GGML_SCHED_MAX_COPIES?
| **Author** | `Nexesenex` |
| :--- | :--- |
| **Created** | 2024-10-21 |
| **Updated** | 2024-10-21 |
---
#### Description
@ikawrakow, could you set up a CLI argument (or at least an env variable, it's much simpler I guess but I'm failing to do it right) to determine GGML_SCHED_MAX_COPIES without recompiling? It impacts VRAM occupation and performances, and it'd be great to set that up conveniently for benching and customized use.
---
#### 🗣️ Discussion
👤 **ikawrakow** replied the **2024-10-21** at **08:29:25**:<br>
I haven't looked into this at all. What is it good for?
---
👤 **Nexesenex** replied the **2024-10-21** at **09:36:22**:<br>
It's supposed to go faster inference on multi-GPU I guess. Mainline sets it at 4, I set it at 1, because I didn't notice much improvement back in the days, but I noticed more vram consumption and gpu load.

View File

@@ -0,0 +1,22 @@
### 🗣️ [#104](https://github.com/ikawrakow/ik_llama.cpp/discussions/104) - Convenience improvements for llama-quantize
| **Author** | `Nexesenex` |
| :--- | :--- |
| **Created** | 2024-10-23 |
| **Updated** | 2024-10-23 |
---
#### Description
Hey IK.
Here are some ideas of potential features for llama-quantize, that I'm not capable to code myself :
- Create a directory when it doesn't exist for the output file.
- Interrupt the quantization (or even **quantize each tensor in a directory**, so the quantization can be resumed on crash, or even a single series of tensor can be requantized (like attn_q weight for example, or even a function of use_more_bits if one of the part of the ternary statement deciding the quantization of a given tensor is not met when you change the quant of a part of the ternary, but not the other). The monolithic approach makes a pretty monster-file, but at the same time, wastes a lot of space, time and compute.
- integrate the formulas like use_more_bits (we have one, I intend to PR more of those) to the tensors that we manually specify with arguments in CLI to customize a FTYPE.
- A pre-check of the available space on disk before the quantization, ideally coupled with a dry-run giving the final size of the desired quant.

View File

@@ -0,0 +1,276 @@
### 🗣️ [#140](https://github.com/ikawrakow/ik_llama.cpp/discussions/140) - Questions about weight[j]
| **Author** | `DavidZyy` |
| :--- | :--- |
| **Created** | 2024-12-13 |
| **Updated** | 2025-02-11 |
---
#### Description
Hi @ikawrakow, your work on quantization is amazing and I really admire them. Recently, I am reading codes about this and have some questions.
For example, at funtion `quantize_row_q4_0_impl` and other places, `weight[j]` is:
```cpp
weight[j] = qw[j] * sqrtf(sigma2 + xb[j]*xb[j]);
```
I already see some discussions at [here](https://github.com/ggerganov/llama.cpp/discussions/5263#discussioncomment-11511794), but I still don't quite understand, Can you give me some guidance? Why do not use the following directly?
```cpp
weight[j] = qw[j]
```
---
#### 🗣️ Discussion
👤 **ikawrakow** replied the **2024-12-14** at **08:13:19**:<br>
Hi @DavidZyy,
this is simply an empirical correction, there is no science behind it (and it was amusing to observe people trying to make scientific sense out of it). From the pre-imatrix days we have learned that it is better to assign higher weights (importance) to model weights with larger magnitudes in a weighted RMSE minimization. As there is no precise science behind that, it was just a matter of experimentation to determine how this higher importance should look like ($x^2$, $|x|$, $\sigma^2 + x^2$, $\sigma + |x|$, etc., are all variations that have been tried). When I introduced the imatrix, the hope was of course that one can get rid of such non-scientific stuff and just use the diagonal elements of the Hessian. But in practice it is rarely as simple as that. Having the $\sqrt{\sigma^2 + x^2}$ in there does improve quantization accuracy, at least as measured by perplexity or KL-divergence.
Why $\sqrt{\sigma^2 + x^2}$ and not something else?
* As the Hessian already gives a lot of information about model weight importance, at some level it should be clear that the empirical correction cannot be as strongly magnitude dependent as it was without the imatrix
* We definitely do not want to have the importance of small-magnitude weights become (nearly) zero
* Based on the above two bullet points, and the experience from pre-imatrix quantization, $\sqrt{\sigma^2 + x^2}$ was an obvious choice that turned out to work better than anything else I tried
Why the need for correcting the Hessian in the first place?
* We are using just the diagonal elements, which is an approximation. In my experience adding a correction to an approximation often improves things
* From a more conceptual point of view, even if we did use the full Hessian, we still don't know if RMSE between the quantized and the full model weights is the similarity measure that we should be minimizing. RMSE is of course very convenient (expressions are very simple), so not knowing what to minimize we just use that. But in reality another similarity measure may be better, and it will have a different Hessian, so a different importance matrix, so we are back to square one where the importances being used are just a matter of empirical experimentation.
---
👤 **DavidZyy** replied the **2024-12-14** at **13:58:43**:<br>
Thanks for taking time to answer this question and share information, I learned a lot from your answers.
Yes, it's very interesting :)
> (and it was amusing to observe people trying to make scientific sense out of it)
---
👤 **jukofyork** replied the **2025-02-10** at **17:03:34**:<br>
Oh shit, I just realised I totally forgot to reply to this post! @ikawrakow Thanks for the explanation!
FWIW, I actually tested a couple of different schemes that were more grounded in regularisation theory, but they performed worse than your empirical method. It would still be nice to find some way to interpolate between the two extremes; the recent 256-expert being a good case in point!
I did manage to fix some of this back when `dbrx` first dropped:
https://github.com/ggerganov/llama.cpp/pull/7099
IIRC, all the main discussion is in this PR:
https://github.com/ggerganov/llama.cpp/pull/6387#issuecomment-2094926182
but I still suspect that for these new very-high-expert-MoEs it should really be down-regularised compared to non-MoE or older low-expert-count-MoEs.
---
👤 **ikawrakow** replied the **2025-02-10** at **18:07:55**:<br>
@jukofyork So, I have used regularization in a variety of contexts. Sadly, having spent the better part of my career in Medical Device where everything is closed source, there aren't many examples of that in the open. [This repository](https://github.com/ikawrakow/mnist) uses Tikhonov regularization for the training of an SVM model to recognize hand written digits. I put it out there because I find it funny that with fewer lines of code I can beet the [ggml mnist example](https://github.com/ggml-org/ggml/tree/master/examples/mnist) by a huge margin (0.4% vs 2% error rate, so 5X lower). But having used ragularization techniques in deformable image registration, large scale optimization of radiation therapy treatments, real-time target and/or critical organ tracking on live MRI images, MR and PET image reconstruction, etc., I think I know quite well when regularization is required, and LLM quantization is not one of the cases where it is, at least not in the classical sense of adding penalty term(s) to the optimization objective. For instance, Tikhonov regularization that was being proposed in one of the discussions, is pretty much the last thing we want to do when quantizing because we definitely do not want to make the quantized values as small as possible, which is the goal of the Tikhonov regularization term. At some level, one can consider i-quants as using "regularization" via forcing groups of quants to fall on a finite set of grid points, the set being much smaller than all possible grid points for the given number of bits per quant. E.g., `IQ2_XXS` uses 256 out of 6561 points on the E8 lattice. This prevents overfitting, thus can be considered as "regularization".
The other thing I have learned is that theories are rarely useful in their pure form. More often than not, you start with this beautiful theory to only find that it does not work very well in practice. So, you start adding fudge factors, and things get better. And then you add even more fudge factors and it gets better. When you are done with it you have something that works really well, but you barely recognize the beautiful pure theory you started from.
Just my 2 cents
> 👤 **jukofyork** replied the **2025-02-10** at **19:26:00**:<br>
> > For instance, Tikhonov regularization that was being proposed in one of the discussions, is pretty much the last thing we want to do when quantizing because we definitely do not want to make the quantized values as small as possible, which is the goal of the Tikhonov regularization term.
>
> I was late to that discussion, but it was possibly me who mentioned this.
>
> If it was, then I wouldn't have been proposing to use Tikhonov regularization on the weighting factors themselves to drive them to zero, as I agree this makes no sense. I would have suggested regularising the log of the weighting factors towards zero, which in turn regularises the weighting factors to 1 (ie: all equally weighted), whilst retaining the multiplicative symmetry around 1 and enforcing the non-negativity.
>
> From a Bayesian perspective:
>
> - Tikhonov regularization of the weights assumes some Gaussian prior centred around zero with lambda controlling the scale (which is obviously not what we want here).
> - Tikhonov regularization of the log of the weights assumes some [log-normal](https://en.wikipedia.org/wiki/Log-normal_distribution) prior centred around 1 with lambda controlling the (log) scale.
>
> I'm pretty sure I tried this way back when I mentioned this in that thread and it did turn out to be slightly worse than your empirically derived method on whatever model I tried it on.
>
> ---
>
> I still think this is an important area to consider (whatever the chosen regularization method is):
>
> #### (A) I see people still using using bartowski's same ~250kb `calibration_datav3.txt` file on `Deepseek-V3` as on fully-dense models.
>
> IMO, this has two huge problems:
>
> 1. The effective sample size is *at best* 1/32 = ~3% compared to a dense model.
> 2. If the router penalty hasn't done a good job during training, the effective sample size is potentially (much) lower than 3%.
>
> This can be corrected by either increasing the sample size, or where not possible (say due to the model being too large); adjusting the regularisation factor appropriately.
>
> #### (B) I see people using `wiki.train.raw` for the `imatrix` and then testing on `wiki.test.raw` (not so much now thankfully).
>
> Thinking they are getting an unbiased estimate of the `imatrix`'s perplexity improvement:
>
> ##### wiki.train.raw
> ```
> = Valkyria Chronicles III =
>
> Senjō no Valkyria 3 : Unrecorded Chronicles ( Japanese : 戦場のヴァルキュリア3 , lit . Valkyria of the Battlefield 3 ) , commonly referred to as Valkyria Chronicles III outside Japan , is a tactical role @-@ playing video game developed by Sega and Media.Vision for the PlayStation Portable . Released in January 2011 in Japan , it is the third game in the Valkyria series . Employing the same fusion of tactical and real @-@ time gameplay as its predecessors , the story runs parallel to the first game and follows the " Nameless " , a penal military unit serving the nation of Gallia during the Second Europan War who perform secret black operations and are pitted against the Imperial unit " Calamaty Raven " .
> The game began development in 2010 , carrying over a large portion of the work done on Valkyria Chronicles II . While it retained the standard features of the series , it also underwent multiple adjustments , such as making the game more forgiving for series newcomers . Character designer Raita Honjou and composer Hitoshi Sakimoto both returned from previous entries , along with Valkyria Chronicles II director Takeshi Ozawa . A large team of writers handled the script . The game 's opening theme was sung by May 'n .
> It met with positive sales in Japan , and was praised by both Japanese and western critics . After release , it received downloadable content , along with an expanded edition in November of that year . It was also adapted into manga and an original video animation series . Due to low sales of Valkyria Chronicles II , Valkyria Chronicles III was not localized , but a fan translation compatible with the game 's expanded edition was released in 2014 . Media.Vision would return to the franchise with the development of Valkyria : Azure Revolution for the PlayStation 4 .
>
> = = Gameplay = =
> ```
>
> ##### wiki.test.raw
>
> ```
> = Robert Boulter =
>
> Robert Boulter is an English film , television and theatre actor . He had a guest @-@ starring role on the television series The Bill in 2000 . This was followed by a starring role in the play Herons written by Simon Stephens , which was performed in 2001 at the Royal Court Theatre . He had a guest role in the television series Judge John Deed in 2002 . In 2004 Boulter landed a role as " Craig " in the episode " Teddy 's Story " of the television series The Long Firm ; he starred alongside actors Mark Strong and Derek Jacobi . He was cast in the 2005 theatre productions of the Philip Ridley play Mercury Fur , which was performed at the Drum Theatre in Plymouth and the Menier Chocolate Factory in London . He was directed by John Tiffany and starred alongside Ben Whishaw , Shane Zaza , Harry Kent , Fraser Ayres , Sophie Stanton and Dominic Hall .
> In 2006 , Boulter starred alongside Whishaw in the play Citizenship written by Mark Ravenhill . He appeared on a 2006 episode of the television series , Doctors , followed by a role in the 2007 theatre production of How to Curse directed by Josie Rourke . How to Curse was performed at Bush Theatre in the London Borough of Hammersmith and Fulham . Boulter starred in two films in 2008 , Daylight Robbery by filmmaker Paris Leonti , and Donkey Punch directed by Olly Blackburn . In May 2008 , Boulter made a guest appearance on a two @-@ part episode arc of the television series Waking the Dead , followed by an appearance on the television series Survivors in November 2008 . He had a recurring role in ten episodes of the television series Casualty in 2010 , as " Kieron Fletcher " . Boulter starred in the 2011 film Mercenaries directed by Paris Leonti .
>
> = = Career = =
> ```
>
> It should be really clear why this is a bad idea.
>
> #### (C) I see people running the `imatrix` calculation on only the first `512` tokens of models with huge contexts.
>
> This is clearly a *very* bad idea for several reasons related to the transformer architecture, likely biases the weighting factors to short sequences and also under-represents (part of) the tensors in the transformer blocks vs the MLP blocks.
>
> ---
>
> I am certainly no "Bayesian purist" and will happily tune the prior to get the best observed results too!
>
> BUT: I strongly believe the effectiveness of the `imatrix` calculations could be vastly improved by adding some method of interpolation/regularisation/whatever to allow for informed tuning of the weighting factors! :smile:
>
> 👤 **saood06** replied the **2025-02-10** at **20:23:18**:<br>
> > I still think this is an important area to consider (whatever the chosen regularization method is):
> > #### (A) I see people still using using bartowski's same ~250kb `calibration_datav3.txt` file on `Deepseek-V3` as on fully-dense models.
> >
> > IMO, this has two huge problems:
> >
> > 1. The effective sample size is _at best_ 1/32 = ~3% compared to a dense model.
> >
> > 2. If the router penalty hasn't done a good job during training, the effective sample size is potentially (much) lower than 3%.
> >
> >
> > This can be corrected by either increasing the sample size, or where not possible (say due to the model being too large); adjusting the regularisation factor appropriately.
>
> There is some discussion among a huggingface quant maker about imatrixing arctic-instruct ( another large MoE), where they talked about how since the experts are stored together in one tensor if for a layer only 1 expert is missing the entire layer can't be quantized, also while investigating this trying to get that expert to activate they observation something that shows size alone doesn't matter as the diversity of data did.
>
> "the only ones that has 127 out of 128 experts other than yours was "calibration_datav3" from bartowski and " imatrix-with-rp-format-data". Many datasets got way less experts than that. It clearly is the quality of training data and not the amount that matters. 4chan pol_062016-112019_labeled is massive but when I aborted it, it only had 122 out of 128 experts on layer 0. MMLU which I though is really diverse only managed to trigger 121 out of 121 experts on layer 0. "Tech-Awesome-Hub/mix-data" was with just 120 out of 128 experts on layer 0 even worse than that."
>
> From: https://huggingface.co/mradermacher/BabyHercules-4x150M-GGUF/discussions/3#6758d52499eea0c4b65d0475
>
> They do discuss the idea of needing more data because of MoE in that thread. I use their imatrix.dat files, and my ppl numbers I gave you are for IQ4_K_R4.
>
> 👤 **ikawrakow** replied the **2025-02-11** at **06:01:32**:<br>
> Is the inability to activate al experts observed just for layer 0 or for all layers?
>
> Are people aware of the fact that one can run the model with more active experts than specified by the meta data?
> ```
> ./bin/llama-imatrix -m some_model -f some_training --override-kv deepseek2.expert_used_count=int:N
> ```
> I think doing that will likely help activate more experts.
>
> I also don't understand why the entire experts tensor cannot be imatrix-quantized if just one expert is missing. If that's what we ended up with, it definitely needs fixing.
>
> 👤 **saood06** replied the **2025-02-11** at **15:17:30**:<br>
> > Is the inability to activate al experts observed just for layer 0 or for all layers?
>
> Just layer 0.
>
> > Are people aware of the fact that one can run the model with more active experts than specified by the meta data?
> >
> > ```
> > ./bin/llama-imatrix -m some_model -f some_training --override-kv deepseek2.expert_used_count=int:N
> > ```
> >
> > I think doing that will likely help activate more experts.
>
> Yes, people are aware of that (not sure if these people are) since I've seen plenty of testing every time a popular MoE comes out of people testing with that override to various values, but are you sure that is recommended? LLM performance tends to drop if you activate more or less than experts than the trained upon amount.
>
>
> > I also don't understand why the entire experts tensor cannot be imatrix-quantized if just one expert is missing. If that's what we ended up with, it definitely needs fixing.
>
> That is what happens. When doing imatrix they hit this (happened with other layers and tensors but this is the only one that persisted through the entire imatrix run.
>
> ```save_imatrix: entry ' blk.0.ffn_gate_exps.weight' has partial data (99.22%) - skipping```
>
> This lead to them not releasing IQ1 quants as it runs into this:
>
> ```llama_model_quantize: failed to quantize: Missing importance matrix for tensor blk.0.ffn_gate_exps.weight in a very low-bit quantization```
>
>
> They never reported that for any of the Deepseek models so I'm assuming they only encountered it with arctic and no matter what they did they were never able to activate that expert so I'm giving some credence to their theory that "There indeed could be an issue in the model router that makes it impossible to ever get routed to this specific expert which would be really unfortunate."
>
> Looking at the files in safetensors each expert is stored separately but with a GGUF that is not the case and they are all stored together.
>
> 👤 **ikawrakow** replied the **2025-02-11** at **16:33:38**:<br>
> Thanks for making me aware of this situation. I prepared PR #202 to deal with it.
>
> 👤 **ikawrakow** replied the **2025-02-11** at **17:11:08**:<br>
> > but are you sure that is recommended?
>
> I don't know if it is recommended. What I do know is that one can improve low bpw quantization by using a slightly higher number of active experts. E.g., for DeepSeek-Lite, 8 instead of 6 active experts is distinctly better for `IQ1_S` and `IQ1_M`. IIRC, 3 instead of 2 active experts did improve `IQ1_S` and `IQ1_M` quantized Mixtral8x7. As you increase the bpw the advantage goes away and eventually becomes counter productive. Using 3 instead of 2 experts for Mixtral8x7 was futile at 4+ bpw. But these new models have way more experts and more active experts, so activating additional experts is more forgiving. A quick check with DeepSeek-Lite (6 active experts as per meta data):
> * For 7 experts PPL is slightly lower (-0.2%)
> * For 8 and 9 experts it is about the same
> * For 10 experts PPL is ~0.3% higher.
>
> 👤 **saood06** replied the **2025-02-11** at **17:27:49**:<br>
> With R1 I've come across a person saying "I tried with 10 and 12 experts and generating perplexity failed with NaNs." and this same person tested 2,3,4,6,8,16 of unsloth's IQ1_M. His results below.
>
> Experts | PPL
> -- | --
> 8 | 3.4155, 4.2311, 3.0817, 2.8601, 2.6933, 2.5792, 2.5123, 2.5239
> 16 | 3.5350, 4.3594, 3.0307, 2.8619, 2.7227, 2.6664, 2.6288, 2.6568
> 6 | 3.4227, 4.2400, 3.1610, 2.9933, 2.8307, 2.7110, 2.6253, 2.6488
> 4 | 3.5790, 4.5984, 3.5135, 3.4490, 3.2952, 3.2563, 3.1883, 3.2978
> 3 | 3.9209, 4.9318, 4.0944, 4.2450, 4.2071, 4.3095, 4.3150, 4.6082
> 2 | 6.2387, 7.7455
>
> Here's another user who reported only lower expert usage.
>
>
> Model | [1] | [2] | [3] | [4] | [5] | [6] | [7] | [8]
> -- | -- | -- | -- | -- | -- | -- | -- | --
> IQ2_XXS | 3.39 | 4.56 | 3.44 | 3.27 | 3.27 | 3.20 | 3.12 | 3.12
> IQ3_XXS (exp=3) | 3.12 | 4.03 | 2.93 | 2.63 | 2.52 | 2.48 | 2.45 | 2.48
> IQ3_XXS (exp=4) | 2.87 | 3.61 | 2.60 | 2.25 | 2.09 | 1.97 | 1.89 | 1.87
> IQ3_XXS (exp=6) | 2.67 | 3.53 | 2.53 | 2.13 | 1.94 | 1.80 | 1.71 | 1.65
> IQ3_XXS (def) | 2.69 | 3.53 | 2.51 | 2.11 | 1.91 | 1.78 | 1.69 | 1.62
>
> 👤 **jukofyork** replied the **2025-02-11** at **19:22:47**:<br>
> > > but are you sure that is recommended?
> >
> > I don't know if it is recommended. What I do know is that one can improve low bpw quantization by using a slightly higher number of active experts. E.g., for DeepSeek-Lite, 8 instead of 6 active experts is distinctly better for `IQ1_S` and `IQ1_M`. IIRC, 3 instead of 2 active experts did improve `IQ1_S` and `IQ1_M` quantized Mixtral8x7. As you increase the bpw the advantage goes away and eventually becomes counter productive. Using 3 instead of 2 experts for Mixtral8x7 was futile at 4+ bpw. But these new models have way more experts and more active experts, so activating additional experts is more forgiving. A quick check with DeepSeek-Lite (6 active experts as per meta data):
> >
> > * For 7 experts PPL is slightly lower (-0.2%)
> >
> > * For 8 and 9 experts it is about the same
> >
> > * For 10 experts PPL is ~0.3% higher.
>
> Yeah, I managed to do this with `dbrx` before the PR that fixes the divisors for the experts separately. IIRC, I actually activated all the experts for `dbrx` and it got a better resulting `imatrix` than the pre-PR code did, and was quite usable.
>
> 👤 **jukofyork** replied the **2025-02-11** at **19:24:47**:<br>
> > With R1 I've come across a person saying "I tried with 10 and 12 experts and generating perplexity failed with NaNs." and this same person tested 2,3,4,6,8,16 of unsloth's IQ1_M. His results below.
>
> This could be because most previous MoEs use softmax to gate/weight with, so as you add more experts is scales down the weights, but `deepseek-v3` uses sigmoids, so the sum getting added into the hidden state will get larger and larger (you can probably also hack the weights and bias to counter this though).
>
> EDIT:
>
> ```
> INFO:hf-to-gguf:blk.11.exp_probs_b.bias, torch.float32 --> F32, shape = {256}
> INFO:hf-to-gguf:blk.11.ffn_gate_inp.weight, torch.bfloat16 --> F32, shape = {7168, 256}
> ```
>
> 👤 **saood06** replied the **2025-02-11** at **20:24:39**:<br>
> > `deepseek-v3` uses sigmoids, so the sum getting added into the hidden state will get larger and larger
>
> Then why does 16 experts work, but not 10/12?
>
> 👤 **jukofyork** replied the **2025-02-11** at **20:33:32**:<br>
> > > `deepseek-v3` uses sigmoids, so the sum getting added into the hidden state will get larger and larger
> >
> > Then why does 16 experts work, but not 10/12?
>
> Not sure - seems very strange!
>
> Only thing i can think of is some have negatively correlated outputs, and the sum of 16 cancels out the error that overflows whereas with 10 or 12 it doesn't?

View File

@@ -0,0 +1,293 @@
### 🗣️ [#15](https://github.com/ikawrakow/ik_llama.cpp/discussions/15) - Will LQER improve k- and i-quants?
| **Author** | `ikawrakow` |
| :--- | :--- |
| **Created** | 2024-08-09 |
| **Updated** | 2025-07-12 |
---
#### Description
[LQER/L²QER](https://arxiv.org/pdf/2402.02446) is the latest hype about LLM quantization. Promptly, there is an [issue](https://github.com/ggerganov/llama.cpp/discussions/8831) in `llama.cpp` to use that to improve the existing quantization methods because, you know, the gras is always greener on the other side of the road. But, unlike many earlier calls to improve quantization with the latest "SOTA" quantization advertisement, err, scientific paper, on arXiv, there are already efforts underway to actually implement this. E.g., [this PR](https://github.com/ggerganov/llama.cpp/pull/8939) adds Numpy dequantization so one can use Numpy to do the SVD of the difference between the full model and a quantized model.
People are of course free to spend their energy any way they see fit, and I should rather mind my own business, but I couldn't help myself but put this prediction on the record:
**LQER/L²QER will not help to improve any of the k- or I-quants in `llama.cpp`.**
Why do I think so?
Having spent so much time on developing all k- and i-quants in `llama.cpp`, I basically remember perplexity (PPL) values for a lot of models, especially the early once such as LLaMA-v1 and LLaMA-v2. And these are exactly the models the LQER authors compare their quantization against in Table 3 of the paper. So, for me, just a quick look was sufficient to see that the results of the paper are nowhere near being SOTA as they are being advertised. But let's do the comparison. I reproduce the Table 3.1 here for convenience:
<img width="1561" alt="Screenshot 2024-08-09 at 1 39 34 PM" src="https://github.com/user-attachments/assets/92f1e85f-83a8-4f51-bc6a-0f7ebfeb21d8">
Activation quantization is not quite there yet in `llama.cpp`, so we will focus on the upper part of the table, which shows results when only the model weights are quantized. Let us do some comparisons. I'll use `Q4_K_S`, `IQ4_XS`, and the newly added `IQ4_K` and `IQ3_K`. The L²QER quantization is 4.3 bpw, so it is in the same range as `IQ3_XS` (4.25 bpw) and `Q4_K_S/IQ4_K` (4.5 bpw). `IQ3_K` (3.4 bpw) is there to put things into perspective.
I have archived my LLaMA-v1 models and didn't feel like restoring (or re-downloading) the 33B and 65B models, so we will look at 7B and 13B. The PPL results in the paper are computed with standard Python tooling, and it is known that perplexities computed with `llama.cpp` can be quite different from people get in the Python Universe. But the ratio of the quantized PPL to the PPL of the `f16` model is nearly independent of the way PPL has been computed. The authors of the LQER paper have chosen to use the difference `PPL(Q) - PPL(f16)` (the ∆PPL column in Table 3), which is basically the same thing. Nevertheless, let's put some effort into making `llama.cpp` PPL more comparable to Python tooling. As far as I can tell, there are two main differences how PPL is computed:
* In `llama.cpp` PPL is evaluated by sequentially going over the provided evaluation text, while in Python samples of the given context length are selected at random. This should not result in a different result, at least not beyond the statistical uncertainty of the PPL estimate, so I did not change `llama.cpp`.
* In `llama.cpp` the mean log probability is evaluated over the second half of the context window `n_ctx`, while in Python the whole context window is used. Both are approximations to PPL for a context `n_ctx`. The `llama.cpp` approximation is better (to first order, it reports PPL for `3/4 n_ctx`, while the Python estimate is for `1/2 n_ctx`. Nevertheless, let's just change it in `llama.cpp` by adjusting [this line](https://github.com/ikawrakow/ik_llama.cpp/blob/a9f302ebe2373321c12b01d8760904901aa064a4/examples/perplexity/perplexity.cpp#L567). But instead of just using `first = 1`, I adjusted a bit around and ended up using `first = std::max(1, n_ctx/128)`, which gave the closest match between `llama.cpp` and the values reported in Table 3 of the LQER paper (which are for a context of 2048. I know this based on other quantization papers, which quote the same `f16` `PPL` values and explicitly state the context window used)
The following table shows the `llama.cpp` `f16` perplexities for the full models computed with this modification:
| LLaMA-v1-7B | LLaMA-v1-13B | LLaMA-v2-7B | LLaMA-v2-13B |
| -------------- | -------------- | -------------- | --------------- |
| 5.6291 +/- 0.02202 | 5.0172 +/- 0.01893 | 5.4802 +/- 0.02128 | 4.8706 +/- 0.01824 |
OK, we can now do the comparison. The table shows ∆PPL for the 4 LLaMA models and the 4 different quantization types. For more convenient comparison I have also added the L²QER result.
| Quantization | bpw | LLaMA-v1-7B | LLaMA-v1-13B | LLaMA-v2-7B | LLaMA-v2-13B |
| ------- | ----- | --- | ---- | ---- | ---- |
| L²QER | 4.30 | 0.220 | 0.100 | 0.100 | 0.060 |
| IQ3_K | 3.43 | 0.220 | 0.142 | 0.114 | 0.144 |
| IQ4_XS | 4.25 | 0.075 | 0.054 | 0.065 | 0.048 |
| Q4_K_S | 4.50 | 0.065 | 0.041 | 0.063 | 0.044 |
| IQ4_K | 4.50 | 0.041 | 0.033 | 0.043 | 0.034 |
I think the difference in performance is clear, and no further discussion is required.
I made [this comment](https://github.com/ggerganov/llama.cpp/pull/729#issuecomment-1519038289) back in April of 2023. I had just gotten involved with `llama.cpp` and had started thinking about the quantization of LLMs. With SVD being a standard tool in the toolbox of an ML practitioner, it was one of the first things that came to mind. Did I try? Of course I did - with disappointing results: one needed way too many terms to be competitive with block-wise quantization (I had already started working on k-quants). It is of course possible that my SVD attempts weren't good and, and the LQER authors were able to get something out of SVD. But my guess is it is a matter of the quality of the quantization to begin with: if the quality is low, then perhaps one can improve with just the first few components of the singular value decomposition. But if one still has a 2X - 5X larger quantization error **after** having done that, it is extremely unlikely that one can improve the much better quants by using just a few SVD terms. So, based on this, I reach the above conclusion.
Pinging @compilade who seems to be the main driving force behind implementing LQER in `llama.cpp` just in case this is somehow useful.
---
#### 🗣️ Discussion
👤 **compilade** replied the **2024-08-09** at **15:12:32**:<br>
Thanks for pinging me, it's interesting to learn about your past attempts with SVD.
In the LQER paper they don't seem to use it on top of SOTA quantization methods (they seem to use it on top of MXINT), so I'm simply curious to see if it's viable to apply it on top of k-quants and i-quants.
It might not be worth it, though, as you say.
But there's also something else which they did not try in the paper: subtracting a low-rank decomposition of the weights to then quantize only what remains, while the LoRA adapter of the quantization error should be able to recover it. I did not yet experiment with different ranks for both of theses low-rank approximations.
And in my preliminary tests this *does* help with pure `Q2_K` compared to plain LQER, but wasn't really better than the default `Q2_K` mix (which also uses `Q3_K` in some places), at least on a small model (OpenELM-270M), and with F16 LoRA and a rank of 32.
It's possible that a specialized quantization type for the not-low-rank part of weights could be useful, but I did not yet study how the distribution is changed when subtracting a low-rank approximation. My hypothesis is that non-linear assymetric quant types have an advantage for this, so the new `IQ2_K` and `IQ3_K` *might* be well suited for this.
I did not yet implement L²QER, so I dont know how it would perform yet. You're likely very right that it won't be good, but I want to try, because it will enable other experiments like different error-minimization objectives for the quantized dense tensor and the low-rank adapter.
Also, I have not yet implemented Numpy dequantization for most of the `IQ` types, only `IQ4_NL` and `IQ4_XS`, because the grids for the others are a bit large. Ideally, they should be generated at runtime with a minimal amount of magic numbers. Is that possible?
---
👤 **ikawrakow** replied the **2024-08-09** at **16:01:22**:<br>
> Also, I have not yet implemented Numpy dequantization for most of the IQ types, only IQ4_NL and IQ4_XS, because the grids for the others are a bit large. Ideally, they should be generated at runtime with a minimal amount of magic numbers. Is that possible?
Perhaps you should ask Georgi? According to `git blame` he is the author of most of the `IQ` tables.
But more seriously: the short answer is 'no'. To generate these tables, I quantized a bunch of models using the full E8 or D4 lattice, and collected statistics how often each lattice point is being used. This data is already orders of magnitude larger than the final `IQ` tables (and it takes quite some tome to generate). I then ran an optimization that attempts to a) Maximize the use count of selected lattice points and b) Minimize the maximum (or count averaged) distance between not selected lattice points to the nearest selected lattice point. I haven't published the code that does these things. But even if I had, the run time of the optimization is much too long to be invoked each time (and the lattice point use statistics is much bigger than the tables). I'm also not sure why you think the tables are too large? The data fits in L1 cache, no? Or are we running this on computers with 16 kB of RAM?
> And in my preliminary tests this does help with pure Q2_K compared to plain LQER, but wasn't really better than the default Q2_K mix (which also uses Q3_K in some places), at least on a small model (OpenELM-270M), and with F16 LoRA and a rank of 32.
If you use enough principle components you will eventually get an improvement, of course. But the question is, given the extra bits spent, is the improvement better than what is achievable by using a different quant, using quantization mixes, etc., with the same extra bits spent. Also, as demonstrated by `IQ2_S` (and `IQ2_K` in this repo), `Q2_K` is far from optimal in terms of the compromise between quantization accuracy and quantized model size, so perhaps one could get something there.
> But there's also something else which they did not try in the paper: subtracting a low-rank decomposition of the weights to then quantize only what remains, while the LoRA adapter of the quantization error should be able to recover it. I did not yet experiment with different ranks for both of theses low-rank approximations.
This is the first thing I tried. If that had been successful, we would have gotten not just a model compression, but a massive increase in performance too as matrix multiplications with a low rank decomposition are much faster than using the full matrix. I did have moderate success with the `K` and `Q` tensors in the early layers of LLaMA-1, but anything else was just hopeless until you started approaching full SVD.
But then again, I'm one of those people suffering from the NIH syndrome, so used my own hand-rolled tools for this investigation. Perhaps you will be more lucky just using standard tooling.
---
👤 **ikawrakow** replied the **2024-08-27** at **15:11:01**:<br>
Btw, on [this branch](https://github.com/ikawrakow/ik_llama.cpp/tree/ik/try_svd) there is some exploration of using SVD before or after the quantization. I have misused the `quantize-stats` tool to look at how the root-mean-square-error (rmse) behaves as a function of the number of SVD components. One can do the SVD before or after quantization. Certainly not production quality, AVX2-only vectorization, very simple multi-threading, but still enough to see that SVD does not add any value to LLMs quantization when the quantization works reasonably well. I know it works because full SVD reduces rmse to zero.
> 👤 **compilade** replied the **2024-08-27** at **16:59:19**:<br>
> Thanks!
>
> I see that when `SVD_BEFORE` is `false`, the initial output fed into `try_svd` is non-zero, and SVD is [done on the subtraction of input and output](https://github.com/ikawrakow/ik_llama.cpp/blob/63fc8014a25e5192b618e0d8f869f8c507c99793/examples/quantize-stats/quantize-stats.cpp#L317), which means this does look similar to LQER (while also quantizing the low-rank tensor?) if I understand it correctly. Still feels like a good proof of concept, even though it doesn't test using SVD both before quantization (to remove low-rank components from the input) *and* after (to then correct both the additional low-rank error and the quantization error) at the same time. It's helpful to know that plain LQER is worse than better quantization.
>
> I didn't really do any experiment lately towards LQER (and L²QER) because I was busy with other things, but this SVD implementation could likely be eventually useful for control vectors according to <https://github.com/ggerganov/llama.cpp/discussions/8831#discussioncomment-10227359> (cc @ngxson)
>
> For L²QER, I think `imatrix` files will probably need to use a less bespoke format, which means I think they could be GGUF files with `general.type` equal to `imatrix` (a bit like LoRA adapters have `general.type` equal to `adapter` since <https://github.com/ggerganov/llama.cpp/pull/8332>).
---
👤 **ikawrakow** replied the **2024-09-11** at **14:31:14**:<br>
@compilade With your PR-9400 in `llama.cpp` I now have to write GGUF loading and link against `ggml` when I want to take a quick look at an imatrix? Instead of just copy/pasting the 20 LOC of imatrix structure definition and (de-)serialization into a `.cpp` file and being done in 5 minutes? Ouch. And no, HF tools will with 99.99% probability not help me with what I'm interested in. I mean, having a Python imatrix to GGUF converter is I guess great for those who want to look at imatrix files on HF, but changing the imatrix tool to output GGUFs is a bit too much afaik.
Oh well, I'll need to keep my own copy of the `imatrix` and `quantize` tools.
> 👤 **ngxson** replied the **2024-09-11** at **15:17:56**:<br>
> Hi and sorry if this change disrupts your workflow.
>
> The main reason behind this change was that we want to unify file formats in llama.cpp. From the perspective of software engineering, is needed because it could help abstract out some parts of the implementation, thus provide a better code base for more features to come in the future.
>
> Contrary to what you said (to have HF to visualize the GGUF file), in fact, this change does introduce a headache to HF backend, since now we have to distinguish between GGUF model file and GGUF other-files (i.e. imatrix, cvector, lora). This is just to clarify to you that the main motivation of the change is about refactoring code in llama.cpp.
>
> Beside that, I'm wondering if it could help you: there is `gguf-py` package that allow GGUF file to be loaded into python. You can then use `torch` to investigate the imatrix tensors.
>
> Another option would be have a CLI arg in imatrix to select the output file format, although this may make the code a bit harder to maintain.
>
> In anyway, I appreciate your work and would love to know if we can do anything to help you.
>
> 👤 **ikawrakow** replied the **2024-09-11** at **16:01:09**:<br>
> > In anyway, I appreciate your work and would love to know if we can do anything to help you.
>
> Not merge PR-9400? Or just merge the imatrix to GGUF Python conversion script?
>
> I have written many tools that are not for public consumption but I have used (and still use occasionally) to investigate various quantization strategies. They are nice, simple, stand-alone programs where I don't even need a Makefile or a CMakeLists.txt but can just do `g++ -O3 some_too.cpp && ./a.out some_imatrix some_other_input`. They all become useless with this commit.
>
> > The main reason behind this change was that we want to unify file formats in llama.cpp.
> > Contrary to what you said (to have HF to visualize the GGUF file), in fact, this change does introduce a headache to HF backend,
>
> I see. We make a change that introduces headaches, triples or quadruples the code required to load/save such files thus magnifying the probability for bugs, and mandates linking against `libggml.so` for any tool that wants to operate with such files, to gain the benefit of "unifying file formats in llama.cpp"? Where the thing being unified is not some monstrous code with thousands of lines of code and massive maintenance burden but a 20 LOC thing that defines the format and implements (de-)serialization? Cool.
>
> 👤 **ikawrakow** replied the **2024-09-11** at **16:19:12**:<br>
> > From the perspective of software engineering, is needed because it could help abstract out some parts of the implementation, thus provide a better code base for more features to come in the future.
> ```
> ls -al ./ggml/src/libggml.so
> -rwxrwxr-x 1 iwan iwan 369408304 Sep 9 20:11 ./ggml/src/libggml.so
> ```
> Don't know about you, but having to link against a 370 MB `.so` to abstract 20 LoC does not add up afaik.
>
> 👤 **ngxson** replied the **2024-09-11** at **16:57:26**:<br>
> Regarding the merge decision, I can't determine whether it will be merged or not. My role is to provide clarity and explore options to help.
>
> The abstraction here isn't just about code length, but about creating a unified approach for tensor save/load operations within llama.cpp. In the future, this could also make it easier to add more parameters to imatrix.gguf file. It also allows more users to experiment with imatrix directly in the GGUF format, without needing conversions.
>
> I completely agree that linking against a 370 MB .so file is not desirable. However, it's worth noting that your `libggml.so` is likely built with CUDA support, which significantly increases its size. Also, the GGUF-related code is actually a small fraction of the whole ggml library.
>
> To address your specific workflow needs, I have a suggestion that might help: What if I provide you a header-only GGUF loader? This could potentially allow you to work with GGUF files without the need for linking against the full `libggml.so`. I've been considering this idea for a while, but couldn't find a valid usage for it.
>
> 👤 **compilade** replied the **2024-09-12** at **02:48:39**:<br>
> @ikawrakow Thanks for expressing concern about the format change.
>
> The main reason for it is that there doesn't seem to be a backward-compatible way to make the non-GGUF-based `imatrix` format work with many ubatches per chunk, or many chunks per ubatches (in the simple format, ncalls is tied to the ubatch size but is also somehow used as the number of chunks). It's also impossible to get the chunk size used to make a non-GGUF `imatrix` file from its metadata. (The convert script assumes 512 was used, but that's not always true. This is mostly relevant when merging `imatrix` files with `--in-file`)
>
> The non-GGUF `imatrix` files *are* simpler to deserialize, *but* that format has no way to be extended backward-compatibly, except by adding more stuff at the end and never ever removing any field. (And that format also doesn't have any magic number at the start, so not particularly easy to identify)
>
> I don't really want to break your scripts, though. Would a reverse convert script, from new to old format help (round-trip conversion tests can be used to test for correctness), or do you categorically oppose using GGUF for `imatrix` files? Should `llama-quantize` be able to load both formats instead of only one?
---
👤 **ikawrakow** replied the **2024-09-12** at **13:16:15**:<br>
@compilade Thank you for responding to my concerns.
> The main reason for it is that there doesn't seem to be a backward-compatible way to make the non-GGUF-based imatrix format work with many ubatches per chunk, or many chunks per ubatches (in the simple format, ncalls is tied to the ubatch size but is also somehow used as the number of chunks).
I must admit I don't understand the concerns. The issue is that one cannot (correctly) combine imatrices computed with different `u_batch` sizes? (One can always combine them, but the files will not contribute to the combined imatrix with the correct weight). Why would one want to do that? AFAIK, not needing to worry about batch and u-batch sizes is a feature, not a bug.
> It's also impossible to get the chunk size used to make a non-GGUF imatrix file from its metadata. (The convert script assumes 512 was used, but that's not always true. This is mostly relevant when merging imatrix files with --in-file)
Here is what I do
```
./bin/llama-imatrix -m some_model -f some+_training_data -c some_context --chunks N -o some_imatrix_c${some_context}.out
```
I.e., my imatrix files always carry the context length that was used in their name. Worth noting that a) The context length has a surprisingly small influence on the quantization results b) One may want to combine imatrices computed with a different context length to see what happens (what context length are you going to record for the combined imatrix file?)
> The non-GGUF imatrix files are simpler to deserialize, but that format has no way to be extended backward-compatibly, except by adding more stuff at the end and never ever removing any field. (And that format also doesn't have any magic number at the start, so not particularly easy to identify)
The imatrix is one and only one thing. I wouldn't know how one wants to "extend" it without it no longer being an imatrix. But suppose we **really** wanted to extend it. Here is what I would do
```
void read_imatrix(std::istream in, ...) {
int n_entries;
VersionInfo vinfo = {}; // default constructor makes sure we are dealing with a "legacy" imatrix file.
in.read((char *)&n_entries, sizeof(n_entries);
if (n_entries == std::numeric_limits<int>::max()) {
// imatrices with that many entries definitely do not exist
// => we are dealing with an "extended" imatrix
// read actual number of entries
in.read((char *)&n_entries, sizeof(n_entries);
// read version info
read_version_info(vinfo);
}
...
}
```
Voila, all existing imatrices continue to work, you can add whatever extensions you like (anywhere you like, not just at the end), we don't need to include `ggml/gguf` headers and link against a 370 MB `libggml.so`, etc.
> 👤 **compilade** replied the **2024-09-13** at **01:56:41**:<br>
> > I must admit I don't understand the concerns. The issue is that one cannot (correctly) combine imatrices computed with different `u_batch` sizes? (One can always combine them, but the files will not contribute to the combined imatrix with the correct weight). Why would one want to do that? AFAIK, not needing to worry about batch and u-batch sizes is a feature, not a bug.
>
> The sanest way to both not worry about batch sizes and correctly combine `imatrix` files is to store the number of tokens (or activations in this case) instead of the number of "chunks". This is what is done in the GGUF-based format. You're right that the chunk size in the metadata isn't really necessary. I *think* it would be possible to make it work that way in the simper format, but there would still be some weirdness with MoE tensors.
>
> I know using GGUF would make the `imatrix` format more complicated, but interoperability with existing and future GGUF tooling would be useful. For example, I'm working on some kind `gguf-diff` tool which will compare tensors between GGUF files (dequantizing if necessary), and making `imatrix` data stored as GGUF would make that tool work on `imatrix` files too without having to specially handle them.
>
> > what context length are you going to record for the combined imatrix file?
>
> The one used at the time of merging them (the last one). It seems like there is no good choice for the context length in that case.
>
> > Voila, all existing imatrices continue to work, you can add whatever extensions you like (anywhere you like, not just at the end)
>
> But the extensions would still break your scripts, so I don't see how it makes it better? It seems like all you want from this is that `imatrix` remains a trivially parsable format, even if it's changed?
>
> > we don't need to include `ggml/gguf` headers and link against a 370 MB `libggml.so`, etc.
>
> You're still using `llama-imatrix` (which does link against `libggml.so`) to generate those files.
>
> You know what, I think you're right to want to keep it simple. But GGUF-based `imatrix` also enables a bunch of stuff which is otherwise not possible. I will set <https://github.com/ggerganov/llama.cpp/pull/9400> as a draft, and then I'll try to make a compromise by making `llama-imatrix` *both* able to output the simple (somewhat backward-compatibly, but by storing the number of tokens as `ncall` instead of the number of chunks (the division by `ncall` will still result in a mean (of squares), so your existing scripts *should* continue to work)) *and* GGUF-based format (so that bidirectional conversion will be directly possible with `--in-file`. The GGUF-based `imatrix` format would only be used when the `--output` ends with `.gguf`, which it will by default), while I'll also try to make `llama-quantize` read both (basically falling back when loading as GGUF fails).
>
> It's gonna take me *at least* another week to implement that (not much free time this month, and lots of good conferences in my area).
>
> Not sure if supporting both formats will be viable long-term, though. But from this discussion I gather that both have reasons to exist.
>
> Basically, I think these are the arguments **for** each format:
>
> - Keeping the "simpler" `imatrix.dat` format
> - Simpler to parse
> - Only 20 LOC (closer to 50 LOC with proper error handling)
> - No need to link to `ggml` to load it
> - Allows self-contained programs to do experiments with it
> - Using GGUF for `imatrix`
> - Reduces the need for special-purpose formats
> - More interoperability with existing and future GGUF tooling
> - `gguf_dump.py`
> - HF previews
> - (eventually) `gguf-diff`
> - Trivially extensible
> - More metadata
> - Easier type changes for metadata (e.g. `int32` vs `int64`)
> - Counts are multi-dimensional
> - For stacked MoE tensors, each expert gets its own activation count
> - Allows keeping the sums intact
> - Great for merging `imatrix` files
>
> And the arguments against:
>
> - Keeping the "simpler" `imatrix.dat` format
> - Not trivially identifiable (no magic number as file header)
> - Weird serialization of MoE activation sums (scaled to use the same chunk count for the whole tensor)
> - Hard to backward-compatibly extend
> - (although some kind of extension *is* possible, it will pretty much always cause breaking changes)
> - Need to write a special-purpose `imatrix_reader` in `gguf-py`
> - Using GGUF for `imatrix`:
> - Depends on more code to load/save such files
> - which means more probability for bugs
> - (although since that code is shared with model loading, noticing/fixing bugs there benefit everything which uses it)
> - Can't make stand-alone programs for quantization experiments like before
> - Need to link to `libggml.so` to use GGUF-based `imatrix` files
> - Or need to include some `gguf.h` header-only library
>
> 👤 **compilade** replied the **2025-07-12** at **14:18:22**:<br>
> @ikawrakow
>
> I made changes to <https://github.com/ggml-org/llama.cpp/pull/9400> since last time.
>
> Is it sufficient for `llama-imatrix` to use the GGUF format only when the output filename ends with `.gguf`, so that if you keep using old output names, you'll still get the same format your scripts can work with?
>
> Similarly, conversion back to the previous format is now implemented, and is used like resuming an `imatrix` file but without a dataset, and where the output format ends with anything else than `.gguf`:
>
> ```console
> $ ./bin/llama-imatrix --in-file imatrix.gguf -o imatrix.dat
> ```
>
> `imatrix.gguf` files can always be converted to the `imatrix.dat` format, but the reverse lacks some shape information for 3d tensor evaluation counts (which is necessary to handle partial data gracefully in stacked MoE tensors). Both directions still work, though. `llama-quantize` can read both formats.
>
> I've had some complaints regarding using the filename extension to select the imatrix format. The alternative would be a format flag, but you would need to know about it (especially if the default isn't the format you're used to).
>
> It's still not completely clear to me what or how strict your requirements are. Is it closer to "GGUF imatrix files should not exist", "GGUF imatrix should only be used deliberately" (e.g. by using the `.gguf` suffix), or "a format flag for the previous format would be enough, even if the default is GGUF"?
>
> 👤 **ikawrakow** replied the **2025-07-12** at **17:19:43**:<br>
> @compilade
>
> Thank you for letting me know. I basically never use `llama.cpp` now, so the imatrix GG-ification is no longer relevant for my needs. The imatrix tool in mainline has been broken for MLA models for quite some time now, so I guess it is time to fix that by merging your PR.
>
> I'm of course looking forward to all the imatrix improvements that have been discussed, but never materialized because their implementation was inhibited by the inferior data format. Now, with the imatrix GG-ified, its future is looking really bright!

View File

@@ -0,0 +1,766 @@
### 🗣️ [#164](https://github.com/ikawrakow/ik_llama.cpp/discussions/164) - Latest CPU performance comparison with llama.cpp
| **Author** | `ikawrakow` |
| :--- | :--- |
| **Created** | 2024-12-24 |
| **Updated** | 2025-04-28 |
---
#### Description
There has been quite a bit of development here and in mainline `llama.cpp` since the performance results on the front page were generated, so I decided to make a new CPU performance comparison.
* Using `llama.cpp` build `14b699ec (4384)` (latest as of December 23 2024)
* Quantization is performed with mainline `llama.cpp`
* Performance is evaluated using the `llama-bench` tool for `PP-512` and `TG-128`
* For the results of `ik_llama.cpp` the command-line option `-rtr 1` is used when running `llama-bench`. This causes all model weights to be repacked into row-interleaved format (if available)
* `AVX2/Zen4` performance is on a Ryzen-7950X, `ARM` is on `M2-Max`
* LLaMA-3.1-8B-Instruct is used in all cases
* For not quantized variants the respective native 16-bit floats are used (`fp16` on M2-Max, `bf16` on the Ryzen-7950X)
### AVX2
| model | size | threads | test | t/s (llama.cpp) | t/s (ik_llama.cpp) | Speedup |
| ------------------------ | ---------: | ------: | ------------: | -------------------: | -----------------: | -------: |
| 8B BF16 | 14.96 GiB | 16 | pp512 | 78.58 ± 0.10 | 256.90 ± 0.36 | 3.269 |
| 8B BF16 | 14.96 GiB | 2 | tg128 | 4.05 ± 0.00 | 4.27 ± 0.00 | 1.054 |
| 8B Q8_0 | 7.95 GiB | 16 | pp512 | 147.92 ± 0.52 | 268.19 ± 0.19 | 1.813 |
| 8B Q8_0 | 7.95 GiB | 2 | tg128 | 4.95 ± 0.01 | 7.63 ± 0.00 | 1.541 |
| 8B Q5_0 | 5.22 GiB | 16 | pp512 | 111.68 ± 0.36 | 251.21 ± 0.41 | 2.249 |
| 8B Q5_0 | 5.22 GiB | 2 | tg128 | 5.30 ± 0.00 | 11.14 ± 0.00 | 2.102 |
| 8B Q4_0 | 4.35 GiB | 16 | pp512 | 153.52 ± 0.21 | 273.54 ± 0.33 | 1.782 |
| 8B Q4_0 | 4.35 GiB | 2 | tg128 | 11.23 ± 0.01 | 12.92 ± 0.00 | 1.150 |
| 8B Q2_K - Small | 2.78 GiB | 16 | pp512 | 122.37 ± 0.31 | 269.96 ± 0.29 | 2.206 |
| 8B Q2_K - Small | 2.78 GiB | 2 | tg128 | 11.33 ± 0.00 | 17.10 ± 0.01 | 1.509 |
| 8B Q3_K - Small | 3.41 GiB | 16 | pp512 | 85.19 ± 0.32 | 255.30 ± 0.24 | 2.997 |
| 8B Q3_K - Small | 3.41 GiB | 2 | tg128 | 8.80 ± 0.00 | 12.99 ± 0.01 | 1.476 |
| 8B Q4_K - Small | 4.36 GiB | 16 | pp512 | 108.40 ± 0.25 | 269.60 ± 0.27 | 2.487 |
| 8B Q4_K - Small | 4.36 GiB | 2 | tg128 | 9.57 ± 0.00 | 13.48 ± 0.00 | 1.409 |
| 8B Q5_K - Small | 5.21 GiB | 16 | pp512 | 75.52 ± 0.19 | 254.68 ± 0.36 | 3.372 |
| 8B Q5_K - Small | 5.21 GiB | 2 | tg128 | 7.51 ± 0.00 | 11.41 ± 0.00 | 1.519 |
| 8B Q6_K | 6.14 GiB | 16 | pp512 | 82.56 ± 0.28 | 259.21 ± 0.37 | 3.140 |
| 8B Q6_K | 6.14 GiB | 2 | tg128 | 7.62 ± 0.00 | 10.05 ± 0.00 | 1.319 |
| 8B IQ4_NL - 4.5 bpw | 4.35 GiB | 16 | pp512 | 123.36 ± 0.27 | 265.88 ± 0.52 | 2.155 |
| 8B IQ4_NL - 4.5 bpw | 4.35 GiB | 2 | tg128 | 5.96 ± 0.01 | 9.30 ± 0.00 | 1.560 |
| 8B IQ4_XS - 4.25 bpw | 4.13 GiB | 16 | pp512 | 74.39 ± 0.18 | 269.91 ± 0.37 | 3.628 |
| 8B IQ4_XS - 4.25 bpw | 4.13 GiB | 2 | tg128 | 8.15 ± 0.00 | 13.58 ± 0.00 | 1.666 |
| 8B IQ2_XXS - 2.0625 bpw | 2.23 GiB | 16 | pp512 | 45.78 ± 0.09 | 164.37 ± 0.48 | 3.590 |
| 8B IQ2_XXS - 2.0625 bpw | 2.23 GiB | 2 | tg128 | 5.47 ± 0.00 | 8.74 ± 0.01 | 1.598 |
| 8B IQ2_XS - 2.3125 bpw | 2.42 GiB | 16 | pp512 | 49.72 ± 0.06 | 156.50 ± 0.26 | 3.148 |
| 8B IQ2_XS - 2.3125 bpw | 2.42 GiB | 2 | tg128 | 5.87 ± 0.00 | 6.87 ± 0.00 | 1.170 |
| 8B IQ2_M - 2.7 bpw | 2.74 GiB | 16 | pp512 | 43.80 ± 0.09 | 181.64 ± 0.62 | 4.147 |
| 8B IQ2_M - 2.7 bpw | 2.74 GiB | 2 | tg128 | 5.24 ± 0.00 | 5.57 ± 0.00 | 1.063 |
| 8B IQ3_XXS - 3.0625 bpw | 3.04 GiB | 16 | pp512 | 34.17 ± 0.06 | 149.68 ± 0.14 | 4.380 |
| 8B IQ3_XXS - 3.0625 bpw | 3.04 GiB | 2 | tg128 | 4.18 ± 0.01 | 6.23 ± 0.00 | 1.490 |
| 8B IQ3_S - 3.4375 bpw | 3.42 GiB | 16 | pp512 | 30.20 ± 0.05 | 156.47 ± 0.34 | 5.181 |
| 8B IQ3_S - 3.4375 bpw | 3.42 GiB | 2 | tg128 | 3.71 ± 0.00 | 4.47 ± 0.00 | 1.205 |
### ARM_NEON
| model | size | threads | test | t/s (llama.cpp) | t/s (ik_llama.cpp) | Speedup |
| ------------------------ | ---------: | ------: | ------------: | -------------------: | -----------------: | -------: |
| 8B F16 | 14.96 GiB | 8 | pp512 | 28.96 ± 0.27 | 91.24 ± 0.24 | 3.151 |
| 8B F16 | 14.96 GiB | 4 | tg128 | 7.89 ± 0.02 | 7.89 ± 0.02 | 1.000 |
| 8B Q8_0 | 7.95 GiB | 8 | pp512 | 54.54 ± 1.35 | 129.70 ± 1.33 | 2.378 |
| 8B Q8_0 | 7.95 GiB | 3 | tg128 | 14.04 ± 0.02 | 14.29 ± 0.05 | 1.017 |
| 8B Q5_0 | 5.22 GiB | 8 | pp512 | 25.15 ± 0.92 | 103.94 ± 0.62 | 4.133 |
| 8B Q5_0 | 5.22 GiB | 4 | tg128 | 12.20 ± 0.01 | 16.63 ± 0.04 | 1.363 |
| 8B Q4_0 | 4.35 GiB | 8 | pp512 | 114.63 ± 2.08 | 122.52 ± 0.15 | 1.069 |
| 8B Q4_0 | 4.35 GiB | 4 | tg128 | 23.89 ± 0.13 | 23.43 ± 0.22 | 0.981 |
| 8B Q2_K - Small | 2.78 GiB | 8 | pp512 | 33.02 ± 0.05 | 108.98 ± 0.24 | 3.300 |
| 8B Q2_K - Small | 2.78 GiB | 4 | tg128 | 13.91 ± 0.01 | 23.49 ± 0.12 | 1.689 |
| 8B Q3_K - Small | 3.41 GiB | 8 | pp512 | 24.95 ± 0.02 | 107.16 ± 0.64 | 4.295 |
| 8B Q3_K - Small | 3.41 GiB | 4 | tg128 | 11.10 ± 0.00 | 15.29 ± 0.04 | 1.377 |
| 8B Q4_K - Small | 4.36 GiB | 8 | pp512 | 43.30 ± 0.57 | 126.53 ± 0.45 | 2.922 |
| 8B Q4_K - Small | 4.36 GiB | 4 | tg128 | 17.55 ± 0.01 | 22.49 ± 0.07 | 1.281 |
| 8B Q5_K - Small | 5.21 GiB | 8 | pp512 | 27.82 ± 0.52 | 108.44 ± 0.19 | 3.898 |
| 8B Q5_K - Small | 5.21 GiB | 4 | tg128 | 12.26 ± 0.01 | 16.15 ± 0.05 | 1.317 |
| 8B Q6_K | 6.14 GiB | 8 | pp512 | 26.73 ± 0.46 | 106.15 ± 1.22 | 3.971 |
| 8B Q6_K | 6.14 GiB | 4 | tg128 | 11.62 ± 0.01 | 14.86 ± 0.05 | 1.279 |
| 8B IQ4_NL - 4.5 bpw | 4.35 GiB | 8 | pp512 | 92.64 ± 2.46 | 121.59 ± 1.41 | 1.313 |
| 8B IQ4_NL - 4.5 bpw | 4.35 GiB | 4 | tg128 | 23.45 ± 0.06 | 22.97 ± 0.01 | 0.980 |
| 8B IQ4_XS - 4.25 bpw | 4.13 GiB | 8 | pp512 | 37.90 ± 0.59 | 134.02 ± 0.66 | 3.536 |
| 8B IQ4_XS - 4.25 bpw | 4.13 GiB | 4 | tg128 | 16.03 ± 0.02 | 23.36 ± 0.18 | 1.457 |
| 8B IQ2_XXS - 2.0625 bpw | 2.23 GiB | 8 | pp512 | 18.50 ± 0.53 | 87.89 ± 0.76 | 4.751 |
| 8B IQ2_XXS - 2.0625 bpw | 2.23 GiB | 4 | tg128 | 8.67 ± 0.02 | 12.28 ± 0.10 | 1.416 |
| 8B IQ2_XS - 2.3125 bpw | 2.42 GiB | 8 | pp512 | 20.40 ± 0.37 | 70.09 ± 0.12 | 3.436 |
| 8B IQ2_XS - 2.3125 bpw | 2.42 GiB | 4 | tg128 | 9.49 ± 0.01 | 11.12 ± 0.09 | 1.172 |
| 8B IQ2_M - 2.7 bpw | 2.74 GiB | 8 | pp512 | 14.61 ± 0.02 | 67.56 ± 0.41 | 4.624 |
| 8B IQ2_M - 2.7 bpw | 2.74 GiB | 4 | tg128 | 6.77 ± 0.01 | 8.90 ± 0.02 | 1.315 |
| 8B IQ3_XXS - 3.0625 bpw | 3.04 GiB | 8 | pp512 | 13.42 ± 0.14 | 78.29 ± 0.33 | 5.833 |
| 8B IQ3_XXS - 3.0625 bpw | 3.04 GiB | 4 | tg128 | 6.26 ± 0.01 | 8.54 ± 0.07 | 1.364 |
| 8B IQ3_S - 3.4375 bpw | 3.42 GiB | 8 | pp512 | 11.49 ± 0.01 | 80.89 ± 0.25 | 7.040 |
| 8B IQ3_S - 3.4375 bpw | 3.42 GiB | 4 | tg128 | 5.34 ± 0.01 | 6.61 ± 0.02 | 1.238 |
* We see that the CPU performance gap has widened significantly since July when I made the comparison on the front page.
* Only `llama.cpp's` low-quality 4-bit quantization `Q4_0` on `ARM_NEON` (which gets repacked to a 4-row interleaved format, formerly known as `Q4_0_4_4`) is competitive.
* The performance gap grown is taken by `IQ3_S` (7X faster on the M2-Max, 5.2X faster on the Ryzen-7950X).
* Even mainstream k-quants are now very significantly faster here
* On the Ryzen-7950X the slowest quantization type in `ik_llama.cpp` is faster than the fastest type in `llama.cpp` for prompt processing
* On the M2-Max the slowest `ik_llama.cpp` type outperforms all `llama.cpp` types except `Q4_0` and `IQ4_NL`.
### Prompt processing (prefill) champion
The fastest way to do prompt processing with `ik_llama.cpp` is the new 8-bit, 8-row interleaved `Q8_K_R8` type. Getting 370 t/s for LLaMA-3.1-8B (~7.5 billion parameters excluding token embeddings) corresponds to ~5.5 TFLOPS!
| model | size | params | backend | threads | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | ---------------: |
| llama 8B Q8_K_R8 | 7.56 GiB | 8.03 B | Zen4 | 16 | pp512 | 370.11 ± 0.58 |
llama 8B Q8_K_R8 | 7.56 GiB | 8.03 B | ARM_NEON | 8 | pp512 | 170.68 ± 0.56 |
---
#### 🗣️ Discussion
👤 **saood06** replied the **2025-01-10** at **23:34:54**:<br>
I ran some benchmarks on an AVX2 machine (Xeon E5-2683 v4, 32 core, quad channel broadwell) on an IQ4_XS of Midnight Miqu 70B v1.5 via batched bench ( with arguments -pps -fa -t 32 -npp 128,256,512 -ntg 128,256 -npl 1,2,4,8,16,32 -c 32768 [context only needed to be set for llama.cpp as otherwise it would skip some tests but ik_llama.cpp defaulted to 32768] ), build 4404 for llama.cpp. No runtime repacking for ik_llama.cpp.
I was curious about batch performance since there is inference software like arrows or loom which would definitely benefit from it.
| PP | TG | B | N_KV | T_TG s (llama.cpp) | S_TG t/s (llama.cpp) | T_TG s (ik_llama.cpp) | S_TG t/s (ik_llama.cpp) | Speedup |
|---------|----------|--------|----------|-----------------------|-----------------------|--------------------------|--------------------------|---------|
| 128 | 128 | 1 | 256 | 92.1 | 1.39 | 90.247 | 1.42 | 1.02 |
| 128 | 128 | 2 | 384 | 115.871 | 2.21 | 93.563 | 2.74 | 1.24 |
| 128 | 128 | 4 | 640 | 209.851 | 2.44 | 111.702 | 4.58 | 1.88 |
| 128 | 128 | 8 | 1152 | 399.978 | 2.56 | 209.249 | 4.89 | 1.91 |
| 128 | 128 | 16 | 2176 | 783.003 | 2.62 | 427.421 | 4.79 | 1.83 |
| 128 | 128 | 32 | 4224 | 1556.121 | 2.63 | 896.142 | 4.57 | 1.74 |
| 128 | 256 | 1 | 384 | 184.753 | 1.39 | 181.031 | 1.41 | 1.02 |
| 128 | 256 | 2 | 640 | 233.044 | 2.2 | 185.192 | 2.76 | 1.26 |
| 128 | 256 | 4 | 1152 | 423.01 | 2.42 | 227.289 | 4.51 | 1.86 |
| 128 | 256 | 8 | 2176 | 807.7 | 2.54 | 434.213 | 4.72 | 1.86 |
| 128 | 256 | 16 | 4224 | 1578.773 | 2.59 | 908.93 | 4.51 | 1.74 |
| 128 | 256 | 32 | 8320 | 3143.512 | 2.61 | 2024.429 | 4.05 | 1.55 |
| 256 | 128 | 1 | 384 | 92.622 | 1.38 | 90.92 | 1.41 | 1.02 |
| 256 | 128 | 2 | 512 | 118.038 | 2.17 | 92.551 | 2.77 | 1.28 |
| 256 | 128 | 4 | 768 | 212.751 | 2.41 | 113.572 | 4.51 | 1.87 |
| 256 | 128 | 8 | 1280 | 404.917 | 2.53 | 211.062 | 4.85 | 1.92 |
| 256 | 128 | 16 | 2304 | 789.767 | 2.59 | 428.125 | 4.78 | 1.84 |
| 256 | 128 | 32 | 4352 | 1569.485 | 2.61 | 899.613 | 4.55 | 1.74 |
| 256 | 256 | 1 | 512 | 186.991 | 1.37 | 181.844 | 1.41 | 1.03 |
| 256 | 256 | 2 | 768 | 237.34 | 2.16 | 186.438 | 2.75 | 1.27 |
| 256 | 256 | 4 | 1280 | 428.1 | 2.39 | 229.219 | 4.47 | 1.87 |
| 256 | 256 | 8 | 2304 | 815.064 | 2.51 | 437.482 | 4.68 | 1.86 |
| 256 | 256 | 16 | 4352 | 1591.762 | 2.57 | 911.641 | 4.49 | 1.75 |
| 256 | 256 | 32 | 8448 | 3170.023 | 2.58 | 2058.671 | 3.98 | 1.54 |
| 512 | 128 | 1 | 640 | 93.876 | 1.36 | 92.345 | 1.39 | 1.02 |
| 512 | 128 | 2 | 768 | 118.683 | 2.16 | 93.867 | 2.73 | 1.26 |
| 512 | 128 | 4 | 1024 | 215.082 | 2.38 | 114.616 | 4.47 | 1.88 |
| 512 | 128 | 8 | 1536 | 411.704 | 2.49 | 215.892 | 4.74 | 1.91 |
| 512 | 128 | 16 | 2560 | 803.455 | 2.55 | 439.992 | 4.65 | 1.83 |
| 512 | 128 | 32 | 4608 | 1595.727 | 2.57 | 928.049 | 4.41 | 1.72 |
| 512 | 256 | 1 | 768 | 188.209 | 1.36 | 183.237 | 1.4 | 1.03 |
| 512 | 256 | 2 | 1024 | 238.668 | 2.15 | 191.19 | 2.68 | 1.25 |
| 512 | 256 | 4 | 1536 | 435.484 | 2.35 | 233.338 | 4.39 | 1.87 |
| 512 | 256 | 8 | 2560 | 828.696 | 2.47 | 443.92 | 4.61 | 1.87 |
| 512 | 256 | 16 | 4608 | 1618.7 | 2.53 | 927.963 | 4.41 | 1.74 |
| 512 | 256 | 32 | 8704 | 3222.905 | 2.54 | 2082.961 | 3.93 | 1.55 |
The table does not have PP results as they did not vary much between tests since the prompt is shared as that is more aligned with my usecase, but even then ik_llama.cpp was faster (~5.05 t/s vs ~2.70 t/s).
I manually repacked it from the IQ4_XS and tested the R4 version of the quant on ik_llama.cpp more thoroughly results below.
| PP | TG | B | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s | T s | S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
| 128 | 128 | 1 | 256 | 19.497 | 6.56 | 92.423 | 1.38 | 111.921 | 2.29 |
| 128 | 128 | 2 | 384 | 19.332 | 6.62 | 92.578 | 2.77 | 111.910 | 3.43 |
| 128 | 128 | 3 | 512 | 19.325 | 6.62 | 94.344 | 4.07 | 113.669 | 4.50 |
| 128 | 128 | 4 | 640 | 19.342 | 6.62 | 96.776 | 5.29 | 116.119 | 5.51 |
| 128 | 128 | 5 | 768 | 19.345 | 6.62 | 106.289 | 6.02 | 125.634 | 6.11 |
| 128 | 128 | 6 | 896 | 19.358 | 6.61 | 124.053 | 6.19 | 143.412 | 6.25 |
| 128 | 128 | 7 | 1024 | 19.344 | 6.62 | 145.853 | 6.14 | 165.197 | 6.20 |
| 128 | 128 | 8 | 1152 | 19.374 | 6.61 | 169.257 | 6.05 | 188.631 | 6.11 |
| 128 | 128 | 9 | 1280 | 19.340 | 6.62 | 188.213 | 6.12 | 207.553 | 6.17 |
| 128 | 128 | 10 | 1408 | 19.354 | 6.61 | 210.678 | 6.08 | 230.033 | 6.12 |
| 128 | 128 | 11 | 1536 | 19.349 | 6.62 | 219.492 | 6.41 | 238.841 | 6.43 |
| 128 | 128 | 12 | 1664 | 19.341 | 6.62 | 251.357 | 6.11 | 270.697 | 6.15 |
| 128 | 128 | 13 | 1792 | 19.341 | 6.62 | 258.946 | 6.43 | 278.287 | 6.44 |
| 128 | 128 | 14 | 1920 | 19.355 | 6.61 | 299.999 | 5.97 | 319.354 | 6.01 |
| 128 | 128 | 15 | 2048 | 19.345 | 6.62 | 302.160 | 6.35 | 321.505 | 6.37 |
| 128 | 128 | 16 | 2176 | 19.362 | 6.61 | 339.064 | 6.04 | 358.426 | 6.07 |
| 128 | 256 | 1 | 384 | 19.365 | 6.61 | 180.876 | 1.42 | 200.241 | 1.92 |
| 128 | 256 | 2 | 640 | 19.382 | 6.60 | 189.188 | 2.71 | 208.570 | 3.07 |
| 128 | 256 | 3 | 896 | 19.359 | 6.61 | 191.263 | 4.02 | 210.621 | 4.25 |
| 128 | 256 | 4 | 1152 | 19.372 | 6.61 | 197.427 | 5.19 | 216.798 | 5.31 |
| 128 | 256 | 5 | 1408 | 19.373 | 6.61 | 219.152 | 5.84 | 238.525 | 5.90 |
| 128 | 256 | 6 | 1664 | 19.370 | 6.61 | 258.357 | 5.95 | 277.727 | 5.99 |
| 128 | 256 | 7 | 1920 | 19.370 | 6.61 | 303.584 | 5.90 | 322.954 | 5.95 |
| 128 | 256 | 8 | 2176 | 19.372 | 6.61 | 349.893 | 5.85 | 369.265 | 5.89 |
| 128 | 256 | 9 | 2432 | 19.327 | 6.62 | 386.352 | 5.96 | 405.680 | 5.99 |
| 128 | 256 | 10 | 2688 | 19.337 | 6.62 | 444.917 | 5.75 | 464.255 | 5.79 |
| 128 | 256 | 11 | 2944 | 19.341 | 6.62 | 451.427 | 6.24 | 470.768 | 6.25 |
| 128 | 256 | 12 | 3200 | 19.345 | 6.62 | 528.326 | 5.81 | 547.671 | 5.84 |
| 128 | 256 | 13 | 3456 | 19.546 | 6.55 | 532.030 | 6.26 | 551.576 | 6.27 |
| 128 | 256 | 14 | 3712 | 19.333 | 6.62 | 646.512 | 5.54 | 665.845 | 5.57 |
| 128 | 256 | 15 | 3968 | 19.335 | 6.62 | 619.687 | 6.20 | 639.021 | 6.21 |
| 128 | 256 | 16 | 4224 | 19.328 | 6.62 | 732.538 | 5.59 | 751.866 | 5.62 |
| 256 | 128 | 1 | 384 | 38.431 | 6.66 | 92.778 | 1.38 | 131.209 | 2.93 |
| 256 | 128 | 2 | 512 | 38.513 | 6.65 | 93.080 | 2.75 | 131.592 | 3.89 |
| 256 | 128 | 3 | 640 | 38.412 | 6.66 | 95.364 | 4.03 | 133.776 | 4.78 |
| 256 | 128 | 4 | 768 | 38.417 | 6.66 | 98.235 | 5.21 | 136.652 | 5.62 |
| 256 | 128 | 5 | 896 | 38.448 | 6.66 | 107.889 | 5.93 | 146.337 | 6.12 |
| 256 | 128 | 6 | 1024 | 38.443 | 6.66 | 125.778 | 6.11 | 164.221 | 6.24 |
| 256 | 128 | 7 | 1152 | 38.437 | 6.66 | 149.730 | 5.98 | 188.167 | 6.12 |
| 256 | 128 | 8 | 1280 | 38.462 | 6.66 | 170.487 | 6.01 | 208.949 | 6.13 |
| 256 | 128 | 9 | 1408 | 38.433 | 6.66 | 189.718 | 6.07 | 228.151 | 6.17 |
| 256 | 128 | 10 | 1536 | 38.438 | 6.66 | 213.574 | 5.99 | 252.011 | 6.09 |
| 256 | 128 | 11 | 1664 | 38.455 | 6.66 | 222.606 | 6.33 | 261.061 | 6.37 |
| 256 | 128 | 12 | 1792 | 38.445 | 6.66 | 252.863 | 6.07 | 291.308 | 6.15 |
| 256 | 128 | 13 | 1920 | 38.443 | 6.66 | 260.814 | 6.38 | 299.257 | 6.42 |
| 256 | 128 | 14 | 2048 | 38.438 | 6.66 | 305.763 | 5.86 | 344.202 | 5.95 |
| 256 | 128 | 15 | 2176 | 38.475 | 6.65 | 303.104 | 6.33 | 341.579 | 6.37 |
| 256 | 128 | 16 | 2304 | 38.469 | 6.65 | 342.793 | 5.97 | 381.262 | 6.04 |
| 256 | 256 | 1 | 512 | 38.455 | 6.66 | 183.865 | 1.39 | 222.320 | 2.30 |
| 256 | 256 | 2 | 768 | 38.479 | 6.65 | 187.584 | 2.73 | 226.063 | 3.40 |
| 256 | 256 | 3 | 1024 | 38.463 | 6.66 | 192.895 | 3.98 | 231.358 | 4.43 |
| 256 | 256 | 4 | 1280 | 38.399 | 6.67 | 199.713 | 5.13 | 238.111 | 5.38 |
| 256 | 256 | 5 | 1536 | 38.439 | 6.66 | 223.437 | 5.73 | 261.875 | 5.87 |
| 256 | 256 | 6 | 1792 | 38.427 | 6.66 | 260.056 | 5.91 | 298.482 | 6.00 |
| 256 | 256 | 7 | 2048 | 38.398 | 6.67 | 307.312 | 5.83 | 345.710 | 5.92 |
| 256 | 256 | 8 | 2304 | 38.415 | 6.66 | 355.564 | 5.76 | 393.979 | 5.85 |
| 256 | 256 | 9 | 2560 | 38.497 | 6.65 | 387.482 | 5.95 | 425.979 | 6.01 |
| 256 | 256 | 10 | 2816 | 38.498 | 6.65 | 451.367 | 5.67 | 489.865 | 5.75 |
| 256 | 256 | 11 | 3072 | 38.493 | 6.65 | 452.656 | 6.22 | 491.149 | 6.25 |
| 256 | 256 | 12 | 3328 | 38.669 | 6.62 | 534.248 | 5.75 | 572.917 | 5.81 |
| 256 | 256 | 13 | 3584 | 38.485 | 6.65 | 534.845 | 6.22 | 573.330 | 6.25 |
| 256 | 256 | 14 | 3840 | 38.486 | 6.65 | 649.772 | 5.52 | 688.257 | 5.58 |
| 256 | 256 | 15 | 4096 | 39.294 | 6.51 | 624.510 | 6.15 | 663.804 | 6.17 |
| 256 | 256 | 16 | 4352 | 38.648 | 6.62 | 745.863 | 5.49 | 784.511 | 5.55 |
| 512 | 128 | 1 | 640 | 77.207 | 6.63 | 91.468 | 1.40 | 168.674 | 3.79 |
| 512 | 128 | 2 | 768 | 76.844 | 6.66 | 94.375 | 2.71 | 171.219 | 4.49 |
| 512 | 128 | 3 | 896 | 77.835 | 6.58 | 97.286 | 3.95 | 175.120 | 5.12 |
| 512 | 128 | 4 | 1024 | 76.964 | 6.65 | 100.195 | 5.11 | 177.159 | 5.78 |
| 512 | 128 | 5 | 1152 | 76.998 | 6.65 | 110.516 | 5.79 | 187.514 | 6.14 |
| 512 | 128 | 6 | 1280 | 77.134 | 6.64 | 128.599 | 5.97 | 205.733 | 6.22 |
| 512 | 128 | 7 | 1408 | 77.085 | 6.64 | 153.659 | 5.83 | 230.744 | 6.10 |
| 512 | 128 | 8 | 1536 | 77.157 | 6.64 | 174.060 | 5.88 | 251.217 | 6.11 |
| 512 | 128 | 9 | 1664 | 77.074 | 6.64 | 192.851 | 5.97 | 269.925 | 6.16 |
| 512 | 128 | 10 | 1792 | 77.079 | 6.64 | 219.608 | 5.83 | 296.688 | 6.04 |
| 512 | 128 | 11 | 1920 | 78.024 | 6.56 | 224.332 | 6.28 | 302.356 | 6.35 |
| 512 | 128 | 12 | 2048 | 77.056 | 6.64 | 258.370 | 5.94 | 335.426 | 6.11 |
| 512 | 128 | 13 | 2176 | 76.931 | 6.66 | 264.692 | 6.29 | 341.624 | 6.37 |
| 512 | 128 | 14 | 2304 | 77.061 | 6.64 | 310.472 | 5.77 | 387.533 | 5.95 |
| 512 | 128 | 15 | 2432 | 77.067 | 6.64 | 305.914 | 6.28 | 382.981 | 6.35 |
| 512 | 128 | 16 | 2560 | 77.067 | 6.64 | 352.858 | 5.80 | 429.925 | 5.95 |
| 512 | 256 | 1 | 768 | 77.023 | 6.65 | 183.489 | 1.40 | 260.512 | 2.95 |
| 512 | 256 | 2 | 1024 | 77.015 | 6.65 | 190.038 | 2.69 | 267.052 | 3.83 |
| 512 | 256 | 3 | 1280 | 77.911 | 6.57 | 196.900 | 3.90 | 274.811 | 4.66 |
| 512 | 256 | 4 | 1536 | 76.980 | 6.65 | 204.269 | 5.01 | 281.249 | 5.46 |
| 512 | 256 | 5 | 1792 | 76.875 | 6.66 | 226.576 | 5.65 | 303.451 | 5.91 |
| 512 | 256 | 6 | 2048 | 77.435 | 6.61 | 267.788 | 5.74 | 345.223 | 5.93 |
| 512 | 256 | 7 | 2304 | 76.984 | 6.65 | 315.387 | 5.68 | 392.370 | 5.87 |
| 512 | 256 | 8 | 2560 | 76.968 | 6.65 | 362.447 | 5.65 | 439.416 | 5.83 |
| 512 | 256 | 9 | 2816 | 76.947 | 6.65 | 393.626 | 5.85 | 470.573 | 5.98 |
| 512 | 256 | 10 | 3072 | 76.959 | 6.65 | 463.783 | 5.52 | 540.742 | 5.68 |
| 512 | 256 | 11 | 3328 | 76.890 | 6.66 | 458.811 | 6.14 | 535.701 | 6.21 |
| 512 | 256 | 12 | 3584 | 77.875 | 6.57 | 544.833 | 5.64 | 622.708 | 5.76 |
| 512 | 256 | 13 | 3840 | 77.002 | 6.65 | 542.172 | 6.14 | 619.174 | 6.20 |
| 512 | 256 | 14 | 4096 | 77.088 | 6.64 | 668.595 | 5.36 | 745.683 | 5.49 |
| 512 | 256 | 15 | 4352 | 77.021 | 6.65 | 629.146 | 6.10 | 706.168 | 6.16 |
| 512 | 256 | 16 | 4608 | 78.044 | 6.56 | 758.943 | 5.40 | 836.987 | 5.51 |
Performance is good, but I don't understand why odd batch sizes seem to perform better. Also is converting from IQ4_XS to IQ4_XS_R4 via the quantize command not reccomended? I did it just for the test above and it went from:
type f32: 161 tensors
type q5_K: 80 tensors
type q6_K: 1 tensors
type iq4_xs: 481 tensors
And after conversion:
type f32: 161 tensors
type q5_K: 10 tensors
type q6_K: 1 tensors
type iq4_xs: 1 tensors
type iq5_k: 80 tensors
type iq4_xs_r4: 470 tensors
I only ask because I'm not sure if the 80 tensors going from q5_K to iq5_k is lossy.
---
👤 **ikawrakow** replied the **2025-01-11** at **07:28:46**:<br>
@saood06 Thanks for testing.
> Performance is good, but I don't understand why odd batch sizes seem to perform better.
Neither do I. I'll have to look into it.
> Also is converting from IQ4_XS to IQ4_XS_R4 via the quantize command not reccomended? I did it just for the test above and it went from:
Sorry, the goal was to make the `_R4` quants use the same quantization mixes, but apparently I have not quite succeeded. The function where the quantization type is selected is quite messy. But instead of re-quantizing to `*_R4`, you can use the `-rtr` command line option, which will make your model use the exact same mix of quantization types (but those where an `_R4` variant is available will be repacked to that).
> I only ask because I'm not sure if the 80 tensors going from q5_K to iq5_k is lossy.
`IQ5_K` is normally quite a bit better than `Q5_K`, so most of the time I would expect this to perform better.
> 👤 **saood06** replied the **2025-01-11** at **09:59:16**:<br>
> >Sorry, the goal was to make the _R4 quants use the same quantization mixes, but apparently I have not quite succeeded. The function where the quantization type is selected is quite messy. But instead of re-quantizing to *_R4, you can use the -rtr command line option, which will make your model use the exact same mix of quantization types (but those where an _R4 variant is available will be repacked to that).
>
> No worries, I only made the quant to test (for actual use, I'd make an IQK quant) and I didn't realize batched-bench supported rtr. It also didn't matter for this machine and test, but I also wasn't sure how runtime repacking and NUMA would behave, if the runtime repacking would interfere with the benefits from POSIX_MADV_RANDOM.
>
> >IQ5_K is normally quite a bit better than Q5_K, so most of the time I would expect this to perform better.
>
> Yes, but if the tensor was originally Q5_K converting it can't recover accuracy, it can only maintain it or lose more.
>
> On another note, I also got Deepseek V3 working with ik_llama.cpp. I don't have direct comparisons to llama.cpp ( and I don't know if I will, making a quant takes 4 hours) but running IQ4_K ( on different hardware then the Midnight Miqu test above, this one is a dual socket Xeon E5-2690 v3). Indirectly comparing to what people were posting on reddit with either machine's that were far better than mine, or quants that were smaller the performance I have seems a lot better. The only thing is this model based on both my experience and some issues made on llama.cpp takes a LOT of tokens to get fully faulted into RAM, which might be why people were posting such low performance numbers.
>
> Once almost all the model is in system cache, it did Prompt processing at 11.5 t/s, and token generation at 2.75 t/s. I still couldn't get it to fully fault, but it did basically stop paging, and performance stopped improving, once it hit those numbers.
>
> I couldn't get it to run with an _R4 quant it hit the GGML_ASSERT(nrc_x%4 == 0), but even without that I'm still happy with the performance of it.
>
> 👤 **ikawrakow** replied the **2025-01-11** at **10:38:23**:<br>
> > I couldn't get it to run with an _R4 quant it hit the GGML_ASSERT(nrc_x%4 == 0), but even without that I'm still happy with the performance of it.
>
> Can you post the assert you see? I was hoping to have covered all places where one needs to check for divisibility by 4 before using `_R4` quants, but apparently I'm still missing checks somewhere. What are the tensor dimensions of this model?
>
> 👤 **saood06** replied the **2025-01-11** at **11:03:54**:<br>
> >Can you post the assert you see?
>
> Here's the full error output I got when trying to run it. I put it in a detail's thing as it is long.
>
> <details>
>
> ```
> /home/saood06/ik_llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:5242: GGML_ASSERT(nrc_x%4 == 0) failed
> /home/saood06/ik_llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:5242:
> /home/saood06/ik_llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:5242: GGML_ASSERT(nrc_x%4 == 0) failed
> /home/saood06/ik_llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:5242: GGML_ASSERT(nrc_x%4 == 0) failed
> /home/saood06/ik_llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:5242:
> /home/saood06/ik_llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:5242: GGML_ASSERT(nrc_x%4 == 0) failed
> /home/saood06/ik_llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:5242: GGML_ASSERT(nrc_x%4 == 0) failed
> /home/saood06/ik_llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:5242: GGML_ASSERT(nrc_x%4 == 0) failed
> /home/saood06/ik_llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:5242: GGML_ASSERT(nrc_x%4 == 0) failed
> /home/saood06/ik_llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:5242: GGML_ASSERT(nrc_x%4 == 0) failed
>
> /home/saood06/ik_llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:5242: GGML_ASSERT(nrc_x%4 == 0) failed
> /home/saood06/ik_llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:5242: GGML_ASSERT(nrc_x%4 == 0) failed
> /home/saood06/ik_llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:5242: GGML_ASSERT(nrc_x%4 == 0) failed
> /home/saood06/ik_llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:5242: GGML_ASSERT(nrc_x%4 == 0) failed
> /home/saood06/ik_llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:5242: GGML_ASSERT(nrc_x%4 == 0) failed
> /home/saood06/ik_llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:5242: GGML_ASSERT(nrc_x%4 == 0) failed
> /home/saood06/ik_llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:5242: GGML_ASSERT(nrc_x%4 == 0) failed
> /home/saood06/ik_llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:5242: GGML_ASSERT(nrc_x%4 == 0) failed
> /home/saood06/ik_llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:5242: GGML_ASSERT(nrc_x%4 == 0) failed
> /home/saood06/ik_llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:5242: GGML_ASSERT(nrc_x%4 == 0) failed
> /home/saood06/ik_llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:5242: GGML_ASSERT(nrc_x%4 == 0) failed
> /home/saood06/ik_llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:5242: GGML_ASSERT(nrc_x%4 == 0) failed
> /home/saood06/ik_llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:5242: GGML_ASSERT(nrc_x%4 == 0) failed
> /home/saood06/ik_llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:5242: GGML_ASSERT(nrc_x%4 == 0) failed
> /home/saood06/ik_llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:5242: GGML_ASSERT(nrc_x%4 == 0) failed
> /home/saood06/ik_llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:5242: GGML_ASSERT(nrc_x%4 == 0) failed
> /home/saood06/ik_llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:5242: GGML_ASSERT(nrc_x%4 == 0) failed
> /home/saood06/ik_llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:5242: GGML_ASSERT(nrc_x%4 == 0) failed
> /home/saood06/ik_llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:5242: GGML_ASSERT(nrc_x%4 == 0) failed
> /home/saood06/ik_llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:5242: GGML_ASSERT(nrc_x%4 == 0) failed
> /home/saood06/ik_llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:5242: GGML_ASSERT(nrc_x%4 == 0) failed
> /home/saood06/ik_llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:5242: GGML_ASSERT(nrc_x%4 == 0) failed
> /home/saood06/ik_llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:5242: GGML_ASSERT(nrc_x%4 == 0) failed
> /home/saood06/ik_llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:5242: GGML_ASSERT(nrc_x%4 == 0) failed
> /home/saood06/ik_llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:5242: GGML_ASSERT(nrc_x%4 == 0) failed
> GGML_ASSERT(nrc_x%4 == 0) failed
> /home/saood06/ik_llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:5242: GGML_ASSERT(nrc_x%4 == 0) failed
>
> /home/saood06/ik_llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:5242: GGML_ASSERT(nrc_x%4 == 0) failed
> /home/saood06/ik_llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:5242: GGML_ASSERT(nrc_x%4 == 0) failed
> /home/saood06/ik_llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:5242: GGML_ASSERT(nrc_x%4 == 0) failed
> /home/saood06/ik_llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:5242: GGML_ASSERT(nrc_x%4 == 0) failed
>
> /home/saood06/ik_llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:5242: GGML_ASSERT(nrc_x%4 == 0) failed
> /home/saood06/ik_llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:5242: GGML_ASSERT(nrc_x%4 == 0) failed
> GGML_ASSERT(nrc_x%4 == 0) failed
> /home/saood06/ik_llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:5242: GGML_ASSERT(nrc_x%4 == 0) failed
> /home/saood06/ik_llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:5242: GGML_ASSERT(nrc_x%4 == 0) failed
> /home/saood06/ik_llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:5242: GGML_ASSERT(nrc_x%4 == 0) failed
> /home/saood06/ik_llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:5242: GGML_ASSERT(nrc_x%4 == 0) failed
> /home/saood06/ik_llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:5242: GGML_ASSERT(nrc_x%4 == 0) failed
> /home/saood06/ik_llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:5242: GGML_ASSERT(nrc_x%4 == 0) failed
> warning: process 2173336 is already traced by process 2173436
> warning: process 2173336 is already traced by process 2173436
> warning: process 2173336 is already traced by process 2173436
> warning: process 2173336 is already traced by process 2173436
> warning: process 2173336 is already traced by process 2173436
> warning: process 2173336 is already traced by process 2173436
> warning: process 2173336 is already traced by process 2173436
> warning: process 2173336 is already traced by process 2173436
> warning: process 2173336 is already traced by process 2173436
> warning: process 2173336 is already traced by process 2173436
> ptrace: Operation not permitted.warning: process 2173336 is already traced by process 2173436
> warning: process 2173336 is already traced by process 2173436
> warning: process 2173336 is already traced by process 2173436
> warning: process 2173336 is already traced by process 2173436
> warning: process 2173336 is already traced by process 2173436
> warning: process 2173336 is already traced by process 2173436
> warning: process 2173336 is already traced by process 2173436
> warning: process 2173336 is already traced by process 2173436
> warning: process 2173336 is already traced by process 2173436
> warning: process 2173336 is already traced by process 2173436
> warning: process 2173336 is already traced by process 2173436
> ptrace: Operation not permitted.ptrace: Operation not permitted.warning: process 2173336 is already traced by process 2173436
> warning: process 2173336 is already traced by process 2173436
> warning: process 2173336 is already traced by process 2173436
> ptrace: Operation not permitted.warning: process 2173336 is already traced by process 2173436
> warning: process 2173336 is already traced by process 2173436
> ptrace: Operation not permitted.warning: process 2173336 is already traced by process 2173436
> warning: process 2173336 is already traced by process 2173436
> ptrace: Operation not permitted.ptrace: Operation not permitted.ptrace: Operation not permitted.warning: process 2173336 is already traced by process 2173436
> ptrace: Operation not permitted.ptrace: Operation not permitted.warning: process 2173336 is already traced by process 2173436
>
> ptrace: Operation not permitted.ptrace: Operation not permitted.ptrace: Operation not permitted.warning: process 2173336 is already traced by process 2173436
> ptrace: Operation not permitted.ptrace: Operation not permitted.warning: process 2173336 is already traced by process 2173436
> ptrace: Operation not permitted.ptrace: Operation not permitted.ptrace: Operation not permitted.warning: process 2173336 is already traced by process 2173436
> ptrace: Operation not permitted.ptrace: Operation not permitted.ptrace: Operation not permitted.
>
> warning: process 2173336 is already traced by process 2173436
> warning: process 2173336 is already traced by process 2173436
> warning: process 2173336 is already traced by process 2173436
> ptrace: Operation not permitted.ptrace: Operation not permitted.ptrace: Operation not permitted.warning: process 2173336 is already traced by process 2173436
> warning: process 2173336 is already traced by process 2173436
>
> ptrace: Operation not permitted.warning: process 2173336 is already traced by process 2173436
> ptrace: Operation not permitted.
> ptrace: Operation not permitted.
> ptrace: Operation not permitted.
>
>
> ptrace: Operation not permitted.
>
> ptrace: Operation not permitted.
>
>
> ptrace: Operation not permitted.
>
> ptrace: Operation not permitted.
>
>
> No stack.No stack.ptrace: Operation not permitted.
>
>
>
> ptrace: Operation not permitted.ptrace: Operation not permitted.ptrace: Operation not permitted.
>
>
> ptrace: Operation not permitted.ptrace: Operation not permitted.
> No stack.No stack.ptrace: Operation not permitted.
>
>
>
> No stack.No stack.No stack.No stack.
>
> No stack.No stack.No stack.No stack.
> No stack.No stack.No stack.No stack.
>
>
> No stack.The program is not being run.
> No stack.
> No stack.
> No stack.
>
> No stack.No stack.No stack.No stack.No stack.
> The program is not being run.
>
>
>
> No stack.
>
>
> No stack.
> No stack.
>
>
> No stack.
> No stack.
>
> No stack.
> No stack.
>
>
> No stack.
>
>
>
>
>
> No stack.The program is not being run.No stack.No stack.No stack.
> No stack.
> The program is not being run.No stack.
> The program is not being run.The program is not being run.The program is not being run.
> The program is not being run.The program is not being run.
> The program is not being run.
> The program is not being run.The program is not being run.The program is not being run.The program is not being run.
> The program is not being run.The program is not being run.The program is not being run.The program is not being run.The program is not being run.
>
> The program is not being run.
> The program is not being run.The program is not being run.The program is not being run.The program is not being run.
> The program is not being run.The program is not being run.
>
>
> The program is not being run.
>
>
> The program is not being run.The program is not being run.
>
> The program is not being run.
>
> The program is not being run.
>
>
> The program is not being run.
> The program is not being run.
>
>
>
> The program is not being run.
>
>
>
>
> The program is not being run.
>
> The program is not being run.
>
>
> The program is not being run.The program is not being run.
>
>
>
>
>
>
>
>
> warning: process 2173336 is already traced by process 2173436
> ptrace: Operation not permitted.
> warning: process 2173336 is already traced by process 2173436
> ptrace: Operation not permitted.warning: process 2173336 is already traced by process 2173436
>
> warning: process 2173336 is already traced by process 2173436
> ptrace: Operation not permitted.warning: process 2173336 is already traced by process 2173436
> ptrace: Operation not permitted.
> warning: process 2173336 is already traced by process 2173436
> ptrace: Operation not permitted.
> ptrace: Operation not permitted.warning: process 2173336 is already traced by process 2173436
>
>
> warning: process 2173336 is already traced by process 2173436
> ptrace: Operation not permitted.No stack.No stack.
>
>
> ptrace: Operation not permitted.
>
> No stack.
> No stack.The program is not being run.The program is not being run.
>
>
> The program is not being run.No stack.No stack.
>
> The program is not being run.
> The program is not being run.
>
> No stack.The program is not being run.No stack.
> The program is not being run.
>
>
> The program is not being run.
> The program is not being run.
> [New LWP 2173387]
> [New LWP 2173386]
> [New LWP 2173385]
> [New LWP 2173384]
> [New LWP 2173383]
> [New LWP 2173382]
> [New LWP 2173381]
> [New LWP 2173380]
> [New LWP 2173379]
> [New LWP 2173378]
> [New LWP 2173377]
> [New LWP 2173376]
> [New LWP 2173375]
> [New LWP 2173374]
> [New LWP 2173373]
> [New LWP 2173372]
> [New LWP 2173371]
> [New LWP 2173370]
> [New LWP 2173369]
> [New LWP 2173368]
> [New LWP 2173367]
> [New LWP 2173366]
> [New LWP 2173365]
> [New LWP 2173364]
> [New LWP 2173363]
> [New LWP 2173362]
> [New LWP 2173361]
> [New LWP 2173360]
> [New LWP 2173359]
> [New LWP 2173358]
> [New LWP 2173357]
> [New LWP 2173356]
> [New LWP 2173355]
> [New LWP 2173354]
> [New LWP 2173353]
> [New LWP 2173352]
> [New LWP 2173351]
> [New LWP 2173350]
> [New LWP 2173349]
> [New LWP 2173348]
> [New LWP 2173347]
> [New LWP 2173346]
> [New LWP 2173345]
> [New LWP 2173344]
> [New LWP 2173343]
> [New LWP 2173342]
> [New LWP 2173341]
> [Thread debugging using libthread_db enabled]
> Using host libthread_db library "/usr/lib64/libthread_db.so.1".
> 0x000055770a10e177 in __GI___wait4 () at ../sysdeps/unix/sysv/linux/wait4.c:30
> warning: 30 ../sysdeps/unix/sysv/linux/wait4.c: No such file or directory
> #0 0x000055770a10e177 in __GI___wait4 () at ../sysdeps/unix/sysv/linux/wait4.c:30
> 30 in ../sysdeps/unix/sysv/linux/wait4.c
> #1 0x000055770a817f7a in ggml_print_backtrace () at /home/saood06/ik_llama.cpp/ggml/src/ggml.c:241
> 241 waitpid(pid, &wstatus, 0);
> #2 0x000055770a840bc8 in ggml_abort (file=0x55770abb91f0 "/home/saood06/ik_llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp", line=5242, fmt=0x55770abb4051 "GGML_ASSERT(%s) failed") at /home/saood06/ik_llama.cpp/ggml/src/ggml.c:268
> 268 ggml_print_backtrace();
> #3 0x000055770aa0814a in (anonymous namespace)::mul_mat_iq4_k_r4_q8_k<1> (n=<optimized out>, vx=<optimized out>, bx=<optimized out>, info=..., nrc_x=<optimized out>) at /home/saood06/ik_llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:5242
> 5242 GGML_ASSERT(nrc_x%4 == 0);
> #4 0x000055770ab7454c in (anonymous namespace)::MulMat::mul_mat_NxM (this=0x7ffe16539de0, n=7168, vx=0x551fe175a500, bx=<optimized out>, info=..., nrc_x=<optimized out>, nrc_y=7168) at /home/saood06/ik_llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:183
> 183 funcs[n_left-1](n, vx, bx, info, nrc_x);
> #5 (anonymous namespace)::MulMat::mul_mat_NxM (this=0x7ffe16539de0, n=7168, vx=0x551fe175a500, bx=<optimized out>, info=..., nrc_x=6, nrc_y=7168) at /home/saood06/ik_llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:123
> 123 inline void mul_mat_NxM(int n, const void * vx, size_t bx, DataInfo& info, int nrc_x, int nrc_y) {
> #6 iqk_mul_mat_moe (Nx=Nx@entry=2048, Ny=Ny@entry=1, ne00=ne00@entry=7168, ne11=ne11@entry=1, typeA=<optimized out>, A=A@entry=0x551fe175a500, strideA=<optimized out>, typeB=15, B=0x55770ff8ef60, strideB=8176, C=0x551d8392b820, nb1=8192, nb2=655 36, vrow_mapping=0x55770ff937e0, ith=0, nth=48) at /home/saood06/ik_llama.cpp/ggml/src/iqk/iqk_mul_mat.cpp:265
> 265 mm.mul_mat_NxM(ne00, (const char *)A + row_size_qx*first_x, row_size_qx, info, nrc_x, Ny);
> #7 0x000055770a82e9a5 in ggml_compute_forward_mul_mat_id (params=<optimized out>, dst=0x557709930930) at /home/saood06/ik_llama.cpp/ggml/src/ggml.c:14281
> 14281 if (!iqk_mul_mat_moe(nr0, nr1, ne00, ne11,
> #8 0x000055770a85c1e7 in ggml_graph_compute_thread (data=data@entry=0x7ffe1653a150) at /home/saood06/ik_llama.cpp/ggml/src/ggml.c:21029
> 21029 ggml_compute_forward(&params, node);
> #9 0x000055770a85c335 in ggml_graph_compute._omp_fn.0 () at /home/saood06/ik_llama.cpp/ggml/src/ggml.c:21080
> 21080 ggml_graph_compute_thread(&worker);
> #10 0x000055770a3b7dc6 in GOMP_parallel () from /usr/lib64/libgomp.so.1
> #11 0x000055770a85f984 in ggml_graph_compute (cgraph=cgraph@entry=0x55770fdda578, cplan=cplan@entry=0x7ffe1653a230) at /home/saood06/ik_llama.cpp/ggml/src/ggml.c:21066
> 21066 #pragma omp parallel num_threads(n_threads)
> #12 0x000055770a86f272 in ggml_backend_cpu_graph_compute (backend=<optimized out>, cgraph=0x55770fdda578) at /home/saood06/ik_llama.cpp/ggml/src/ggml-backend.c:815
> 815 return ggml_graph_compute(cgraph, &cplan);
> #13 0x000055770a872f7a in ggml_backend_graph_compute_async (backend=0x5577104efd20, cgraph=0x55770fdda578) at /home/saood06/ik_llama.cpp/ggml/src/ggml-backend.c:282
> 282 return backend->iface.graph_compute(backend, cgraph);
> #14 ggml_backend_sched_compute_splits (sched=0x55770ff4a860) at /home/saood06/ik_llama.cpp/ggml/src/ggml-backend.c:1795
> 1795 enum ggml_status ec = ggml_backend_graph_compute_async(split_backend, &split->graph);
> #15 0x000055770ad9d036 in llama_graph_compute (lctx=..., gf=0x5577098df030, n_threads=48) at /home/saood06/ik_llama.cpp/src/llama.cpp:14917
> 14917 ggml_backend_sched_graph_compute_async(lctx.sched, gf);
> #16 llama_decode_internal (batch_all=..., lctx=...) at /home/saood06/ik_llama.cpp/src/llama.cpp:15133
> 15133 llama_graph_compute(lctx, gf, n_threads);
> #17 llama_decode (ctx=0x55770fde9e00, batch=...) at /home/saood06/ik_llama.cpp/src/llama.cpp:19318
> 19318 const int ret = llama_decode_internal(*ctx, batch);
> #18 0x000055770ae99991 in llama_init_from_gpt_params (params=...) at /home/saood06/ik_llama.cpp/common/common.cpp:2179
> 2179 llama_decode(lctx, llama_batch_get_one(tmp.data(), std::min(tmp.size(), (size_t) params.n_batch), 0, 0));
> #19 0x000055770ae6bbac in main (argc=<optimized out>, argv=<optimized out>) at /home/saood06/ik_llama.cpp/examples/main/main.cpp:210
> 210 llama_init_result llama_init = llama_init_from_gpt_params(params);
> Aborted (core dumped)
> [Inferior 1 (process 2173336) detached]
>
> ```
> </details>
>
> >What are the tensor dimensions of this model?
>
> https://huggingface.co/unsloth/DeepSeek-V3-GGUF/tree/main/DeepSeek-V3-Q2_K_L?show_file_info=DeepSeek-V3-Q2_K_L%2FDeepSeek-V3-Q2_K_L-00001-of-00005.gguf
>
> That link should list them in a relatively nice format. You'll have to click through to view all 5 parts though.
>
> 👤 **ikawrakow** replied the **2025-01-11** at **11:17:30**:<br>
> Thanks! This explains it. It is a MoE model, so I must have forgotten to make sure the number of rows is a multiple of 4 when splitting work between threads in the MoE matrix multiplication implementation. I'll try to fix it.
>
> 👤 **saood06** replied the **2025-01-12** at **18:08:54**:<br>
> >Thanks! This explains it.
>
> I'm glad you were able to figure out the issue.
>
> >I'll try to fix it.
>
> I see you did with #170, now the _R4 works for Deepseek V3 but performance is different from what I was expecting. I am pleasantly surprised by token generation going from 2.75 t/s to 3.10 t/s. Prompt processing on the other hand dropped from 11.5 t/s to 9.8 t/s.
>
> Either way thanks for the quick fix. The bump in TG speeds is nice, even if PP speed went down for me.
>
> 👤 **ikawrakow** replied the **2025-01-13** at **05:54:15**:<br>
> > Prompt processing on the other hand dropped from 11.5 t/s to 9.8 t/s.
>
> This is strange. In my testing with Mixtral8x7B, after the fix `IQ4_XS_R4` is about 30% faster than `IQ4_XS` for prompt processing. Deepseek V3 is beyond my compute capabilities, so not able to investigate.
>
> 👤 **saood06** replied the **2025-01-19** at **13:00:33**:<br>
> >after the fix IQ4_XS_R4 is about 30% faster than IQ4_XS for prompt processing
>
> I've been testing IQ4_K_R4 vs IQ4_K. but I will test both IQ4_XS some for Mixtral-8x22B as I plan to test that, and I'll give some numbers against llama.cpp.
>
> >Deepseek V3 is beyond my compute capabilities, so not able to investigate.
>
> I understand, it is a large model and why I have yet to test IQ4_XS, to compare against both in ppl, and also against llama.cpp. But even if you can't test the implementation, I got permission from the author of the Deepseek PR to create a PR here, would you accept it.
---
👤 **ikawrakow** replied the **2025-01-11** at **07:58:35**:<br>
> > Performance is good, but I don't understand why odd batch sizes seem to perform better.
> Neither do I. I'll have to look into it.
It is related to flash attention (FA). Here is a graph that shows t/s as a function of batch size with and without FA (LLaMA-3.1-8B-Instruct, Ryzen-7950X CPU)
![batches](https://github.com/user-attachments/assets/2c2e6020-4bea-41f9-9b56-f51bcfd3c61a)
Clearly I'm doing something there that works better for odd number of queries. I'll need to investigate.
---
👤 **saood06** replied the **2025-01-19** at **13:33:06**:<br>
>We see that the CPU performance gap has widened significantly since July when I made the comparison on the front page.
Do you plan to update the README.md with these numbers? The R4 quants are very impressive.
> 👤 **ikawrakow** replied the **2025-01-19** at **15:30:36**:<br>
> I should, I know. It is just that I prefer to solve problems rather that write about how I solved the problem and what came out.
>
> 👤 **saood06** replied the **2025-04-27** at **09:33:26**:<br>
> You made a good list of things [here](https://github.com/ikawrakow/ik_llama.cpp/discussions/256#discussioncomment-12496828), the "Why?" section can be updated with newer models like the official bitnet release, Deepseek, Llama-4. Updating the benchmarks though I know is a lot.
>
> 👤 **ikawrakow** replied the **2025-04-28** at **14:29:33**:<br>
> Something like PR #352 ?
---
👤 **bartowski1182** replied the **2025-01-23** at **02:58:19**:<br>
Out of curiousity, do you intend to maintain this fork as an alternative to llama.cpp perpetually? or is it more of a testing grounds before upstreaming?
wondering if it's worth recommending people run this specifically for better performance or if it's more of a "bleeding edge" kind of project that people should just wait to get later when it's more ready
> 👤 **ikawrakow** replied the **2025-01-23** at **08:18:58**:<br>
> > Out of curiousity, do you intend to maintain this fork as an alternative to llama.cpp perpetually? or is it more of a testing grounds before upstreaming?
>
> Nothing is perpetual in this world :smiley:
>
> But no, I have no intention to be upstreaming to `llama.cpp`.
>
> It is also a bit of a chicken and egg game: I'll only get a more significant number of users if people know (or at least expect) that I'm seriously committed to his project and the project gets advertised around social networks, but I can only know if I want to seriously commit to maintaining this project long term for a significant number of users if I already have many users and have dealt with the associated bug reports and feature requests :smiley:
>
> As it stands, this project is only useful for technical users who are not scared to build the project themself (no docker images and pre-build binaries), and are using one of the platforms I develop/test on (Linux and macOS, `AVX2` or `ARM_NEON` CPUs, newer Nvidia GPUs). It may or may not work on Windows/Android/etc, old Nvidia or AMD GPUs, etc. I absolutely don't have the bandwidth (or desire) to be supporting every operating system and computing platform under the sun, including 10+ year old CPUs and GPUs, and obscure platforms used by exactly 3 people in the worlds, as `llama.cpp` does.
>
> 👤 **bartowski1182** replied the **2025-01-23** at **15:12:49**:<br>
> yeah that makes sense! would be cool to see someone attempt to upstream some improvements but I understand your lack of desire considering it's probably quite the headache
>
> Good to know though you intend to keep this going for at least awhile
---
👤 **saood06** replied the **2025-01-30** at **22:48:57**:<br>
I was curious due to Deepseek's design to test the MHA 35B c4ai-command-r-v01.Q8_0 on my Xeon E5-2683 v4. Ran as much context as I had RAM for. TG is set 5 not 32 as it was slow.
| PP | TG | B | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s | T s | S t/s |
|-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
| 128 | 5 | 1 | 133 | 20.344 | 6.29 | 5.500 | 0.91 | 25.843 | 5.15 |
| 256 | 5 | 1 | 261 | 34.275 | 7.47 | 30.895 | 0.16 | 65.170 | 4.00 |
| 512 | 5 | 1 | 517 | 56.097 | 9.13 | 31.850 | 0.16 | 87.947 | 5.88 |
| 1024 | 5 | 1 | 1029 | 112.460 | 9.11 | 21.224 | 0.24 | 133.684 | 7.70 |
| 2048 | 5 | 1 | 2053 | 218.188 | 9.39 | 32.941 | 0.15 | 251.130 | 8.18 |
| 4096 | 5 | 1 | 4101 | 448.955 | 9.12 | 31.231 | 0.16 | 480.186 | 8.54 |
| 8192 | 5 | 1 | 8197 | 977.908 | 8.38 | 42.563 | 0.12 | 1020.471 | 8.03 |
| 16384 | 5 | 1 | 16389 | 2339.461 | 7.00 | 39.989 | 0.13 | 2379.450 | 6.89 |
| 22000 | 5 | 1 | 22005 | 3484.923 | 6.31 | 44.705 | 0.11 | 3529.628 | 6.23 |

View File

@@ -0,0 +1,28 @@
### 🗣️ [#165](https://github.com/ikawrakow/ik_llama.cpp/discussions/165) - Norm RMS Epsilon
| **Author** | `Nexesenex` |
| :--- | :--- |
| **Created** | 2024-12-25 |
| **Updated** | 2024-12-27 |
---
#### Description
While it crosses my mind..
@Ikawrakow : a while ago, you made some measurements with variations of Norm RMS Epsilon which showed some little benefits to offset it for <2bpw quants. It was on L2 I believe, and I wonder if it applies to other arches, and if yes, if there's some sort of "formula" which would come with it to improve the low bitrate quants themselves.
Just beotian thoughts.
And merry XMAS btw, if you celebrate it!
---
#### 🗣️ Discussion
👤 **ikawrakow** replied the **2024-12-27** at **17:44:24**:<br>
I'm travelling, so just quickly from the phone.
Yes, there is a small benefit from increasing rms_eps also for LlaMA-3, but only for very low-bit quants (IQ2_XXS). No, I haven't done any kind of systematic investigation.

View File

@@ -0,0 +1,49 @@
### 🗣️ [#166](https://github.com/ikawrakow/ik_llama.cpp/discussions/166) - Learning more LLM quantization
| **Author** | `robinnarsinghranabhat` |
| :--- | :--- |
| **Created** | 2025-01-05 |
| **Updated** | 2025-03-13 |
---
#### Description
For beginners like me to ML, I wanted to learn what research papers guided the quantization implement in llama.
It might sound silly but we have separate tricks for quantization during training and during evaluation right ?
---
#### 🗣️ Discussion
👤 **ikawrakow** replied the **2025-01-05** at **10:37:28**:<br>
> For beginners like me to ML, I wanted to learn what research papers guided the quantization implement in llama.
I developed all quantization types in `llama.cpp` apart from the legacy quants `Q4_0, Q4_1, Q5_0, Q5_1, Q8_0` (but these are very simple round-to-nearest block-wise quants). I did not read any research papers, just went ahead and experimented. Rarely reading papers has always been my approach to research. I have found that reading what others have done influences my thinking direction and hence may prevent finding a better approach. I only go and read papers if I was not able to find a meaningful solution to a problem on my own.
> It might sound silly but we have separate tricks for quantization during training and during evaluation right ?
`llama.cpp` does not do any training, so it is always post-training quantization (PTQ). But in general there is quantization-aware training (QAT), where the model is not actually quantized during training but model weights are forced to stay within a specified range with the hope that this will give better PTQ results. The only actually quantized model training approach I'm aware of is Bitnet from Microsoft Research, where a ternary model is trained (weights are -1, 0, 1, plus a per tensor float scaling factor). More recently researchers have been utilizing fine-tuning for PTQ, where some corpus of training data is used to guide PTQ (look for, e.g., Quip#, AQLM, QTIP). This is quite different from the simple quantization approaches used in `llama.cpp` and also here in this repository, requires a full-fledged training framework such as PyTorch, powerful GPU(s), and many hours/days of GPU time.
---
👤 **robinnarsinghranabhat** replied the **2025-01-10** at **21:38:11**:<br>
Thank you for this humble response !
Now I understand it's doing inference on quantized weights.
But I get lost trying to understand llama cpp codebase. how should I navigate this codebase ?
I am comfortable with python, machine learning concepts and understand pointers in C.
But never written complex programs in C/C++.
Do I need to understand fundamentals concept on operating systems, comp.arch, memory-management e.t.c. ?
I want to be a programmar like you.
Sorry .. lots of questions all over the place :(
> 👤 **arnfaldur** replied the **2025-03-13** at **02:10:31**:<br>
> Trying to understand this codebase isn't attacking the wall where it's lowest. You're probably best off finding some beginner/intermediate C++ courses online. I imagine that there are plenty available for free. You don't strictly need to understand all these fundamentals to understand what this project is doing, but you sound like you're in the *don't know what you don't know* phase and a general Computer Science course would likely get you the farthest at this point.

View File

@@ -0,0 +1,96 @@
### 🗣️ [#18](https://github.com/ikawrakow/ik_llama.cpp/discussions/18) - CPU beating GPU in token generation speed
| **Author** | `ikawrakow` |
| :--- | :--- |
| **Created** | 2024-08-13 |
| **Updated** | 2025-04-03 |
---
#### Description
The [TriLM](https://huggingface.co/collections/SpectraSuite/trilms-unpacked-668d5f62afe0f4036925b1d2) ternary models are available in various sizes, so I was curious to look into prompt processing (PP) and token generation (TG) speed when the model is small enough to fit in the CPU cache. I have a Ryzen-7950X CPU with 64 MiB of L3 cache, and the 99M parameter TriLM model is 46 MiB when quantized with `IQ2_TN`. So, without further ado, lets look at a comparison between the Ryzen-7950X and an RTX-4080 in this case:
| backend | threads | test | t/s |
| ---------- | ------: | ------------: | ---------------: |
| Ryzen-7950X | 16 | pp1500 | 8268.11 ± 48.34 |
| Ryzen-7950X | 4 | tg500 | 1016.65 ± 22.17 |
| Ryzen-7950X | 8 | tg500 | 1224.83 ± 32.28 |
| Ryzen-7950X | 16 | tg500 | 1240.54 ± 25.74 |
| RTX-4080 | - | pp1500 | 110388 ± 250 |
| RTX-4080 | - | tg500 | 1136.64 ± 4.99 |
The GPU is still much faster than the CPU for prompt processing (although the difference, which is typically a factor of ~30 between this specific GPU and CPU, has shrunk to just a factor of 13), but now the CPU beats the GPU in TG speed!
I also have an M2-Max laptop (the version with a 30-core GPU). Here is what we get:
| backend | threads | test | t/s |
| ---------- | ------: | ------------: | ---------------: |
| M2-Max CPU | 8 | pp1500 | 5209.27 ± 21.48 |
| M2-Max CPU | 2 | tg500 | 692.87 ± 1.74 |
| M2-Max CPU | 4 | tg500 | 841.48 ± 5.96 |
| M2-Max CPU | 8 | tg500 | 894.73 ± 10.03 |
| M2-Max GPU | 4 | pp1500 | 25824 ± 562 |
| M2-Max GPU | 4 | tg500 | 464.86 ± 3.85 |
Also here the GPU is faster for PP (but just 5X faster), but the CPU wipes the floor with the GPU for TG, beating it close to 2X using all 8 threads, and 1.5X with just 2 threads!
---
#### 🗣️ Discussion
👤 **ikawrakow** replied the **2024-09-02** at **13:20:54**:<br>
Now that we have efficient Flash Attention (FA) implementation on the CPU via PR #32, we can compare again performance between the CPU and GPU for this tiny 99M parameter model. We get
| model | size | params | backend | ngl | threads | fa | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -: | ------------: | ---------------: |
| IQ2_BN - 2.06 bpw TriLM | 45.89 MiB | 99.76 M | CUDA | 100 | 1 | 1 | pp1500 | 156827.38 ± 727 |
| IQ2_BN - 2.06 bpw TriLM | 45.89 MiB | 99.76 M | CUDA | 100 | 1 | 1 | tg500 | 1496.37 ± 36.79 |
| IQ2_BN - 2.06 bpw TriLM | 45.89 MiB | 99.76 M | CPU | 0 | 16 | 1 | pp1500 | 12133.80 ± 51.45 |
| IQ2_BN - 2.06 bpw TriLM | 45.89 MiB | 99.76 M | CPU | 0 | 16 | 1 | tg500 | 1509.52 ± 9.65 |
TG speed is now about the same, which is still quite remarkable.
FA has improved CPU prompt processing speed by almost 50%, TG by 22%.
> 👤 **saood06** replied the **2025-04-02** at **10:36:44**:<br>
> Is there a chance SpargeAttn could be implemented here. Code [here](https://github.com/thu-ml/SpargeAttn), Paper [here](https://arxiv.org/abs/2502.18137).
>
> If it could would it benefit speed on CPU?
>
> 👤 **ikawrakow** replied the **2025-04-02** at **13:44:09**:<br>
> Other than the paper, is there any evidence that this works as advertised? If I did nothing else but implementing breakthroughs announced on arXiv, the day still wouldn't have enough hours.
>
> 👤 **saood06** replied the **2025-04-03** at **00:24:39**:<br>
> >Other than the paper, is there any evidence that this works as advertised?
>
> Not really (there are multiple ComfyUI custom nodes that port support but not much on people using it), the paper looked interesting to me and the idea makes sense to me, but the implementation they have looks premature. The same group put out SageAttention/SageAttention2 which has been widely adopted (mostly for image/video models) and the performance matched the paper but SpargeAttn has gotten interest but not much adoption because of the state of the implmentation.
>
> >If I did nothing else but implementing breakthroughs announced on arXiv, the day still wouldn't have enough hours.
>
> Sorry.
---
👤 **ikawrakow** replied the **2024-09-08** at **07:16:59**:<br>
With PR #42 we get this
| model | size | params | backend | threads | fa | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | -: | ------------: | ---------------: |
| IQ2_BN - 2.06 bpw TriLM | 45.89 MiB | 99.76 M | CPU | 16 | 1 | pp1500 | 12906.95 ± 61.04 |
| IQ2_BN - 2.06 bpw TriLM | 45.89 MiB | 99.76 M | CPU | 16 | 1 | tg512 | 1563.62 ± 12.55 |
I.e., 56% improvement for PP and 26% improvement for TG since the original post from Aug 13!
I see [PR-8151](https://github.com/ggerganov/llama.cpp/pull/8151), which provides dedicated quantization for the TriLM ternary models in mainline `llama.cpp`, has been merged. Here is what we get for `TQ2_0` that corresponds to our `IQ2_TN`
| model | size | params | backend | threads | fa | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | -: | ------------: | -------------------: |
| TQ2_0 - 2.06 bpw ternary | 45.89 MiB | 99.76 M | CPU | 16 | 1 | pp1500 | 5187.34 ± 11.69 |
| TQ2_0 - 2.06 bpw ternary | 45.89 MiB | 99.76 M | CPU | 16 | 0 | pp1500 | 5281.54 ± 53.33 |
| TQ2_0 - 2.06 bpw ternary | 45.89 MiB | 99.76 M | CPU | 16 | 1 | tg500 | 1156.25 ± 18.14 |
| TQ2_0 - 2.06 bpw ternary | 45.89 MiB | 99.76 M | CPU | 16 | 0 | tg500 | 1041.27 ± 21.30 |
Our version is 2.44X faster for PP and 35% faster for TG.

View File

@@ -0,0 +1,657 @@
### 🗣️ [#201](https://github.com/ikawrakow/ik_llama.cpp/discussions/201) - What is the NUMA situation ?
| **Author** | `bhugueney` |
| :--- | :--- |
| **Created** | 2025-02-11 |
| **Updated** | 2025-05-21 |
---
#### Description
It seems to me that output generation being memory bandwidth bounded and LLM requiring a lot of RAM , a cheap way to try increase both RAM amount and bandwidth is to go for NUMA.
For instance, a dual Epyc server can have 16 or 24 memory channels each CPU can also have up to 4 NUMA domains for best theoretical performance (also, on Gen 2 Epyc at least, L3 cache is shared only amongst cores on the same CCX).
However, there are many pitfalls to efficient NUMA programming especially to minimize cross NUMA domain memory and PCIe access.
It is my understanding that llama.cpp is trying to avoid the most basic problems (e.g. allocation everything in 1 NUMA domain) but more work needs to be done.
[KTransformers](https://github.com/kvcache-ai/ktransformers/blob/main/doc/en/DeepseekR1_V3_tutorial.md#some-explanations) just duplicates matrices on each NUMA domain !
[vLLM](https://docs.vllm.ai/en/latest/getting_started/installation/cpu/index.html#other-considerations) can do tensor parallelism on NUMA : «In general each NUMA node is treated as one GPU card. »
Is ik_llama.cpp NUMA aware ? If not, are there plans to make it NUMA aware ?
Thx !
---
#### 🗣️ Discussion
👤 **ikawrakow** replied the **2025-02-11** at **06:09:03**:<br>
In `ik_llama.cpp`, being a fork of `llama.cpp`, the NUMA situation is the same as in `llama.cpp`.
Improving performance on NUMA systems is something I would be interested in looking into, but I don't have a dual socket system available (with enough memory bandwidth to make it interesting), and I'm just a lonely guy hacking here for fun without the resources to go and rent/buy such a system.
> 👤 **bhugueney** replied the **2025-02-11** at **10:56:00**:<br>
> Thx !
> I sure hope my message didn't come of as complaining : I've very grateful for what you already did !
> If you are interested I will try to provide you full access to my dual Epyc server with 16 × 64 GB of DDR4 @3200.
>
> 👤 **ikawrakow** replied the **2025-02-11** at **14:47:10**:<br>
> This would be of course great, but I'm hesitant to promise to tackle the NUMA issue right away.
>
> When you say "full access", you mean you are not going to be using the system while I'm using it? Which Epycs do you have?
>
> 👤 **bhugueney** replied the **2025-02-11** at **23:17:06**:<br>
> I'm not expecting any promises, especially as I'm afraid llama.cpp cannot be patched to become NUMA efficient. My (very) limited understanding is that people ran llama.cpp CPU backend on NUMA and got bad performance because one thread was doing all the memory allocation (so in one NUMA domain) and they started trying to address that by patching the CPU backend. Unfortunately, such approach seems doomed to hit a wall as NUMA efficiency probably requires a different architecture more like a multi-GPU backend with tensor parallelism where each NUMA domain would be treated like a GPU wrt trying to minimize inter GPU communication and maximize parallelism. This is the vLLM approach for NUMA if I'm note mistaken.
>
> When I say "full access", I mean IPMI access while I'm not using it. But I have to figure things out first. Epycs would be 7R32 (same as AWS c5a instances).
>
> 👤 **saood06** replied the **2025-02-11** at **23:58:26**:<br>
> So in regards to the current state of llama.cpp/ik_llama.cpp NUMA performance I don't think it's that bad. I've seen a few reports from a few users on more modern NUMA machines than mine report performance running multiple instances of llama.cpp on each NUMA domain isolated, vs running one larger instance on all NUMA domains and although there was gain to be had it wasn't that dramatic of a difference. My older NUMA machine also gets decent performance for it's bandwidth.
>
> I'm looking into expert parallelism for the Deepseek V3/R1 MoE model, which should benefit NUMA systems. The plan for that is port over the PR which allows you to specify what tensor is loaded onto what backend, change the tensor representation of this model to not consolidate the experts. At that point I'd test performance with that and each NUMA node on a separate RPC backend, since changing ik_llama.cpp to create a backend for each NUMA domain might require a lot more work, but I'd look into it once I get there.
---
👤 **saood06** replied the **2025-03-13** at **05:53:54**:<br>
There is actually a good discussion on mainline: https://github.com/ggml-org/llama.cpp/discussions/12088
They did test ik_llama.cpp (but in only with a single NUMA Node on a single CPU at Q8_0) where it still outperformed mainline for CPU only.
Also you can look at zts9989's comment [here](https://github.com/ggml-org/llama.cpp/pull/11397#issuecomment-2716225570) where he talks about NUMA and what llama.cpp could improve on after he found that "approximately 50% of CPU usage is spent on thread synchronization" when running Deepseek R1 with multiple numa nodes.
> 👤 **ikawrakow** replied the **2025-03-13** at **07:27:34**:<br>
> > They did test ik_llama.cpp (but in only with a single NUMA Node on a single CPU at Q8_0) where it still outperformed mainline for CPU only.
>
> Where can I find the test results?
>
> 👤 **saood06** replied the **2025-03-13** at **07:44:42**:<br>
> In the linked post the second table under 6980P Benchmarks has it, but pasting it here for reference:
>
> Quantization | Tokens/Second | NUMA Configuration
> -- | -- | --
> Q8_0 | 6.6 | 1x NUMA Node on 1x CPU ik_llama
> Q8_0 | 6.2 | 1x NUMA Node on 1x CPU
>
> This is the only published result for ik_llama but they do state "Keep an eye on [ik_llama.cpp](https://github.com/ikawrakow/ik_llama.cpp) fork which has interesting optimizations." so they may run more.
>
> 👤 **saood06** replied the **2025-03-13** at **08:45:24**:<br>
> I forgot he had much more detailed results under Methodology and Notes, there is a section for ik_llama.cpp showing the command and bench numbers, interestingly ik_llama.cpp performance peaked at 128 threads for both PP and TG compared to peaking at 86 threads for TG and 128 threads for PP in mainline. He also shares PP numbers as well, where ik_llama again shows better performance than mainline. He does explicitly state TODO for testing ik_llama.cpp for 2x CPU Q8_0.
>
> Again pasting the segment of his post featuring ik_llama.cpp for reference:
>
> <div class="snippet-clipboard-content notranslate position-relative overflow-auto"><pre class="notranslate"><code class="notranslate">numactl -N 0 -m 0 \
> ./build/bin/llama-bench \
> --model /mnt/ai/models/unsloth/DeepSeek-R1-GGUF/DeepSeek-R1-Q8_0/DeepSeek-R1.Q8_0-00001-of-00015.gguf \
> --cache-type-k f16 \
> --cache-type-v f16 \
> --numa numactl \
> --threads 64,43,64,86,128,172
> </code></pre><div class="zeroclipboard-container position-absolute right-0 top-0">
>
> </div></div>
> <p dir="auto"><strong>Results</strong></p>
>
> model | size | params | backend | threads | test | t/s
> -- | -- | -- | -- | -- | -- | --
> deepseek2 671B Q8_0 | 664.29 GiB | 671.03 B | CPU | 64 | pp512 | 56.86 ± 7.21
> deepseek2 671B Q8_0 | 664.29 GiB | 671.03 B | CPU | 64 | tg128 | 4.86 ± 0.01
> deepseek2 671B Q8_0 | 664.29 GiB | 671.03 B | CPU | 43 | pp512 | 40.62 ± 0.02
> deepseek2 671B Q8_0 | 664.29 GiB | 671.03 B | CPU | 43 | tg128 | 3.69 ± 0.00
> deepseek2 671B Q8_0 | 664.29 GiB | 671.03 B | CPU | 64 | pp512 | 57.67 ± 4.62
> deepseek2 671B Q8_0 | 664.29 GiB | 671.03 B | CPU | 64 | tg128 | 4.89 ± 0.00
> deepseek2 671B Q8_0 | 664.29 GiB | 671.03 B | CPU | 86 | pp512 | 62.21 ± 13.63
> deepseek2 671B Q8_0 | 664.29 GiB | 671.03 B | CPU | 86 | tg128 | 5.69 ± 0.00
> deepseek2 671B Q8_0 | 664.29 GiB | 671.03 B | CPU | 128 | pp512 | 78.89 ± 21.46
> deepseek2 671B Q8_0 | 664.29 GiB | 671.03 B | CPU | 128 | tg128 | 6.60 ± 0.00
> deepseek2 671B Q8_0 | 664.29 GiB | 671.03 B | CPU | 172 | pp512 | 70.63 ± 0.58
> deepseek2 671B Q8_0 | 664.29 GiB | 671.03 B | CPU | 172 | tg128 | 5.05 ± 0.00
---
👤 **ikawrakow** replied the **2025-03-13** at **11:55:55**:<br>
@saood06
Thanks for alerting me to this thread.
They have tested the lowest performing configuration in https://github.com/ggml-org/llama.cpp/discussions/12088 (but this is also to be expected as I don't have any documentation on the new features, so one needs to go through the PRs to discover them).
For instance, here is a table for DeepSeek-Lite `pp512` performance on my Ryzen-7950X using `Q8_0`. The first row is the configuration used in https://github.com/ggml-org/llama.cpp/discussions/12088, the last is the best possible result for `pp512`. There is a 50% difference, so I wouldn't be surprised if it is possible to get 100+ t/s on their test system considering the 78 t/s they got with the vanilla settings.
| model | threads | fa | rtr | fmoe | test | t/s |
| ------------------- | ------: | -: | --: | ---: | ------------: | ---------------: |
| deepseek2 16B Q8_0 | 16 | 0 | 0 | 0 | pp512 | 433.04 ± 1.44 |
| deepseek2 16B Q8_0 | 16 | 1 | 0 | 0 | pp512 | 440.25 ± 2.54 |
| deepseek2 16B Q8_0 | 16 | 0 | 0 | 1 | pp512 | 441.58 ± 3.34 |
| deepseek2 16B Q8_0 | 16 | 1 | 0 | 1 | pp512 | 452.19 ± 1.21 |
| deepseek2 16B Q8_0 | 16 | 0 | 1 | 0 | pp512 | 607.32 ± 5.09 |
| deepseek2 16B Q8_0 | 16 | 1 | 1 | 0 | pp512 | 625.10 ± 7.66 |
| deepseek2 16B Q8_0 | 16 | 0 | 1 | 1 | pp512 | 627.87 ± 4.54 |
| deepseek2 16B Q8_0 | 16 | 1 | 1 | 1 | pp512 | 652.81 ± 3.52 |
TG is a very different story. There performance is clearly dominated by memory access patterns and thread synchronization, and I cannot look into optimizing this aspect without having access to such a system. As it stands, the achieved performance is nowhere near the maximum theoretical performance. The tested 6980P has a theoretical bandwidth of 512? GiB/s, so 8X my Ryzen-7950X. I get `tg128=22.3 t/s` for `Q8_0`, DeepSeek-Lite has ~15X fewer active parameters, so per napkin math we expect `8*22.3/15 = 11.9 t/s`, so nearly 2X of what is being measured. In contrast, the 22.3 t/s for the `Q8_0` quantized DeepSeek-Lite on my Ryzen-7950X correspond to fetching model weights at a rate of about 57 GiB/s, so pretty close to the theoretical maximum (and I have never seen anything more than 60 GiB/s on the Ryzen-7950X, even for dense models, which is probably due to the few percent synchronization overhead).
@ubergarm
Very interesting results, thank you for posting and including my little LLM inference playground in the results. I have seen a higher than usual amount of stars added to my repository in the last few days, I guess this must be due to your post.
I'm curious which `AVX512` extensions are supported by this CPU to understand if vanilla `AVX2` is being used, or the code optimized for the Zen4 core (requires `AVX512F, AVX512VNNI, AVX512VL, AVX512BW, AVX512DQ`).
Playing with some of the more advanced options that mainline `llama.cpp` does not have would be of course very interesting too.
> 👤 **saood06** replied the **2025-03-13** at **21:20:04**:<br>
> >I'm curious which AVX512 extensions are supported by this CPU to understand if vanilla AVX2 is being used, or the code optimized for the Zen4 core (requires AVX512F, AVX512VNNI, AVX512VL, AVX512BW, AVX512DQ).
>
> All of those extensions are supported (and also AVX512_fp16 which AMD does not support even on Zen 5), none of the normal sources I use for this have been updated to show Granite Rapids but I did find [this](https://www.phoronix.com/image-viewer.php?id=intel-xeon-6980p-performance&image=intel_xeon_6980p_2_lrg). Granite rapids was supposed to have support for Intel AVX10 (Version 1, or Intel AVX10.1) but that apparently did not happen.
>
> >I have seen a higher than usual amount of stars added to my repository in the last few days, I guess this must be due to your post.
>
> I've also seen an uptick in organic mentions of ik_llama.cpp recently and have done my best to help people understand all the new features and benefits.
>
> 👤 **ubergarm** replied the **2025-03-13** at **22:15:00**:<br>
> @ikawrakow
>
> > Very interesting results, thank you for posting and including my little LLM inference playground in the results.
>
> My pleasure, thanks for sharing your work. I've been tracking progress across various inference engines and stumbled onto yours from [this github pr discussion](https://github.com/ggml-org/llama.cpp/pull/12227#issuecomment-2708219642) about MLA and flash attention.
>
> > The tested 6980P has a theoretical bandwidth of 512? GiB/s
>
> Your back of the napkin math is good, this machine tested with `mlc` (Intel Memory Latency Checker) shows just almost exactly 512GiB/s per CPU socket within the same NUMA node. Shown in the 1x NUMA node per CPU core here with BIOS set to `SNC=Disable`. Otherwise it has 3x nodes per CPU with an uneven number of cores hah...
>
> ```
> Measuring Memory Bandwidths between nodes within system
> Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
> Using all the threads from each core if Hyper-threading is enabled
> Using Read-only traffic type
> Numa node
> Numa node 0 1
> 0 554843.5 247793.1
> 1 247281.1 552385.5
> ```
>
> > Playing with some of the more advanced options that mainline llama.cpp does not have would be of course very interesting too.
>
> Yes, I'm playing with [ktransformers](https://github.com/ubergarm/r1-ktransformers-guide/) as well, but it has a hard requirement on GPU. Unfortunately, this 6980P rig has no GPU so I'm limited to CPU only testing.
>
> > so one needs to go through the PRs to discover them
>
> Correct, I have not gone through your branches and PRs to figure out the best combination of code and options for pure CPU inference using the various unsloth R1 671B GGUF quants.
>
> @saood06
>
> > Also you can look at zts9989's comment https://github.com/ggml-org/llama.cpp/pull/11397#issuecomment-2716225570 where he talks about NUMA and what llama.cpp could improve on after he found that "approximately 50% of CPU usage is spent on thread synchronization" when running Deepseek R1 with multiple numa nodes.
>
> Yes, this is the most optimized CPU implementation of which I've heard to date. Seems unlikely they will release code directly to github, but possibly would share files via email, but I haven't asked.
>
> > All of those extensions are supported (and also AVX512_fp16
>
> Correct, I have the output of `lscpu` buried in the `Methodology and Notes` `<detail>` as you discovered. Copy pasted below for ease of reference. The three AMX Extensions specific flags unique to newer Intel Xeon are `amx_bf16` `amx_int8` `amx_tile`. Very interesting for DeepSeek is that Intel's next generation Diamond Rapids may support [`amx_fp8`](https://www.phoronix.com/news/Intel-AMX-FP8-In-LLVM). It's mildly annoying that older NVIDIA GPUs with capability <8.9 don't natively support fp8e4nv. This is required for [DeepSeek's Triton fp8_gemm implementation](https://github.com/deepseek-ai/DeepSeek-V3/blob/main/inference/kernel.py). Then the official DeepGemm implementation seems limited to [only 9.0 (H100s) hardware](https://github.com/deepseek-ai/DeepGEMM/issues/6) currently too afaict.
>
> Funny to see [guys with Dual 5090s whining](https://github.com/vllm-project/vllm/issues/14628#issuecomment-2720369467) that their stuff doesn't work yet haha....
>
> It seems llama.cpp main has some support for these, however I'm not completely sure that it speeds up token generation or if it needs a specific quant. It does seem to at least be compiled in and doing *something* on the `Q8_0` test:
>
> ```
> load_tensors: tensor 'token_embd.weight' (q8_0) (and 54 others) cannot be used with preferred buffer type AMX, using CPU instead
> ...
> load_tensors: AMX model buffer size = 18214.39 MiB
> load_tensors: CPU_Mapped model buffer size = 45565.90 MiB
> ...
> ```
>
> I don't believe I noticed these debug logs when I tested `ik_llama.cpp@a48e1632` by simply compiling main branch with no special new arguments.
>
> Quoting [@aubreyli](https://github.com/ggml-org/llama.cpp/discussions/12088#discussioncomment-12469251)
> > AMX tile config is [here](https://github.com/ggml-org/llama.cpp/blob/master/ggml/src/ggml-cpu/amx/mmq.cpp#L168) in llama.cpp And AMX MUL_MAT is [here](https://github.com/ggml-org/llama.cpp/blob/master/ggml/src/ggml-cpu/amx/mmq.cpp#L2369)
> >
> > If the tensor OP type is GGML_OP_MUL_MAT, it will be invoked on Intel AMX supported platform.
>
> I have more time soon with access to this dual 6980P if you have a specific branch, feature, or quant configuration suggestion for me to try or point me to a branch or PR and I can read-up on it to test and benchmark.
>
> Thanks!
>
> ```
> ## CPU
> $ lscpu | grep Xeon
> Model name: Intel(R) Xeon(R) 6980P
>
> ## CPU Flags
> $ lscpu | grep Flags
> Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cat_l2 cdp_l3 intel_ppin cdp_l2 ssbd mba ibrs ibpb stibp ibrs_enhanced tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid cqm rdt_a avx512f avx512dq rdseed adx smap avx512ifma clflushopt clwb intel_pt avx512cd sha_ni avx512bw avx512vl xsaveopt xsavec xgetbv1 xsaves cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local split_lock_detect user_shstk avx_vnni avx512_bf16 wbnoinvd dtherm ida arat pln pts hwp hwp_act_window hwp_epp hwp_pkg_req vnmi avx512vbmi umip pku ospke waitpkg avx512_vbmi2 gfni vaes vpclmulqdq avx512_vnni avx512_bitalg tme avx512_vpopcntdq la57 rdpid bus_lock_detect cldemote movdiri movdir64b enqcmd fsrm md_clear serialize tsxldtrk pconfig arch_lbr ibt amx_bf16 avx512_fp16 amx_tile amx_int8 flush_l1d arch_capabilities
> ```
>
> 👤 **saood06** replied the **2025-03-13** at **22:51:58**:<br>
> > > Playing with some of the more advanced options that mainline llama.cpp does not have would be of course very interesting too.
> >
> > Yes, I'm playing with [ktransformers](https://github.com/ubergarm/r1-ktransformers-guide/) as well, but it has a hard requirement on GPU. Unfortunately, this 6980P rig has no GPU so I'm limited to CPU only testing.
>
> When you do have a machine with a GPU, ik_llama.cpp can also make use of it in a similar way by offloading select tensors to the GPU. The implementation here is a lot more flexible, but that comes at the cost of knowing what tensors to offload. I would be really interested to see how performance stacks up against ktransformers on the same machine, with both offloading to the GPU.
>
> > Correct, I have not gone through your branches and PRs to figure out the best combination of code and options for pure CPU inference using the various unsloth R1 671B GGUF quants.
>
> There is no best performance, MLA offers significantly better TG performance at long contexts but it does come at the cost of PP (as MLA is inherently more compute intensive) . There have been a lot of optimizations done by ikawrakow to help recover that PP performance, and I think the best for MLA currently is with the use of -mla 2 -fa. The -fmoe and -rtr flags also improve performance. (There might be a caveat with -rtr as it disables mmap and may do non optimal things with where memory is allocated, I personally repack my quants and do not use the -rtr flag)
>
> >It's mildly annoying that older NVIDIA GPUs with capability <8.9 don't natively support fp8e4nv. This is required for [DeepSeek's Triton fp8_gemm implementation](https://github.com/deepseek-ai/DeepSeek-V3/blob/main/inference/kernel.py). Then the official DeepGemm implementation seems limited to [only 9.0 (H100s) hardware](https://github.com/deepseek-ai/DeepGEMM/issues/6) currently too afaict.
>
> I'm also annoyed by that as I have a 3090 and torch compile on fp8 stuff just errors instead of up casting.
>
>
> > It seems llama.cpp main has some support for these, however I'm not completely sure that it speeds up token generation or if it needs a specific quant. It does seem to at least be compiled in and doing _something_ on the `Q8_0` test:
> >
> > ```
> > load_tensors: tensor 'token_embd.weight' (q8_0) (and 54 others) cannot be used with preferred buffer type AMX, using CPU instead
> > ...
> > load_tensors: AMX model buffer size = 18214.39 MiB
> > load_tensors: CPU_Mapped model buffer size = 45565.90 MiB
> > ...
> > ```
> >
> > I don't believe I noticed these debug logs when I tested `ik_llama.cpp@a48e1632` by simply compiling main branch with no special new arguments.
> >
> > Quoting [@aubreyli](https://github.com/ggml-org/llama.cpp/discussions/12088#discussioncomment-12469251)
> >
> > > AMX tile config is [here](https://github.com/ggml-org/llama.cpp/blob/master/ggml/src/ggml-cpu/amx/mmq.cpp#L168) in llama.cpp And AMX MUL_MAT is [here](https://github.com/ggml-org/llama.cpp/blob/master/ggml/src/ggml-cpu/amx/mmq.cpp#L2369)
> > > If the tensor OP type is GGML_OP_MUL_MAT, it will be invoked on Intel AMX supported platform.
> >
>
> AMX support was added to llama.cpp after ik_llama.cpp last merged mainline. Some things are easy to port into ik_llama.cpp, others are more difficult, I have not looked into it but I also don't know how much value it would add given how ik_llama.cpp overhauls a lot of the backend anyways.
>
> > I have more time soon with access to this dual 6980P if you have a specific branch, feature, or quant configuration suggestion for me to try or point me to a branch or PR and I can read-up on it to test and benchmark.
>
> I'll leave requests to @ikawrakow but I think his table above showing off -fa -rtr, and -fmoe show the benefits of those arguments. This PR https://github.com/ikawrakow/ik_llama.cpp/pull/246 has a good summary of the MLA and FA options, and this latest PR shows the most recent numbers and latest optimization: https://github.com/ikawrakow/ik_llama.cpp/pull/253
---
👤 **saood06** replied the **2025-03-25** at **03:29:01**:<br>
@ubergarm (thought you might also be interested in this).
>[KTransformers](https://github.com/kvcache-ai/ktransformers/blob/main/doc/en/DeepseekR1_V3_tutorial.md#some-explanations) just duplicates matrices on each NUMA domain !
Someone has shared code that can duplicate the model for NUMA benefits on llama.cpp:
https://github.com/ggml-org/llama.cpp/discussions/12289
>TLDR: Replicate models on each NUMA. On my platform, pure CPU inference of QwQ-32B FP16 improved from ~6.6 token/s to ~10.7 token/s, and DeepSeek R1 671B Q8 from ~7.2 token/s to ~9.7 token/s. You can find the modified llama.cpp version [here](https://github.com/vproxy-tools/llama.cpp).
The downside of duplicating the model is pretty heavy, but this approach obviously avoids any non local memory access, and shows the upper bound on performance that that could be gained from other solutions that reduce or remove non local memory access.
Looking at the codebase, I think it currently only works for dual socket nodes, and I would have been more interested in testing it but none of my machines (even the very unstable one quad socket 1 TB memory node that I haven't turned on in a long time) would have enough RAM to replicate my preferred quant of R1, I'd have to use one under 192 GB (I do still have my IQ1_S_R4 V2 that is 129 GB).
> 👤 **ubergarm** replied the **2025-03-25** at **15:58:04**:<br>
> Super, I just fetched this fork and will take a peek.
>
> > The downside of duplicating the model is pretty heavy
>
> Yeah, it is *so much* RAM!
>
> Probably easiest to go BIOS `NPS1` on dual socket AMD Epyc or on newer Intel Xeon BIOS `SNC=Disable` to get exactly 2 big NUMA nodes (one per CPU socket). Ideally you would have the most number of individual NUMA nodes to maximize performance, but the RAM is then too small per node to fit the bigger models.
>
> Also [mingfeima](https://github.com/mingfeima) left an [interesting comment](https://github.com/ggml-org/llama.cpp/issues/12003#issuecomment-2731572966) recently discussing some of the intel specific optimizations and work he's doing on sglang.
>
> Finally, I recently saw Wendell of [level1techs youtube channel do a video](https://www.youtube.com/watch?v=kOh04PhXqmY) about quad socket Intel Xeon. Seems like it could be configured into 8 individual NUMA nodes with 1TB each possibly? Talk about wasting RAM, but would be fun to try haha...
>
> 👤 **saood06** replied the **2025-03-27** at **07:24:15**:<br>
> >Super, I just fetched this fork and will take a peek.
>
> Did you ever test it?
---
👤 **ikawrakow** replied the **2025-03-25** at **16:06:42**:<br>
> Ideally you would have the most number of individual NUMA nodes to maximize performance,
Why?
> 👤 **ubergarm** replied the **2025-03-25** at **16:14:54**:<br>
> Looking at Intel Memory Latency Checker `mlc` benchmarks suggest that the memory local to the compute on a specific NUMA node gives best bandwidth and latency.
>
> My thinking is that duplicating weights into each NUMA node and having local threads working with that RAM would maximize performance.
>
> However, I'm not fully aware of the other implications of combining computations for the final results in this "data parallel" situation. I've only read about "all reduce" in GPU specific implementations suggesting `nvlink` or `p2p` or RDMA infiniband networking is required for those "tensor parallel" implementations.
>
> For now I'd be happy to configure each CPU socket as a single numa node in BIOS as that would probably be good enough and more likely to have enough RAM to fit bigger models. So data parallel = number CPU sockets = (probably 2 for most folks)
---
👤 **ikawrakow** replied the **2025-03-25** at **16:24:17**:<br>
Sure, that would be if you wanted to squeeze out the last bit of performance. But we are not at that stage. Instead, we are a factor of 2 or more away from what should be possible. Having 2 big NUMA nodes would make the distribution of weights much easier: simply change the weight loading to use two threads, each pinned to a specific NUMA node, and each loading half of the tensor data. During inference pin half the threads to run on the 1st NUMA node, and the other half to the second NUMA node. My thinking is that this should give a significant boost in performance without replicating the model on both NUMA nodes. It is of course possible to do stuff such as this with several NUMA nodes, but it makes things way more complicated. So, I'm thinking that the 1st step should be to get better performance with 2 NUMA nodes. But if you are telling me that this is very far from ideal, and that the only way to get better performance is to enable and utilize all NUMA nodes, then it is a waste of time to implement the simple approach described above.
> 👤 **ubergarm** replied the **2025-03-25** at **16:36:46**:<br>
> > that would be if you wanted to squeeze out the last bit of performance. But we are not at that stage.
>
> Yes, I agree on both points.
>
> > I'm thinking that the 1st step should be to get better performance with 2 NUMA nodes
>
> Again, I agree. My understanding is ktransformers `USE_NUMA=1` compilation flag is for 2 NUMA nodes. Also the [discussion/fork saood06 linked](https://github.com/ggml-org/llama.cpp/discussions/12289) seems to be specific to 2 NUMA nodes.
>
> Going for exactly 2 NUMA nodes is also good because:
> 1. Most AMD Epyc BIOS dual socket boards likely support `NPS1` for exactly 2 NUMA Nodes
> 2. Newer Intel Xeon BIOS dual socket boards supports `SNC=Disable`for exactly 2 NUMA Nodes
>
> No need to worry about rare brand new quad socket intel xeon boards or more smaller NUMA nodes currently imo.
>
> I'll try to find my `mlc` benchmarks and post here, as the bandwidth is still pretty good converting a single CPU into 1 NUMA node.
>
> 👤 **ubergarm** replied the **2025-03-25** at **16:52:11**:<br>
> #### intel `mlc`
>
> Configuring BIOS to `SNC=Disable` to collapse 3x NUMA nodes per CPU socket into a single NUMA node per 6980P socket gives similar enough RAM bandwidth/latency performance.
>
> So probably not worth trying to support more than 2 NUMA nodes "data parallel" type feature assuming other systems perform similarly.
>
> <details>
>
> <summary>Dual Socket Intel Xeon 6980P `SNC=Auto/Enabled`</summary>
> This gives 6x total NUMA nodes (3x per CPU socket).
>
> ```
> Intel(R) Memory Latency Checker - v3.11b
> Measuring idle latencies for sequential access (in ns)...
> Numa node
> Numa node 0 1 2 3 4 5
> 0 138.7 168.0 208.5 394.1 475.2 445.1
> 1 160.3 134.4 170.4 415.2 448.2 479.7
> 2 156.2 123.6 106.5 507.8 513.2 452.5
> 3 396.0 476.0 445.6 102.0 129.4 157.5
> 4 419.7 452.6 421.2 122.1 102.4 130.2
> 5 445.4 449.5 392.4 148.3 122.3 103.8
>
> Measuring Peak Injection Memory Bandwidths for the system
> Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
> Using all the threads from each core if Hyper-threading is enabled
> Using traffic with the following read-write ratios
> ALL Reads : 1126026.6
> 3:1 Reads-Writes : 972377.5
> 2:1 Reads-Writes : 933247.3
> 1:1 Reads-Writes : 927164.2
> Stream-triad like: 939630.2
>
> Measuring Memory Bandwidths between nodes within system
> Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
> Using all the threads from each core if Hyper-threading is enabled
> Using Read-only traffic type
> Numa node
> Numa node 0 1 2 3 4 5
> 0 187911.4 188622.8 188716.9 94137.8 93596.5 93730.5
> 1 188260.8 188176.4 188653.1 94495.4 90659.3 93774.2
> 2 188624.6 188626.7 188129.6 94509.6 27886.4 93792.7
> 3 94161.1 93415.7 94558.3 187851.4 188418.6 188691.9
> 4 94201.1 91712.7 94546.8 188169.2 188067.6 188544.2
> 5 94183.2 44861.0 94241.8 188416.4 188380.0 187933.8
>
> Measuring Loaded Latencies for the system
> Using all the threads from each core if Hyper-threading is enabled
> Using Read-only traffic type
> Inject Latency Bandwidth
> Delay (ns) MB/sec
> ==========================
> 00000 378.26 1125007.8
> 00002 381.36 1125706.3
> 00008 382.90 1125594.5
> 00015 381.40 1128101.6
> 00050 377.79 1129501.1
> 00100 296.51 1117783.2
> 00200 301.72 1122699.0
> 00300 207.87 1017250.0
> 00400 170.76 782113.4
> 00500 157.40 665276.4
> 00700 138.25 488635.4
> 01000 128.65 349546.6
> 01300 125.55 271876.5
> 01700 123.93 209644.5
> 02500 116.19 143990.9
> 03500 120.17 103477.5
> 05000 119.53 72875.8
> 09000 113.89 40898.3
> 20000 115.14 18113.6
>
> Measuring cache-to-cache transfer latency (in ns)...
> Local Socket L2->L2 HIT latency 80.5
> Local Socket L2->L2 HITM latency 80.9
> Remote Socket L2->L2 HITM latency (data address homed in writer socket)
> Reader Numa Node
> Writer Numa Node 0 1 2 3 4 5
> 0 - 99.3 124.9 376.2 401.7 429.5
> 1 108.8 - 100.9 452.1 425.7 422.2
> 2 131.0 103.8 - 435.5 407.4 378.1
> 3 372.3 393.3 423.4 - 101.2 125.6
> 4 444.2 414.2 413.5 106.3 - 100.9
> 5 429.5 399.3 374.0 130.3 106.1 -
> Remote Socket L2->L2 HITM latency (data address homed in reader socket)
> Reader Numa Node
> Writer Numa Node 0 1 2 3 4 5
> 0 - 109.6 140.2 381.2 444.0 440.0
> 1 106.9 - 110.8 405.8 414.7 411.6
> 2 137.1 103.8 - 436.3 442.6 381.2
> 3 380.8 441.6 439.1 - 110.6 139.5
> 4 406.3 412.7 411.6 105.8 - 110.7
> 5 436.7 440.5 381.2 136.3 105.9 -
>
> ```
>
> </details>
>
> ---
>
> <details>
>
> <summary>Dual Socket Intel Xeon 6980P `SNC=Disabled`</summary>
>
> This gives 2x total NUMA nodes (1x per CPU socket).
>
> ```
> Intel(R) Memory Latency Checker - v3.11b
> Measuring idle latencies for sequential access (in ns)...
> Numa node
> Numa node 0 1
> 0 130.7 449.2
> 1 410.0 129.4
>
> Measuring Peak Injection Memory Bandwidths for the system
> Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
> Using all the threads from each core if Hyper-threading is enabled
> Using traffic with the following read-write ratios
> ALL Reads : 1108235.0
> 3:1 Reads-Writes : 972151.5
> 2:1 Reads-Writes : 940099.8
> 1:1 Reads-Writes : 928269.2
> Stream-triad like: 918997.2
>
> Measuring Memory Bandwidths between nodes within system
> Bandwidths are in MB/sec (1 MB/sec = 1,000,000 Bytes/sec)
> Using all the threads from each core if Hyper-threading is enabled
> Using Read-only traffic type
> Numa node
> Numa node 0 1
> 0 554843.5 247793.1
> 1 247281.1 552385.5
>
> Measuring Loaded Latencies for the system
> Using all the threads from each core if Hyper-threading is enabled
> Using Read-only traffic type
> Inject Latency Bandwidth
> Delay (ns) MB/sec
> ==========================
> 00000 357.28 1106966.8
> 00002 362.94 1108392.3
> 00008 363.07 1107547.6
> 00015 360.97 1104844.6
> 00050 359.09 1102679.2
> 00100 307.11 1099803.6
> 00200 320.42 1105411.1
> 00300 231.07 1007100.3
> 00400 188.93 789261.0
> 00500 174.05 665122.5
> 00700 158.95 487463.0
> 01000 150.90 349530.7
> 01300 148.47 271576.2
> 01700 146.67 209392.6
> 02500 144.40 143857.9
> 03500 142.66 103386.9
> 05000 140.57 72810.8
> 09000 139.24 40768.0
> 20000 138.79 18002.4
>
> Measuring cache-to-cache transfer latency (in ns)...
> Local Socket L2->L2 HIT latency 179.7
> Local Socket L2->L2 HITM latency 180.2
> Remote Socket L2->L2 HITM latency (data address homed in writer socket)
> Reader Numa Node
> Writer Numa Node 0 1
> 0 - 433.3
> 1 413.7 -
> Remote Socket L2->L2 HITM latency (data address homed in reader socket)
> Reader Numa Node
> Writer Numa Node 0 1
> 0 - 425.0
> 1 422.4 -
> ```
>
> </details>
>
> ## References
> * [Additional Benchmarks and discussions on Phoronix](https://www.phoronix.com/review/xeon-6980p-snc3-hex)
>
> 👤 **saood06** replied the **2025-03-25** at **18:09:30**:<br>
> > During inference pin half the threads to run on the 1st NUMA node, and the other half to the second NUMA node.
>
> The problem is not splitting the model, it is ensuring the work of any given thread is stored local to it's NUMA node.
>
> This PR: https://github.com/ggml-org/llama.cpp/pull/6915 made it difficult as mentioned here: https://github.com/ggml-org/llama.cpp/issues/1437#issuecomment-2095809308
>
> Maybe you could use [this](https://www.intel.com/content/www/us/en/docs/dpcpp-cpp-compiler/developer-guide-reference/2023-0/thread-affinity-interface.html#LOW_LEVEL_AFFINITY_API) so that each thread could change it's affinity to a random thread on the correct numa node (this would also work since I don't think this would otherwise be compatible with --numa interleave [but not sure has been a long time since I looked into that).
>
> 👤 **ikawrakow** replied the **2025-03-25** at **18:17:01**:<br>
> There is no dynamic thread scheduling here. No thread pools either.
>
> In my experience from the past, touching memory with on a NUMA node makes it automatically that the actual data is stored in a memory bank local to the node on which the thread is running. The difficulty will be more in fighting with the almighty `ggml` backend than anything else.
>
> 👤 **ikawrakow** replied the **2025-03-25** at **18:26:08**:<br>
> Dynamic thread scheduling does help for PP with big enough batch sizes. It would also help on systems with a mix of P/E cores (although, if mainline `llama.cpp` has that, I notice absolutely zero benefit on my M2-Max. Performance there is still best with 8 threads, not 12). But for TG with all same cores the overhead of thread synchronization for work stealing is typically too high to have benefit. Maybe it is different for a humongous model such as DeepSeek-R1? But then again, it has nearly 4X the number of nodes in the compute graph, so the work per node is not that much higher than DeepSeek-Lite.
>
> 👤 **saood06** replied the **2025-03-25** at **18:36:09**:<br>
> > There is no dynamic thread scheduling here. No thread pools either.
>
> @bmtwl
>
> You said
>
> >The problem at that time was the thread allocation code didn't have any way to ascertain which numa node it was running on or what numa node the tensors it was going to be working on was pinned to.
> >[...]
> >I'm still very interested in this and want to take another stab at it, but haven't been able to work up the will to try again yet.
>
> Do you think you'd want to attempt it in this repo as there is no dynamic scheduling or threadpool here?
---
👤 **ubergarm** replied the **2025-03-30** at **17:25:05**:<br>
Oh I see a benchmark in the wild attempting to benchmark that [vproxy-tools/llama.cpp](https://github.com/vproxy-tools/llama.cpp) NUMA data parallel code against ik fork: https://github.com/ggml-org/llama.cpp/discussions/12289#discussioncomment-12668490
> It seems clear that porting the mirror impl. to the ik fork should make the best available version.
Not sure the details of how they are running it though...
> 👤 **saood06** replied the **2025-03-30** at **20:58:05**:<br>
> > Oh I see a benchmark in the wild attempting to benchmark that [vproxy-tools/llama.cpp](https://github.com/vproxy-tools/llama.cpp) NUMA data parallel code against ik fork: [ggml-org/llama.cpp#12289 (comment)](https://github.com/ggml-org/llama.cpp/discussions/12289#discussioncomment-12668490)
> >
> > Not sure the details of how they are running it though...
>
> Thanks for the link, I agree it would be nice if they included more details.
>
> 👤 **ubergarm** replied the **2025-03-30** at **21:14:31**:<br>
> Yeah, I gave it a try and while it did run it wasn't allocating threads on both NUMA nodes so I gave up for now after posting my logs.
>
> 👤 **saood06** replied the **2025-03-30** at **21:34:22**:<br>
> > Yeah, I gave it a try and while it did run it wasn't allocating threads on both NUMA nodes so I gave up for now after posting my logs.
>
> Did you try running it with numactl on just 2 NUMA nodes? There is also an issue tracker for [vproxy-tools/llama.cpp](https://github.com/vproxy-tools/llama.cpp/issues) where you could report that.
---
👤 **bhugueney** replied the **2025-04-08** at **10:24:55**:<br>
I currently settle for running my DeepSeek v3 model on just one NUMA / socket of my dual socket system. However, while investigating the draft models situation, it occurred to me that if should be relatively easy to specify cores for the main model (on one socket) and specify other cores (in my case on the other socket/NUMA node) for the draft model as communication between the two should be minimal.
What do people think about it?
---
👤 **saood06** replied the **2025-05-20** at **08:37:01**:<br>
On my dual socket machine using https://github.com/intel/pcm
I found this is what it looks like during PP:
| | READ (GB) | WRITE (GB) | LOCAL | CPU energy | DIMM energy | LLCRDMISSLAT (ns) | UncFREQ (Ghz) |
|------------|-------|-------|-------|------------|-------------|-------------------|---------------|
| Socket - 0 | 7.93 | 3.60 | 49 % | 96.90 | 23.78 | 365.82 | 2.30 |
| Socket - 1 | 2.56 | 1.55 | 46 % | 89.43 | 18.93 | 436.65 | 2.21 |
| Total | 10.50 | 5.15 | 48 % | 186.32 | 42.71 | 400.13 | 2.25 |
And during TG:
| | READ (GB) | WRITE (GB) | LOCAL | CPU energy | DIMM energy | LLCRDMISSLAT (ns) | UncFREQ (Ghz) |
|------------|-------|-------|-------|------------|-------------|-------------------|---------------|
| Socket - 0 | 16.22 | 0.55 | 90 % | 134.39 | 26.05 | 219.40 | 2.68 |
| Socket - 1 | 14.74 | 0.15 | 95 % | 133.64 | 25.46 | 214.65 | 2.77 |
| Total | 30.96 | 0.70 | 92 % | 268.02 | 51.52 | 216.97 | 2.73 |
---
👤 **VinnyG9** replied the **2025-05-21** at **04:15:29**:<br>
just sharing i tried all snoop modes on my x99 dual board and got 200-300% boost vs stock bios settings, this setting is also available on xeon scalable fwiw
## stock bios
| model | size | params | backend | ngl | threads | fa | rtr | fmoe | test | t/s |
| ----------------------------------- | ----------: | --------: | --------- | ----: | --------: | ---: | ----: | -----: | -------: | ---------------: |
| ============ Repacked 337 tensors | | | | | | | | | | |
| qwen3moe ?B Q4_K - Medium | 16.49 GiB | 30.53 B | CUDA | 0 | 31 | 1 | 1 | 1 | pp256 | 108.42 ± 1.82 |
| qwen3moe ?B Q4_K - Medium | 16.49 GiB | 30.53 B | CUDA | 0 | 31 | 1 | 1 | 1 | pp512 | 123.10 ± 1.64 |
| qwen3moe ?B Q4_K - Medium | 16.49 GiB | 30.53 B | CUDA | 0 | 31 | 1 | 1 | 1 | pp1024 | 118.61 ± 1.67 |
| qwen3moe ?B Q4_K - Medium | 16.49 GiB | 30.53 B | CUDA | 0 | 31 | 1 | 1 | 1 | tg128 | 12.28 ± 0.03 |
| qwen3moe ?B Q4_K - Medium | 16.49 GiB | 30.53 B | CUDA | 0 | 31 | 1 | 1 | 1 | tg256 | 12.17 ± 0.06 |
## home snoop w/ dir OSB
| model | size | params | backend | ngl | threads | fa | rtr | fmoe | test | t/s |
| ----------------------------------- | ----------: | --------: | --------- | ----: | --------: | ---: | ----: | -----: | ------: | ----------------: |
| ============ Repacked 337 tensors | | | | | | | | | | |
| qwen3moe ?B Q4_K - Medium | 16.49 GiB | 30.53 B | CUDA | 0 | 31 | 1 | 1 | 1 | pp64 | 173.70 ± 16.62 |
| qwen3moe ?B Q4_K - Medium | 16.49 GiB | 30.53 B | CUDA | 0 | 31 | 1 | 1 | 1 | pp128 | 235.53 ± 19.14 |
| qwen3moe ?B Q4_K - Medium | 16.49 GiB | 30.53 B | CUDA | 0 | 31 | 1 | 1 | 1 | pp256 | 270.99 ± 7.79 |
| qwen3moe ?B Q4_K - Medium | 16.49 GiB | 30.53 B | CUDA | 0 | 31 | 1 | 1 | 1 | pp512 | 263.82 ± 6.02 |
| qwen3moe ?B Q4_K - Medium | 16.49 GiB | 30.53 B | CUDA | 0 | 31 | 1 | 1 | 1 | tg64 | 31.61 ± 1.01 |
| qwen3moe ?B Q4_K - Medium | 16.49 GiB | 30.53 B | CUDA | 0 | 31 | 1 | 1 | 1 | tg128 | 34.76 ± 1.54 |
| qwen3moe ?B Q4_K - Medium | 16.49 GiB | 30.53 B | CUDA | 0 | 31 | 1 | 1 | 1 | tg256 | 35.70 ± 0.34 |
> 👤 **ubergarm** replied the **2025-05-21** at **14:26:30**:<br>
> Wow, big gains! I'd never heard of "snoop" mode, but don't have a lot of intel server experience:
>
> > DIR+OSB mode allows for low local memory latency, high local memory bandwidth and I/O directory cache to reduce directory update overheads for I/O accesses.
>
> Are you running hybrid CPU+GPU CUDA offloading some layers? I forget your exact system specs and VRAM, but if you can offload the whole thing it can go quite faster psure. Also, if I'm running CPU/RAM *only* I generally recompile and disable CUDA backend fwiw.
>
> Glad you're having fun tweaking and tuning!
>
> 👤 **VinnyG9** replied the **2025-05-21** at **18:07:27**:<br>
> > Wow, big gains! I'd never heard of "snoop" mode, but don't have a lot of intel server experience:
> >
> > > DIR+OSB mode allows for low local memory latency, high local memory bandwidth and I/O directory cache to reduce directory update overheads for I/O accesses.
> >
> > Are you running hybrid CPU+GPU CUDA offloading some layers? I forget your exact system specs and VRAM, but if you can offload the whole thing it can go quite faster psure. Also, if I'm running CPU/RAM _only_ I generally recompile and disable CUDA backend fwiw.
> >
> > Glad you're having fun tweaking and tuning!
>
> i saw ik recommending it so i tried disabling cuda build for cpu inference, but up to 2k tokens max i tested it was slower no idea why
> snoop mode is a numa thing but it helped on single cpu inference also by ~10-30% , i see a nice boost on intel MLC too like 116 > 140GB/s
> hybrid inference only saw ~10% TG increase(offloading about 40% of weghts)
>
> qwen3 dense got a 90% boost

View File

@@ -0,0 +1,283 @@
### 🗣️ [#211](https://github.com/ikawrakow/ik_llama.cpp/discussions/211) - help me create an importance matrix primer
| **Author** | `robbiemu` |
| :--- | :--- |
| **Created** | 2025-02-19 |
| **Updated** | 2025-02-22 |
---
#### Description
this primer, if I am honest is mostly about the related main stream llama.cpp project, but the details are so general I think it generally applies. I was hoping @ikawrakow you might review this and help me to track down gaps and errors, before I release a final version. (I'm the [llama-gguf-optimize](https://github.com/robbiemu/llama-gguf-optimize) guy interested in language preservation, btw -- hello again! ).
(version: 0.3)
# importance matrices in Llama.cpp
## Architectural Design of Importance Matrices in Llama.cpp
Quantization reduces the precision of neural network weights and activations, lowering memory usage and computational costs. Early calibration methods, such as min-max scaling, determined quantization ranges based on observed activation values. Modern calibration-based methods typically select quantization parameters, such as scaling factors and offsets, by analyzing the networks data distributions to improve accuracy.
### Background: On Quantization
The development of techniques to quantify weight importance in neural networks has roots in **network pruning**. This will introduce a Hessian related to the model's weights and performance, so it should be defined first.
The Hessian matrix $H$ is defined as the matrix of **second-order partial derivatives** of the loss $\mathcal{L}$ (like MSE, minimized during training, which compares _model outputs_ to target values) with respect to the models weights., composed of second-order partial derivatives $H_{ij} = \frac{\partial^2 \mathcal{L}}{\partial w_i \partial w_j}$. This Hessian effectively measures the local curvature of the error surface during training. Its eigenvalues and eigenvectors reveal the directions of greatest sensitivity in parameter space. A large value means the loss changes rapidly when that weight is modified (high curvature), while a small value indicates the loss is relatively flat with respect to that weight.
#### Network Pruning: Optimal Brain Damage and Optimal Brain Surgeon
Network pruning aims to remove redundant or non-essential weights without significantly degrading model performance. Early foundational work, such as **Optimal Brain Damage (OBD)** (LeCun et al., 1990) and **Optimal Brain Surgeon (OBS)** (Hassibi & Stork, 1993), formalized this process using second-order derivatives of the loss function.
1. **Optimal Brain Damage (OBD):**
OBD approximates the sensitivity of the loss to weight removal by leveraging a **diagonal Hessian matrix**. The importance of a weight $w_i$ is computed as:
$$
\mathcal{I}_i = \frac{1}{2} w_i^2 \cdot H_{ii},
$$
where $H_{ii}$ is the second derivative of the loss with respect to $w_i$. This diagonal approximation assumes that interactions between weights (off-diagonal Hessian terms) are negligible, drastically reducing computational complexity.
2. **Optimal Brain Surgeon (OBS):**
OBS generalizes OBD by incorporating the **full Hessian matrix**, capturing cross-interactions between weights. The saliency $\mathcal{S}_q$ of removing weight $w_q$ is given by:
$$
\mathcal{S}_q = \frac{w_q^2}{2 [H^{-1}]_{qq}},
$$
where $[H^{-1}]_{qq}$ is the inverse Hessians diagonal entry for $w_q$. While more accurate, computing and inverting the full Hessian is computationally prohibitive for modern deep networks, limiting OBSs practicality.
Both methods link weight importance to the curvature of the loss landscape in a global matrix of model weights. A weight with a large $H_{ii}$ (steep curvature) is highly sensitive—even small perturbations may destabilize the model. Conversely, a flat curvature ($H_{ii} \approx 0$) implies robustness to changes.
#### Hessian-Based Sensitivity Analysis
Exact Hessian computation is often infeasible for large networks due to its $O(N^2)$ memory cost (where $N$ is the number of weights).
In quantization, the goal is analogous to pruning: allocate higher precision (bits) to weights that most influence model output.
- **Sensitivity Metric for Quantization:**
The expected change to the loss from quantizing $w_i$ can be approximated as:
$$
\Delta \mathcal{L} \approx \frac{1}{2} \sum_i H_{ii} (\Delta w_i)^2,
$$
where $\Delta w_i$ is the quantization error (essentially $q_i - w_i$ in the llama.cpp-specific formulation discussed later). To minimize $\Delta \mathcal{L}$, weights with large $H_{ii}$ (high sensitivity) should have smaller $\Delta w_i$, achieved by allocating more bits.
In practice, gradient methods such as the **Fisher information matrix** (computed from first-order gradients as $F = \mathbb{E}[\nabla \mathcal{L} \nabla \mathcal{L}^T]$) are often used instead. The FIM avoids second-derivative computations but assumes the loss is well-approximated by a probabilistic model (it equals the Hessian exactly when the loss is the negative log-likelihood of a probabilistic model, like cross-entropy loss. For other losses, it's an approximation). In such a framework, a small gradient for a given weight indicates that even a large change in that weight has little effect on the models performance. Conversely, a large gradient suggests that even a small change could have a significant impact. Squaring these gradients provides a measure of importance for each weight. However, there are two major drawbacks when applying this approach to llama.cpp:
1. **Limited Training Capabilities:**
llama.cpp does not currently support the full training regime required to reliably compute these gradients, which includes both the activation and the losss error signal.
2. **Memory Overhead:**
The resulting importance matrix is large — at minimum, its size matches that of the model, and when using fp32 gradients, it can be nearly twice as large.
## Llama.cpp fundamentals
To overcome these challenges, llama.cpp employs an alternative that leverages readily available activation statistics rather than gradients. Consider a single row from a model tensor, whose weights are denoted by $w_j$. This row interacts with a column of activations (or embeddings) $a_j$ produced by preceding network layers. The dot product of the weight row with the activation column yields one element of the subsequent activation matrix.
Now, suppose we quantize this tensor row to obtain quantized weights $q_j$. To minimize the quantization error on the resulting activations, we define an error function:
$$
F = \left(\sum_{j} (q_j - w_j) \, a_j\right)^2.
$$
Taking the derivative of $F$ with respect to a particular quantized weight $q_i$ gives:
$$
\frac{\partial F}{\partial q_i} = \sum_{j} a_i \, a_j \, (q_j - w_j).
$$
Averaging this expression over a representative dataset, we obtain:
$$
\sum_{j} \langle a_i a_j \rangle \, (q_j - w_j),
$$
where $\langle \cdot \rangle$ denotes the expectation value over the data.
Because activations can take on both positive and negative values, the cross terms $\langle a_i a_j \rangle$ for $i \neq j$ are likely to cancel out (unless there is a strong correlation). This means the diagonal elements $\langle a_i^2 \rangle$ dominate. Therefore, the approach can be simplified by using:
$$
\mathcal{I}_i = \langle a_i^2 \rangle,
$$
This design enables hardware-aware optimizations while maintaining model accuracy through these core mechanisms:
- **Importance Matrix**:
As discussed above, this is a mathematical construct that assigns **sensitivity scores** to columns of neural network weights, repeated row by row. Columns with higher scores (indicating greater impact on model outputs) retain higher numerical precision during quantization, while less critical columns undergo more aggressive compression.
- **Precision Allocation Strategy**:
A base strategy to adjust is required. The standard quantization methods in `llama.cpp` (like `Q4_0`, `Q5_K`, etc.) generally use a linear mapping, ie: $x = a * q$ or $x = a*q + b$ (see [Even more quantization types?](https://github.com/ggml-org/llama.cpp/discussions/5063)). More details on this approach is provided later in this article. Some _i-quants_ in llama.cpp employ **3rd-order polynomial dequantization**:
$$
W_{quant} = aq^3 + bq^2 + cq + d
$$
This non-linear mapping can provide better compression than equivalent linear methods while maintaining accuracy. The use of importance matrices introduces a more sophisticated strategy, biasing the quantization scale for blocks of weights.
### Matrix Representation
A naive conceptualization to the creation of an importance matrix would be to divide the entire model up into columns per weight as if it were one giant matrix, thus producing one importance matrix. For reasons previously mentioned, this is not the case. Instead, each layer in the network is given its own importance matrix.
- **1D Tensor of Weights**:
- Each layer in a neural network can be thought of as a vector (1D tensor) of weights. This is essentially a flat list of all the weights in that layer.
- **Block-Wise Grouping**:
- For quantization, weights are logically partitioned into **fixed-size blocks**. These blocks are not a literal reshaping of the tensor into 2D space but instead represent computational groupings.
- **Columns in the Importance Matrix**:
- Each column in the importance matrix corresponds to one of these groups of weights.
- The importance score for a column is derived from the **variance of the weight's associated activations**.
#### Application
The framework introduces a bias for each weight's parameters (eg, _scale_) based on each value — also in the source code called a "weight" — in the importance matrix. This is implemented with **Hardware-Agnostic Vectorization** implemented through an abstracted SIMD interface, which leverages compile-time intrinsics to generate optimized code paths for multiple instruction sets: x86 (AVX2), ARM (NEON), and RISC-V (V extension).
## Quantization Workflow Implementation
_A comparison of the approaches used in all of the different quantizations available in llama.cpp is beyond the scope of this article. Here, approaches similar to some Q4 approaches are discussed. This is partially applicable to many other bit depths and quantization types._
### Core Algorithmic Steps
1. **Importance matrix column scores**
2. **Block-Wise Processing**
- 32-element blocks align to reduce quantization error, and 32 is a good choice because all transformer models in existence have row sizes that are divisible by 32, so one does not need to deal with partial blocks.
- 256-element superblocks used in k-quants
#### Block-level quantization of the row
Quantization maps a range of floating-point values to a smaller set of integers. This process relies on two key parameters:
1. **Scale** (multiplier): Determines how much to multiply quantized integers to approximate original values.
2. **Minimum** (offset): Defines the starting point of the quantization range. _In symmetric quantization (e.g., Q4_0), the minimum is omitted, as the range is centered at zero._
The reconstructed value is calculated as:
`original ≈ q * scale + minimum`
##### Example: Q4_0 Quantization
In llama.cpps **Q4_0** format, quantization simplifies to **symmetric scaling** (no minimum term):
`original ≈ q * scale`.
**Key Properties of Q4_0**:
- **Per block of 32 weights**:
- Each weight is stored as a 4-bit integer (`q`).
- A single **6-bit scale** (`d`) is shared across the block.
- Total overhead: 6 bits (scale) + 0 bits (minimum) = **6 bits per block**.
- **Optimization objective**:
Minimize the weighted reconstruction error:
$$
\sum_{i} w_i (x_i - \text{scale} \cdot q_i)^2
$$
- $x_i$: Original floating-point weights.
- $q_i$: 4-bit integers (range: -8 to 7).
- $w_i$: Importance weights (derived from the importance matrix).
**Role of the Importance Matrix**:
When provided, the algorithm prioritizes minimizing errors for high-importance weights by:
1. **Weighting the error terms**: Errors at positions with larger `quant_weights[i]` contribute more to the loss.
2. **Iterative scale refinement**: Tests candidate scales to find the one that minimizes importance-weighted error (see `make_qx_quants` code).
- Without an importance matrix, the scale is determined by the **maximum absolute weight** in the block (`d = max / -8`), treating all weights equally.
##### Comparison with Q4_K quants
Briefly, **Q4_K** introduces additional complexity to improve accuracy at the cost of storage, using both the scale and minimum parameters and 256 weight _superblocks_ with their own parameters (the importance matrix biases error minimization at **both levels** in this case).
### Execution Flow
#### Phase 1: Importance Matrix Generation
The workflow initiates with `llama-imatrix` execution, which performs forward passes through the model using calibration data. Key implementation steps include:
8. **Chunk Processing**: Input text is divided into configurable-length segments (default 512 tokens, configurable to match context size) to be processed sequentially. Each chunk undergoes full model inference while tracking activation patterns.
9. **Tensor Significance Accumulation**: The `llama-imatrix` tool aggregates importance metrics across all processed chunks, maintaining running totals for each weight tensor. GPU offloading via `-ngl` parameter accelerates this computation through parallel processing.
10. **Output Serialization**: Final importance values are normalized and stored in binary format (`imatrix.dat` by default) with metadata including processing timestamps and chunk statistics.
#### Phase 2: Quantization Application
The `llama-quantize` tool consumes the generated *imatrix* through several critical code paths:
11. **Matrix Loading**: During quantization initialization, the specified imatrix file is memory-mapped and validated against the target model architecture. The `prepare_imatrix()` function handles format compatibility checks and memory allocation.
12. **Weight Prioritization**: The quantization algorithm uses quantized weights modified by parameters such as scale that are adjusted with importance scores. High-importance weights receive larger bit allocations within mixed-precision quantization blocks.
## Calibration Process Specifications
### Data Selection Recommendations
The users define calibration corpora. Discussions on llama.cpp's implementation suggest:
- **Domain Alignment**
- Technical models: 40% code (GitHub), 30% math (arXiv), 30% general text
- Conversational models: 60% dialogue datasets, 40% Wikipedia
- **Entropy Filtering**
- Some form of filtering of data may improve quality.
---
This documentation introduces general approaches to quantization and then llama.cpp's approach to importance-based quantization, emphasizing major technical implementation details. This approach demonstrates quantization efficiency across several hardware platforms, with calibration data selection remaining the primary user-controlled quality factor.
---
#### 🗣️ Discussion
👤 **ikawrakow** replied the **2025-02-21** at **06:51:45**:<br>
1. Many equations do not show in my Browsers (Firefox, Safari)
2. You are trying to describe the imatrix as used in llama.cpp. Hence, it would be better to use the mathematical foundation of that instead of the LeanQuants paper.
3. You could start by referring to the imatrix PR in `llama.cpp` (https://github.com/ggml-org/llama.cpp/pull/4861)
4. Only `IQ4_XS` and `IQ4_NL` use a non-linear mapping from quantized values to dequantized model weights. All other i-quants in `llama.cpp` use points on a lattice to map a group of 8 (`IQ2_XXS, IQ2_XS, IQ2_S`, E8 lattice) or 4 (`IQ3_XXS, IQ3_S`, D4 lattice) quants to corresponding model values.
5. Blocks of 32 have nothing to do with `AVX2`. They are there to reduce quantization error, and 32 is a good choice because all transformer models in existence have row sizes that are divisible by 32, so one does not need to deal with partial blocks. Blocks of 256 are there to reduce storage requirements spent on block scales. E.g., `Q4_K` uses 6 bits for scale/minimum in blocks of 32, ending up with `256/32*(6+6) = 96` bits for the block scale. Add `2*16` bits for the super-block `fp16` scale/minimu and you end up with 128 bits or 0.5 bits per weight. In comparison, `Q4_1` which would be the corresponding legacy quantization type uses 5 bits per weight.
6. Legacy quants do not support imatrix: wrong. See e.g. [this function](https://github.com/ggml-org/llama.cpp/blob/ee02ad02c56ff36a5edd22d8617ab3f9546ce7fe/ggml/src/ggml-quants.c#L1849), which gets called when quantizing a model to `Q4_0`. From there one goes to [this function](https://github.com/ggml-org/llama.cpp/blob/ee02ad02c56ff36a5edd22d8617ab3f9546ce7fe/ggml/src/ggml-quants.c#L1821), which explicitly uses an importance matrix.
7. Phase 2: wrong
8. Dynamic bitwidth allocation: wrong
9. Chunk processing: the division is not "for sequential processing" but to have the ability to generate imatrix data for different **context lengths**.
Etc. Sorry @robbiemu, but this is just too far from representing the actual imatrix fundamentals and the imatrix use for guiding quantization.
> 👤 **robbiemu** replied the **2025-02-21** at **11:55:47**:<br>
> thank you for that :) Its a draft, of course there are things going to be wrong, its a big project that I've worked _with_ much more than _in_, and I need and appreciate the help identifying where I need to correct.
>
> especially things like simple errata like Github's markdown not rendering latex and my confusing at one point blocks of 32 for superblocks of 256 vis-a-vis AVX2 are little burden. But there were a couple of points that I dont feel confident how to process.
>
> At the beginning, I did transclude in sections from another document I have on LeanQuants specifically because in our conversation where I felt you were the one to equate the imatrix to the hessian approach. And they have a very natural way of expressing the relationship to quantization decisions so .. I took pains to show the approximate relationship. That and, if you search/read about llama.cpp importance matrices online now, you will often see this relationship indicated. In reading your PR comment I see that you don't even explicitly mention it, so maybe inclusion was misguided. Yet, you also don't directly ground quantization decisions to using an importance matrix. In other words, the "how did we get here" that this section currently provides .. I'll need to add that still. Do you prefer another formulation rather than what I used from LeanQuant? If I were to keep it: What is glossed over as essentially a given, that you can calculate only the diagonal, and the fact that you can treat a block-diagonal matrix here as a collection of smaller matrices (so you can break up the model's quantization row-wise, as is done in llama.cpp) -- those can be simplified or removed and replaced with the derivation you spell out in your PR.
>
> What really interests me is # 7. after generating your imatrix the next step, in practice, is to use the quantization tool. So it must be in the details it is incorrect. I got this from perplexity (I've not been working very much in the llama.cpp source code, except in regards YaRN). If it is not too much to ask, could I ask you to help correct that into a high level description. I'm trying to avoid an exact correspondence here (phase 1 also does not live up to that), I just want a simple conceptual description of the execution graph.
>
> 👤 **robbiemu** replied the **2025-02-21** at **12:28:24**:<br>
> On one other point:
>
> "for sequential processing" -- this is just a lack of clarity, it I guess should be "to then be processed sequentially" maybe. I was never describing the reasoning, just the application, not getting into the details. Maybe I could add something about matching the max_positional_embeddings though, sure. batch and ubatch currently under the lens for change, there's a draft PR to make ubatch functionally different from batch in imatrix generation (ie computing multiple chunks per batch in https://github.com/ggml-org/llama.cpp/pull/9400 ) - as the nature and intent are perhaps changing, describing the intent is something I am not interested in adding to the document.
---
👤 **ikawrakow** replied the **2025-02-21** at **16:20:18**:<br>
If this was a draft that had the occasional mistake here or there, I would try to help you. But the content is so far away from reality that I wouldn't know where to begin (short of completely rewriting it).
As an example, let's look at the section "Phase 2" (point 7 i my initial response that really interests you):
> During quantization initialization, the specified imatrix file is memory-mapped
No, it isn't. It is small and there is no need to complicate things with `mmap`. The data is simply loaded into memory using a standard C++ file stream.
> The quantization algorithm scales compression aggressiveness inversely with importance scores...
Absolutely not. Everything is quantized with the same number of bits, so the "compression aggressiveness" is the same. Instead, when the difference between the original and the quantized model is minimized, the importance matrix enters as a weighting factor in the optimization objective (a.k.a. "loss" these days).
> the quantization resolution R is determined by: [followed by bogus equation]
Where did you even get this equation from? It certainly is not used anywhere in `llama.cpp` or `ik_llama.cpp`
> High-importance weights receive larger bit allocations within mixed-precision quantization ...
No. All model weights in a tensor use the exact same amount of bits per weight.
> 👤 **robbiemu** replied the **2025-02-21** at **19:03:42**:<br>
> Ok hold on, please understand I'm just trying to essentially describe this, using tools to help me avoid reading the code was probably a mistake but, in my defense, its a big project that I am trying to elaborate. :) I'll apply the changes, this will get better. Maybe I should seek help from others instead... if so my apologies. I dont want to address the entire reply you gave me there just now, but something you said really gave me doubt.
>
> >> The quantization algorithm scales compression aggressiveness inversely with importance scores...
> >
> > Absolutely not. Everything is quantized with the same number of bits, so the "compression aggressiveness" is the same. Instead, when the difference between the original and the quantized model is minimized, the importance matrix enters as a weighting factor in the optimization objective (a.k.a. "loss" these days).
>
> Wow that is a surprise. So for example, in your earlier reference to the `quantize_row_q4_0_impl()` function, the loop is not assigning a different number of bits to each column of weights within the row? If it is applying the same value throughout, why is it using a for loop for each column of weights from the row?
>
> edit: ooh, I forgot about this! I had known it at some level before, but it was never necessary in discussing it so I forgot and went back to my original understanding. It is basically a lot more computation to use a different number of bits, but there are other details that go into extracting the original value. the multiplier and the offset.

View File

@@ -0,0 +1,278 @@
### 🗣️ [#223](https://github.com/ikawrakow/ik_llama.cpp/discussions/223) - Recent performance testing with DeepSeek R1
| **Author** | `bitbottrap` |
| :--- | :--- |
| **Created** | 2025-02-22 |
| **Updated** | 2025-03-14 |
---
#### Description
I'm open to a more rigorous set of tests using accepted benchmark files. Just point me to them. I can run this periodically if it's scripted. Available are 2x24GB GPUs and 1TB of RAM on an Epyc CPU.
Tested with:
commit 4b45b82e67d9362e7522e5c7107e9d99219e0432 (HEAD -> main, origin/main, origin/HEAD)
Author: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Date: Thu Feb 20 17:42:07 2025 +0200
Honor attn_output specified in the command line also for low-bit quants
DeepSeek R1 Q4_K_M
Only the MLA configuration worked at 163840 token context. Everything else was OOM.
Attention Type | rtr | CUDA | Context Size | KV Quant | Load Time (ms) | Tokens/Second (Prompt Eval) | Tokens/Second (Eval) | Notes
-- | -- | -- | -- | -- | -- | -- | -- | --
flash | | | 8192 | Q8 | 87751 | 43.22 | 1.68 |  
flash | X | | 8192 | Q8 | 249508 | 58.58 | 1.89 |  
flash | | | 8192 |   | 146536 | 44.26 | 2.18 |  
flash | X | | 8192 |   | 259598 | 52.65 | 2.18 |  
mla | | | 8192 |   | 74651 | 32.76 | 5.21 |  
mla | X | | 8192 |   | 0 | 0 | 0 | FAIL, core dump
standard | | | 8192 |   | 94564 | 39.74 | 4.86 |  
standard | X | | 8192 |   | 254080 | 48.15 | 4.87 |  
flash | | | 65536 |   | 249237 | 43.44 | 2.05 |  
flash | X | | 65536 |   | 422931 | 55.18 | 2.06 |  
flash | | | 128000 |   | 416902 | 41.61 | 2.1 |  
flash | X | | 128000 |   | 593555 | 50.35 | 2.12 |  
mla | | | 128000 |   | 274483 | 32.18 | 5.24 |  
standard | | | 128000 |   | 612123 | 39.96 | 4.81 |  
standard | X | | 128000 |   | 731429 | 49.46 | 4.7 |  
flash | | | 163840 | Q8 | 413241 | 47.44 | 1.74 |  
flash | X | | 163840 | Q8 | 444949 | 57.90 | 1.75 |  
mla | | | 163840 |   | 83955 | 31.3 | 5.25 |  
mla | X | | 163840 |   | 0 | 0 | 0 | FAIL
flash | | X | 8192 |   | 0 | 0 | 0 | fail: ggml_cuda_flash_attn_ext_wmma_f16: Unhandled head size 192
flash | X | X | 8192 |   | 397501 | 49.35 | 2.16 |  
mla | | X | 8192 |   | 95964 | 22.77 | 5.22 | FAIL, garbage output
mla | X | X | 8192 |   | 0 | 0 | 0 | FAIL, core dump
standard | X | X | 8192 |   | 396659 | 50.17 | 4.84 |  
standard |   | X | 8192 |   | 126521 | 21.5 | 4.68 |
---
#### 🗣️ Discussion
👤 **saood06** replied the **2025-02-23** at **01:03:00**:<br>
Thank you so much for these results.
Also was the test conducted the same as before with a 500 token prompt and a 300 token response, or something different?
>I'm open to a more rigorous set of tests using accepted benchmark files.
I can make a branch containing what fairydreaming used to evaluate PP and TG performance.
From it's readme:
>Benchmark the prompt processing and token generation performance of `llama.cpp`
by doing a sweep over a whole context size and gathering performance metrics
in each ubatch-sized window. Only a single token sequence is used.
>[...]
>The purpose of the benchmark is to visualize how the performance changes with
the context size without averaging the metrics values over the whole context.
> 👤 **bitbottrap** replied the **2025-02-23** at **01:18:38**:<br>
> 500 token prompt, 300 token output.
>
> If it's scripted and the results get written to a log that I can easily post I can do this periodically while this project is relevant. I did this by hand and it was the wrong way of doing it. And I'm not sure what parameters would be most beneficial to change especially when new features are being developed / tested.
---
👤 **saood06** replied the **2025-02-23** at **01:36:47**:<br>
The fairydreaming benchmark includes a script that contains a python script that generates a graph that would display multiple configurations against each other here are two examples of it's output from fairydreaming ( [1](https://preview.redd.it/o2uxzg63x3he1.png?width=989&format=png&auto=webp&s=dc2743353f3d5a86258aa51efc7e18853e3911a0) and [2](https://www.reddit.com/r/LocalLLaMA/comments/1igpwzl/paradigm_shift/mawmoq0/) )
We could tell you what configs to run and then you just pass all the jsonl output from each config into the script and it outputs a graph.
Edit: Fixed image link to show PP instead of TG graph
> 👤 **bitbottrap** replied the **2025-02-23** at **02:49:14**:<br>
> I'm primarily motivated by DeepSeek R1/V3 improvements right now. Being that the model is so large and the most value would probably be pushing limits of context tests take a while. I use this system during the day so I definitely can't afford to create such detailed graphs regularly. But if there were a smaller number of runs, say up to 30ish that's reasonable to run overnight by request.
>
> 👤 **saood06** replied the **2025-02-23** at **04:59:50**:<br>
> >Being that the model is so large and the most value would probably be pushing limits of context tests take a while.
>
> I understand my system is far weaker than yours (the highest PP I've seen is 11), and I've done overnight benchmarks so I do appreciate you doing this. I just created #225 for an easy to use but thorough benchmark, that will output nice graphs.
>
> >But if there were a smaller number of runs, say up to 30ish that's reasonable to run overnight by request.
>
> @ikawrakow Can you pick any runs you would like to see?
---
👤 **ikawrakow** replied the **2025-02-23** at **05:57:41**:<br>
Thank you for this!
What is the hardware configuration? (EPYC model, single or dual socket, how many RAM sticks and what type)
How many threads do you use when running the benchmarks?
I think the most pressing issue is to understand why TG performance with FA enabled is so low. Is it possible to run one FA configuration with varying number of threads (e.g., `llama_bench -m $model -p 0 -n 64 -t 2,4,8,16,...,max_threads`?
The MLA failures are also concerning, but solving them would require debugging.
CUDA does not support FA with different K and V head sizes and in the DeepSeekV3/R1 models, so no need to run those. I guess, I should add a check for that.
Run time repacking seems to be adding 2-3 minutes to the load time. This is better than I expected but I guess it could be very annoying if used regularly. I should try to optimize or perhaps create a tool to repack an existing model.
---
👤 **bitbottrap** replied the **2025-02-23** at **15:30:00**:<br>
Epyc 7773X (64 cores, 128 threads), one socket, 8x128GB RAM
For the above I used 63 threads as a balance between prefill and generation.
Is the run time repacking equivalent of using Q4_K_S versus quantizing a model with Q4_K_R4? Also, there is no repacking for Q4_K_M? If so, some of the comparisons are off as the models being compared are in fact different.
I don't think repacking time is important for such a large model. Can't imagine loading it on demand in many environments.
Here is a table of the benchmarks you asked for above.
threads | std | flash | mla
-- | -- | -- | --
2 | 0.99 | 0.92 | 0.99
4 | 1.89 | 1.7 | 1.86
8 | 3.25 | 2.89 | 3.26
16 | 4.6 | 4.04 | 4.64
24 | 4.81 | 4.03 | 4.82
32 | 4.81 | 4.17 | 4.8
48 | 4.75 | 4.08 | 4.75
64 | 4.69 | 4.14 | 4.73
96 | 4.56 | 4.05 | 4.64
128 | 4.49 | 4.11 | 4.59
---
👤 **ikawrakow** replied the **2025-02-23** at **16:08:15**:<br>
Thanks!
So, what is the difference between the above and the original table? Here we see FA having lower performance than std/MLA, but only 10-20% lower and not 2.5x lower as in the original table. FA having slightly lower TG performance is in line with the expectation. Its main benefit is prefill performance, so depending on context (number of tokens generated vs prompt length), it will often win against std or MLA in terms of total processing time. But not when TG performance is 2.5X lower...
> For the above I used 63 threads as a balance between prefill and generation.
63 or 64? 63 is really bad as suddenly number of rows in tensors is no longer a multiple of the number of threads, so threads process different portions, and one likely even ends up with false sharing (threads writing into the same cache line, triggering cache syncs with potentially disastrous effects on performance). You see a little bit of that in the FA column above at 24, 48 and 96 threads, but these are still relatively "nice" thread numbers compared to 63.
> Is the run time repacking equivalent of using Q4_K_S versus quantizing a model with Q4_K_R4?
Run-time-repacking (rtr) does not change the mix of quantization types. `Q4_K_M` is a mix of `Q4_K` and `Q5_K`, so after rtr we will have a corresponding mix of `Q4_K_R4` and `Q5_K_R4`. If you select `Q4_K_R4` as the quantization type during quantization, then yes, you basically end up with the same as `Q4_K_S` after rtr.
> Epyc 7773X (64 cores, 128 threads), one socket, 8x128GB RAM
OK, so this is Zen3, so using vanilla AVX2 implementation. If the information I find on the Internet is correct, it should have ~200 GB/s memory bandwidth. We have 37B active parameters at about 4.8 bpw for `Q4_K_M`, so about 22 GB of model weights are active, so we should be getting in the range of 8-9 t/s for TG. I wonder where is the bottleneck. I'm able to 100% saturate the memory bandwidth on a Ryzen-7950X (Zen4 core), Ryzen-5975WX (Zen3 core) and M2-Max with the models I can run.
> 👤 **bitbottrap** replied the **2025-02-24** at **01:12:31**:<br>
> Good eye and thank you for challenging my assumptions. I had benchmarked mla and found that 63 threads was just fine. No large drop like flash attention. Here are the per-thread-count results for flash attention. Yes, there's a huge drop for 63:
>
> | Thread Count | Prompt Eval Time (tokens/s) | Eval Time (tokens/s) |
> |-------------|-----------------------------|----------------------|
> | 2 | 2.39 | 0.98 |
> | 4 | 4.71 | 1.57 |
> | 8 | 9.30 | 2.65 |
> | 16 | 18.14 | 3.57 |
> | 24 | 26.52 | 3.18 |
> | 32 | 33.74 | 3.41 |
> | 48 | 42.53 | 3.42 |
> | 49 | 39.05 | 1.88 |
> | 50 | 43.38 | 2.36 |
> | 51 | 39.63 | 1.89 |
> | 52 | 44.61 | 2.68 |
> | 53 | 42.42 | 1.89 |
> | 54 | 44.63 | 2.28 |
> | 55 | 42.70 | 2.18 |
> | 56 | 45.70 | 3.20 |
> | 57 | 43.20 | 1.96 |
> | 58 | 45.45 | 2.40 |
> | 59 | 44.28 | 1.88 |
> | 60 | 44.52 | 2.63 |
> | 61 | 44.46 | 1.89 |
> | 62 | 43.56 | 2.32 |
> | 63 | 45.11 | 1.91 |
> | 64 | 48.52 | 3.59 |
> | 65 | 36.08 | 2.05 |
> | 96 | 37.80 | 3.75 |
> | 128 | 43.49 | 3.67 |
>
> There's also a bit of a difference in that these numbers and the original chart were derived from running llama-cli versus llama-bench. Full command line:
>
> llama-cli -fa -b 1024 -ub 1024 -m DeepSeek-R1-256x21B-Q4_K-00001-of-00030.gguf -c 8192 -t 64 --mlock -n 300 -f prompt-prefill-benchmark.txt
>
> Yes, none of this comes close to the theoretical maximum 200GB/sec memory bandwidth.
---
👤 **ikawrakow** replied the **2025-02-24** at **14:35:34**:<br>
Really curious to see what happens with PR #232.
> 👤 **bitbottrap** replied the **2025-02-26** at **01:30:24**:<br>
> Well I see the PR is in main. If you've got a command line that works with 1 or 2 24GB GPUs I'll start it up. I'd like to fit maximum possible context in there.
>
> I see that mla with rtr is working together. Did a hand run and it sped things up. I also generated Q4_K_R4 and Q8_0_R8 quants and they also appear to speed things up. All working together too.
>
> One thing bothers me and that's the official llama.cpp doesn't like the standard quants that are generated. I used the evshiron convert_hf_to_gguf.py and llama.cpp complains about "wrong number of tensors; expected 1147, got 1025"
>
> A lot of interesting features have gone in here and started working recently. Sounds like it's time for a fairly thorough benchmarking.
>
> Here's some size info regarding KV and compute with 163840 context using mla:
> llama_kv_cache_init: CPU KV buffer size = 20740.00 MiB
> llama_new_context_with_model: KV self size = 20740.00 MiB, c^KV (f16): 10980.00 MiB, kv^T (f16): 9760.00 MiB
> ggml_cuda_host_malloc: failed to allocate 0.49 MiB of pinned memory: no CUDA-capable device is detected
> llama_new_context_with_model: CPU output buffer size = 0.49 MiB
> ggml_cuda_host_malloc: failed to allocate 41644.01 MiB of pinned memory: no CUDA-capable device is detected
> llama_new_context_with_model: CUDA_Host compute buffer size = 41644.01 MiB
---
👤 **ikawrakow** replied the **2025-02-26** at **13:08:06**:<br>
> If you've got a command line that works with 1 or 2 24GB GPUs I'll start it up
Basically whatever command you use for your standard testing, but add `-ngl 999 -ot "\.ffn_.*_exps\.=CPU"`. My concept is that the non-expert tensors of DeepSeekV3/R1 (~17B) fit on a single 24GB GPU when quantized. I don't think `llama.cpp` (and by inheritance `ik_llama.cpp`) benefits from multiple GPU's performance wise, so the only benefit from using both GPU's would be the ability to process larger contexts (assuming one can meaningfully split the layers, but I have never played with that as I don't have access to a multi-GPU system).
> One thing bothers me and that's the official llama.cpp doesn't like the standard quants that are generated. I used the evshiron convert_hf_to_gguf.py and llama.cpp complains about "wrong number of tensors; expected 1147, got 1025"
This bothers me too, but that's how it got implemented in this unmerged [llama.cpp PR](https://github.com/ggml-org/llama.cpp/pull/11446) where the MLA implementation here originally came from (but there have been quite a few improvements compared to the PR in `llama.cpp`). Basically, the tensors `wkv_b` get split into `wk_b` and `wv_b` by the `convert_hf_to_gguf.py` script, so there are more tensors in the GGUF produced by `ik_llama.cpp` compared to mainline. I have thought about removing this change from `convert_hf_to_gguf.py` and performing the split on-the-fly while loading the model. But then we run into issues with the imatrix stuff because `wk_b` and `wv_b` will not have entries in the imatrix file (so, no low-bit quantization is possible). It is also not possible to take an existing imatrix and split its `wkv_b` entries because `wv_b` is transposed. From my perspective `llama.cpp` goes too far in treating situations that, although unexpected, can be gracefully handled into fatal errors. In this particular case, all tensors that `llama.cpp` needs to run the model are present, so the presence of the additional `wk_b` and `wv_b` tensors shouldn't result in an error. But I guess that's what happens in a project with many users and few regular contributors who have the big picture.
On KV cache size: To match KTransformers, `ik_llama.cpp` must be able to handle a context of 8K tokens. Based on the figures you provide for a context of 163k tokens, 8K tokens will require ~1 GiB if left as `f16`, or 765 MiB if the K cache is quantized with `Q8_0`. Let's assume the non-experts are quantized with 6.5 bpw on average (for DeepSeekV3/R1 it is useful to use more bits for the attention tensors and shared experts). 17B * 6.5 bpw = 13.5 GiB. So, there would be ~10 GiB left for KV cache and compute buffers I don't know how much compute buffers are required for DeepSeekV3/R1, but it seems you will be able to go to 32K or perhaps 65K tokens with MLA. Going beyond that will require splitting the model between the two GPUs.
Of note: MLA is ~20% slower than standard attention for less than a few hundred tokens in the cache. It becomes competitive performance wise only beyond 16k tokens. With MLA there are two matrix multiplications that are extremely slow on CUDA. I'm trying to improve that but no luck so far.
> 👤 **ikawrakow** replied the **2025-02-26** at **17:29:07**:<br>
> PR #234 does speed MLA, but only with a single GPU involved.
>
> 👤 **ikawrakow** replied the **2025-02-26** at **17:33:19**:<br>
> Oh, and adding `-fmoe` (or `-fmoe 1` with `llama-bench`) is useful too. This fuses the MoE matrix multiplications. Speedup is not dramatic, but we do get a few percent speedup for prefill and 1-2% for TG.
---
👤 **bitbottrap** replied the **2025-03-14** at **14:54:37**:<br>
So I was going to try and get a bunch of benchmarks with recent code and I encountered a problem using any GPU offloading. This was a feature that was working, but poorly, last time I did some hand testing.
The model is DeepSeek R1 Q8_0
| Configuration | Prompt Eval Time (tokens/s) | Eval Time (tokens/s) | Notes |
|-------------------------------|----------------------------|---------------------|---------------------------------|
| -mla 1 | 37.00 | 3.52 | |
| -mla 1 -fa | N/A | N/A | Segmentation fault (core dumped)|
| -mla 1 -fmoe | 37.55 | 3.53 | |
| -mla 1 -rtr | 43.58 | 3.50 | |
| -mla 1 -rtr -fmoe | 44.37 | 3.51 | |
| -mla 2 | 38.52 | 3.49 | |
| -mla 2 -fa | N/A | N/A | NO TEXT GENERATED |
| -mla 2 -fa -fmoe | N/A | N/A | NO TEXT GENERATED |
| -mla 2 -rtr | 45.41 | 3.47 | |
| -mla 2 -rtr -fmoe | N/A | N/A |Killed/crashed |
| -mla 2 -fmoe | 38.79 | 3.49 | |
Command lines like these with GPU offloading failed:
CUDA_VISIBLE_DEVICES=0 ~/llmla/ik_llama.cpp/build/bin/llama-cli -mla 2 -ngl 0 -b 1024 -ub 1024 -m DeepSeek-R1-Q8_0.gguf -c 8192 -t 64 --mlock -n 300 -f /mnt/data/prompt-prefill-benchmark.txt
CUDA error: out of memory
CUDA_VISIBLE_DEVICES=0 ~/llmla/ik_llama.cpp/build/bin/llama-cli -mla 1 -rtr -b 1024 -ub 1024 -m DeepSeek-R1-Q8_0.gguf -c 8192 -t 64 --mlock -n 300 -f /mnt/data/prompt-prefill-benchmark.txt -ngl 999 -ot "\.ffn_.*_exps\.=CPU"
died

View File

@@ -0,0 +1,241 @@
### 🗣️ [#25](https://github.com/ikawrakow/ik_llama.cpp/discussions/25) - CPU prompt processing speed for large contexts
| **Author** | `ikawrakow` |
| :--- | :--- |
| **Created** | 2024-08-22 |
| **Updated** | 2025-01-15 |
---
#### Description
Back in the day when open source / open weight LLMs had a very limited context window, one of the most desired features among LLM enthusiasts was a larger context window. People came up with all sorts of modifications to the RoPE operation, used (LoRA) fine tuning, etc., to increase the context window beyond the maximum context used during model training. Today we have open source / open weight models that can handle much longer contexts. E.g., LLaMA-3.1 goes up to 128k tokens, which is probably more than what one can handle with consumer grade hardware for "Inference at the Edge" (and I find it kind of funny to see the many issues opened in the `llama.cpp` repository because users did not limit the maximum context length when running `llama.cpp`, and correspondingly the model would not load because the KV-cache required for 128k tokens does not fit into their <= 24 GB VRAM).
But how well is the large context length being handled?
On the GPU `llama.cpp` has an implementation of Flash Attention (FA), which improves prompt processing speeds for long contexts quite a bit (see the graph below). But, as mentioned, one cannot take advantage of the full context offered by LLaMA-3.1 - me for instance, with the paltry 16 GB VRAM on the RTX-4080 that I have at my disposal, cannot go beyond 32k tokens even for 8B LLaMA-3.1. `llama.cpp` has a FA implementation for the CPU as well, so let's see how well this works:
```
./bin/llama-bench -p 2048 -n 0 -t 16 -fa [0|1]
```
which gives these results on my Ryzen-7950X CPU:
| model | size | params | backend | threads | fa | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | -: | ------------: | ---------------: |
| llama 8B Q4_K - Small | 4.38 GiB | 8.03 B | CPU | 16 | 0 | pp2048 | 93.13 ± 0.34 |
| llama 8B Q4_K - Small | 4.38 GiB | 8.03 B | CPU | 16 | 1 | pp2048 | 87.28 ± 0.30 |
Oops. FA is **slower** than no-FA. This is mainline `llama.cpp`. What about the version in this repository where we have much improved CPU prompt processing speed? We get this:
| model | size | params | backend | threads | fa | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | -: | ------------: | ---------------: |
| llama 8B Q4_K - Small | 4.38 GiB | 8.03 B | CPU | 16 | 0 | pp2048 | 174.09 ± 1.35 |
| llama 8B Q4_K - Small | 4.38 GiB | 8.03 B | CPU | 16 | 1 | pp2048 | 137.87 ± 1.55 |
Oops. Even worse - FA is 26% slower. Why? Because when FA is turned on the `KQ = K * Q` and `KQV = V * KQ` matrix multiplications are handled internally within the FA kernel, so no longer take advantage of the optimized version provided by `iqk_mul_mat`, so performance suffers more.
So, the short answer is: no luck with the current `llama.cpp` version using long contexts on the CPU (unless of course one is very patient).
Anyhow, how well does the CPU do compared to the GPU? The following graph shows the ratio of tokens/second on the CPU to tokens/second on the GPU as a function of prompt length. The CPU is Ryzen-7950X, the GPU is RTX-4080. The black symbols/line is the ratio without GPU Flash Attention, the red circles/line is with FA turned on on the GPU (but not on the CPU).
![pp_cpu_vs_gpu](https://github.com/user-attachments/assets/9ffb6471-356a-430a-b625-03f4cd1431f0)
The behavior of the curves is interesting for relatively short prompts (say, up to 32 tokens, which is the range of interest for speculative sampling or batch processing), but here we are interested in the portion beyond 500 tokens. Without FA on the GPU, the CPU does improve relative to the GPU with increasing context length, becoming only 16X slower at 32k tokens ("only" considering that we are comparing a $500 previous generation Ryzen to the second fastest consumer grade GPU currently on the market). But when FA is turned on, the performance gap keeps increasing with increasing context length, reaching about 53X slower than the GPU at 32k tokens (and hence the GPU with FA is 3.1X faster compared to no-FA at 32k tokens).
Clearly it would be useful if we could make the CPU go faster for large contexts.
Here is a quick summary of how the computation time is spent on the CPU when processing a prompt of 32k tokens (using LLaMA-3.1-8B quantized to `Q4_K_S`). For comparison, I have added in the 4th column the fraction of time spent for the various operations in the more "normal" case of processing 512 tokens.
| operation | time (us) | fraction of total time | fraction for PP-512 |
| ---------: | ---: | ---: | ---: |
| MUL_MAT | 3.78863e+08 | 0.8022 | 0.9334 |
| SOFT_MAX | 8.4128e+07 | 0.1781 | 0.0084 |
| quantize | 2.32309e+06 | 0.0049 | 0.0159 |
| MUL | 2.117e+06 | 0.0045 | 0.0133 |
| RMS_NORM | 1.13661e+06 | 0.0024 | 0.0070 |
| ADD | 968962 | 0.0021 | 0.0058 |
| SILU | 914848 | 0.0019 | 0.0060 |
| ROPE | 878818 | 0.0019 | 0.0038 |
| CONT | 632398 | 0.0013 | 0.0040 |
| CPY | 306549 | 0.0006 | 0.0021 |
| GET_ROWS | 12628 | 0.0000 | 0.0002 |
So, basically the entire time is spent doing matrix multiplications and `SOFT_MAX` on the `K*Q` product in the self-attention part (but according to the measured wall time the operation took 495 seconds, while the total of all operations works out to 472 seconds, so there is possibly a ~5% spent on thread synchronization). `SOFT_MAX`, which takes less than 1% of the processing time for 512 tokens increases to 17.8% for a context of 32k. But why is `SOFT_MAX` taking so long? Didn't Justine Tunney just recently contribute a vectorized `expf` implementation to `llama.cpp`, hich should make `SOFT_MAX` go faster? Well, the vectorized `expf` is being used here, but we also need to load from/store back to RAM 2080 GiB while computing `SOFT_MAX`. Given the 84.1 seconds taken by `SOFT_MAX`, this works out to about 25 GiB/s, which is pretty close to the 30 GiB/s the Ryzen-7950X CPU can do in the best case scenario when copying data from here to there.
What about the matrix multiplications? The next table shows total time in us and the fraction of the total matrix multiplication time time taken by the various matrix multiplications (note: this is the sum over all layers):
| Result tensor | Time (us) | Fraction of total time |
| ---: | ---: | ---: |
| kq | 1.29016e+08 | 0.3405 |
| kqv | 9.59329e+07 | 0.2532 |
| ffn_out | 4.31925e+07 | 0.1141 |
| ffn_up | 4.16408e+07 | 0.1099 |
| ffn_gate | 3.91751e+07 | 0.1034 |
| Qcur | 1.1825e+07 | 0.0312 |
| kqv_out | 1.1343e+07 | 0.0299 |
| Vcur | 3.32323e+06 | 0.0088 |
| Kcur | 3.29824e+06 | 0.0087 |
| result_output | 115747 | 0.0003 |
So, close to 60% of the matrix multiplication time is spent for `kq = K*Q` and `kqv = V * softmax(K*Q)`. Combining 60% of 80% with 17.8% for `SOFT_MAX`, we have close to 2/3 of the total time being spent on `K*Q`, `softmax(K*Q)` and `V*softmax(K*Q)`. Interestingly enough, the `kq` and `kqv` matrix multiplications require the exact same amount of floating point operations - 142.94 TFLOP for the 32k context we are looking at. And yet, `kqv` is computed about 35% faster - why? Again, it is a matter of storing data to RAM: `kq` is 2080 GiB (no, we don't keep it all, processing is done in batches), so this works out to 16.1 GiB/s written to memory while computing `kq`. On the other hand `kqv` is "just" 16 GiB, so the matrix multiplication function is storing results at a rate of 0.17 GiB/s - so it is far from being throttled by memory bandwidth. We also see from the data that we get about 1.5 TFLOP/s when computing `kqv`, and about 1.1 TFLOP/s for `kq`. I happen to know that in a synthetic benchmark with just matrix multiplications and result fitting into L2 cache, we get about 2 TFLOP/s with the `iqk_mul_mat` implementation for `fp32`.
Based on this, here are some angles of attack for improving the CPU performance for large prompts:
1. Investigate if it is possible to get the `kqv` speed closer to the 2 TFLOP/s we know is achievable
2. Investigate if we can improve `kq` performance by better interleaving computation with memory writes. We are at ~16 GiB/s and 30 GiB/s is the limit on this CPU
3. Fuse `kq` and `softmax(kq)` into a single operation. As I don't want to go implement this new operation on all back-ends, the fusing should be done on-the-fly while evaluating the computation graph on the CPU. This will eliminate writing `kq` to RAM, so has the potential of shaving off at least 15% of the time
4. Fuse `K*Q`, `softmax(K*Q)` and `V*softmax(K*Q)` into a single operation. I.e., re-discover Flash Attention :-) As the experience with the `llama.cpp` CPU implementation shows, it is not just a matter of not storing intermediate results into RAM. One still needs to go as fast as possible with the matrix multiplications to actually get performance improvement from this.
5. Look into quantized KV cache. Quantized matrix multiplications are faster than `fp32` - we get in the 2.5 to 3 TFLOP/s with the implementation in `iqk_mul_mat`, but I need to look in more detail into the associated accuracy loss. In addition, if `V` is quantized, `softmax(K*Q)` must be quantized as well, which may be too costly unless fused into the `softmax(K*Q)` operation.
---
#### 🗣️ Discussion
👤 **jart** replied the **2024-08-22** at **15:26:07**:<br>
> ~5% spent on thread synchronization
Have you tried these measurements with the latest llamafile sources? There's a variety of improvements to thread synchronization. For example, here's a better memory barrier that's more on par with what GNU OpenMP does.
```c
void ggml_barrier(const struct ggml_compute_params * params) {
if (params->shared->n_threads == 1)
return;
int n = params->shared->n_threads;
atomic_int * count = &params->shared->n_barrier;
atomic_uint * phase = &params->shared->n_barrier_passed[params->ith].i;
unsigned i = atomic_load_explicit(phase, memory_order_relaxed);
if (atomic_fetch_add_explicit(count, 1, memory_order_acq_rel) == n - 1) {
atomic_store_explicit(count, 0, memory_order_relaxed);
for (int j = 0; j < n; ++j)
atomic_store_explicit(&params->shared->n_barrier_passed[j].i,
i + 1, memory_order_relaxed);
atomic_thread_fence(memory_order_release);
} else {
while (atomic_load_explicit(phase, memory_order_relaxed) == i)
pthread_pause_np();
atomic_thread_fence(memory_order_acquire);
}
}
```
In `ggml_graph_compute_thread()` it helps a lot to say:
```c
for (int node_n = 0; node_n < cgraph->n_nodes; node_n++) {
struct ggml_tensor * node = cgraph->nodes[node_n];
if (ggml_is_noop(node->op)) // [jart]
continue;
// ...
```
Assuming you have this defined:
```c
static bool ggml_is_noop(enum ggml_op op) { // [jart]
switch (op) {
case GGML_OP_NONE:
case GGML_OP_PERMUTE:
case GGML_OP_RESHAPE:
case GGML_OP_TRANSPOSE:
case GGML_OP_VIEW:
return true;
default:
return false;
}
}
```
llama.cpp also likes to spawn a thread for every token when predicting. You can make threads spawn/join 10x faster with this:
- https://github.com/Mozilla-Ocho/llamafile/blob/main/llamafile/pool.cpp
Is this all something that'd interest you? I can easily send a PR adding it to your repo if you don't care about things like MSVC.
---
👤 **ikawrakow** replied the **2024-08-22** at **16:16:08**:<br>
Hey @jart, thanks for the comments!
> Have you tried these measurements with the latest llamafile sources? There's a variety of improvements to thread synchronization. For example, here's a better memory barrier that's more on par with what GNU OpenMP does.
No, I'm working with my `llama.cpp` clone and using OpenMP on Linux. On my M2-Max OpenMP is somehow really bad, so I'm using a slightly modified version of `ggml_barrier`, see [here](https://github.com/ikawrakow/ik_llama.cpp/blob/bd99ed7d0afd2b12c0f5ff5c17b58486396dfe7e/ggml/src/ggml.c#L3371). But I'll definitely look into using threads differently. It hasn't been an issue with my setup until I started looking into these long contexts. When you do long contexts the computation takes quite some time, so the OS will definitely preempt one or more threads at some point, and then we end up waiting for them to finish with the `ggml` approach of splitting the work into `n_thread` chunks. I think for the long contexts it will be better to do work stealing from a pool of tasks that is a few times larger than the number of threads. I'm planning to also look into that.
> In ggml_graph_compute_thread() it helps a lot to say:
Ha, you had already done that! I didn't check `llamafile` and discovered this on my own, see [this PR](https://github.com/ikawrakow/ik_llama.cpp/pull/19)
> Is this all something that'd interest you? I can easily send a PR adding it to your repo if you don't care about things like MSVC.
I don't care about MSVC, so sure. There is the MIT vs Apache-2.0 issue, but we can sort that out.
> 👤 **jart** replied the **2024-08-22** at **18:02:15**:<br>
> Apple doesn't have OpenMP. So that's where my thread synchronization changes have the most impact. Right now in llama.cpp if I build it on my Apple M2 and run with `-ngl 0` for CPU mode it gets 134 tok/sec tops. But llamafile with `-ngl 0` on MacOS M2 generates text at anywhere from 150 tok/sec to 210 tok/sec depending on how much Netflix is interfering and how much I win the XNU scheduler lottery (I imagine things are consistently 200+ if Asahi Linux is used instead of XNU). On the other hand, if I use Metal GPU then it consistently generates text at 200 tok/sec.
>
> Yes, that's correct. I'm claiming that the changes you and I both made on llamafile have made M2 Ultra CPU go faster than its GPU sometimes when generating text with TinyLlama-1.1B-Chat-v1.0.Q4_K_M.gguf. However if I use a larger model like Mistral 7b where the matmuls start to dominate a lot more than the sync barriers, then I can only generate 42 tok/sec and GPU does 72 tok/sec. So this is all a bit orthogonal to the goal here of huge context windows. I just wanted you to know that we did something most people would likely assume is not possible. I certainly wouldn't have, because when I started focusing on this in January I set out with the goal of making CPU at at least only 10x slower than GPU.
>
> 👤 **jart** replied the **2024-08-22** at **18:13:48**:<br>
> As for MIT vs. Apache 2.0 there's a lot of leeway from Mozilla to make my work available to other local AI projects under the MIT license if that's what you're using here. I'll roll up a pull request for you sometime in the next few days, that'll work smoothly on POSIX platforms.
>
> 👤 **ikawrakow** replied the **2024-08-22** at **19:08:09**:<br>
> > Apple doesn't have OpenMP
>
> I thought the currently recommended approach in `llama.cpp` is to `brew install libomp`, which then by default enables OpenMP? That's what I tried anyway after observing a horrible performance with the `ggml_barrier` implementation on my M2-Max laptop, but that didn't help much either, so I did end up putting in the inline assembly that fixed performance for me.
>
> But yes, for small models such as TinyLlama thread synchronization becomes really important, so I should try your barrier version.
>
> 👤 **jart** replied the **2024-08-22** at **22:12:59**:<br>
> I don't even know why OpenMP is there. It's a GPL-licensed library. We might as well be using Torch if we're going to link that. Goes against the very spirit of the project which is figuring these things out for ourselves.
>
> 👤 **jart** replied the **2024-08-22** at **22:16:45**:<br>
> Also if by libomp you mean LLVM libomp, sadly it's kind of an newer alternative and it's got none of the alpha of GNU's OpenMP runtime. Based on my own evaluation, LLVM libomp is about as fast as llama.cpp's old synchronization code, when it's applied for GGML speedups.
---
👤 **ikawrakow** replied the **2024-08-27** at **06:31:49**:<br>
I did try a few things on [this branch](https://github.com/ikawrakow/ik_llama.cpp/tree/ik/kq_fused_softmax), but nothing is really working. The branch is just exploratory, absolutely not production ready, and `AVX512`-only. Given the unsatisfactory outcome, it will not get merged.
* I can get the CPU flash attention to run faster than the original (quite a bit faster for very large prompts), but it is still slower than no flash attention
* I can get a ~3% speedup for large prompts by optimizing for no-alibi and causal attention mask. But given the marginal improvement, increased complexity, and reduced generality, it does not seem worth adding.
On the bright side, PR #27 merges "soft-capping" with soft-max. For large prompts, this leads to a significant performance boost for Gemma-2 models. At 32k tokens and Gemma-2-2b, the performance gap between GPU with flash attention and the Ryzen-7950X CPU is now "only" a factor of 45 (instead of the 53X in the above graph).
---
👤 **ikawrakow** replied the **2024-08-30** at **15:25:30**:<br>
OK, I have progress on [this branch](https://github.com/ikawrakow/ik_llama.cpp/tree/ik/kq_fused_softmax). Extremely hacky and `AVX512`-only (or, more precisely, Zen4-only), totally not production ready. But I'm finally able to outperform no flash attention on my Ryzen-7950X CPU - by about 20% for context of 16k, 23% for 32k, with LLaMA-3.1-8B.
This graph shows the current status. y-axis is tokens per second on my Ryzen-7950X CPU, x-axis is context size (logarithmic scale). Black symbols show the performance in this repository, green is mainline `llama.cpp`, both without FA. The red symbols is what we get if we turn on FA as inherited from `llama.cpp`, so complete disaster. Blue symbols are mainline `llama.cpp` with FA. Yes, it is slower than no-FA (and the fact that it is slower on most platforms except newer GPU's with CUDA appears to be not well known). The magenta symbols show the results for the new FA implementation on the [ik/kq_fused_softmax](https://github.com/ikawrakow/ik_llama.cpp/tree/ik/kq_fused_softmax) branch. There are many attempts there, so this is the result of [this function](https://github.com/ikawrakow/ik_llama.cpp/blob/77b7baaff79cdc94fc13bd67698e85a40a55bb00/ggml/src/iqk/iqk_mul_mat.cpp#L6786)
![fa](https://github.com/user-attachments/assets/4f5b7e7a-0648-4972-ba93-cd14da3ab1e6)
My guess is that there is still a bottleneck at 32k tokens. Based on the FA to n-FA relative performance increase up to 16k tokens I would expect a performance gain above 30% at 32k tokens instead of the 23% we currently get.
---
👤 **ikawrakow** replied the **2024-08-30** at **15:37:24**:<br>
And here is how the raltive CPU vs GPU performance graph changes with the new CPU flash attention implementation. The FA curve is basically flat now beyond 1000 tokens, except at 32k where I suspect a bottleneck that I have not found.
![pp_cpu_vs_gpu](https://github.com/user-attachments/assets/96c27976-f22b-4fa9-a0b5-021f0992a83c)
---
👤 **ikawrakow** replied the **2025-01-15** at **17:50:21**:<br>
There has been progress since I last wrote here, with PR #172 being the latest contribution to improving CPU prompt processing speed. The following graph is for LLaMA-3.1-8B-Instruct quantized to `IQ4_XS` (which seems a fairly popular quantization type). Tested on a Ryzen-7950X CPU. The mandatory current mainline `llama.cpp` results are for `build: 1d850433 (4488)`. The results for `ik_llama.cpp` are obtained using run-time-repacking to the corresponding 4-row interleaved variant.
![pp512_vs_ctx](https://github.com/user-attachments/assets/81a09390-b0da-4d5c-9815-300b4b86705c)
* In mainline `llama.cpp` FA continues to be underwhelming, being handsomely outperformed by not using FA
* `ik_llama.cpp` now finally exceeds 100 t/s for a prompt of 32k tokens. I get 122 t/s (`BF16` KV-cache) and 113 t/s (`Q8_0` KV-cache). The best I could do with mainline is 37 t/s (`Q8_0` K-cache, no FA).
* I'm quite pleased that `Q8_0` KV-cache is now almost on par with `BF16`
* `ik_llama.cpp` is almost 4 times faster than mainline at 256 tokens, and still 3.3 times faster at 32k tokens. For such large contexts the computation time is heavily dominated by the `K*Q` and `V*softmax(K*Q)` matrix multiplications, with these matrices by far exceeding L3 cache size, and hence the operation becoming memory bound. In fact, part of the improvement in PR #172 is due to reducing the number of memory loads from the `V`-cache in the FA computation.
* If processing very long context is a significant use case, utilizing `Q8_K_R8` brings additional gains. We get 373 t/s for 512 tokens, 312 t/s at 4k, 268 t/s at 8k, 203 t/s at 16k, and 136 t/s at 32k tokens.
It is also interesting to look at the performance relative to a GPU. I'm using an RTX-4080 GPU with the same model and FA enabled. Compared to earlier plots in this thread, I have changed the plot to show the ratio of GPU to CPU prompt processing speed and have restricted the prompt length to $\ge 100$ tokens to reduce the range of the y-axis. The Ryzen-7950X now saturates at about 27.5X lower performance compared to the RTX-4080, which is not bad at all.
![pp_gpu_vs_cpu](https://github.com/user-attachments/assets/ef674c0e-7556-4bbe-96cb-658a530aabc6)

View File

@@ -0,0 +1,64 @@
### 🗣️ [#256](https://github.com/ikawrakow/ik_llama.cpp/discussions/256) - Diverging from llama.cpp
| **Author** | `arnfaldur` |
| :--- | :--- |
| **Created** | 2025-03-14 |
| **Updated** | 2025-03-14 |
---
#### Description
I just discovered this fork yesterday and would like to understand the situation better. This message is addressed to @ikawrakow
I was very excited to discover that you were still innovating on quantizations but I'm confused as to why it's happening on a fork with little desire (https://github.com/ikawrakow/ik_llama.cpp/issues/133) to upstream the developments. I researched the history of this fork and many of the discussions that lead to it's creation (like the curiosity about Justine's tinyBLAS doubts), but have still not found a satisfactory answer.
## Underutilization
The **very impressive** developments occurring on this fork seem to me to be underutilized. The `llama.cpp` community is huge and all those people could be enjoying the new `IQn_K` quants. But as it stands, most people don't know about them. Bartowski and his peers aren't uploading `IQn_K` quants to hugging face, and even if someone were to go through the effort of making them themselves, using them is considerably harder as there are no build instructions here, and the build process has changed upstream.
There is of course the possibility that you don't care about mass adoption of your quants, in which case the last paragraph isn't relevant. I completely respect that disposition, if that is the case.
I would be surprised if that was the case however. Why share the work on this fork if not for others to use? A potential answer would be that you prefer a smaller, more technical community that is less concerned about mass adoption and compatibility. That is certainly valid but there are some downsides, e.g. no Bartowski quants, slower support for new models, and no development of secondary tools like the server. You might not care about those things either. I do, but I can also solve them myself with mild effort.
## The quants of `llama.cpp`
A defining feature of llama.cpp is it's popular model format and it's supported quantizations. I know that many people always wait for Bartowski's speedy quantizations for new models and pick their preferred quants from there, just like I do. As I understand it you contributed every one of these quantization schemes, many of wich were SOTA or near SOTA at the time of publishing. In light of that, your efforts were instrumental in making `llama.cpp` into what it is today. Especially considering that quantization quality is probably the most important aspect of running models in RAM constrained environments, which is the point of `llama.cpp`.
As is likely evident, I think it is a big loss to the commons that these new quants and optimizations aren't available upstream.
I still want to emphasize that I believe that there is a valid reason for the fork's creation and I would be very interested in hearing that reason.
## Resolution
In light of the importance of the past contributions to `llama.cpp`, I also want to know if you would ever consider upstreaming them, and importantly, under what conditions you would be willing to do that. The maintainers of `llama.cpp` should see the value in the work on this fork and want to get it upstreamed and I hope that they would be willing to accommodate you and do what ever it takes to make you happy to contribute.
I'm sorry if this is a bit much, but I think it's very important and I was honestly shocked to discover this and that nobody is talking about this. Maybe I care more about quants than most `llama.cpp` users 🤷
---
#### 🗣️ Discussion
👤 **ikawrakow** replied the **2025-03-14** at **06:06:08**:<br>
Hello @arnfaldur,
I'm hacking here to keep my brain utilized and to have some fun. Definitely not looking for fame and/or mass adoption of this repository. A few people have found it useful, this is good enough for me (and, if it did become popular, I'm not sure I want to spend my time supporting non-technical users). I will not be upstreaming stuff to `llama.cpp`, but obviously with this repo being MIT licensed, upstream is free to take from here whatever they find useful. In addition to the `IQX_K` quants, there are a lot of things here that are better than upstream. In no particular order
* CPU Flash Attention implementation that, unlike upstream, actually improves performance. By quite a margin for very long contexts. Oh, it also works for models where the K head size is different from the V head size (DeepSeek models)
* GPU Flash Attention for different K and V head sizes
* MLA in 2 variants, very relevant for DeepSeekV3/R1 CPU and GPU inference
* What I believe are the fastest quantized matrix multiplications on the planet
* Row interleaving for (almost) all quantization types, which leads to much better CPU performance. Upstream has some of that, but just for `IQ4_0`, `Q8_0`, and `IQ4_NL`, but even for those performance here is quite a bit better, even on `ARM` CPUs.
* Selective tensor offloading to the GPU. Very useful when the model does not fit in VRAM, and one can offload specific tensors to the GPU(s). This replicates what KTransformers have done
* Support for Bitnet models with much better performance than `llama.cpp` and even the 12k stars Bitnet repository from Microsoft
* Much more comprehensive `bf16` support. CUDA support for `bf16` was added not too long ago in upstream, but mine beats it by a factor of 2 for prompt processing
* Various fused operations. This includes fusing of experts (relevant for MoE models). Gemma2 performance is quite a bit better than upstream because of that on CPU, GPU, Metal (but I guess this is no longer relevant with Gemma3 now released)
* Support for custom quantization schemes
---
👤 **bitbottrap** replied the **2025-03-14** at **14:40:37**:<br>
I completely agree that some of this stuff needs to get into llama.cpp. And I completely understand why ikawrakow does not want to be personally responsible for it.
I'm not sure what the focus is over there in llama.cpp land but it's very active. I just don't see a lot of the core stuff being improved on like it is here.

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

File diff suppressed because one or more lines are too long

View File

@@ -0,0 +1,523 @@
### 🗣️ [#288](https://github.com/ikawrakow/ik_llama.cpp/discussions/288) - On @compilade's PR 12557 and @jukofyork's quantization ideas
| **Author** | `ikawrakow` |
| :--- | :--- |
| **Created** | 2025-03-25 |
| **Updated** | 2025-04-11 |
---
#### Description
@compilade has submitted an [interesting PR](https://github.com/ggml-org/llama.cpp/pull/12557) in the mainline `llama.cpp` repository. As it is often the case, @jukofyork has improvement ideas. As both pinged me, and as I no longer hang around in the `llama.cpp` project, I'll address the pings here.
### @compilade's PR
First of all, this is a nice piece of work, so congratulations!
I did try the PR on a few models. I focused on `Q3_K` and `IQ4_NL` as I don't see the utility of using quantization types meant for ternary models (`TQ1_0`, `TQ2_0`) also for non-ternary models, and am also not particularly interested in the legacy quantization types (`Q4_0`, `Q5_0`, too low quality relative to the bits spent). I could have also looked at `IQ4_XS`, but it is very similar to `IQ4_NL`, so here we go with my observations:
* Without imatrix, the existing quantization methods are strictly better than your PR as measured by perplexity<sup>1</sup>
* With imatrix and pure quantization, your `Q3_K` is significantly better than the existing quantization method (but see below). `IQ4_NL` is hit-or-miss - sometimes slightly better, sometimes slightly worse, but overall not much of a difference apart from the 5X increase in quantization time.
* When I added the imatrix to `llama.cpp` it wasn't clear that it will take off the way it did. Hence, the quantization methods I contributed are the way they are. Perhaps they are suboptimal when there is a (meaningful) imatrix, but a major driving force was to make them as robust as possible for quantization without imatrix.
* I have run into this on a number of occasions when I was still actively working on quantization: in many models some tensors have a disproportionally high impact on the observed quantization quality. So, when using `--pure`, it may appear that one gets an improvement because the new method being tested happens to do better on exactly these tensors, but worse on many others. One gets excited about having improved things, but then in practice, with the high-impact tensors quantized with more bits in the quantization mix, suddenly the observed quality is lower than what one had before. Case in point, `Q3_K_M` with your PR often has a higher PPL than the existing quantization, despite being clearly better with `--pure`
* More on `--pure`: in some models token embedding quantization has a disproportional impact on observed quality, and some quantization types do not quantize `token_embd.weight` very well. You do use `Q8_0` for the output tensor, I think it would be better to also use `Q8_0` for token embeddings when using `--pure`.
* It is not that I didn't know how to implement exact minimization of RMSE (or maximization of cosine similarity, if that's what you prefer). The existing methods are the way they are because of the observation that the exact solution of the optimization problem often leads to disastrous results for observed quantization quality. RMSE (or cosine similarity) are just surrogates, so finding a better solution does not automatically lead to better quantization quality. I have seen people describe some of the k- and i-quant quantization methods as "brute force". They are not (brute force will look completely different and would take much longer. Also, the moment we decided to use brute force, that would be the moment where we would plug in an exact solution method that runs many times faster than brute force). They use carefully tuned heuristics to avoid the quants getting lost in the fields. When the iamtrix came along I was exited to use exact solution methods instead of heuristics. Unfortunately, even with an imatrix, one can (and often does) end up with a worse outcome with quantized weights that are more similar to the original model weights (as measured by the surrogate).
* `IQ4_K` and `IQ5_K` here are miles ahead of any 4- or 5-bpw quantization type in mainline `llama.cpp`. Hence, I'm skeptical that they can be improved with your PR (but you are more than welcome to submit a PR here if you are able to demonstrate improvement). `IQ2_K` and `IQ3_K` are on par or slightly better than i-quants with similar size, so before improving these you have to find a way to apply the methods of your PR to `IQ2_XXS, IQ2_XS, IQ2_S, IQ3_XXS, IQ3_S` (one of your TODO items).
* On `TQ2_0` being faster than `IQ1_S`: in theory, sure. In practice, the table below shows what I observe with the PR branch for `TQ2_0`, and with `ik_llama.cpp` for `IQ1_S` (using the row-interleaved variant `IQ1_S_R4`):
| model | size | params | backend | threads | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
| llama 8B TQ2_0 - 2.06 bpw ternary | 2.72 GiB | 8.03 B | CPU | 16 | pp512 | 153.05 ± 0.29 |
| llama 8B TQ2_0 - 2.06 bpw ternary | 2.72 GiB | 8.03 B | CPU | 16 | tg128 | 23.79 ± 0.00 |
| llama 8B IQ1_S_R4 - 1.5 bpw | 2.39 GiB | 8.03 B | CPU | 16 | pp512 | 184.46 ± 1.36 |
| llama 8B IQ1_S_R4 - 1.5 bpw | 2.39 GiB | 8.03 B | CPU | 16 | tg128 | 26.86 ± 0.00 |
### @jukofyork's ideas
If you start with a fully symmetric probability distribution (not always the case, but for simplicity let's assume it is fully symmetric), and you draw a **finite** number of random samples from it (the wights in one quantization block), you then scale the sampled values such that the maximum magnitude value **always takes the same scaled value**, you end up with a non-symmetric probability distribution for the **scaled samples**. The smaller the sample size, the larger the asymmetry. With the sample size approaching infinity, the observed probability distribution will become symmetric. You can ask WolframAlpha about it, or you can write a simple script that samples 32 values from a Gaussian distribution, scales, and scores the resulting scaled pdf.
Anyway, this is why the `IQ4_NL` (and `IQ4_XS`, as well as the `IQ2_K, IQ3_K` quants from this repository) quant lookup tables are asymmetric (and not because I'm a moron who didn't know how to make a symmetric function). But, if you don't accept this for granted (you most likely don't), just go and replace `kvalues_iq4nl` in `ggml-quants.c` with your symmetric variant, and watch the disaster that ensues. You need to do it at a few more places because for some reason this table is not in `ggml-common.h` as it should be.
___
<sup>1</sup> I know, I know. The Internet Gods have spoken: PPL doesn't tell us anything and is completely useless; KLD is the one and only one true measure of quantization quality. But me, not being a religious person, and having quite a bit of research experience under my belt, I don't take the God's opinions for granted. I have written elsewhere about the equivalence of PPL and KLD for an infinitely large test corpus, and about the superiority of PPL for a test corpus of limited size, so I will not repeat myself here.
---
#### 🗣️ Discussion
👤 **jukofyork** replied the **2025-03-25** at **12:48:44**:<br>
> @compilade has submitted an [interesting PR](https://github.com/ggml-org/llama.cpp/pull/12557) in the mainline `llama.cpp` repository. As it is often the case, @jukofyork has improvement ideas. As both pinged me, and as I no longer hang around in the `llama.cpp` project, I'll address the pings here.
> ### @jukofyork's ideas
>
> If you start with a fully symmetric probability distribution (not always the case, but for simplicity let's assume it is fully symmetric), and you draw a **finite** number of random samples from it (the wights in one quantization block), you then scale the sampled values such that the maximum magnitude value **always takes the same scaled value**, you end up with a non-symmetric probability distribution for the **scaled samples**. The smaller the sample size, the larger the asymmetry. With the sample size approaching infinity, the observed probability distribution will become symmetric. You can ask WolframAlpha about it, or you can write a simple script that samples 32 values from a Gaussian distribution, scales, and scores the resulting scaled pdf.
>
> Anyway, this is why the `IQ4_NL` (and `IQ4_XS`, as well as the `IQ2_K, IQ3_K` quants from this repository) quant lookup tables are asymmetric (and not because I'm a moron who didn't know how to make a symmetric function). But, if you don't accept this for granted (you most likely don't), just go and replace `kvalues_iq4nl` in `ggml-quants.c` with your symmetric variant, and watch the disaster that ensues. You need to do it at a few more places because for some reason this table is not in `ggml-common.h` as it should be.
Just to be clear: I wasn't implying you had done anything wrong and merely showing something that I had noticed and spent a couple of hours playing with last year (which I never mentioned before as it wasn't clear it was of any use nor related to anything useful).
I'm sorry if I've come across badly as this isn't my intention - I've nothing to gain from any of this, but just find it interesting :) If you search my nick you can find similar posts by me on the now dead 2+2 forums (everything is on discord now sadly) on similar topics from 25+ years ago!
---
👤 **ikawrakow** replied the **2025-03-25** at **14:28:09**:<br>
@jukofyork Sorry if I have come across a bit harsh. But it is interesting stuff indeed, so we all can get passionate about it.
Anyway, attached is a very simple C++ program that illustrates the asymmetry of the scaled distribution. Here is what it does:
* It picks $N$ random points, either uniformly in $[-1,1]$ or from a Gaussian distribution with $\sigma = 1$ (command line argument)
* It finds the minimum and maximum values in the sample $x_{\rm min}$ and $x_{\rm max}$
* It determines a scale such that the value with the larger absolute value is at -1. I.e., if $|x_{\rm min}| > |x_{\rm max}|$, then $s = -1/x_{\rm min}$, else $s = -1/x_{\rm max}$. It than takes the other extremum (the one with the lower absolute value), and computes $x_s = s x_{\rm other}$.
* It repeats the above $M$ times and computes the average of the observed $x_s$
Here is a plot of the computed average as a function of sample size $N$. For a sample of just 2 points, the average is effectively zero. If the distribution of scaled values was symmetric, the average should be 1 (or very close to 1). We see that this is not the case. For a Gaussian distribution we are quite far away from the symmetric value of 1 that we expect for $N \to \infty$ even for $N = 32$ (the typical block size used in many k- and i-quants). I have used
```
g++ -O3 distr1.cpp
./a.out 1000 -32 >test1.out
./a.out 1000 -32 1 > test2.out
```
to generate the data in the graph (a negative sample size will cause the program to loop between 2 and the absolute value of the argument given).
![distr](https://github.com/user-attachments/assets/81286fac-86ec-4f20-873e-24d6eb18f36c)
[distr1.cpp.gz](https://github.com/user-attachments/files/19449673/distr1.cpp.gz)
---
👤 **ikawrakow** replied the **2025-03-25** at **15:01:41**:<br>
Here is another very simple C++ program:
* Pick $N$ random values
* Sort them in increasing order. Let's the sorted values be $x_i$
* If $|x_0| > |x_{N-1}|$, then $s = -1/x_0,\quad\tilde{x}_i = s x_i$
* Else $s = -1/x_{N-1}$ and $\tilde{x}_i = s x_{N-1-i}$ (don't know why it doesn't show the equation correctly)
* Compute the average of the scaled $\tilde{x}_i$ over a given number of samples.
With this, we get this graph. It looks very similar to what one gets by doing an actual block-wise quantization with non uniform values.
![distr2](https://github.com/user-attachments/assets/92a9e89c-297b-4a1c-be36-675499e094c5)
[distr2.cpp.gz](https://github.com/user-attachments/files/19450493/distr2.cpp.gz)
---
👤 **compilade** replied the **2025-03-25** at **16:25:49**:<br>
@ikawrakow
> First of all, this is a nice piece of work, so congratulations!
Thank you. Your existing work on `imatrix` definitely made it easier to try this kind of weighted rounding algorithms on actual models. At first the idea only applied to ternarization with no ability to weigh the error: <https://github.com/microsoft/BitNet/discussions/112>.
> Without imatrix, the existing quantization methods are strictly better than your PR as measured by perplexity
Right. I will consider reverting back to the existing quantization methods when `imatrix` is not used (although for `Q3_K`, I still think `make_q3_quants` has some problems when the sign of the absmax value is positive (according to the equirectangular projections, in that case it looks like almost exactly like what `Q3_0` would (in the upper left part)), which could be fixed).
I was hoping the more exhaustive algorithms would always be better (since they *are* better at minimizing the weighted squared error), but when they optimize the wrong thing (when no `imatrix` is given) can be worse, except apparently for some models like `Qwen2.5-Coder-3B-Instruct`.
But I also suspect the default weights for the weighted rounding without `imatrix` could be improved (but at that point I guess I should only change what rounding algorithm is used *if* I find those better default weights (which I thought I did from the results of `Qwen2.5-Coder-3B-Instruct`, but apparently not in general)).
Aside: *is there* a generally better solution for the default importance weights (without `imatrix`)? (It seems the heuristics between quant types disagree: some use `x[i] * x[i]`, others `fabsf(x[i])`, and others `sqrtf(sum_x2/N) + fabsf(x[i])` (Note that I did read <https://github.com/ikawrakow/ik_llama.cpp/discussions/140>, I'm not questioning that these were better in practice in their respective cases))
I think this depends on the weighted rounding algorithm with which the weights are used (since the behaviors can be different).
> `IQ4_NL` is hit-or-miss - sometimes slightly better, sometimes slightly worse, but overall not much of a difference apart from the 5X increase in quantization time
Strange, the increase in quantization time for `IQ4_NL` with `imatrix` is only slightly more than 2× for me, and close to none (1×) when no `imatrix` is provided. There is room for improvement in the performance of `make_qkxh_nl_quants` because I did not yet extensively profile it with `perf` except for a previously slower `qsort`-based version (which *really was* 5× slower).
And there are still some adjustments I did not try yet and which could improve both the time (by a noticeable factor) and perplexity (hopefully), which is to add the same "clamping protection" as my linear weighted rounding algorithms (e.g. in `make_qkxh_quants`, the inverse scales which would clamp the `x[i]` with the biggest `w[i] * fabsf(x[i])` are not tried (since this *did* improve the PPL and KLD with `imatrix` for linear quants like `Q3_K`, `Q4_0` and `Q5_0`)). But it might also not help in which case I'm considering reverting to the existing `IQ4_NL` quantization algorithm, even though it makes less satisfying equirectangular projections.
I value your feedback, which is why I'll try to improve on this point (or exclude the changes to `IQ4_NL`).
> You do use `Q8_0` for the output tensor, I think it would be better to also use `Q8_0` for token embeddings when using `--pure`.
I do use `Q8_0` for the token embeddings too in my tests. The example command I've included in the PR description **does** specify `--token-embedding-type q8_0`
```console
$ ./bin/llama-quantize --imatrix <some-file.imatrix> --token-embedding-type q8_0 --output-tensor-type q8_0 --pure <source.gguf> <quant.gguf> <quant-type>
```
> RMSE (or cosine similarity) are just surrogates, so finding a better solution does not automatically lead to better quantization quality.
Yeah, I did notice that. The search algorithms I've made can be adapted to other metrics (although that can also be said of the existing algorithms for k-quants, since they also use weighted squared error), as long as they can be calculated cumulatively.
I'd like to find better surrogates, and more exhaustive search algorithms which are not brute-force (yet still yield optimal-looking results) can help with that, even though for now minimizing weighted squared error on the model tensors doesn't quite match the actual thing we want to minimize (PPL and KLD), which makes your carefully tuned heuristics superior for now.
> Case in point, Q3_K_M with your PR often has a higher PPL than the existing quantization, despite being clearly better with `--pure`
On which model(s) did you observe this? I'd like to reproduce this observation.
> I have written elsewhere about the equivalence of PPL and KLD for an infinitely large test corpus, and about the superiority of PPL for a test corpus of limited size, so I will not repeat myself here.
Right, but the test corpus is not infinite, and for a small test corpus I actually find KLD faster for meaningful comparisions (because the ± error goes down faster than for `ln(PPL(Q)/PPL(base))`, and so sometimes when I'm not using a GPU I don't have to leave it running that long to know if a change is meaningful when tweaking some things).
But I agree PPL is more convenient for quickly comparing versions of quants of a lot of different models (because the logits files get big really fast), at least when using a GPU.
> But it is interesting stuff indeed, so we all can get passionate about it.
Yes, totally agree! And technically I already got what I wanted out of these algorithms (even if they are not merged or not better), which is the very nice plots they can make to hopefully help me understand a bit more the representable vector space of both linear and non-linear quants, especially when viewed appropriately in a 360 degree panorama viewer: <https://blobs.compilade.net/pannellum.htm#panorama=equirectangular-iq4nl-qkxs-2048.png>.
---
👤 **ikawrakow** replied the **2025-03-25** at **16:53:43**:<br>
> Aside: is there a generally better solution for the default importance weights (without imatrix)? (It seems the heuristics between quant types disagree: some use x[i] * x[i], others fabsf(x[i]), and others sqrtf(sum_x2/N) + fabsf(x[I])
It is a heuristic. Trial and error. IIRC, higher bpw quants do better with a stronger large magnitude weighting (e.g., $x^2$), with lower bpw $|x|$ or similar is generally better.
> On which model(s) did you observe this? I'd like to reproduce this observation.
Go back to the basics. Start with LLaMA-v1-7B. I know, nobody uses that today. But then again, almost all of k-quants development was based on the experience with the LLaMA-v1 models, and k-quants have done surprisingly well in the almost two years since they were released on the thousands of models they have been tried on. Even today when I want to try a new quantization idea, I always check performance with LLaMA-v1, LLaMA-v2, and Mistral-7B. Your `IQ4_NL` doesn't do very well on LLaMA-v1-7B - without an imatrix it arrives at a PPL higher than `Q4_0`.
> Strange, the increase in quantization time for IQ4_NL with imatrix is only slightly more than 2× for me,
Oh, I used `ik_llama.cpp` to compare. It is possible that has become much faster than mainline (I haven't used mainline for quite some time). I started testing with DeepSeek-Lite, and almost gave up (your `IQ4_NL` quantization took 302.5 seconds with imatrix). `ik_llama.cpp` does it in 54.5 seconds.
> 👤 **bartowski1182** replied the **2025-03-26** at **17:42:29**:<br>
> Re: quantization speed
>
> Do you have any loose thoughts on where your crazy speedup may be coming from? Not asking you to do a thorough investigation, but curious if you have an initial place to point me
>
> 👤 **ikawrakow** replied the **2025-03-26** at **18:16:32**:<br>
> IIRC:
> At some point I was annoyed by the slow quantization speed of quantization types with non-linear grids (`IQ4_XS, IQ4_NL` in mainline, here also `IQ2_KS, IQ2_K, IQ3_K, IQ4_K, IQ5_K, IQ6_K`). The major bottleneck turned out to be finding the bin in which a value falls after scaling. E.g., [this function](https://github.com/ggml-org/llama.cpp/blob/2447ad8a981253a2b8e9f4b31cc8e7fdff83423e/ggml/src/ggml-quants.c#L4562) in mainline, which does a binary search to find the bin. So, I replaced that with functions such as [this one](https://github.com/ikawrakow/ik_llama.cpp/blob/a22250df93fd833a6cb7f310b159ad1b54e4d582/ggml/src/ggml-quants.c#L14528). I think that was the major part. I don't remember if I did additional optimizations and what they were, if any. I would have to go through the old PRs to find out.
>
> 👤 **compilade** replied the **2025-03-26** at **18:24:02**:<br>
> @bartowski1182
>
> (EDIT: sorry, I did not see ikawrakow's answer before commenting)
>
> My guess would be that `best_index_iq4nl` is faster than `best_index_int8`:
>
> <https://github.com/ikawrakow/ik_llama.cpp/blob/a22250df93fd833a6cb7f310b159ad1b54e4d582/ggml/src/ggml-quants.c#L14518-L14533>
>
> And `best_index_int8` does lots of comparisons instead of using a lookup table more directly (doesn't seem to render inline since it's from a different repo (mainline `llama.cpp`)):
>
> <https://github.com/ggml-org/llama.cpp/blob/2447ad8a981253a2b8e9f4b31cc8e7fdff83423e/ggml/src/ggml-quants.c#L4562-L4571>
>
> I will check if (and how) `best_index_iq4nl` affects the equirectangular projection of `IQ4_NL`, since that seems relevant.
> (EDIT: it doesn't seem to change anything at a cursory glance. So it is pretty much equivalent.)
>
> 👤 **ikawrakow** replied the **2025-03-26** at **18:40:39**:<br>
> Here some napkin math: @compilade said that their approach is only 2X slower than the master branch in mainline. If I use the DeepSeek-Lint values, it means mainline will quantize it in 150 seconds instead of 300 seconds. If you add this optimization, it will become 50 seconds (using round values to make it easier to follow). You then add 150 seconds for the heap search, and it becomes 200 seconds. So, 4X slower than `ik_llama.cpp`, but only ~30% slower than the current state of mainline.
>
> 👤 **compilade** replied the **2025-03-26** at **19:26:28**:<br>
> @ikawrakow My implementation (with the cumulative search) unfortunately cannot use this optimization, because it doesn't use `best_index_int8` anyway. The reason my implementation is slow is because it's too exhaustive. It calculates `sumqx` and `sumq2` for *all* scales which would result in a distinct quantization, and it tests both signs. That is `(32*(7+8))+1 = 481` distinct scales compared per block of 32, compared to the `(2*7+1)+1 = 16` scales compared by the implementations which use either `best_index_int8` or `best_index_iq4nl`.
>
> It's nice that it's not `481/16 = 30` times slower, though 6× does seem too slow, I agree.
>
> The only ways to make the cumulative search faster is to reduce how many scales it searches (which for linear quants is easier because more of them are equivalent and can be skipped), or to make the cumulative step faster.
>
> (It might be possible to mix both approaches to search for more than 16 scales at 1× speed (or faster))
>
> 👤 **bartowski1182** replied the **2025-03-26** at **19:35:38**:<br>
> Appreciate the insights, thanks!
---
👤 **ikawrakow** replied the **2025-03-28** at **09:36:09**:<br>
@compilade @bartowski1182
You may be interested in PR #295
---
👤 **ubergarm** replied the **2025-03-29** at **17:57:59**:<br>
While not directly related to the quants specific to #295 , I did just release what may be one of the best quants (for generation quality) in its size class for `V3-0324` on huggingface [ubergarm/DeepSeek-V3-0324-GGUF](https://huggingface.co/ubergarm/DeepSeek-V3-0324-GGUF) cooking with `ik_llama.cpp`. It also still fits 32k context in under 24GB VRAM and can hit over 4 tok/sec tg mmap'ing on my 9950x 96GB + 3090TI 24GB VRAM rig using `-ser 6,1` sacrificing minimal perplexity.
It only works with `ik_llama.cpp` as even with experimental mainline PRs [fairydreaming:deepseek2-mla-exp](https://github.com/ggml-org/llama.cpp/pull/11446) and [sl/custom-tensor-offload](https://github.com/ggml-org/llama.cpp/pull/11397) you still need support for `IQ3_K_R4`/`IQ2_K_R4` which is only available here.
I haven't done full perplexity and benchmarking comparisons across the major quant cookers versions, but have a rough table showing the differences between ubergarm, @bartowski1182, @danielhanchen (unsloth), and eventually mradermacher's recipes. I'll add it in the fold here for convenience.
Big thanks to y'all doing so much inspirational work and making this stuff more and more accessible!
:point_down:
<details>
<summary>:point_left: V3-0324 quant recipe comparison table</summary>
| | [ubergarm/DeepSeek-V3-0324-IQ2_K_R4](https://huggingface.co/ubergarm/DeepSeek-V3-0324-GGUF?show_file_info=DeepSeek-V3-0324-IQ2_K_R4%2FDeepSeek-V3-0324-IQ2_K_R4-00001-of-00005.gguf) | [bartowski/DeepSeek-V3-0324-Q2_K_L](https://huggingface.co/bartowski/deepseek-ai_DeepSeek-V3-0324-GGUF?show_file_info=deepseek-ai_DeepSeek-V3-0324-Q2_K_L%2Fdeepseek-ai_DeepSeek-V3-0324-Q2_K_L-00001-of-00007.gguf) | [unsloth/DeepSeek-V3-0324-UD-Q2_K_XL](https://huggingface.co/unsloth/DeepSeek-V3-0324-GGUF?show_file_info=UD-Q2_K_XL%2FDeepSeek-V3-0324-UD-Q2_K_XL-00001-of-00006.gguf) | [mradermacher/DeepSeek-V3-0324-i1-GGUF-Q2_K](https://huggingface.co/mradermacher/DeepSeek-V3-0324-i1-GGUF) |
| --- | --- | --- | --- | --- |
| **Overview** | | | | |
| `tensor_count` | 267 | 190 | 253 | |
| `kv_count` | 53 | 53 | 49 | |
| `split.tensors.count` | 1147 | 1025 | 1025 | |
| `token_embd.weight` | `Q8_0` | `Q8_0` | `Q4_K` | |
| File Size (GiB) | 227 | 228 | 231 | |
| **Multi-Head Latent Attention** | | | | |
| `blk.*.attn_kv_b.weight` | `Q8_0` | n/a | n/a | n/a |
| `blk.*.attn_k_b.weight` | `Q8_0` | n/a | n/a | n/a |
| `blk.*.attn_v_b.weight` | `Q8_0` | n/a | n/a | n/a |
| **Dense Layers** | | | | |
| `blk.[0-2].attn_kv_a_mqa.weight` | `Q8_0` | `Q2_K` | `Q6_K` | |
| `blk.[0-2].attn_kv_a_norm.weight` | `F32` | `F32` | `F32` | |
| `blk.[0-2].attn_kv_b.weight` | `Q8_0` | `Q2_K` | `Q6_K` | |
| `blk.[0-2].attn_norm.weight` | `F32` | `F32` | `F32` | |
| `blk.[0-2].attn_q_a.weight` | `Q8_0` | `Q2_K` | `Q4_K` | |
| `blk.[0-2].attn_q_a_norm.weight` | `F32` | `F32` | `F32` | |
| `blk.[0-2].attn_q_b.weight` | `Q8_0` | `Q2_K` | `Q4_K` | |
| `blk.[0-2].ffn_down.weight` | `Q8_0` | `Q3_K` | `Q6_K` | |
| `blk.[0-2].ffn_gate.weight` | `Q8_0` | `Q2_K` | `Q4_K` | |
| `blk.[0-2].ffn_norm.weight` | `F32` | `F32` | `F32` | |
| `blk.[0-2].ffn_up.weight` | `Q8_0` | `Q2_K` | `Q4_K` | |
| `blk.[0-2].attn_output.weight` | `Q8_0` | `Q3_K` | `Q4_K` | |
| **Shared & Routed MoE Layers** | | | | |
| `blk.[3-60].attn_kv_a_mqa.weight` | `Q8_0` | `Q2_K` | `Q6_K` | |
| `blk.[3-60].attn_kv_a_norm.weight` | `F32` | `F32` | `F32` | |
| `blk.[3-60].attn_kv_b.weight` | `Q8_0` | `Q2_K` | `Q6_K` | |
| `blk.[3-60].attn_norm.weight` | `F32` | `F32` | `F32` | |
| `blk.[3-60].attn_q_a.weight` | `Q8_0` | `Q2_K` | `Q4_K` | |
| `blk.[3-60].attn_q_a_norm.weight` | `F32` | `F32` | `F32` | |
| `blk.[3-60].attn_q_b.weight` | `Q8_0` | `Q2_K` | `Q4_K` | |
| `blk.[3-60].exp_probs_b.bias` | `F32` | `F32` | `F32` | |
| `blk.[3-60].ffn_down_exps.weight` | `IQ3_K_R4` | `Q3_K` | `Q3_K` | |
| `blk.[3-60].ffn_down_shexp.weight` | `Q8_0` | `Q3_K` | `Q6_K` | |
| `blk.[3-60].ffn_gate_exps.weight` | `IQ2_K_R4` | `Q2_K` | `Q2_K` | |
| `blk.[3-60].ffn_gate_inp.weight` | `F32` | `F32` | `F32` | |
| `blk.[3-60].ffn_gate_shexp.weight` | `Q8_0` | `Q2_K` | `Q4_K` | |
| `blk.[3-60].ffn_norm.weight` | `F32` | `F32` | `F32` | |
| `blk.[3-60].ffn_up_exps.weight` | `IQ2_K_R4` | `Q2_K` | `Q2_K` | |
| `blk.[3-60].ffn_up_shexp.weight` | `Q8_0` | `Q2_K` | `Q4_K` | |
| `blk.[3-60].attn_output.weight` | `Q8_0` | `Q3_K` | `Q4_K` | |
| **Important Matrix & Perplexity** | | | | |
| `imatrix.dataset` | `calibration_data_v5_rc.txt`| `calibration_datav3.txt` | n/a | ? |
| Final PPL (wiki.test.raw) | 3.5614 +/- 0.02001 | ? | ? | ? |
</details>
:point_up:
> 👤 **ikawrakow** replied the **2025-03-29** at **18:18:55**:<br>
> I would be really curious to see the PPL values of the other quant cookers.
>
> 👤 **bartowski1182** replied the **2025-03-29** at **18:42:51**:<br>
> How many chunks of wiki test raw are you using for PPL? If you give your exact command I can get you the PPL for my own quant
>
> It's very intriguing. I know that most likely the unsloth one will be better than my own since he went out of his way to optimize the tensor types for that model which is just not something I have the throughput to handle 😅
>
> Also don't really want to make the same ones as him and release them since it would just be ripping off his work 🤷‍♂️
>
> Interesting stuff overall though
>
> 👤 **ubergarm** replied the **2025-03-29** at **19:06:34**:<br>
> Yeah I'm curious too! Bartowski you do use imatrix though, which I don't think unsloth does. So so not sure how that would make up for the smaller tensor types.
>
> I just ran the `Q8_0` for baseline comparison and got this result:
>
> >Final estimate: PPL = 3.2454 +/- 0.01773
>
> Here is the methodology including exact wiki.text.raw and commands:
>
> <details>
>
> <summary>:point_right: Details and Methodology :point_left: </summary>
>
> ```bash
> $ cd ik_llama.cpp
> $ git rev-parse --short HEAD
> 4819257c
>
> $ wget https://github.com/user-attachments/files/19090237/wiki.test.raw.gz
> $ gunzip wiki.test.raw.gz
> $ sha256sum wiki.test.raw
> 173c87a53759e0201f33e0ccf978e510c2042d7f2cb78229d9a50d79b9e7dd08 wiki.test.raw
>
> # CPU+GPU Perplexity Run
> $ CUDA_VISIBLE_DEVICES="0," \
> ./build/bin/llama-perplexity \
> --model /mnt/raid/models/ubergarm/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324-IQ2_K_R4.gguf \
> -ctk q8_0 \
> -mla 2 -fa \
> -amb 512 \
> -fmoe \
> --ctx-size 512 \
> --ubatch-size 512 \
> -f wiki.test.raw \
> --seed 1337 \
> --n-gpu-layers 63 \
> --override-tensor exps=CPU \
> --threads 24
>
> # CPU only Perplexity Run (for big `Q8_0`)
> $ numactl -N 1 -m 1 \
> ./build/bin/llama-perplexity \
> --model /mnt/ai/models/ubergarm/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324-Q8_0.gguf \
> -ctk q8_0 \
> -mla 3 -fa \
> -amb 512 \
> -fmoe \
> --ctx-size 512 \
> --ubatch-size 512 \
> -f wiki.test.raw \
> --seed 1337 \
> --numa numactl \
> --threads 128
>
> llama_print_timings: load time = 3493.83 ms
> llama_print_timings: sample time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
> llama_print_timings: prompt eval time = 4081619.28 ms / 287232 tokens ( 14.21 ms per token, 70.37 tokens per second)
> llama_print_timings: eval time = 0.00 ms / 1 runs ( 0.00 ms per token, inf tokens per second)
> llama_print_timings: total time = 4132068.91 ms / 287233 tokens
>
> Final estimate: PPL = 3.2454 +/- 0.01773
> ```
>
> </details>
>
> One other nice thing about `ik_llama.cpp` is you can customize the layers using a script without maintaining a llama.cpp code fork. I included the [script I used on the model card](https://huggingface.co/ubergarm/DeepSeek-V3-0324-GGUF#quantize-script).
>
> Finally, I'm not sure what imatrix text mradermacher uses to make imatrix, but I did a [quick comparison](https://gist.github.com/tristandruyen/9e207a95c7d75ddf37525d353e00659c?permalink_comment_id=5519433#gistcomment-5519433) of two otherwise identical quantizations using bartowski's imatrix and a slightly updated input text. They give similar perplexity against wiki.text.raw, for whatever that is worth hah...
>
> Anyway, yeah thanks for all your effort! I dunno how y'all keep up with the torrent of near weekly big model releases lately! Cheers!
>
> 👤 **ikawrakow** replied the **2025-03-29** at **19:06:35**:<br>
> I think @ubergarm can do the full PPL in less than an hour with their Xeon server. I don't know what kind of hardware you have.
>
> > ... since he went out of his way to optimize the tensor types for that model
> > Also don't really want to make the same ones as him and release them since it would just be ripping off his work
>
> I'm sure you are aware that quantization mixes have been in `llama.cpp` since the release of k-quants. All of those use more bits for the first few `ffn_down` layers. Also all of them use more bits for the attention tensors in MoE models. If you look at the Unsloth's so called "dynamic" quants, it is easy to see that with a small change of the function that determines the quantization type to handle the different names of the DeepSeek tensors (and the presence of shared experts), you will get basically what they used. Did they mention that? Of course not. So now the entire industry knows that Unsloth invented "dynamic" quants.
>
> 👤 **bartowski1182** replied the **2025-03-29** at **20:14:48**:<br>
> Yeah I did browse through his repo to check the changes he made, I do understand the overall nature of the quantization mixes and his adjustments made, and I know I could either pull his fork or make similar changes of my own to get the same results but just out of principle don't want to rehost if I'm not actually adding anything to the process
>
> I've got myself an EPYC server so things run pretty okay on my end as well, I'm just lacking on the GPU front for some things :)
>
> Unsloth also did a weird thing by releasing truly (I think) "dynamic" BnB quants at the same time as "dynamic" DeepSeek GGUF quants, so the naming feels a bit off, but there clearly is some value to be gained by manually altering the decision making for tensor types to favour some over others with DeepSeek, the generic existing one is leaving performance on the table
>
> Of course I'd like to know if the efforts in this branch more than make up for that, it wouldn't surprise me at all..
>
> > All of those use more bits for the first few ffn_down layers. Also all of them use more bits for the attention tensors in MoE models
>
> This part however I was not explicitly aware of, but still in terms of raw bits per weight, unsloth's mix seems superior (at least in the tests he has ran, PPL, KLD, and additional tests would be good to see if it's genuinely big improvements or if it's actually similar overall)
>
> 👤 **saood06** replied the **2025-03-30** at **01:51:10**:<br>
> Since mradermacher doesn't use gguf split you may have to use [gguf-py/scripts/gguf_dump.py](https://github.com/ikawrakow/ik_llama.cpp/blob/main/gguf-py/scripts/gguf_dump.py) to get the metadata.
>
> > 👇
> > 👈 V3-0324 quant recipe comparison table
> > ☝️
>
> You can probably remove tensor_count doesn't matter, as it changes based on split size and kv_count also doesn't really mean much it's just the number of entries of metadata from your table.
>
> 👤 **ikawrakow** replied the **2025-03-30** at **05:44:14**:<br>
> > This part however I was not explicitly aware of, but still in terms of raw bits per weight, unsloth's mix seems superior
>
> Superior compared to what? To unmaintained `llama.cpp`? Where @compilade's PR 12557 is the first noteworthy thing related to quantization that has happened since I left the project more than a year ago?
>
> Let's take a look at a few examples.
>
> [This line](https://github.com/ggml-org/llama.cpp/blob/af6ae1efb27a9a7c3f7f7f84639d2243f7303ac1/src/llama-quant.cpp#L250) and the following checks if this is an attention tensor, and if we are dealing with a MoE model. It worked for Mixtral8x7B, which was the only serious MoE model at the time. But in DeepSeek the most important attention tensor is `attn_kv_b`, and we are not having exactly 8 experts, so we don't get the intended behavior.
>
> [This line](https://github.com/ggml-org/llama.cpp/blob/af6ae1efb27a9a7c3f7f7f84639d2243f7303ac1/src/llama-quant.cpp#L316) sets more bits for the attention output tensor. Again, it fails because DeepSeek doesn't have exactly 8 experts, and no-one of the 1000+ `llama.cpp` contributors knew how to adapt it to the MoE models that came out after Mixtral8x7B.
>
> When the quantization mix strategies for MoE were written, experts were in separate tensors named `blk.X.ffn_up/gate/down.Y.weight` (where `X` was the layer index and `Y` the expert index). Then somebody decided to combine the experts into a single tensor named `blk.X.ffn_up/down/gate_exps.weight`, but did not change the code that decides on the quantization mix. Voila, you have the `QX_K_M` "dynamic" quants not working as intended.
>
> Take a look at the code block that follows `} else if (name.find("ffn_down") != std::string::npos) {`. Several of the quantization type modifications use more bits for the first `1/8` of the layers. Which is 7 for DeepSeek-V3/R1. In how many layers do Unsloth use more bits for `ffn_down` in their "carefully tuned dynamic" quants?
>
> 👤 **bartowski1182** replied the **2025-03-30** at **15:33:58**:<br>
> > Superior compared to what? To unmaintained llama.cpp? Where @compilade's PR 12557 is the first noteworthy thing related to quantization that has happened since I left the project more than a year ago?
>
> I mean yeah I did mention that I wouldn't be surprised if this branch has superior performance over even what he did 🤷‍♂️ I do recognize the stale state llama.cpp has been left in with regards to SOTA quantization performance
>
> I'm also not attempting to advocate his work or claim it's a God send, I recognize what it is and what it's being compared to
>
> Against llama.cpp's IQ2_XXS, it seems to perform closer to the original weights in terms of at least behaviour
>
> That's not to say it's anywhere near SOTA or even necessarily close to what you've achieved here, just a factual observation to be used as evidence that in llama.cpp there's clearly performance being left on the table
>
> That's a very interesting observation about the MoE code though containing a quite glaring bug, I wonder how much fixing that alone gets us back.. presumably a lot since as you mentioned most of the changes in the branch were about those early layers.
>
> I also recognize the fact that since you left quantization itself has definitely gone to the backburner, I'm very thankful to compilade for his efforts but yeah, not quite the same since
>
> I'm also surprised no one has come around and attempted to upstream some of your changes, several seem like just free performance gains, others are understandably more complex but there's certainly a few low hanging fruit that are just being ignored (and yes I recognize the irony of not doing it myself while complaining others aren't doing it)
>
> 👤 **ikawrakow** replied the **2025-03-30** at **17:03:32**:<br>
> The only reason I started this discussion was that you wrote above "... it would just be ripping off his work". And the point I was trying to make was that it would be perfectly fine to rip off their work as this is exactly what they did.
>
> 👤 **bartowski1182** replied the **2025-03-30** at **17:26:34**:<br>
> Oh I mean, fair haha. I guess I meant I don't want to strictly 1:1 copy his repo and release identical quants
>
> But you're definitely right that his work is basically just a bandage solution that happens to be the proper way to handle MoE models in general
>
> I do highly appreciate the insight though for the record, I don't mean to come off as argumentative or dismissive! I'll be looking into what you suggested for sure
>
> 👤 **bartowski1182** replied the **2025-03-30** at **19:24:25**:<br>
> @ikawrakow would you mind if I took inspiration from your changes to https://github.com/ikawrakow/ik_llama.cpp/blob/main/src/llama.cpp for some upstream work on llama_tensor_get_type? "inspiration" in this case would likely mean just straight up copying any changes that, to my untrained eye, seem strictly better and without risk of negatives (since I wouldn't discount the possibility some may be negative without other appropriate changes throughout the system)
>
> 👤 **ikawrakow** replied the **2025-03-31** at **06:01:25**:<br>
> Sure, go ahead. I see I haven't actually changed all occurrences of `n_expert == 8` to `n_expert >= 8`, so you may want find/replace all when making the change.
>
> Here people now use custom rules for making quants, so you may want to explore this as well. If you stick to quants available in mainline `llama.cpp`, you can "cook" the quants you publish with `ik_llama.cpp`.
>
> 👤 **bartowski1182** replied the **2025-04-01** at **23:20:00**:<br>
> @ubergarm I finished PPL of my original Q2_K upload and a new one I've added with changes from here and also just copying a bit of other work in the area
>
> llama.cpp main: 3.9012
>
> my fork: 3.6868
>
> considering the size only increased by 1%, i'm pretty stoked with that PPL improvement, and while yours is clearly still better, llama.cpp main is missing lots of ikawrakow's magic so it's not bad!
>
> 👤 **saood06** replied the **2025-04-02** at **00:19:01**:<br>
> > I finished PPL of my original Q2_K upload and a new one I've added with changes from here and also just copying a bit of other work in the area
> >
> > llama.cpp main: 3.9012
> >
> > my fork: 3.6868
> >
> > considering the size only increased by 1%, i'm pretty stoked with that PPL improvement, and while yours is clearly still better, llama.cpp main is missing lots of ikawrakow's magic so it's not bad!
>
> I'm not ubergarm, but thank you for this, I'm always curious to see PPL numbers and this is interesting.
>
> 👤 **ubergarm** replied the **2025-04-02** at **19:26:29**:<br>
> > @ubergarm I finished PPL of my original Q2_K upload and a new one I've added with changes from here and also just copying a bit of other work in the area
> >
> > llama.cpp main: 3.9012
> >
> > my fork: 3.6868
> >
> > considering the size only increased by 1%, i'm pretty stoked with that PPL improvement, and while yours is clearly still better, llama.cpp main is missing lots of ikawrakow's magic so it's not bad!
>
> Hey that is a nice drop in PPL for 1% size increase! Ohh sweet I see your [new Q2_K_L-V2](https://huggingface.co/bartowski/deepseek-ai_DeepSeek-V3-0324-GGUF#v2-uploads) variant! I wouldn't say mine is "better" given removing some weight in the GPU tensors possibly allows yours to run 64k context in under 24GB VRAM ([which mine only fits 32k](https://www.reddit.com/r/LocalLLaMA/comments/1joyl9t/comment/ml1lgob/)).
>
> Also interesting that [suddenly today mainline llama.cpp merged in `-ot` support!](https://github.com/ggml-org/llama.cpp/pull/11397). Curious what they will do with [MLA support](https://github.com/ggml-org/llama.cpp/pull/11446).
>
> Cheers!
>
> 👤 **bartowski1182** replied the **2025-04-03** at **03:10:18**:<br>
> Opened the PR here:
>
> https://github.com/ggml-org/llama.cpp/pull/12727
>
> that Q2_K_L-V2 will be replaced with a SLIIIIGHTLY better one probably tomorrow, but it's basically the same overall, just a few small bumps for another couple hundred mb
>
> 👤 **danielhanchen** replied the **2025-04-03** at **03:41:53**:<br>
> Oh hi! I didn't expect to be tagged - @bartowski1182 you're more than welcome to use the llama.cpp fork I have :)
>
> @ikawrakow Much apologies if people are mis-representing I "invented" dynamic quants, which is far from the truth. Appreciate the work you do, and keep it up - and ignore all the haters - your code is great!
>
> @ubergarm Great work on the quant as well! I was planning to do imatrix for all quants from now on, but I'm still trying to get the calibration dataset done specifically for instruct models - reasoning models are also a bit more complex.
>
> 👤 **danielhanchen** replied the **2025-04-03** at **03:45:49**:<br>
> It was actually pure coincidence on making the dynamic quants for DeepSeek R1, V3, since unfortunately as @ikawrakow mentioned, `llama.cpp` also quantizes the shared experts and dense layers the same as the rest of the model - my changes are at https://github.com/unslothai/llama.cpp/
>
> But the main motivation for "dynamic quants" was due to bitsandbytes and vLLM for finetuning, not actually llama.cpp as @bartowski1182 mentioned. For eg in Gemma 3, I did both activation and weight error analysis to see which parts to quantize / not quantize:
> ![image](https://github.com/user-attachments/assets/1586b89f-b985-47cb-88f1-26bb5b974087)
---
👤 **saood06** replied the **2025-04-11** at **03:06:19**:<br>
@danielhanchen
For Maverick you reported hitting this over protectiveness issue in llama.cpp
![image](https://github.com/user-attachments/assets/46f8f974-0e6d-41fd-942b-3e9cbce4475c)
>We tried adding more uncommon languages to our calibration dataset, and tried using more tokens (1 million) vs Scout's 250K tokens for calibration
That issue has been addressed here in #202 but you may need to adjust it to allow 10% missing to get the blk.1 tensors as well (but block 45 is below 50% which seems very odd).

View File

@@ -0,0 +1,204 @@
### 🗣️ [#316](https://github.com/ikawrakow/ik_llama.cpp/discussions/316) - Mainline is now copying stuff from ik_llama.cpp
| **Author** | `ikawrakow` |
| :--- | :--- |
| **Created** | 2025-04-06 |
| **Updated** | 2025-04-29 |
---
#### Description
We have [this merged PR](https://github.com/ggml-org/ggml/pull/1174) and [this pending PR](https://github.com/ggml-org/ggml/pull/1179) in the [ggml repository](https://github.com/ggml-org/ggml) copying code from `ik_llama.cpp`. It is an interesting choice of venue. [ggml](https://github.com/ggml-org/ggml) is well known, but much lower profile than [llama.cpp](https://github.com/ggml-org/llama.cpp). We know that changes added to `ggml` quietly make their way into `llama.cpp` with "sync: ggml" PRs such as [this one](https://github.com/ggml-org/llama.cpp/pull/12670).
The merged PR went into `ggml` without attribution (other than the source being mentioned in the PR). The pending PR attributes the change to `<48489457+ikawrakow@users.noreply.github.com>`, so me, but me as one of the (currently) 335 [ggml authors](https://github.com/ggml-org/ggml/blob/master/AUTHORS). But I definitely did not write the code with the intent of contributing it to `ggml`, `llama.cpp`, or any of ggerganov's projects. Does that mean that since I once contributed to `llama.cpp`, the copyright on everything I produce from there on is jointly owned by the 335 `ggml` authors, or perhaps even by the (currently) 1106 [llama.cpp authors](https://github.com/ggml-org/llama.cpp/blob/master/AUTHORS)?
`ik_llama.cpp` is open source, and it uses the same MIT license as `ggml/llama.cpp`. The MIT license says
```
The above copyright notice and this permission notice shall be included in all
copies or substantial portions of the Software.
```
Hmm. The PRs are definitely not a copy of `ik_llama.cpp`, but are they a "substantial portion" of it? How is "substantial" being measured? By LOCs? By utility? By some other measure?
Let's take the [merged PR](https://github.com/ggml-org/ggml/pull/1174). It is just 50 LOC of trivial code. And yet, it does improve prompt processing of `bf16` models by a factor of 2 compared to [this PR](https://github.com/ggml-org/llama.cpp/pull/11093), which added CUDA `bf16` support to `llama.cpp`. The [pending PR](https://github.com/ggml-org/ggml/pull/1179) is just a 69 LOC change of only slightly less trivial code. And yet, it improves PP performance of MoE models with many experts such as DeepSeek-V3/R1/Lite by more than [this 2000+ LOC](https://github.com/ggml-org/llama.cpp/pull/11583) rework of the CUDA matrix multiplication kernels and flash attention implementation. Let's take a look at [this ik_llama.cpp PR](https://github.com/ikawrakow/ik_llama.cpp/pull/307) that has not been discovered yet. The relevant change that improves MoE PP performance is the rewrite of [this kernel](https://github.com/ikawrakow/ik_llama.cpp/blob/ec84855c6ae5a08686f3e5d8010e38064269deb3/ggml/src/ggml-metal.metal#L8541). It is just 60 LOC or so, but the performance gain is many times more than the grand total of all modifications made to the `ggml/llama.cpp` Metal backend since I left these projects in March of 2024.
So, again, is it utility or number of LOCs that define the copied code as "substantial portion" of the software it was copied from?
But, hey, IANAL, so it is maybe better to focus on the moral side of things. When I left the `llama.cpp` project, I expressed the wish that all of my contributions be removed. They didn't need to do it legally, but wouldn't it have been nice if they still did? ggerganov cited too much impact on downstream projects. Not on `llama.cpp` itself, but on downstream projects. Because, you know, downstream projects are too inept to add back k-quants, i-wuants, and imatrix after their removal from upstream. In any case, it is known what happened, so it should be obvious to anyone that I don't want my work to be copied into ggerganov's projects. If they were nice, they would have re-implemented these changes - it is not rocket science. And if they were really nice, they would have acknowledged `ik_llama.cpp` for the inspiration. Or, if they didn't feel like re-implementing it, they would add my copyright notice, legally required or not, so we don't need to ponder at what point what they copied became a "substantial portion" of the work they are copying.
---
#### 🗣️ Discussion
👤 **CISC** replied the **2025-04-06** at **13:12:04**:<br>
Uh, I was not aware of any wish for your work to be removed, in fact, I made the PRs solely based on your comment here: https://github.com/ikawrakow/ik_llama.cpp/discussions/256#discussioncomment-12496828
I chose to submit these to `ggml` not for some nefarious reason, but simply because they were restricted to `ggml` code only.
---
👤 **CISC** replied the **2025-04-06** at **13:33:21**:<br>
> Hmm. The PRs are definitely not a copy of `ik_llama.cpp`, but are they a "substantial portion" of it? How is "substantial" being measured? By LOCs? By utility? By some other measure?
TBH I overlooked that you added yourself to the copyright notice, I looked at diffs only. It's simple to fix though, I can add it to any file that has your code merged into it.
> If they were nice, they would have re-implemented these changes - it is not rocket science. And if they were really nice, they would have acknowledged `ik_llama.cpp` for the inspiration. Or, if they didn't feel like re-implementing it, they would add my copyright notice, legally required or not, so we don't need to ponder at what point what they copied became a "substantial portion" of the work they are copying.
Please don't blame anyone else than me, I do not represent `ggml` nor `llama.cpp`, and I acted in good faith.
---
👤 **ikawrakow** replied the **2025-04-06** at **13:50:50**:<br>
@CISC
I'm sorry if this came across as a critique/attack on you. That was not the intent, and it has nothing to do with you. It is between ggerganov and me. Given the history, and there is 15 years of it even before `llama.cpp` came to be, I would have expected a different reaction from ggerganov to your PRs.
> 👤 **JohannesGaessler** replied the **2025-04-06** at **14:06:02**:<br>
> In the end I am the one who is responsible for reviewing and merging the PR in question. I had interpreted [this post](https://github.com/ikawrakow/ik_llama.cpp/discussions/256#discussioncomment-12496828) as permission to do so without preconditions. I'm sorry for acting against your wishes.
---
👤 **CISC** replied the **2025-04-06** at **14:08:38**:<br>
This puts me in a bind though, my intention was to upstream what I could (with the hardware I have available to test) as it seemed you were suggesting that this should be done (but not willing to do yourself).
You have made a great number of awesome contributions here, and I still wish for them to be merged into mainline, as it would improve it greatly, and it might make it simpler for you to rebase and get newer features from mainline as well. This should be a win-win.
---
👤 **ikawrakow** replied the **2025-04-06** at **14:37:07**:<br>
@CISC @JohannesGaessler As you both refer to what I wrote in #256, here it is:
> upstream is free to take from here whatever they find useful
Meaning there is nothing I can do to prevent that from happening as I'm publishing under a MIT license. I don't think I said that I do not expect upstream to abide by the terms of the license.
> 👤 **CISC** replied the **2025-04-06** at **14:38:40**:<br>
> > @CISC @JohannesGaessler As you both refer to what I wrote in #256, here it is:
> >
> > > upstream is free to take from here whatever they find useful
> >
> > Meaning there is nothing I can do to prevent that from happening as I'm publishing under a MIT license. I don't think I said that I do not expect upstream to abide by the terms of the license.
>
> I'm fixing my mistake right now, sorry about that.
---
👤 **ikawrakow** replied the **2025-04-07** at **06:30:56**:<br>
So, this is becoming interesting. Here is what @ggerganov has to say about my copyright notice being included in the file(s) where stuff was copied from my work:
> Including copyright notices is optional since the Berne convention - this was discussed last year: https://github.com/ggml-org/llama.cpp/discussions/6394.
>
> And again - we do provide the notices in the AUTHORS files. There is no need to sprinkle them inside the code.
The [discussion 6934](https://github.com/ggml-org/llama.cpp/discussions/6394) was about Intel engineers copy-pasting CUDA kernels that I wrote into the SYCL implementation and slapping their copyright notice on it (and, to add insult to injury, they were copy-pasting the code into wrong places, and refusing to accept PRs fixing it, which was the actual reason to start the discussion in the first place). The very knowledgable conflict resolution expert with no legal education who came to resolve the conflict said that was OK, because according to the [Berne Convention](https://en.wikipedia.org/wiki/Berne_Convention) they couldn't take away the copyright from me by doing that (I wonder if software was covered in the original Berne Convention agreement of 1886? Just kidding). The copyright is collectively owned by the authors of the project, and their copyright is established by the AUTHORS file, so copyright notices do not need to be present in every file (but apparently it is OK for Intel to have their copyright notice in the file, without further copyright notices).
@ggerganov The work from which it is being copied is not work contributed to your project by me and therefore covered by my name being in the AUTHORS file of your work. Can you please point me to the text in the Berne Convention where it is established that if you copied my work into your work, it would be OK to ignore the terms of the license under which I published my work, and not include my copyright notice in your work as requested by the MIT license? If you don't like copyright notices "sprinkled inside the code", you have the option to reject the PRs or add my copyright notice to the copyright notice of your project. Oh, another option (if you trust your legal expertise) would be to accept the PRs as is, and then make your own PRs removing the copyright notices. In that way it would be you not being nice to a fellow open source developer with whom you want to "freely exchange ideas" (and possibly violating the terms of their license), not your contributor. I think asking a contributor to do that is going too far. But at the end of the day it is your project, so yes, you can ask your contributors to play by your rules.
---
👤 **JohannesGaessler** replied the **2025-04-07** at **07:59:15**:<br>
For the record: Do you find it acceptable for people to read your code and to then submit a PR to llama.cpp/ggml with the same functionality?
> 👤 **ikawrakow** replied the **2025-04-07** at **09:10:21**:<br>
> > For the record: Do you find it acceptable for people to read your code and to then submit a PR to llama.cpp/ggml with the same functionality?
>
> I addressed that above. But here it is again my perhaps wrong concept of how it should be:
> * If you copy my code, you need to add a copyright notice as requested by the MIT license.
> * If you reimplement what I have done here in your own way, you don't need to mention me or this repository. But if you were nice, you would still mention the original source/idea. Just like in many places in the ggml/llama.cpp code there are references to papers and/or other repositories.
>
> Now, also for the record, it isn't so that there aren't copyright notices in `ggml` "sprinkled around the code" as @ggerganov puts it. See for instance [this](https://github.com/ggml-org/ggml/blob/ab9ed73d40965d7e4b25a4adf2230b9a19bffbf9/src/ggml-cpu/ops.cpp#L4996) (and same notices in all other backends). I have this line in my fork as well in a completely [different place](https://github.com/ikawrakow/ik_llama.cpp/blob/a051f08b8f059fa10dd089d231b975291c122e9d/ggml/src/ggml.c#L16726), so it has been preserved over multiple code reorganizations (so, maintaining copyright notices in the source code as things are moved around is not quite as painful as claimed). You don't wonder why a Kawrakow copyright notice is so different from a Jeffrey Quesnelle and Bowen Peng copyright notice?
>
> 👤 **JohannesGaessler** replied the **2025-04-07** at **10:41:05**:<br>
> Thank you for your input. My perspective is that I don't have the ability to resolve a conflict between you and Georgi especially because I'm ignorant of your prior history. My previous policy was that I would simply not look at any of your code and that is what I will go back to.
>
> 👤 **bartowski1182** replied the **2025-04-13** at **15:47:29**:<br>
> As another outsider without a horse in this race (besides wanting everyone to benefit as much as possible by all the best work), I don't think a simple code comment referencing either the original PR from this repo, or lacking the ability to find one simply, a quick mention of this repo, world detract much if anything from the overall code experience
>
> In fact, recently when making changes, I've seen code with a comment referencing a PR from other repos, or from llamacpp itself, and these help immensely for tracking down motivations and any potential discussions that went on at the time
>
> And yes you can git blame, but that becomes cumbersome if there's ever a single refactor
>
> My unrequested and uneducated 2c
---
👤 **ikawrakow** replied the **2025-04-07** at **11:07:50**:<br>
> My previous policy was that I would simply not look at any of your code and that is what I will go back to.
Yes, of course, as predicted.
---
👤 **jano403** replied the **2025-04-07** at **11:16:19**:<br>
A based thing to do would be to license your repository under AGPL3.0, solves all problems.
> 👤 **ikawrakow** replied the **2025-04-07** at **11:23:15**:<br>
> > A based thing to do would be to license your repository under AGPL3.0, solves all problems.
>
> Yes, I agree, it would have been better. But I didn't feel like juggling two different licenses, so just went with the original MIT license.
>
> On the other hand, the final outcome would not have been any different. Mainline will independently discover and implement the improvement I have made here without looking at my changes, not even once. I think this was made very clear by @JohannesGaessler's last comment.
>
> 👤 **jano403** replied the **2025-04-07** at **11:29:07**:<br>
> Never too late to change it if You ever feel like it.
> Btw, appreciate all the hard work You're doing for quants and speed improvements!
>
> 👤 **ikawrakow** replied the **2025-04-07** at **11:40:33**:<br>
> I would need to read up on what is the correct way of mixing MIT licensed code with (A)GPL licensed code. Or can you point me to a simple to follow set of instructions?
>
> 👤 **CISC** replied the **2025-04-07** at **12:00:19**:<br>
> I'm not sure what "problems" that is supposed to fix though? Was the license really the problem?
>
> 👤 **ikawrakow** replied the **2025-04-07** at **12:06:07**:<br>
> It would have avoided ggerganov talking about the Berne Convention and implying that no copyright notices are required, or putting contributors such as yourself into the difficult position of having to choose between doing the right thing or following his rules.
>
> 👤 **CISC** replied the **2025-04-07** at **12:15:28**:<br>
> It would have avoided me even considering upstreaming, that's all, the rest is unrelated fallout.
>
> 👤 **jano403** replied the **2025-04-07** at **12:34:09**:<br>
> > I would need to read up on what is the correct way of mixing MIT licensed code with (A)GPL licensed code. Or can you point me to a simple to follow set of instructions?
>
> I believe the MIT license is compatible with GPL/AGPL, take a look at https://github.com/LostRuins/koboldcpp for example. The original code would still be MIT licensed but the project as a whole, including Your modifications would be GPL/AGPL licensed.
> ![image](https://github.com/user-attachments/assets/58b0011f-6f53-4cfe-a57f-89101946b1b7)
>
> 👤 **jano403** replied the **2025-04-07** at **12:35:47**:<br>
> https://www.gnu.org/licenses/license-list.en.html#GPLCompatibleLicenses
> ![image](https://github.com/user-attachments/assets/8d7b887c-fd6d-48e6-a5b8-325110cf1ef5)
> ![image](https://github.com/user-attachments/assets/6ebd73b4-e7f6-4dbe-a75b-d29dc2d05d68)
>
> edit: As for copyright notices, You could simply add
> ```
> // Modifications made after <DATE> licensed under GPLv3/AGPLv3
> // AGPL/GPL license
> // SPDX-License-Identifier: AGPL/GPL
> //
> ```
> or similar when You make new changes.
>
> 👤 **ikawrakow** replied the **2025-04-07** at **12:48:51**:<br>
> > It would have avoided me even considering upstreaming, that's all, the rest is unrelated fallout.
>
> Well, also that. Which have resulted in you having a much less interesting weekend 😄
---
👤 **ikawrakow** replied the **2025-04-07** at **11:24:52**:<br>
@CISC
I'm sorry you ended up in the middle of this. I hope this has not damaged your relation with, and your ability to contribute to, the `ggml` and `llama.cpp` projects.
> 👤 **CISC** replied the **2025-04-07** at **11:58:00**:<br>
> > I'm sorry you ended up in the middle of this. I hope this has not damaged your relation with, and your ability to contribute to, the `ggml` and `llama.cpp` projects.
>
> Let's just say this weekend was more interesting than I would have liked. :(

View File

@@ -0,0 +1,60 @@
### 🗣️ [#319](https://github.com/ikawrakow/ik_llama.cpp/discussions/319) - KTransformers copying ik_llama.cpp
| **Author** | `ikawrakow` |
| :--- | :--- |
| **Created** | 2025-04-08 |
| **Updated** | 2025-04-13 |
---
#### Description
[This PR](https://github.com/kvcache-ai/ktransformers/pull/754) is a direct copy from [this file](https://github.com/ikawrakow/ik_llama.cpp/blob/main/ggml/src/iqk/iqk_mul_mat.cpp) in `ik_llama.cpp`. It never acknowledges the source of the changes, and the KTransformers maintainers did not respond to [my comment](https://github.com/kvcache-ai/ktransformers/pull/754#issuecomment-2781515478) I left in the PR.
The PR is being sold as `IQ1_S` implementation, but it copies not just the `IQ1_S` GEMM, but also ~1800 LOCs of additional stuff, including the `IQ2_XXS` implementation, the new implementation of any float type x any other float type GEMM, and a bunch of other optimizations I have done since my contributions to [llamafile](https://github.com/Mozilla-Ocho/llamafile) ([394](https://github.com/Mozilla-Ocho/llamafile/pull/394), [405](https://github.com/Mozilla-Ocho/llamafile/pull/405), [428](https://github.com/Mozilla-Ocho/llamafile/pull/428), [435](https://github.com/Mozilla-Ocho/llamafile/pull/435), [453](https://github.com/Mozilla-Ocho/llamafile/pull/453), and [464](https://github.com/Mozilla-Ocho/llamafile/pull/464))
For those who don't know, KTRansformers uses the quantized GEMM/GEMV implementation that I contributed to [llamafile](https://github.com/Mozilla-Ocho/llamafile). `llamafile` uses the Apache-2.0 license, so I contributed the code under that license. KTransformers have kept the [copyright notice](https://github.com/kvcache-ai/ktransformers/blob/f4ae7c85edd66d6acf3ef253eeaf0143eb3358ab/third_party/llamafile/iqk_mul_mat.inc#L3) in the file, but did not update after merging PR 754, which contains a copy of MIT licensed code.
KTransformers PR 754 is interesting anyway. Github user @godrosev entered issue #209 on February 19 asking for `IQ1_S` support in `llamafile`. There was already implementation for the row-interleaved variant `IQ1_S_R4` in `ik_llama.cpp`, so I wasn't planning to also have support for `IQ1_S`, and suggested to them to use that instead. But after some back-and-fort, I decided to add `IQ1_S`, which I did in PR #212 on Feb 20. The KTransformers PR 754 is on March 3 and comes from Github user @moonshadow-25. There are 5 commits in the PR, and the first 2 come from @godrosev. @godrosev and @moonshadow-25 both have no Github activity other the PR (and Issue #209).
So now the question is, what do I do about that. Opinions?
---
#### 🗣️ Discussion
👤 **moonshadow-25** replied the **2025-04-08** at **08:50:43**:<br>
hi ikawrakow, I am not an official developer of KT,@godrosv he is my colleague, and I am very sorry about this matter. After he gave me the code, I started the porting work without asking the source, but I noticed that the author in the file is also the same module's author as Llamafile, which is you. Afterwards, I completed all the porting work but did not modify any author information, because from the beginning KT kept mentioning that they used llamaflile as the core optimization, and I only filled in the complete functionality.
I have always felt that the CPU optimization in Llamafile is the best part done. If I really want others to not know that you did it, I can completely modify the variable or function names. However, I have fully ported it, only modifying the necessary interface parts, because I still believe that the iqk part of Llamafile is your contribution!
---
👤 **ikawrakow** replied the **2025-04-08** at **09:29:53**:<br>
> and I am very sorry about this matter
Are you planning to correct it? The 1800 lines added in your PR are not a "port", but a direct copy of portions of the code here. It would be very nice if the actual origin was acknowledged by you and by the KT developers.
---
👤 **moonshadow-25** replied the **2025-04-08** at **10:06:25**:<br>
Yes, I have always believed that both the early content and the “ported” parts of Llamafile originated from your work. And what I did more was porting and testing, so I never intended to modify (except for necessary interface adjustments) your work. I think this is your contribution
I hope we can have more communication in the future
> 👤 **ikawrakow** replied the **2025-04-08** at **11:19:06**:<br>
> Sorry, @moonshadow-25, but there are no "ported” parts of Llamafile in your PR. There are 1800 lines of code copied from here. They do not exist in Llamafile to be "ported" (i.e., copied) from there.
>
> You have created a bit of a mess with your PR. KTransformers and Llamafile are both Apache-2.0 licensed. But the code here is published under a MIT License. Now, Apache-2.0 and MIT are both very permissive licenses, so it is easy to bundle code published under these license together, as explained for instance [here](https://infra.apache.org/licensing-howto.html). You could have even asked me if I would be willing to relicense the portions you copied to Apache-2.0 so it makes things easier for KTransformers (after all, I did change the MIT License of the code I contributed to Llamafile to Apache-2.0 to make it easier for them). But as permissive as these licenses are, it does not mean you can just ignore what they ask you to do.
>
> 👤 **moonshadow-25** replied the **2025-04-08** at **11:41:27**:<br>
> Indeed, I am very sorry that I only realized the difference now. They look too similar, and both authors are you. So I subjectively assumed it was the same license.
> I must make some remedies as soon as possible, and I hope to hear your advice
---
👤 **ikawrakow** replied the **2025-04-13** at **15:56:21**:<br>
The KTransformers devs have now merged [this PR](https://github.com/kvcache-ai/ktransformers/pull/1116), which addresses the concern raised in this discussion => closing.

View File

@@ -0,0 +1,284 @@
### 🗣️ [#323](https://github.com/ikawrakow/ik_llama.cpp/discussions/323) - Is there an easy way to repack an existing GGUF so it could be used without --run-time-repack (thus enabling mmap)
| **Author** | `Lissanro` |
| :--- | :--- |
| **Created** | 2025-04-10 |
| **Updated** | 2025-05-21 |
---
#### Description
DeepSeek-V3-0324-GGUF-UD-Q4_K_XL works great for me when I load it using --run-time-repack, I get more than 7 tokens/s with EPYC 7763 and 1TB of 3200MHz RAM + 4x3090 GPUs. But this unfortunately disables mmap and requires a lot of compute on each reload - and if I need to switch models often in some tasks (for example, a separate model to process input images and describe them, then continue with DeepSeek V3), it slows things down.
So, what I am looking for, is it possible to repack DeepSeek-V3-0324-GGUF-UD-Q4_K_XL offline to a new GGUF which would work well with ik_llama.cpp and I ould load it without the --run-time-repack?
I know there are some existing quants made specifically for ik_llama.cpp, like https://huggingface.co/ubergarm/DeepSeek-V3-0324-GGUF, but I noticed that DeepSeek-V3-0324-GGUF-IQ4_K_R4 for example gives me 4-5 tokens/s at most, my guess because it quantized very differently, even though it has about the same size. This also suggests that creating my own quant from scratch may be very difficult - not only I have to download the full size models for V3 and R1 (which would take weeks via 4G connection I have), but I also may end up with a quant that does not perform as good as the original Unsloth quant, since I do not have any experience with creating GGUF quants. This is why I would prefer to find a way to repack an existing quant, rather than trying to create one from scratch, if that is possible?
In case it matters, here is the command I use to run the model (I specify only -ctk q8_0 because my understanding -ctv does not have any effect when due to enabled optimizations V cache is not actually used):
```
taskset -c 0-63 ~/pkgs/ik_llama.cpp/build/bin/llama-server \
--model ~/models/DeepSeek-V3-0324-GGUF-UD-Q4_K_XL/DeepSeek-V3-0324-UD-Q4_K_XL-00001-of-00009.gguf \
--ctx-size 81920 --n-gpu-layers 62 --tensor-split 25,25,25,25 \
-mla 2 -fa -ctk q8_0 -amb 2048 -fmoe -rtr \
--override-tensor "ffn_down_exps=CPU, ffn_up_exps=CPU, gate_exps=CPU" \
--threads 64 --host 0.0.0.0 --port 5000
```
This command utilizes about 20GB of VRAM on each 24GB GPU. The main issue is that I am yet to figure out a way how to repack this GGUF so I could run without the -rtr option. I would appreciate any help how to resolve this?
---
#### 🗣️ Discussion
👤 **ikawrakow** replied the **2025-04-10** at **15:31:47**:<br>
You can use
```
./bin/llama-quantize --repack --repack-pattern exps ~/models/DeepSeek-V3-0324-GGUF-UD-Q4_K_XL/DeepSeek-V3-0324-UD-Q4_K_XL-00001-of-00009.gguf repacked_model_file_name q4_k_r4
```
The command will not overwrite the existing model, so you need to have enough free disk space for both models.
In your command that starts the server, you can simplify to
```
--override-tensor exps=CPU
```
It is a regular expression, so it is equivalent to explicitly listing
```
--override-tensor "ffn_down_exps=CPU, ffn_up_exps=CPU, gate_exps=CPU"
```
More generally, you can use `--repack-pattern` in the `llama-quantize` command by simply copying the regular expressions from the `--override-tensor` argument and removing the `=CPU` from it. So,
```
./bin/llama-quantize --repack --repack-pattern "ffn_down_exps,ffn_up_exps,gate_exps" etc.
```
is equivalent.
> 👤 **ikawrakow** replied the **2025-04-10** at **15:36:25**:<br>
> I have never repacked (or quantized) a multi-part GGUF, so I don't know if `llama-quantize` does the right thing to load all parts. In case it does not, you may need to concatenate the parts into a single file
> ```
> cat file1 file2 ... fileN >>combined_file
> ```
>
> 👤 **saood06** replied the **2025-04-10** at **23:00:39**:<br>
> >In case it does not, you may need to concatenate the parts into a single file
> >
> > ```
> > cat file1 file2 ... fileN >>combined_file
> > ```
>
> Files split in the gguf-split way need to be merged via gguf-split.
---
👤 **ubergarm** replied the **2025-04-10** at **22:05:30**:<br>
> I noticed that DeepSeek-V3-0324-GGUF-IQ4_K_R4 for example gives me 4-5 tokens/s at most, my guess because it quantized very differently, even though it has about the same size.
A few thoughts here:
1. My quant was designed to be a bit heavy in the non-routed experts to give better quality output. You can trade-off some quality for extra speed by adding `-ser 6,1` as detailed in [PR#239](https://github.com/ikawrakow/ik_llama.cpp/pull/239).
2. My quant is designed to offload just over 17GiB weights to VRAM plus context cache. However, it looks like you have 96 GB VRAM (4x GPUs?). Using `-ot exps=CPU` shouldn't fill up 20GB VRAM on 4x cards (80GB)?. Designing a quant specific to multiple-gpu setups like yours is more tricky as you want to offload some of the routed `exps` layers which need to be quantized in a way suited for GPU inferencing.
So yeah, like ik mentions, you will want to use `./bin/llama-quantize --repack --repack-pattern "ffn_down_exps,ffn_up_exps,gate_exps" etc.` and figure out ahead of time the size of the tensors/layers you want to offload onto GPU (and don't repack those), and only repack the remaining routed experts `exps` layers going into RAM for CPU inferencing. In other words the repacked `q4_k_r4` is for running on CPU RAM. Don't repack the tensors/layers you're running on GPU.
Haha hope I didn't confuse too much. This is indeed a more straight-forward way than rolling your own quant, which would have the same steps but more.
Cheers!
---
👤 **Lissanro** replied the **2025-04-11** at **10:49:26**:<br>
@ikawrakow
Thank you, I was able to convert based on the suggested command, but the issue is, performance of the converted quant is very low, so I cannot really use it yet. I would appreciate any help to figure out how to convert it in the same way like -rtr option does, but to a file permanently, so I can use mmap and load without -rtr option.
With the original Unsloth quant and -rtr option, I get more than 7 tokens/s, while with converted quant without -rtr option, I get 4-5 tokens/s. Maybe it converted some tensors to more compute intensive equivalents? Perhaps there are other options besides
The command I used was:
```
~/pkgs/ik_llama.cpp/build/bin/llama-quantize --repack --repack-pattern exps ~/models/DeepSeek-V3-0324-GGUF-UD-Q4_K_XL/DeepSeek-V3-0324-UD-Q4_K_XL-00001-of-00009.gguf /tmp/DeepSeek-V3-0324-GGUF-UD-Q4_K_R4.gguf q4_k_r4
main: build = 3630 (5f44f4b3)
main: built with cc (Ubuntu 13.3.0-6ubuntu2~24.04) 13.3.0 for x86_64-linux-gnu
main: quantizing '/home/lissanro/pkgs/text-generation-webui/models/DeepSeek-V3-0324-GGUF-UD-Q4_K_XL/DeepSeek-V3-0324-UD-Q4_K_XL-00001-of-00009.gguf' to '/mnt/secondary/tmp/DeepSeek-V3-0324-GGUF-UD-Q4_K_R4.gguf' as Q4_K_R4
llama_model_loader: additional 8 GGUFs metadata loaded.
...
```
Here is full conversion log which includes all the output during the conversion:
https://pastebin.com/P7QEQsKy
Three runs using the original Unsloth quant with -rtr option (timings line only for each run):
```
INFO [ print_timings] generation eval time = 31669.99 ms / 230 runs ( 137.70 ms per token, 7.26 tokens per second) | tid="128283826724864" timestamp=1744362775 id_slot=0 id_task=0 t_token_generation=31669.991 n_decoded=230 t_token=137.69561304347826 n_tokens_second=7.262395496102289
INFO [ print_timings] generation eval time = 37422.90 ms / 273 runs ( 137.08 ms per token, 7.29 tokens per second) | tid="128283826724864" timestamp=1744362939 id_slot=0 id_task=232 t_token_generation=37422.898 n_decoded=273 t_token=137.08021245421247 n_tokens_second=7.2949989068190275
INFO [ print_timings] generation eval time = 39311.07 ms / 297 runs ( 132.36 ms per token, 7.56 tokens per second) | tid="128283826724864" timestamp=1744364349 id_slot=0 id_task=507 t_token_generation=39311.072 n_decoded=297 t_token=132.36051178451177 n_tokens_second=7.555123401366415
```
Three runs using the same prompt with the converted quant (without the -rtr option):
```
INFO [ print_timings] generation eval time = 67077.44 ms / 287 runs ( 233.72 ms per token, 4.28 tokens per second) | tid="140159021387776" timestamp=1744366116 id_slot=0 id_task=0 t_token_generation=67077.444 n_decoded=287 t_token=233.71931707317074 n_tokens_second=4.278636496644088
INFO [ print_timings] generation eval time = 67416.24 ms / 342 runs ( 197.12 ms per token, 5.07 tokens per second) | tid="140159021387776" timestamp=1744366192 id_slot=0 id_task=289 t_token_generation=67416.242 n_decoded=342 t_token=197.12351461988303 n_tokens_second=5.072961497913218
INFO [ print_timings] generation eval time = 76603.74 ms / 303 runs ( 252.82 ms per token, 3.96 tokens per second) | tid="140159021387776" timestamp=1744366731 id_slot=0 id_task=633 t_token_generation=76603.741 n_decoded=303 t_token=252.81762706270626 n_tokens_second=3.955420401726856
```
---
👤 **Lissanro** replied the **2025-04-11** at **10:52:18**:<br>
@saood06
It seems my own quant converted from the Unsloth one also loses a lot of performance, so it may not be something specific to your quant. I am not sure what the issue is yet. It is worth mentioning that my EPYC 7763 64-core CPU is under full load during inference with either quant, so my guess something in the converted quants hits CPU bottleneck, which is not present when using Unsloth quant with -rtr option.
As of VRAM usage, I think it depends on context length. To be more precise, with 80K context I get around 19 gigabytes VRAM utilization on each GPU, so around 76-80 VRAM usage in total. If I try to increase context size too much, I get CUDA OOM errors, confirming it is using VRAM for context.
Maybe I could put some additional ffn_down_exps, ffn_up_exps or ffn_gate_exps on each GPU, but not sure which of them is more beneficial to put in VRAM yet. I already experimented with blk.3.ffn_gate_exps=CUDA0, ... and so on, but since I cannot put too many of them due to having not that much VRAM free, I did not notice difference in performance. I did not try with non-gate ones yet.
With my workflow that involves loading 72B vision model in VRAM, processing images, then load V3, not being able to get mmap working with good performance is the biggest bottleneck at the moment. I am still trying to figure out if there are options I could try to achieve the same kind of conversion -rtr option does, to create a new GGUF that would work the same in terms of performance but would not require -rtr anymore.
---
👤 **ikawrakow** replied the **2025-04-11** at **10:58:58**:<br>
The offline repacking command should produce a result that is 100% equivalent to what happens with online repacking.
But the two runs will not be equivalent as memory will be allocated and assigned to tensors in a different way. I have seen performance differences between offline and online repacking on my hardware, but never as large as you are reporting.
Can you try dropping caches before using the offline repacked model?
```
echo 3 | sudo tee /proc/sys/vm/drop_caches
```
---
👤 **ikawrakow** replied the **2025-04-11** at **11:10:46**:<br>
> Maybe I could put some additional ffn_down_exps, ffn_up_exps or ffn_gate_exps on each GPU, but not sure which of them is more beneficial to put in VRAM yet. I already experimented with blk.3.ffn_gate_exps=CUDA0, ... and so on, but since I cannot put too many of them due to having not that much VRAM free, I did not notice difference in performance. I did not try with non-gate ones yet.
If you have spare VRAM, the best strategy is to put the `ffn_up_exps` and `ffn_gate_exps` of a given number of layers in VRAM (how many layers depends on how much VRAM you have left and how big the tensors are). This brings more benefit than putting just one of the experts tensors or all 3 of the experts tensors, especially when you are using `-fmoe`. I'm currently running some experiments with LlaMA-4-Scout on my low-end hardware (Ryzen-5975WX + RTX 4080), and I use
```
-ot "blk\.[0-9]\.ffn_up_exps=CUDA0,blk\.[0-9]\.ffn_gate_exps=CUDA0,blk\.1[0-9]\.ffn_up_exps=CUDA0,blk\.1[0-9]\.ffn_gate_exps=CUDA0,exps=CPU" -ngl 100
```
to have all attention and shared experts tensors plus the first 20 layers of `ffn_up_exps` and `ffn_gate_exps` on the GPU, with all remaining experts on the CPU.
---
👤 **Lissanro** replied the **2025-04-11** at **11:35:48**:<br>
First, I load the repacked model with -rtr option - obviously should be unnecessary, but I was curious if it makes a difference, and to my surprise, it did, I got good performance again (full log: https://pastebin.com/5d6R2GDG):
```
INFO [ print_timings] generation eval time = 46791.42 ms / 341 runs ( 137.22 ms per token, 7.29 tokens per second) | tid="127320811921408" timestamp=1744369176 id_slot=0 id_task=0 t_token_generation=46791.423 n_decoded=341 t_token=137.2182492668622 n_tokens_second=7.287660390238612
INFO [ print_timings] generation eval time = 36683.23 ms / 274 runs ( 133.88 ms per token, 7.47 tokens per second) | tid="127320811921408" timestamp=1744369220 id_slot=0 id_task=343 t_token_generation=36683.233 n_decoded=274 t_token=133.88041240875913 n_tokens_second=7.469352551341372
```
Then, I ran `echo 3 | sudo tee /proc/sys/vm/drop_caches`, this left me with 704 GB of memory free of cache. I also have no swap file and my system has 1TB of RAM in total, so plenty of memory for 378GB quant (the size of the converted quant). After it fully loaded, I still have 322GB of completely free memory. But, the performance become quite bad (from almost 7.5 tokens/s down to less than 4 tokens/s; full log: https://pastebin.com/K4PYP52t):
```
INFO [ print_timings] generation eval time = 75071.14 ms / 270 runs ( 278.04 ms per token, 3.60 tokens per second) | tid="140708181868544" timestamp=1744369869 id_slot=0 id_task=0 t_token_generation=75071.144 n_decoded=270 t_token=278.04127407407407 n_tokens_second=3.5965883242701087
INFO [ print_timings] generation eval time = 73892.48 ms / 268 runs ( 275.72 ms per token, 3.63 tokens per second) | tid="140708181868544" timestamp=1744369983 id_slot=0 id_task=272 t_token_generation=73892.479 n_decoded=268 t_token=275.7182052238806 n_tokens_second=3.626891445880439
```
I tried adding --mlock, but the performance did not improve much (still was getting at most 4-5 tokens/s no matter how many times I tried).
Since -rtr option disables mmap, I decided to disable it explicitly with --no-mmap and run without -rtr option, to see if it is mmap that ruins the performance:
```
INFO [ print_timings] generation eval time = 42764.35 ms / 314 runs ( 136.19 ms per token, 7.34 tokens per second) | tid="129645145907200" timestamp=1744370957 id_slot=0 id_task=0 t_token_generation=42764.346 n_decoded=314 t_token=136.19218471337578 n_tokens_second=7.342565229455397
```
...and with the repacked quant and --no-mmap option, performance was back to normal. So, it seems something about mmap that drastically reduces performance. Nothing wrong with the quant file then. Very strange. In theory, I would expect the performance to be about the same, since either way the same memory is used and I have plenty of it free.
Please let me know if there are some kind of performance profiling or additional logging I could do on my side.
As of putting more ffn_up_exps and ffn_gate_exps on GPU, I will try that with as much layers as I can, thank you very much for the suggestion.
> 👤 **ubergarm** replied the **2025-04-11** at **14:20:23**:<br>
> @Lissanro
>
> > --no-mmap option, performance was back to normal. So, it seems something about mmap that drastically reduces performance. Nothing wrong with the quant file then.
>
> If you are benchmarking while using mmap, you have to throw away the first full run results typically as the benchmarks start running before the model is loaded into page cache. You can check by watching your disk i/o and `cached` inside of `btop`. You will notice with mmap disabled, it takes longer to start up and finish allocating the entire model into RAM. When using mmap, it starts much quicker but runs slower in the beginning. This is normal expected behavior for all inference engines I've used.
>
> Also, depending on how your system is configured, when not using mmap() you may be taking advantage of transparent huge pages automatically under the hood. You can check that with `numastat -m -p $(pidof llama-server)` or llama-bench etc... It seems to be system dependent on how this effects performance.
>
> Keep us posted once you come up with a multi-gpu command line to override `ffn_up_exps` and `ffn_gate_exps` tensors onto each GPU as ik mentions above. I wanted to document that somewhere to help others as many of the questions I see are how to use more VRAM correctly when using `-ot`.
>
> Thanks!
>
> 👤 **ubergarm** replied the **2025-04-11** at **19:08:55**:<br>
> @Lissanro
>
> Also, using the above examples I'm slowly learning how to better use `-ot` myself. I have a few examples now on [discussion #258](https://github.com/ikawrakow/ik_llama.cpp/discussions/258#discussioncomment-12807746) which you could use to target `CUDA0` `CUDA1` etc to craft the best command for your rig.
---
👤 **Lissanro** replied the **2025-04-13** at **03:57:01**:<br>
I was able to achieve similar speed with mmap after resetting my BIOS, and changing only absolutely necessary settings. Before that, no matter what I did, it ran at 30%-50% reduced speed. Not sure exactly what setting was messing up results, maybe performance tuning settings for memory throughput.
But all good now, this is my current performance with mmap enabled using repacked quant (this is with around 2.5K token long fill in the context window):
```
INFO [ print_timings] generation eval time = 1400.35 ms / 11 runs ( 127.30 ms per token, 7.86 tokens per second) | tid="124902137237504" timestamp=1744499973 id_slot=0 id_task=835 t_token_generation=1400.348 n_decoded=11 t_token=127.30436363636363 n_tokens_second=7.85519028127294
```
With 32K filled, I get lesser performance but still good:
```
INFO [ print_timings] generation eval time = 76081.15 ms / 387 runs ( 196.59 ms per token, 5.09 tokens per second) | tid="132320194224128" timestamp=1744494220 id_slot=0 id_task=2362 t_token_generation=76081.154 n_decoded=387 t_token=196.5921291989664 n_tokens_second=5.086673632736959
```
I did not save exact stats for 64K+ context fill, but it was slightly above 3 tokens/s for output. Input generally was within 50-80 tokens/s range. Reloading model with mmap enabled takes about 45 seconds, which is great.
My final command to repack R1 and V3 was like this:
```
~/pkgs/ik_llama.cpp/build/bin/llama-quantize --repack \
--repack-pattern "(^blk\.[7-9]|\d\d).ffn_(up|gate)_exps|ffn_down_exps" \
/mnt/secondary/neuro/DeepSeek-R1-GGUF_Q4_K_M-163840seq/DeepSeek-R1-Q4_K_M-00001-of-00011.gguf \
/home/lissanro/neuro/DeepSeek-R1-GGUF_Q4_K_M-163840seq/DeepSeek-R1-GGUF_Q4_K_M_R4.gguf \
q4_k_r4
```
The pattern in llama-quantize crafted in a way that avoids repacking tensors I intent to use on GPUs. This the command I use to run it:
```
taskset -c 0-63 ~/pkgs/ik_llama.cpp/build/bin/llama-server \
--model /home/lissanro/neuro/DeepSeek-R1-GGUF_Q4_K_M-163840seq/DeepSeek-R1-GGUF_Q4_K_M_R4.gguf \
--ctx-size 73728 --n-gpu-layers 62 --tensor-split 25,25,25,25 -mla 2 -fa -ctk q8_0 -amb 1024 -fmoe \
-ot "blk\.3\.ffn_up_exps=CUDA0, blk\.3\.ffn_gate_exps=CUDA0" \
-ot "blk\.4\.ffn_up_exps=CUDA1, blk\.4\.ffn_gate_exps=CUDA1" \
-ot "blk\.5\.ffn_up_exps=CUDA2, blk\.5\.ffn_gate_exps=CUDA2" \
-ot "blk\.6\.ffn_up_exps=CUDA3, blk\.6\.ffn_gate_exps=CUDA3" \
-ot "ffn_down_exps=CPU, ffn_up_exps=CPU, gate_exps=CPU" \
--threads 64 --host 0.0.0.0 --port 5000
```
I also noticed that I need to specify CPU overrides last rather than first for CUDA overrides to have an effect. I used multiple -ot arguments since a single one could not understand multi-line format, but with many -ot, I can use multiple lines in my script for better readability. Putting ffn_up_exps and ffn_gate_exps from blocks 3-6 on my GPUs (one pair per GPU) is all that I could fit, I had even reduce context length to 72K (73728).
Thank you so very much, @ikawrakow and @ubergarm , for helping me to figure this out!
---
👤 **Ph0rk0z** replied the **2025-05-17** at **18:57:32**:<br>
So to repack I do inverse of my cuda regex? Can quant type also be converted? Or does it just become same_R4? MMAP or not, the entire model gets cached on my system, at least for qwen 235b sizes.
---
👤 **Lissanro** replied the **2025-05-21** at **05:27:22**:<br>
@Ph0rk0z
You need to craft a regex for R4 repacking happen in way that covers all tensors you plan to keep on CPU, but does not affect tensors that you plan running on GPU (GPU tensors need to be kept non-R4). You can refer to regexes in my previous message to see how repack regex differs.
> 👤 **Ph0rk0z** replied the **2025-05-21** at **11:25:07**:<br>
> Yea I assume it's just see which layers are on GPU and then exclude them. So if you pick 1,2,3,4 make a not 1,2,3,4 regex. Funny enough we have AI for this. But I have IQ4_XS, so what does that become? IQ4_XS_R4? Or can it repack to something else?
>
> 👤 **ikawrakow** replied the **2025-05-21** at **11:29:29**:<br>
> > Or can it repack to something else?
>
> No. The repacking is only to the corresponding row-interleaved type. Repacking to something else would result in quality loss.

View File

@@ -0,0 +1,74 @@
### 🗣️ [#350](https://github.com/ikawrakow/ik_llama.cpp/discussions/350) - Maverick slow prompt with gpu
| **Author** | `justinjja` |
| :--- | :--- |
| **Created** | 2025-04-27 |
| **Updated** | 2025-04-27 |
---
#### Description
Any idea what the deal is with prompt speeds on Maverick?
1 3090 and a 56 core ddr4 epyc - Q4.5 - ~3500 tokens:
Prompt 6.24 T/s
Generation 31.7 T/s
Same but with the GPU disabled:
Prompt 95 T/s
Generation 5.6 T/s
Is it possible to leave prompt processing on the CPU and still use the GPU for generation?
---
#### 🗣️ Discussion
👤 **saood06** replied the **2025-04-27** at **04:22:52**:<br>
Do you mind providing the exact commands used to get those numbers (and any details about the quant used)?
---
👤 **ikawrakow** replied the **2025-04-27** at **06:45:38**:<br>
Please tell us your command line parameters.
I cannot run Maverick, but here is how I run Scout on a 32-core Ryzen-5975WX with a 16 GB RTX-4080:
```
./bin/llama-sweep-bench -m $model -t 32 -ngl 100 -ot "blk\.[0-9]\.ffn_up=CUDA0,blk\.[0-9]\.ffn_gate=CUDA0,exps=CPU" -rtr -fa -fmoe -ctk q8_0 -ctv q8_0 -c 16384 -ub 2048
```
where `$model` roughly corresponds in size to Unsloth's UD-Q2_K-XL (~40 GiB). And here is what I get in terms of performance as measured by `llama-sweep-bench`
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
| 2048 | 512 | 0 | 5.798 | 353.23 | 24.503 | 20.90 |
| 2048 | 512 | 2048 | 5.779 | 354.36 | 25.474 | 20.10 |
| 2048 | 512 | 4096 | 5.868 | 349.04 | 26.436 | 19.37 |
| 2048 | 512 | 6144 | 5.958 | 343.76 | 27.480 | 18.63 |
| 2048 | 512 | 8192 | 6.041 | 339.04 | 28.457 | 17.99 |
| 2048 | 512 | 10240 | 6.121 | 334.60 | 29.508 | 17.35 |
| 2048 | 512 | 12288 | 6.206 | 329.99 | 30.540 | 16.76 |
| 2048 | 512 | 14336 | 6.297 | 325.25 | 31.513 | 16.25 |
The above command puts all attention tensors, shared experts, and the first 10 layers of `ffn_up_exps` and `ffn_down_exps` tensors on the GPU, all remaining experts stay on the CPU. With 16k context, this requires about 14 GiB of VRAM. You can use something similar, adapting to the 24 GiB of VRAM you have, and the different size of the Maverick model.
---
👤 **justinjja** replied the **2025-04-27** at **16:51:53**:<br>
Nice, thank you!
My command must have been bad.
Your command 5x'ed my prompt speed.
And upgrading my pcie from Gen3x4 to Gen4x16 got me another 4x on top of that.
I'm running unsloths 4.5 Bit dynamic gguf.
On my original test I'm now able to get:
128 prompt and 34 gen
New command:
CUDA_VISIBLE_DEVICES=0 ./llama-server -m mav.gguf -t 32 --n-gpu-layers 100 -ot "blk\.[0-1]\.ffn_up=CUDA0,blk\.[0-1]\.ffn_gate=CUDA0,exps=CPU" -fa -ctk q8_0 -ctv q8_0 -c 16384 -ub 2048 --host 0.0.0.0 --port 8000

View File

@@ -0,0 +1,349 @@
### 🗣️ [#354](https://github.com/ikawrakow/ik_llama.cpp/discussions/354) - Not all MLAs are born equal
| **Author** | `ikawrakow` |
| :--- | :--- |
| **Created** | 2025-04-29 |
| **Updated** | 2025-05-13 |
---
#### Description
## Intro
After several attempts, they have added MLA for DeepSeek models in mainline `llama.cpp` via [this PR](https://github.com/ggml-org/llama.cpp/pull/12801), and I was curious to see how it performs. They have of course made it maximally painful - one needs to re-download and re-convert the model to be able to take advantage of the MLA feature. Fortunately for me, on my hardware I can only run DeepSeek-Lite, i.e., a 32 GB download, so not too bad (but in comparison, `ik_llama.cpp` allows usage of MLA with an original DeepSeek GGUF as the tensors necessary for MLA get created on-the-fly). Anyway, I'm on a 300 Mb/s connection, so 15 minutes later I'm up and running.
What is the TL;DR? As the title already said - not all MLAs are born equal.
## Setup
I'll be using a `Q4_0` quantized DeepSeek-Lite model for all comparison. `Q4_0` is the fastest quantization type in mainline due to the extraordinary amount of attention it receives. GPU performance measurements are done on an RTX-4080 GPU. CPU performance is measured on a Ryzen-7950X CPU (and the RTX-4080 is in the Ryzen-7950X rig).
## CUDA performance
I was most curious about CUDA performance. Why? Because [in this PR](https://github.com/ggml-org/llama.cpp/pull/13014) @JohannesGaessler has completely independently, without [ever looking at ik_llama.cpp](https://github.com/ikawrakow/ik_llama.cpp/pull/283/files/7f6980fa5166d029ad04cef395d2993ddc8da307#r2029830357), discovered [this optimization](https://github.com/ikawrakow/ik_llama.cpp/pull/248) in `ik_llama.cpp`, so I wanted to know how the two implementations compare. Mainline does not support Flash Attention (FA) for DeepSeek on CUDA (due to K- and V-head sizes being different). `ik_llama.cpp` uses FlashMLA-2.
This graph shows CUDA TG performance as a function of `N_KV`, the number of tokens in the KV cache. For `N_KV = 0`, mainline is now about 15% faster than `ik_llama.cpp`. This can be due to the fact that @JohannesGaessler is a much better GPU programmer than I'm, so has achieved a more optimized implementation. However, looking at the comments and performance measurements in [the PR](https://github.com/ggml-org/llama.cpp/pull/13014), a more likely explanation is the enabling of CUDA graphs for TG with MoE models in [this PR](https://github.com/ggml-org/llama.cpp/pull/12970) (CUDA graphs are disabled in `ik_llama.cpp` for MoE models). But as soon as there are some tokens in the KV cache (the normal use case scenario), `ik_llama.cpp` becomes faster. The performance gap grows with increasing KV cache size and reaches 1.8X at 32k tokens.
![dsl2_cuda_tg](https://github.com/user-attachments/assets/49af1fbc-4cad-4929-9147-5faf18aa65ce)
The next graph compares CUDA PP performance as a function of `N_KV` for `u_batch` size of 1024 tokens. The performance optimizations in `ik_llama.cpp` have not been independently discovered yet, so here performance gap is 1.85X for small `N_KV`, increasing to 2.5X at 32k tokens.
![dsl2_cuda_pp](https://github.com/user-attachments/assets/5ceffcaa-c2dc-4e9a-8833-9405d5c34a00)
<details>
<summary> llama.cpp CUDA performance data</summary>
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
| 1024 | 256 | 0 | 0.316 | 3243.40 | 1.216 | 210.47 |
| 1024 | 256 | 1024 | 0.270 | 3798.75 | 1.651 | 155.05 |
| 1024 | 256 | 2048 | 0.296 | 3464.06 | 1.843 | 138.94 |
| 1024 | 256 | 3072 | 0.325 | 3150.91 | 2.050 | 124.88 |
| 1024 | 256 | 4096 | 0.356 | 2877.39 | 2.231 | 114.76 |
| 1024 | 256 | 5120 | 0.389 | 2630.72 | 2.444 | 104.75 |
| 1024 | 256 | 6144 | 0.417 | 2457.48 | 2.641 | 96.93 |
| 1024 | 256 | 7168 | 0.449 | 2278.58 | 2.850 | 89.84 |
| 1024 | 256 | 8192 | 0.489 | 2096.06 | 3.063 | 83.59 |
| 1024 | 256 | 9216 | 0.531 | 1927.90 | 3.272 | 78.23 |
| 1024 | 256 | 10240 | 0.553 | 1852.72 | 3.498 | 73.18 |
| 1024 | 256 | 11264 | 0.593 | 1725.85 | 3.703 | 69.13 |
| 1024 | 256 | 12288 | 0.614 | 1667.04 | 3.930 | 65.14 |
| 1024 | 256 | 13312 | 0.635 | 1611.74 | 4.145 | 61.76 |
| 1024 | 256 | 14336 | 0.678 | 1509.69 | 4.372 | 58.55 |
| 1024 | 256 | 15360 | 0.696 | 1470.41 | 4.586 | 55.83 |
| 1024 | 256 | 16384 | 0.740 | 1382.99 | 4.807 | 53.26 |
| 1024 | 256 | 17408 | 0.762 | 1343.59 | 5.029 | 50.91 |
| 1024 | 256 | 18432 | 0.787 | 1301.07 | 5.242 | 48.83 |
| 1024 | 256 | 19456 | 0.823 | 1244.17 | 5.463 | 46.86 |
| 1024 | 256 | 20480 | 0.846 | 1210.20 | 5.669 | 45.16 |
| 1024 | 256 | 21504 | 0.892 | 1148.57 | 5.911 | 43.31 |
| 1024 | 256 | 22528 | 0.915 | 1119.55 | 6.113 | 41.88 |
| 1024 | 256 | 23552 | 0.955 | 1071.99 | 6.345 | 40.35 |
| 1024 | 256 | 24576 | 0.979 | 1045.94 | 6.538 | 39.15 |
| 1024 | 256 | 25600 | 1.002 | 1021.85 | 6.779 | 37.76 |
| 1024 | 256 | 26624 | 1.045 | 980.14 | 6.967 | 36.74 |
| 1024 | 256 | 27648 | 1.065 | 961.08 | 7.211 | 35.50 |
| 1024 | 256 | 28672 | 1.105 | 926.56 | 7.398 | 34.60 |
| 1024 | 256 | 29696 | 1.132 | 904.44 | 7.654 | 33.45 |
| 1024 | 256 | 30720 | 1.167 | 877.39 | 7.846 | 32.63 |
| 1024 | 256 | 31744 | 1.185 | 864.19 | 8.107 | 31.58 |
</details>
<details>
<summary> ik_llama.cpp CUDA performance data</summary>
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
| 1024 | 256 | 0 | 0.152 | 6756.76 | 1.411 | 181.44 |
| 1024 | 256 | 1024 | 0.146 | 7030.26 | 1.500 | 170.61 |
| 1024 | 256 | 2048 | 0.153 | 6676.49 | 1.600 | 160.02 |
| 1024 | 256 | 3072 | 0.166 | 6175.71 | 1.666 | 153.67 |
| 1024 | 256 | 4096 | 0.178 | 5762.29 | 1.776 | 144.18 |
| 1024 | 256 | 5120 | 0.188 | 5444.81 | 1.873 | 136.67 |
| 1024 | 256 | 6144 | 0.197 | 5202.70 | 1.959 | 130.66 |
| 1024 | 256 | 7168 | 0.206 | 4962.35 | 2.063 | 124.09 |
| 1024 | 256 | 8192 | 0.218 | 4696.99 | 2.136 | 119.83 |
| 1024 | 256 | 9216 | 0.229 | 4468.32 | 2.251 | 113.72 |
| 1024 | 256 | 10240 | 0.241 | 4240.46 | 2.344 | 109.20 |
| 1024 | 256 | 11264 | 0.254 | 4036.79 | 2.426 | 105.54 |
| 1024 | 256 | 12288 | 0.265 | 3861.63 | 2.518 | 101.68 |
| 1024 | 256 | 13312 | 0.276 | 3704.23 | 2.610 | 98.09 |
| 1024 | 256 | 14336 | 0.289 | 3547.76 | 2.718 | 94.19 |
| 1024 | 256 | 15360 | 0.299 | 3419.88 | 2.796 | 91.55 |
| 1024 | 256 | 16384 | 0.310 | 3305.62 | 2.897 | 88.38 |
| 1024 | 256 | 17408 | 0.321 | 3189.96 | 2.976 | 86.02 |
| 1024 | 256 | 18432 | 0.332 | 3084.30 | 3.075 | 83.24 |
| 1024 | 256 | 19456 | 0.342 | 2993.22 | 3.179 | 80.53 |
| 1024 | 256 | 20480 | 0.352 | 2908.33 | 3.273 | 78.22 |
| 1024 | 256 | 21504 | 0.363 | 2823.02 | 3.360 | 76.19 |
| 1024 | 256 | 22528 | 0.373 | 2744.26 | 3.455 | 74.09 |
| 1024 | 256 | 23552 | 0.384 | 2665.50 | 3.543 | 72.26 |
| 1024 | 256 | 24576 | 0.395 | 2590.50 | 3.664 | 69.88 |
| 1024 | 256 | 25600 | 0.408 | 2506.74 | 3.768 | 67.94 |
| 1024 | 256 | 26624 | 0.419 | 2446.47 | 3.884 | 65.90 |
| 1024 | 256 | 27648 | 0.429 | 2384.76 | 4.016 | 63.74 |
| 1024 | 256 | 28672 | 0.439 | 2331.18 | 4.171 | 61.38 |
| 1024 | 256 | 29696 | 0.452 | 2264.41 | 4.282 | 59.78 |
| 1024 | 256 | 30720 | 0.462 | 2214.40 | 4.441 | 57.65 |
| 1024 | 256 | 31744 | 0.472 | 2168.74 | 4.562 | 56.11 |
</details>
Perhaps also of interest is the extra VRAM required. For DeepSeek-Lite at 32k tokens mainline KV-cache size 1836 MiB, along with a CUDA compute buffer size of 2280 MiB, for a total of 4116 MiB. In comparison, `ik_llama.cpp` uses 972 MiV of K-cache (there is no V-cache required as it gets computed from the K-cache at the expense of some performance reduction) plus 936 MiB of CUDA compute buffer for a total of 1908 MiB, so 2.15X times less.
## CPU performance
Mainline does support FA on the CPU, but performance is quite bad, so I'm including mainline results with and without FA enabled. When FA is enabled, the KV cache is quantized with `Q8_0`. `ik_llama.cpp` calculations are with FlashMLA-3, which is the best option for CPU inference.
The following graph shows CPU TG performance as a function of `N_KV`. Here mainline FA is faster by about 3% when the KV cache is empty. This is an artifact of the way FA is implemented: the minimum size of the u-batch created is 256 tokens. When there is no actual context in the KV cache almost all tokens are masked away. Mainline's FA implementation checks for that and skips the `K*Q` dot product for such tokens. I have not bothered adding this optimization to `ik_llama.cpp` as it never is useful in actual usage (when the KV cache is not empty). With any context `ik_llama.cpp` is faster. The performance gap increases with increasing number of tokens in the KV cache and reaches 39% (no FA) or 70% (FA) at 16k tokens.
![dsl2_cpu_tg](https://github.com/user-attachments/assets/eb8a1793-d8ba-4157-a327-283c4b7629cf)
The next graph shows PP performance as a function of `N_KV`. Here the performance gap to mainline without FA is 2.87X for zero context, increasing to 4.5X at 16k tokens. When FA is enabled in mainline, it is 10X slower at 16k tokens.
![dsl2_cpu_pp](https://github.com/user-attachments/assets/d68ba66b-c3bf-4fae-adc8-e8dd8cb59b04)
<details>
<summary> llama.cpp CPU performance data (FA disabled)</summary>
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
| 512 | 128 | 0 | 1.938 | 264.21 | 3.802 | 33.67 |
| 512 | 128 | 512 | 2.207 | 231.96 | 3.936 | 32.52 |
| 512 | 128 | 1024 | 2.523 | 202.97 | 4.091 | 31.29 |
| 512 | 128 | 1536 | 2.883 | 177.61 | 4.273 | 29.96 |
| 512 | 128 | 2048 | 3.175 | 161.26 | 4.405 | 29.06 |
| 512 | 128 | 2560 | 3.502 | 146.20 | 4.466 | 28.66 |
| 512 | 128 | 3072 | 3.818 | 134.09 | 4.634 | 27.62 |
| 512 | 128 | 3584 | 4.134 | 123.84 | 4.685 | 27.32 |
| 512 | 128 | 4096 | 4.460 | 114.79 | 4.838 | 26.46 |
| 512 | 128 | 4608 | 4.783 | 107.04 | 4.967 | 25.77 |
| 512 | 128 | 5120 | 5.102 | 100.36 | 5.105 | 25.07 |
| 512 | 128 | 5632 | 5.398 | 94.84 | 5.246 | 24.40 |
| 512 | 128 | 6144 | 5.737 | 89.25 | 5.396 | 23.72 |
| 512 | 128 | 6656 | 6.067 | 84.40 | 5.529 | 23.15 |
| 512 | 128 | 7168 | 6.372 | 80.35 | 5.663 | 22.60 |
| 512 | 128 | 7680 | 6.682 | 76.63 | 5.781 | 22.14 |
| 512 | 128 | 8192 | 7.010 | 73.03 | 5.909 | 21.66 |
| 512 | 128 | 8704 | 7.335 | 69.81 | 6.020 | 21.26 |
| 512 | 128 | 9216 | 7.643 | 66.99 | 6.125 | 20.90 |
| 512 | 128 | 9728 | 7.928 | 64.58 | 6.233 | 20.53 |
| 512 | 128 | 10240 | 8.282 | 61.82 | 6.358 | 20.13 |
| 512 | 128 | 10752 | 8.601 | 59.53 | 6.487 | 19.73 |
| 512 | 128 | 11264 | 8.912 | 57.45 | 6.625 | 19.32 |
| 512 | 128 | 11776 | 9.194 | 55.69 | 6.760 | 18.94 |
| 512 | 128 | 12288 | 9.549 | 53.62 | 6.898 | 18.56 |
| 512 | 128 | 12800 | 9.872 | 51.86 | 7.028 | 18.21 |
| 512 | 128 | 13312 | 10.186 | 50.27 | 7.161 | 17.87 |
| 512 | 128 | 13824 | 10.465 | 48.92 | 7.281 | 17.58 |
| 512 | 128 | 14336 | 10.824 | 47.30 | 7.398 | 17.30 |
| 512 | 128 | 14848 | 11.142 | 45.95 | 7.508 | 17.05 |
| 512 | 128 | 15360 | 11.462 | 44.67 | 7.620 | 16.80 |
| 512 | 128 | 15872 | 11.733 | 43.64 | 7.721 | 16.58 |
</details>
<details>
<summary> llama.cpp CPU performance data (FA enabled)</summary>
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
| 512 | 128 | 0 | 1.912 | 267.73 | 3.695 | 34.64 |
| 512 | 128 | 512 | 2.618 | 195.55 | 3.846 | 33.28 |
| 512 | 128 | 1024 | 3.394 | 150.85 | 4.028 | 31.78 |
| 512 | 128 | 1536 | 4.184 | 122.38 | 4.211 | 30.40 |
| 512 | 128 | 2048 | 4.958 | 103.27 | 4.416 | 28.98 |
| 512 | 128 | 2560 | 5.711 | 89.65 | 4.582 | 27.94 |
| 512 | 128 | 3072 | 6.545 | 78.22 | 4.767 | 26.85 |
| 512 | 128 | 3584 | 7.257 | 70.55 | 4.958 | 25.81 |
| 512 | 128 | 4096 | 8.079 | 63.37 | 5.143 | 24.89 |
| 512 | 128 | 4608 | 8.981 | 57.01 | 5.336 | 23.99 |
| 512 | 128 | 5120 | 9.600 | 53.33 | 5.468 | 23.41 |
| 512 | 128 | 5632 | 10.373 | 49.36 | 5.660 | 22.62 |
| 512 | 128 | 6144 | 11.271 | 45.43 | 5.850 | 21.88 |
| 512 | 128 | 6656 | 11.922 | 42.95 | 6.058 | 21.13 |
| 512 | 128 | 7168 | 12.692 | 40.34 | 6.247 | 20.49 |
| 512 | 128 | 7680 | 13.498 | 37.93 | 6.435 | 19.89 |
| 512 | 128 | 8192 | 14.237 | 35.96 | 6.563 | 19.50 |
| 512 | 128 | 8704 | 15.004 | 34.12 | 6.755 | 18.95 |
| 512 | 128 | 9216 | 15.794 | 32.42 | 6.942 | 18.44 |
| 512 | 128 | 9728 | 16.552 | 30.93 | 7.131 | 17.95 |
| 512 | 128 | 10240 | 17.326 | 29.55 | 7.321 | 17.48 |
| 512 | 128 | 10752 | 18.126 | 28.25 | 7.520 | 17.02 |
| 512 | 128 | 11264 | 18.846 | 27.17 | 7.713 | 16.60 |
| 512 | 128 | 11776 | 19.618 | 26.10 | 7.902 | 16.20 |
| 512 | 128 | 12288 | 20.404 | 25.09 | 8.096 | 15.81 |
| 512 | 128 | 12800 | 21.219 | 24.13 | 8.286 | 15.45 |
| 512 | 128 | 13312 | 21.950 | 23.33 | 8.543 | 14.98 |
| 512 | 128 | 13824 | 22.765 | 22.49 | 8.735 | 14.65 |
| 512 | 128 | 14336 | 23.532 | 21.76 | 8.933 | 14.33 |
| 512 | 128 | 14848 | 24.284 | 21.08 | 9.119 | 14.04 |
| 512 | 128 | 15360 | 25.070 | 20.42 | 9.316 | 13.74 |
| 512 | 128 | 15872 | 25.856 | 19.80 | 9.510 | 13.46 |
</details>
<details>
<summary>ik_llama.cpp CPU performance data</summary>
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
| 512 | 128 | 0 | 0.739 | 693.23 | 3.836 | 33.37 |
| 512 | 128 | 512 | 0.769 | 665.76 | 3.931 | 32.56 |
| 512 | 128 | 1024 | 0.817 | 626.90 | 3.958 | 32.34 |
| 512 | 128 | 1536 | 0.869 | 589.09 | 3.991 | 32.07 |
| 512 | 128 | 2048 | 0.912 | 561.30 | 4.037 | 31.71 |
| 512 | 128 | 2560 | 0.967 | 529.68 | 4.087 | 31.32 |
| 512 | 128 | 3072 | 1.020 | 502.07 | 4.146 | 30.87 |
| 512 | 128 | 3584 | 1.087 | 470.96 | 4.182 | 30.61 |
| 512 | 128 | 4096 | 1.132 | 452.35 | 4.235 | 30.22 |
| 512 | 128 | 4608 | 1.189 | 430.73 | 4.290 | 29.84 |
| 512 | 128 | 5120 | 1.247 | 410.52 | 4.351 | 29.42 |
| 512 | 128 | 5632 | 1.304 | 392.59 | 4.426 | 28.92 |
| 512 | 128 | 6144 | 1.363 | 375.64 | 4.508 | 28.39 |
| 512 | 128 | 6656 | 1.420 | 360.52 | 4.584 | 27.92 |
| 512 | 128 | 7168 | 1.485 | 344.78 | 4.665 | 27.44 |
| 512 | 128 | 7680 | 1.542 | 332.04 | 4.751 | 26.94 |
| 512 | 128 | 8192 | 1.605 | 318.99 | 4.821 | 26.55 |
| 512 | 128 | 8704 | 1.669 | 306.76 | 4.736 | 27.02 |
| 512 | 128 | 9216 | 1.736 | 294.93 | 4.773 | 26.82 |
| 512 | 128 | 9728 | 1.802 | 284.05 | 4.832 | 26.49 |
| 512 | 128 | 10240 | 1.865 | 274.57 | 4.889 | 26.18 |
| 512 | 128 | 10752 | 1.927 | 265.65 | 4.949 | 25.87 |
| 512 | 128 | 11264 | 1.994 | 256.77 | 5.015 | 25.53 |
| 512 | 128 | 11776 | 2.063 | 248.24 | 5.074 | 25.23 |
| 512 | 128 | 12288 | 2.127 | 240.67 | 5.139 | 24.91 |
| 512 | 128 | 12800 | 2.194 | 233.39 | 5.207 | 24.58 |
| 512 | 128 | 13312 | 2.262 | 226.33 | 5.272 | 24.28 |
| 512 | 128 | 13824 | 2.326 | 220.10 | 5.342 | 23.96 |
| 512 | 128 | 14336 | 2.389 | 214.35 | 5.399 | 23.71 |
| 512 | 128 | 14848 | 2.456 | 208.43 | 5.461 | 23.44 |
| 512 | 128 | 15360 | 2.522 | 203.02 | 5.511 | 23.23 |
| 512 | 128 | 15872 | 2.590 | 197.72 | 5.573 | 22.97 |
</details>
---
#### 🗣️ Discussion
👤 **JohannesGaessler** replied the **2025-04-29** at **07:29:26**:<br>
Since you are tagging me: I did look at the more general implementation for mapping MoE to regular matrix multiplications in the PR where I commented but I did not look at any MoE-specific CUDA code for matrix vector multiplication, nor was I aware that this repository had such an optimization. It's just the natural way of writing a fused kernel.
> 👤 **ikawrakow** replied the **2025-04-29** at **14:39:31**:<br>
> > It's just the natural way of writing a fused kernel.
>
> Sure, a kernel that did not get written for a very long time, despite the well known fact that `llama.cpp` CUDA performance for MoE models is really bad. Which indicates that the understanding how badly the fused kernel was needed was missing. It is not very often that one has a PR that [improves performance up to 4X](https://github.com/ggml-org/llama.cpp/pull/13014#issuecomment-2816637977).
>
> But if it is so as you say, then sorry.
>
> 👤 **JohannesGaessler** replied the **2025-04-29** at **15:33:40**:<br>
> Apology accepted. My top priority was and still is good performance for dense GEMM/GEMV because that is the most fundamental operation. MoE optimizations have now simply reached the front of the priority queue.
---
👤 **cmoncure** replied the **2025-05-06** at **15:50:00**:<br>
I read this and the warning on the README.md about incompatible GGUFs is quite unfortunate. I don't mind spending the time to create my own quants for this fork in the pursuit of maximum performance. I am a total noob to creating quants, however.
I am building an EPYC box with 768 GB RAM and 96 GB VRAM (2x48). Will I be able to use scripts to conveniently convert such releases as DeepSeek V3/R1 or the curious tngtech/DeepSeek-R1T-Chimera model from safetensors?
Do you plan to support the incompatible mainline GGUF files? Can I assume that GGUFs created before mid-April or so will be compatible? (Downloading these larger models represents a considerable cost.)
Thank you for creating this work and making it available. You are a true wizard.
> 👤 **ikawrakow** replied the **2025-05-06** at **16:16:34**:<br>
> > Can I assume that GGUFs created before mid-April or so will be compatible? (Downloading these larger models represents a considerable cost.)
>
> I think so. But to make sure, if you are downloading from HF, you can check the content of the GGUF. To be compatible, it needs to have tensors ` blk.X.attn_kv_b.weight` (where `X` is the layer index, so 0,1,...). If it does, it will work with this fork. If instead it has separate tensors `blk.X.attn_k_b.weight` and `blk.X.attn_v_b.weight`, it is most likely not compatible.
>
> > Do you plan to support the incompatible mainline GGUF files?
>
> No, not really. There are implications beyond compatibility. The change impacts quantization of the attention tensors, and I think there are now some reports from users about reduced model quality after the change was made and the quantized models compatible with that change started coming out.
>
> 👤 **saood06** replied the **2025-05-06** at **20:24:09**:<br>
> > I think so. But to make sure, if you are downloading from HF, you can check the content of the GGUF. To be compatible, it needs to have tensors ` blk.X.attn_kv_b.weight` (where `X` is the layer index, so 0,1,...). If it does, it will work with this fork. If instead it has separate tensors `blk.X.attn_k_b.weight` and `blk.X.attn_v_b.weight`, it is most likely not compatible.
>
> Just to be more clear after looking at one converted with the compatible version of MLA that works [here](https://huggingface.co/ubergarm/DeepSeek-V3-0324-GGUF/tree/main/DeepSeek-V3-0324-IQ2_K_R4?show_file_info=DeepSeek-V3-0324-IQ2_K_R4%2FDeepSeek-V3-0324-IQ2_K_R4-00001-of-00005.gguf) , it has `attn_k_b.weight`, `attn_v_b.weight` and `attn_kv_b.weight`.
>
> Looking at one converted with the incompatible version of MLA that does not work [here](https://huggingface.co/bullerwins/DeepSeek-R1T-Chimera-GGUF/tree/main/DeepSeek-R1T-Chimera-Q4_K_M?show_file_info=DeepSeek-R1T-Chimera-Q4_K_M%2FDeepSeek-R1T-Chimera-Q4_K_M-00001-of-00010.gguf) it is missing `attn_kv_b.weight` but has `attn_k_b.weight` and `attn_v_b.weight`.
>
> Looking at one converted from before MLA support which will work here by generating the MLA tensors on the fly [here](https://huggingface.co/unsloth/DeepSeek-V3-GGUF/tree/main/DeepSeek-V3-Q2_K_L?show_file_info=DeepSeek-V3-Q2_K_L%2FDeepSeek-V3-Q2_K_L-00001-of-00005.gguf) it has `attn_kv_b.weight` but not `attn_k_b.weight`, `attn_v_b.weight`.
>
> So in conclusion if the model has all three `attn_k_b.weight`, `attn_v_b.weight` and `attn_kv_b.weight` or just `attn_kv_b.weight` it will work here, but if it has `attn_k_b.weight` and `attn_v_b.weight` but no `attn_kv_b.weight` it will not work here.
>
> Edit: The above is outdated, see #394 and #409
>
> 👤 **ubergarm** replied the **2025-05-12** at **15:39:39**:<br>
> Sorry for late reply @cmoncure , I have a rough outline of the process of going from fp8 to GGUF for ik's fork [buried in a fold in my quickstart guide](https://github.com/ikawrakow/ik_llama.cpp/discussions/258) under the "Custom Quants" section.
>
> Its a bit dated already, but the basic procedures are described there. I'd suggest making your own imatrix and take [this new PR411 into consideration ](https://github.com/ikawrakow/ik_llama.cpp/pull/411) for that step as well.
>
> 👤 **saood06** replied the **2025-05-13** at **00:23:49**:<br>
> > Sorry for late reply @cmoncure , I have a rough outline of the process of going from fp8 to GGUF for ik's fork [buried in a fold in my quickstart guide](https://github.com/ikawrakow/ik_llama.cpp/discussions/258) under the "Custom Quants" section.
> >
> > Its a bit dated already, but the basic procedures are described there. I'd suggest making your own imatrix and take [this new PR411 into consideration ](https://github.com/ikawrakow/ik_llama.cpp/pull/411) for that step as well.
>
> The dequant method in your guide (that I had recommended) may need more precise instructions to work now. For more info see [this](https://github.com/ikawrakow/ik_llama.cpp/issues/383#issuecomment-2865306085) and the following comments.
>
> 👤 **ubergarm** replied the **2025-05-13** at **20:13:04**:<br>
> Thanks @saood06 , I managed to `git apply saood06.patch` copy/pasting your comment and that fixes up building `triton-cpu`. I tested with `uv venv ./venv --python 3.12 --python-preference=only-managed` for my venv and updated a couple lines of the quick start guide.
>
> Hopefully enough bread crumbs our future selves can figure it out.
>
> 👤 **saood06** replied the **2025-05-13** at **21:09:54**:<br>
> > Thanks @saood06 , I managed to `git apply saood06.patch` copy/pasting your comment and that fixes up building `triton-cpu`.
>
> Mind telling me the exact version/commit hash of `triton-cpu` you built?
>
> I noticed mine is 3.2.0 and they seem to be on 3.3.0 (and thus I hoped the bug would be fixed upstream)
>
> 👤 **ubergarm** replied the **2025-05-13** at **21:21:58**:<br>
> > > Thanks @saood06 , I managed to `git apply saood06.patch` copy/pasting your comment and that fixes up building `triton-cpu`.
> >
> > Mind telling me the exact version/commit hash of `triton-cpu` you built?
> >
> > I noticed mine is 3.2.0 and they seem to be on 3.3.0 (and thus I hoped the bug would be fixed upstream)
>
> I added your patch to `main@0625715c` `Artlesbol` `[MathToVecLib] Add support for setting bit-widths for AVX512...` `Apr 26 12:24:21 2025 +0800`
>
> I originally tried to use the same git sha I used the first time, but it doesn't exist anymore, so I guess they force pushed main or something somewhere along the way between now and March 13, 2025 maybe?
>
> 👤 **saood06** replied the **2025-05-13** at **21:45:22**:<br>
> > I originally tried to use the same git sha I used the first time, but it doesn't exist anymore, so I guess they force pushed main or something somewhere along the way between now and March 13, 2025 maybe?
>
> I noticed similar things when trying to look into the history of the repo. Whatever they are doing it makes tracing down the source of changes in their repo very tedious and annoying.
>
> Thanks for confirming the issue still exists in their latest commit, I don't currently plan on creating a better fix for them so I made an issue https://github.com/triton-lang/triton-cpu/issues/237 and hopefully they fix it.
>
> 👤 **saood06** replied the **2025-05-13** at **22:33:34**:<br>
> @ubergarm if you still have the build errors that my patch solves do you mind sharing them in the issue I made. I don't have them, and they are requesting them in the issue I opened.
>
> 👤 **ubergarm** replied the **2025-05-13** at **23:10:18**:<br>
> > @ubergarm if you still have the build errors that my patch solves do you mind sharing them in the issue I made. I don't have them, and they are requesting them in the issue I opened.
>
> Its a goofy browser ssh client for this specific rig, i tried to scroll my tmux back but its gone...
>
> I see the issue and will just delete my `venv` and try to repro and paste it in there: https://github.com/triton-lang/triton-cpu/issues/237

File diff suppressed because it is too large Load Diff

File diff suppressed because one or more lines are too long

View File

@@ -0,0 +1,20 @@
### 🗣️ [#372](https://github.com/ikawrakow/ik_llama.cpp/discussions/372) - multy gpu
| **Author** | `airnsk` |
| :--- | :--- |
| **Created** | 2025-05-03 |
| **Updated** | 2025-05-06 |
---
#### Description
I have 2 cmp90 10 gb GPUs on a computer with 512gb ram. Is it possible to run qwen3-235B?
---
#### 🗣️ Discussion
👤 **ikawrakow** replied the **2025-05-06** at **16:39:53**:<br>
I think so. But what kind of performance you will get also depends on the CPU you have as a large part of the calculations will be done on the CPU.

View File

@@ -0,0 +1,175 @@
### 🗣️ [#384](https://github.com/ikawrakow/ik_llama.cpp/discussions/384) - ik_llama.cpp issues on an old workstation
| **Author** | `matt23654` |
| :--- | :--- |
| **Created** | 2025-05-06 |
| **Updated** | 2025-05-06 |
---
#### Description
Hi! So I have managed to get ubergarm's 235B quant to work on a 6 year old workstation with 2*2080TI's, 64GB RAM and a pretty fast (new) SSD.
I have encountered some wierd issues with trying to use multiple GPUs though:
- Just using one device and offloading all experts to CPU works.
- The problems start when I try to keep some MoE experts on GPUs...
- Trying to use 2 devices with -sm layer and putting the first few layers entirely on GPU results in a crash on load where for some reason CUDA tries to allocate 170GB of VRAM:
```
llama_new_context_with_model: n_ctx = 8192
llama_new_context_with_model: n_batch = 2048
llama_new_context_with_model: n_ubatch = 512
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: mla_attn = 0
llama_new_context_with_model: attn_max_b = 0
llama_new_context_with_model: fused_moe = 1
llama_new_context_with_model: ser = -1, 0
llama_new_context_with_model: freq_base = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CUDA0 KV buffer size = 768.00 MiB
llama_kv_cache_init: CUDA1 KV buffer size = 736.00 MiB
llama_new_context_with_model: KV self size = 1504.00 MiB, K (f16): 752.00 MiB, V (f16): 752.00 MiB
llama_new_context_with_model: CUDA_Host output buffer size = 1.16 MiB
llama_new_context_with_model: pipeline parallelism enabled (n_copies=4)
ggml_backend_cuda_buffer_type_alloc_buffer: allocating 167771.94 MiB on device 0: cudaMalloc failed: out of memory
ggml_gallocr_reserve_n: failed to allocate CUDA0 buffer of size 175921630208
llama_new_context_with_model: failed to allocate compute buffers
```
- Trying to use -sm row results either in illegal memory access if I specifically pin some expert weights to CUDA1, or the ``GGML_ASSERT(!ggml_backend_buffer_is_cuda_split(src0_1->buffer) && "mul_mat_id does not support split buffers")`` error if I do not. Incidentally I think the last one is because split buffers and 3d tensors are not supported by llama.cpp.
Command used (some variation of):
```
build/bin/llama-server -m ~/.cache/huggingface/hub/models--ubergarm--Qwen3-235B-A22B-GGUF/snapshots/073738969f80d41f288cbfd6a29523769336bee8/Qwen3-235B-A22B-mix-IQ3_K-00001-of-00003.gguf -ngl 99 --temp 0.6 --top-k 20 --top-p 0.95 --min-p 0 --presence-penalty 1.5 -c 8192 -ot "^blk\.[0-2]\.=CUDA1" -ot "^blk\.[3-9]\.ffn_.*_exps\.=CPU" -ot "[1-9][0-9]\.ffn_.*_exps\.=CPU" --host 127.0.0.1 --port 4000 -fa -fmoe -sm row -mg 0 -v
```
Am I just doing something wrong or is there some genuine bug here?
---
#### 🗣️ Discussion
👤 **ikawrakow** replied the **2025-05-06** at **11:31:27**:<br>
Split mode "row" does not work for MoE models (and I'm not sure if it works for dense models as I don't have access to a multi-GPU system, so have not tested since forking). I'm pretty sure split mode "row" does not work for MoE models in mainline `llama.cpp` either.
With two or more GPU's you may need a more complicated tensor override recipe to get the best possible performance out of the system. For two identical GPU's I think you could start by using
```
-ngl 99 -ot exps=CPU -ts 50,50
```
note how much VRAM this has used on each GPU, and then change to e.g.
```
-ngl 99 -ts 50,50 -ot "blk\.[0-1]\.ffn=CUDA0,blk\.[2-3]\.ffn=CUDA1,exps=CPU
```
(I'm just guessing, as I don't have access to a multi-GPU system).
Note that the tensor overrides are processed in the order they were defined on the command line. So, in the above example, we don't need to be specific about experts tensor layers going to the CPU because the ones that we want to stay on the GPU (layers 0,1 on CUDA0, layers 2,3 on CUDA1) were already handled, so all remaining experts go to the CPU.
If the GPUs are different, then it may be better to just manually define with `-ot` which tensors go where.
> 👤 **matt23654** replied the **2025-05-06** at **13:54:09**:<br>
> Hi @ikawrakow !
>
> No matter what I do ``-sm layer`` just doesnt seem to work with 2 devices. A variation of your first command segfaults:
>
> ``build/bin/llama-server -m ~/.cache/huggingface/hub/models--ubergarm--Qwen3-235B-A22B-GGUF/snapshots/073738969f80d41f288cbfd6a29523769336bee8/Qwen3-235B-A22B-mix-IQ3_K-00001-of-00003.gguf -ngl 99 --temp 0.6 --top-k 20 --top-p 0.95 --min-p 0 --presence-penalty 1.5 -c 8192 --host 127.0.0.1 --port 4000 -fa -fmoe -sm layer -v -ts 50,50 -ot "exps=CPU"``
>
> ...
>
> ```
> llama_new_context_with_model: mla_attn = 0
> llama_new_context_with_model: attn_max_b = 0
> llama_new_context_with_model: fused_moe = 1
> llama_new_context_with_model: ser = -1, 0
> llama_new_context_with_model: freq_base = 1000000.0
> llama_new_context_with_model: freq_scale = 1
> llama_kv_cache_init: CUDA0 KV buffer size = 768.00 MiB
> llama_kv_cache_init: CUDA1 KV buffer size = 736.00 MiB
> llama_new_context_with_model: KV self size = 1504.00 MiB, K (f16): 752.00 MiB, V (f16): 752.00 MiB
> llama_new_context_with_model: CUDA_Host output buffer size = 1.16 MiB
> llama_new_context_with_model: pipeline parallelism enabled (n_copies=4)
> ggml_backend_cuda_buffer_type_alloc_buffer: allocating 173219.94 MiB on device 0: cudaMalloc failed: out of memory
> ggml_gallocr_reserve_n: failed to allocate CUDA0 buffer of size 181634272256
> llama_new_context_with_model: failed to allocate compute buffers
> llama_init_from_gpt_params: error: failed to create context with model '~/.cache/huggingface/hub/models--ubergarm--Qwen3-235B-A22B-GGUF/snapshots/073738969f80d41f288cbfd6a29523769336bee8/Qwen3-235B-A22B-mix-IQ3_K-00001-of-00003.gguf'
> ERR [ load_model] unable to load model | tid="127462866935808" timestamp=1746539401 model="~/.cache/huggingface/hub/models--ubergarm--Qwen3-235B-A22B-GGUF/snapshots/073738969f80d41f288cbfd6a29523769336bee8/Qwen3-235B-A22B-mix-IQ3_K-00001-of-00003.gguf"
> Segmentation fault (core dumped)
> ```
>
> I don't know why it wants to allocate such a huge amount of memory. It doesn't do that with one device or with ``-sm row`` (as mentioned row doesn't work if I try to put any MoE expert tensors on the GPUs).
>
> 👤 **ubergarm** replied the **2025-05-06** at **13:57:01**:<br>
> @matt23654
>
> First I'm not sure where this came from but a lot of folks keep using `-ot "^blk\.[3-9]\.ffn_.*_exps\.=CPU"` which misses some other ffn layers without the `exps` as the naming convention on Qwen3 is a bit different than DeepSeek for example.
>
>
> One other tip for multi-gpu is to recompile with `-DGGML_SCHED_MAX_COPIES=1`
>
> Look here for more discussions and examples: https://huggingface.co/ubergarm/Qwen3-235B-A22B-GGUF/discussions/1#681642d4a383b2fb9aa3bd8c
>
> Keep us posted how you get along, as some others have reported success with multi-gpu once they get the arguments just right for their specific systems!
>
> 👤 **matt23654** replied the **2025-05-06** at **15:19:56**:<br>
> Thanks @ubergarm ! For some reason ``-DGGML_SCHED_MAX_COPIES=1`` works and it no longer tries allocating 170GB of VRAM. I'm getting ~15 tok/s PP and ~6 tok/s generation. Not too bad really for a very old computer offloading from SSD! Specs: i9-9940X, 64GB quad channel ram, 2*2080Ti. I also offloaded all the ffn tensors as suggested.
>
> I'm guessing that I can't really expect to get a lot of PP speed with SSD offloading and an old CPU (i9-9940X)?
>
> 👤 **ikawrakow** replied the **2025-05-06** at **16:32:43**:<br>
> @matt23654 I'm curious what happens if you add `-rtr` to your command line. Model loading will take longer, but possibly this will improve your PP performance (PP being only 2.5 times faster than TG does not sound right).
>
> 👤 **matt23654** replied the **2025-05-06** at **19:59:06**:<br>
> @ikawrakow So there definitely looks to be something a bit wierd going on, maybe because of the SSD, but ``-rtr`` didn't really change PP speed. I've also tried compiling with OpenBLAS, but that somehow seems to have made it slower (yay!).
>
> The CPU is less active during PP than during regular inference, so I can only assume that somehow the SSD is bottlenecking it. The SSD bandwidth on its own should only be about 0.5tok/s peak, I think the reason generation is so fast is that Qwen isn't choosing experts uniformly and so the kernel caching is making it far closer to the quad-channel ram speed instead. That's my theory, anyway.
>
> 👤 **ubergarm** replied the **2025-05-06** at **20:44:40**:<br>
> You might be able to get some more out of it, not sure your what your final command was, but give this a try:
> ```
> # do *not* use BLAS and set -DGGML_SCHED_MAX_COPIES=1
> cmake -B build -DGGML_CUDA=ON -DGGML_RPC=OFF -DGGML_BLAS=OFF -DGGML_SCHED_MAX_COPIES=1
> cmake --build build --config Release -j $(nproc)
>
> # 1. -sm layer seems default, so i removed it
> # 2. you didn't specify threads? set that to number of physical cores or experiment, i'll assume -t 16
> # 3. try the more simple to understand version regex of listing each ffn layer to each CUDA, increase if u have VRAM
> # 4. explicitly put all other ffn to CPU just so you see it print out on startup
> # 5. use quantized kv cache e.g. q8_0 or q4_0
>
> $ build/bin/llama-server \
> -m ~/.cache/huggingface/hub/models--ubergarm--Qwen3-235B-A22B-GGUF/snapshots/073738969f80d41f288cbfd6a29523769336bee8/Qwen3-235B-A22B-mix-IQ3_K-00001-of-00003.gguf \
> -c 8192 \
> -ctk q8_0 -ctv q8_0 \
> -fa \
> -fmoe \
> -ngl 99 \
> -ts 50,50 \
> -ot "blk\.(0|1)\.ffn.*=CUDA0" \
> -ot "blk\.(2|3)\.ffn.*=CUDA1" \
> -ot "ffn.*=CPU" \
> -t 16 \
> --temp 0.6 \
> --top-k 20 \
> --top-p 0.95 \
> --min-p 0 \
> --presence-penalty 1.5 \
> -v \
> --host 127.0.0.1 \
> --port 4000
> ```
>
> If you have more VRAM (assuming like 11GB per GPU?), then try to add one more layer each until you OOM, or use the extra e.g.
> ```
> -ot "blk\.(0|1|2)\.ffn.*=CUDA0" \
> -ot "blk\.(3|4|5)\.ffn.*=CUDA1" \
> ```
>
> Or u can use the extra VRAM for more context etc...
> Curious if you get anything more out of that, and share you updated command whenever. Cheers!
>
> *EDIT*: I removed `-rtr` because you don't have enough RAM to use that as it disables mmap. You can look into doing the offline tensor repack of the weights not offloaded to GPU so you can get the benefits of the repacked `_R4` and also mmap() to run despite only 64GB RAM.
>
> So your system is a bit more complex of a setup to get max speed.

View File

@@ -0,0 +1,349 @@
### 🗣️ [#385](https://github.com/ikawrakow/ik_llama.cpp/discussions/385) - Qwen3 235B performance on Intel Xeon Scalable processor
| **Author** | `Gaolingx` |
| :--- | :--- |
| **Created** | 2025-05-06 |
| **Updated** | 2025-05-27 |
---
#### Description
## Introduction
The Qwen3 models were officially released on 29th, April, 2025. This is a mixture-of-experts (MoE) models which 235B in total and 22B activated, here are the following features.
- Type: Causal Language Models
- Training Stage: Pretraining & Post-training
- Number of Parameters: 235B in total and 22B activated
- Number of Paramaters (Non-Embedding): 234B
- Number of Layers: 94
- Number of Attention Heads (GQA): 64 for Q and 4 for KV
- Number of Experts: 128
- Number of Activated Experts: 8
- Context Length: 32,768 natively and 131,072 tokens with YaRN.
The qwen3moe had supported in in PR #355, I tried to run the biggest model [Qwen3-235B-A22B-128K-GGUF](https://hf-mirror.com/unsloth/Qwen3-235B-A22B-128K-GGUF) with ik_llama.cpp on my Workstation, I need better generation quality, an my system has sufficient memory(Total 512G RAM), so I chose the relatively higher quality quantization `Q8_0`.
## System Info
Here are my SystemInfo(include hardware and software)
- Hardware
- CPU: Intel(R) Xeon(R) Gold 6138 CPU @ 2.00GHz(20c, 40t) x2
- RAM: RDIMM DDR4 2666 2Rx4 32G x16(12 Channels total)
- Motherboard: Supermicro X11DPi-N
- SSD: ZHITAI TiPlus7100 1TB
- Software
- OS: Microsoft Windows 10 Pro
- BIOS: Hyper-Threading-Enable, SNC-Disable
- Model: Qwen3-235B-A22B-128K-Q8_0(unsloth/Qwen3-235B-A22B-128K-GGUF)
- ik_llama.cpp:
```text
INFO [ main] build info | tid="61372" timestamp=1746525421 build=3667 commit="e3fec173"
INFO [ main] system info | tid="61372" timestamp=1746525421 n_threads=16 n_threads_batch=-1 total_threads=40 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | "
```
## Memory Performance
![cachemem2](https://github.com/user-attachments/assets/264caeef-bc57-4d42-9d8a-21b835fc9219)
## CPU-backend performance
The command line for is `ik_llama.cpp`
llama-sweep-bench:
```text
./llama-sweep-bench -m "%MODEL_PATH%" -c 16384 -t 20 -ngl 0 -fa
```
### ik_llama.cpp CPU-only performance data(Qwen3-235B-A22B-128K-Q8_0)
main: n_kv_max = 16384, n_batch = 2048, n_ubatch = 512, flash_attn = 1, n_gpu_layers = 0, n_threads = 20, n_threads_batch = 20
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
| 512 | 128 | 0 | 67.198 | 7.62 | 53.220 | 2.41 |
| 512 | 128 | 512 | 65.739 | 7.79 | 51.455 | 2.49 |
| 512 | 128 | 1024 | 67.660 | 7.57 | 51.890 | 2.47 |
| 512 | 128 | 1536 | 68.719 | 7.45 | 52.238 | 2.45 |
| 512 | 128 | 2048 | 70.073 | 7.31 | 53.222 | 2.41 |
| 512 | 128 | 2560 | 71.726 | 7.14 | 53.961 | 2.37 |
| 512 | 128 | 3072 | 73.097 | 7.00 | 54.397 | 2.35 |
| 512 | 128 | 3584 | 74.688 | 6.86 | 54.247 | 2.36 |
| 512 | 128 | 4096 | 76.166 | 6.72 | 56.074 | 2.28 |
| 512 | 128 | 4608 | 78.441 | 6.53 | 55.985 | 2.29 |
| 512 | 128 | 5120 | 85.400 | 6.00 | 56.714 | 2.26 |
| 512 | 128 | 5632 | 80.910 | 6.33 | 58.679 | 2.18 |
| 512 | 128 | 6144 | 82.747 | 6.19 | 56.730 | 2.26 |
| 512 | 128 | 6656 | 83.653 | 6.12 | 57.644 | 2.22 |
| 512 | 128 | 7168 | 85.044 | 6.02 | 57.860 | 2.21 |
| 512 | 128 | 7680 | 86.687 | 5.91 | 59.510 | 2.15 |
| 512 | 128 | 8192 | 88.306 | 5.80 | 59.983 | 2.13 |
| 512 | 128 | 8704 | 95.135 | 5.38 | 58.736 | 2.18 |
| 512 | 128 | 9216 | 91.348 | 5.60 | 60.733 | 2.11 |
| 512 | 128 | 9728 | 97.391 | 5.26 | 60.376 | 2.12 |
| 512 | 128 | 10240 | 95.785 | 5.35 | 64.163 | 1.99 |
| 512 | 128 | 10752 | 98.549 | 5.20 | 63.393 | 2.02 |
| 512 | 128 | 11264 | 98.616 | 5.19 | 61.447 | 2.08 |
| 512 | 128 | 11776 | 105.775 | 4.84 | 65.116 | 1.97 |
| 512 | 128 | 12288 | 102.959 | 4.97 | 67.291 | 1.90 |
| 512 | 128 | 12800 | 105.210 | 4.87 | 65.661 | 1.95 |
| 512 | 128 | 13312 | 107.702 | 4.75 | 66.114 | 1.94 |
| 512 | 128 | 13824 | 109.233 | 4.69 | 64.225 | 1.99 |
| 512 | 128 | 14336 | 111.032 | 4.61 | 67.671 | 1.89 |
| 512 | 128 | 14848 | 114.479 | 4.47 | 66.681 | 1.92 |
| 512 | 128 | 15360 | 117.857 | 4.34 | 73.044 | 1.75 |
| 512 | 128 | 15872 | 120.052 | 4.26 | 71.046 | 1.80 |
---
![02](https://github.com/user-attachments/assets/9bbdc4f2-0222-4e68-bfa8-145cabe97691)
## ik_llama.cpp CPU-only performance data(Qwen3-30B-A3B-128K-GGUF)
I also experimented with `Qwen3-30B-A3B-128K-Q8_0(unsloth/Qwen3-235B-A22B-128K-GGUF)`, Here are the results, well, the performance is much better than I though.
main: n_kv_max = 16384, n_batch = 2048, n_ubatch = 512, flash_attn = 1, n_gpu_layers = 0, n_threads = 20, n_threads_batch = 20
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
| 512 | 128 | 0 | 8.519 | 60.10 | 9.924 | 12.90 |
| 512 | 128 | 512 | 8.950 | 57.21 | 10.045 | 12.74 |
| 512 | 128 | 1024 | 9.279 | 55.18 | 10.204 | 12.54 |
| 512 | 128 | 1536 | 9.648 | 53.07 | 10.613 | 12.06 |
| 512 | 128 | 2048 | 10.097 | 50.71 | 10.722 | 11.94 |
| 512 | 128 | 2560 | 10.486 | 48.83 | 11.015 | 11.62 |
| 512 | 128 | 3072 | 10.999 | 46.55 | 11.164 | 11.47 |
| 512 | 128 | 3584 | 11.336 | 45.17 | 11.139 | 11.49 |
| 512 | 128 | 4096 | 12.480 | 41.03 | 11.718 | 10.92 |
| 512 | 128 | 4608 | 12.244 | 41.82 | 11.725 | 10.92 |
| 512 | 128 | 5120 | 12.551 | 40.79 | 12.213 | 10.48 |
| 512 | 128 | 5632 | 13.537 | 37.82 | 12.453 | 10.28 |
| 512 | 128 | 6144 | 13.356 | 38.34 | 12.584 | 10.17 |
| 512 | 128 | 6656 | 13.847 | 36.98 | 12.603 | 10.16 |
| 512 | 128 | 7168 | 14.128 | 36.24 | 12.656 | 10.11 |
| 512 | 128 | 7680 | 14.631 | 34.99 | 13.198 | 9.70 |
| 512 | 128 | 8192 | 15.002 | 34.13 | 13.520 | 9.47 |
| 512 | 128 | 8704 | 15.356 | 33.34 | 13.095 | 9.77 |
| 512 | 128 | 9216 | 16.050 | 31.90 | 13.614 | 9.40 |
| 512 | 128 | 9728 | 16.395 | 31.23 | 13.093 | 9.78 |
| 512 | 128 | 10240 | 16.790 | 30.49 | 14.537 | 8.80 |
| 512 | 128 | 10752 | 17.052 | 30.03 | 14.793 | 8.65 |
| 512 | 128 | 11264 | 17.668 | 28.98 | 13.957 | 9.17 |
| 512 | 128 | 11776 | 18.276 | 28.02 | 15.028 | 8.52 |
| 512 | 128 | 12288 | 18.335 | 27.92 | 15.267 | 8.38 |
| 512 | 128 | 12800 | 19.061 | 26.86 | 15.272 | 8.38 |
| 512 | 128 | 13312 | 19.379 | 26.42 | 15.310 | 8.36 |
| 512 | 128 | 13824 | 19.764 | 25.91 | 15.000 | 8.53 |
| 512 | 128 | 14336 | 20.432 | 25.06 | 15.612 | 8.20 |
| 512 | 128 | 14848 | 21.632 | 23.67 | 15.587 | 8.21 |
| 512 | 128 | 15360 | 22.311 | 22.95 | 17.303 | 7.40 |
| 512 | 128 | 15872 | 21.767 | 23.52 | 16.894 | 7.58 |
---
![03](https://github.com/user-attachments/assets/3f4f1148-85dc-471d-85ee-0a4afa13db07)
## Profiler Data
I also use `Intel VTune Profiler 2025.0.1` capture some interesting data when running llama-server with `Qwen3-30B-A3B-128K-Q8_0`, I will show them as well.
![2025-05-04T15_17_00](https://github.com/user-attachments/assets/8ed1d864-4cb5-483b-9df9-a72bbbfc426b)
![2025-05-04T15_51_53](https://github.com/user-attachments/assets/152044c8-9a54-4992-8afb-501a791260c6)
![2025-05-04T15_52_19](https://github.com/user-attachments/assets/5af8f7da-8b6d-4686-a4c9-68c7ffeb2925)
---
#### 🗣️ Discussion
👤 **ikawrakow** replied the **2025-05-06** at **13:11:51**:<br>
Thank you for these results. Quite amazing that it works reasonably well on an almost 8 years old CPU!
I'm curious if you might get better performance by repacking the model (unlikely for TG, very likely for PP). You can repack either on the fly by adding `-rtr` to the command line, or offline like this
```
./bin/llama-quantize --repack $model $repacked_model q8_0_r8
```
This shouldn't take very long, even for the 235B model.
Another note: at least on the CPUs that I have available, one gets better performance using `q8_0` KV cache (add `-ctk q8_0 -ctv q8_0` to the command line). Not so much for short contexts, but quite noticeable for long contexts.
> 👤 **saood06** replied the **2025-05-06** at **20:29:54**:<br>
> > Another note: at least on the CPUs that I have available, one gets better performance using `q8_0` KV cache (add `-ctk q8_0 -ctv q8_0` to the command line). Not so much for short contexts, but quite noticeable for long contexts.
>
> I have seen this https://www.reddit.com/r/LocalLLaMA/comments/1kewkno/qwen_30b_a3b_performance_degradation_with_kv/ where they report using `q8_0` KV cache causes the model to not able to solve a problem with a comment saying:
> ```
> KV cache q8_0: 0/5
> KV cache f16: 2/2
> ```
>
> 👤 **Gaolingx** replied the **2025-05-07** at **07:16:13**:<br>
> Ok, Thanks for the info. I found that the memory bandwidth was not filled when I use vtune profiler analysis the memory access, Maybe numa system works in Linux better, I will try to use `numactl` changes the memory policy ([https://github.com/ggml-org/llama.cpp/issues/1437](https://github.com/ggml-org/llama.cpp/issues/1437)), and repack the model with `q8_0_r8`. I will see if I can do better yet however.
---
👤 **Gaolingx** replied the **2025-05-07** at **18:42:39**:<br>
Note: when I run llama-server with `-fa` and `-rtr` parameter, the speed is a little faster than only use `-fa`, the prefill and decode are increased, That is a good beginning!
`-c 8192 -t 16 -fa`:
```text
INFO [ print_timings] prompt eval time = 197624.81 ms / 1266 tokens ( 156.10 ms per token, 6.41 tokens per second) | tid="46204" timestamp=1746371113 id_slot=0 id_task=4917 t_prompt_processing=197624.812 n_prompt_tokens_processed=1266 t_token=156.10174723538705 n_tokens_second=6.406078200342577
INFO [ print_timings] generation eval time = 372468.51 ms / 861 runs ( 432.60 ms per token, 2.31 tokens per second) | tid="46204" timestamp=1746371113 id_slot=0 id_task=4917 t_token_generation=372468.513 n_decoded=861 t_token=432.5998989547038 n_tokens_second=2.3116047932889296
INFO [ print_timings] total time = 570093.32 ms | tid="46204" timestamp=1746371113 id_slot=0 id_task=4917 t_prompt_processing=197624.812 t_token_generation=372468.513 t_total=570093.325
```
`-c 8192 -t 16 -fa -rtr`:
```text
INFO [ print_timings] prompt eval time = 9707.99 ms / 168 tokens ( 57.79 ms per token, 17.31 tokens per second) | tid="46820" timestamp=1746855833 id_slot=0 id_task=9260 t_prompt_processing=9707.992 n_prompt_tokens_processed=168 t_token=57.78566666666667 n_tokens_second=17.30532946463079
INFO [ print_timings] generation eval time = 26156.20 ms / 76 runs ( 344.16 ms per token, 2.91 tokens per second) | tid="46820" timestamp=1746855833 id_slot=0 id_task=9260 t_token_generation=26156.196 n_decoded=76 t_token=344.1604736842105 n_tokens_second=2.905621291414088
INFO [ print_timings] total time = 35864.19 ms | tid="46820" timestamp=1746855833 id_slot=0 id_task=9260 t_prompt_processing=9707.992 t_token_generation=26156.196 t_total=35864.188
```
---
👤 **ikawrakow** replied the **2025-05-08** at **12:59:17**:<br>
@saood06
> I have seen this https://www.reddit.com/r/LocalLLaMA/comments/1kewkno/qwen_30b_a3b_performance_degradation_with_kv/ where they report using q8_0 KV cache causes the model to not able to solve a problem with a comment saying:
This grabbed my attention as I have never seen any significant difference between `f16` and `q8_0` KV cache (if anything, I would be more suspect towards `f16` because it can overflow, and I think there have been reports about that). So, being someone who does not take thinks for granted, I tried it myself.
### Attempt 1
* I saw Redditor is using a `Q4_K_M` model, so try a stock `Q4_K_M` quantization
* `f16` and `Q8_0` KV cache both fail in all 3 attempts
* `f16` and `q8_0` both at some point arrive at the correct conclusion that two characters in the encoded text correspond to a single letter, but both abandon the idea after some unsuccessful attempts
* `f16` and `q8_0` both enter into seemingly infinite loop of trying the same ideas again and again. Sometimes they stop and give an incorrect answer, sometimes they keep going until they run out of tokens (I gave a limit of 20k tokens)
### Attempt 2
* Quantize to stock `IQ4_K`
* 3 attempts with `f16` and 3 attempts with `q8_0`. Each attempt uses the same seed for `q8_0` and for `f16`, but there are 3 different seeds for the 3 attempts
* `f16`: 2 out of 3 correct. The failed attempt runs out of tokens. Correct, Correct, Incorrect
* `q8_0`: 2 out of 3 correct. The failed attempt comes back with an incorrect result after about 12k tokens. Correct, Incorrect, Correct
* Each run consumes a different amount of thinking tokens
Hence, I think that the outcome is largely determined by the quality of the quantized model and by some luck. We know that in a random process (as we have here) slight differences in the computed token probabilities can make the model go on a very different path, even if the same seed was used.
> 👤 **saood06** replied the **2025-05-08** at **22:40:13**:<br>
> >So, being someone who does not take thinks for granted, I tried it myself.
>
> Thank you. Do you mind saying what sampler settings you used?
>
> > Hence, I think that the outcome is largely determined by the quality of the quantized model and by some luck. We know that in a random process (as we have here) slight differences in the computed token probabilities can make the model go on a very different path, even if the same seed was used.
>
> The "luck" factor can be at least somewhat lessened based on how you sample (and why I like manually sampling and exploring many branches, and often injecting in tokens that would others never be sampled [since min_p would have removed it as it would be too low]). In my experience there are places where the "luck" of a single token selected by sane sampler settings does have an outsized impact on the internal world state, but often it doesn't with the model using different words or changing trivial things but otherwise staying on the same track. Either way for entire responses yes, there are often large variations between seeds and sampling parameters.
>
> There are other ways that are being researched to try and improve outcomes such as using majority voting, incorporating scoring models or reward models and other highly compute intensive ways of trying to eek out more performance and consistency from models but for me manually sampling works well (and I also find it interesting and enjoyable trying to create a mental model of the AI's mental model).
>
> >This grabbed my attention as I have never seen any significant difference between f16 and q8_0 KV cache (if anything, I would be more suspect towards f16 because it can overflow, and I think there have been reports about that).
>
> For me, with Deepseek based models I tend to use f16 as I don't see the need to save the space and speed is very close between them, but with other models I do quantize the KV cache, so I was also really surprised by the thread I linked. One last thing I saw in there that I forgot to mention was him stating "I know but as a side test I tried also Roo Code that I could not get to use all the tools with KV cache Q8 and worked fine with F16." so I'm not sure why his experience shows such stark differences that I also have never really experienced.
---
👤 **Gaolingx** replied the **2025-05-13** at **00:52:27**:<br>
Note: qwen3moe uses 8 experts by default. I found that we can speed up token generation(2.7 token/s->3.2 token/s) by reducing some experts used (from Top-8 to Top-6), without a significant drop in quality.
parameter:
`.\llama-server --model "%MODEL%" --host %HOST% --port %PORT% --threads 16 --n-gpu-layers 0 --ctx-size 8192 --flash-attn --run-time-repack --override-kv qwen3moe.expert_used_count=int:6`
```text
INFO [ print_timings] prompt eval time = 10360.09 ms / 153 tokens ( 67.71 ms per token, 14.77 tokens per second) | tid="71476" timestamp=1747096864 id_slot=0 id_task=9696 t_prompt_processing=10360.092 n_prompt_tokens_processed=153 t_token=67.71301960784314 n_tokens_second=14.768208622085595
INFO [ print_timings] generation eval time = 15317.10 ms / 50 runs ( 306.34 ms per token, 3.26 tokens per second) | tid="71476" timestamp=1747096864 id_slot=0 id_task=9696 t_token_generation=15317.103 n_decoded=50 t_token=306.34206 n_tokens_second=3.2643248530743705
INFO [ print_timings] total time = 25677.19 ms | tid="71476" timestamp=1747096864 id_slot=0 id_task=9696 t_prompt_processing=10360.092 t_token_generation=15317.103 t_total=25677.195
```
> 👤 **saood06** replied the **2025-05-13** at **01:03:32**:<br>
> > Note: qwen3moe uses 8 experts by default. I found that we can speed up token generation(2.7 token/s->3.2 token/s) by reducing some experts used (from Top-8 to Top-6), without a significant drop in quality.
>
> There is this feature: https://github.com/ikawrakow/ik_llama.cpp/pull/239 I personally haven't had much success using it (for Deepseek V3/R1) , but it may work for you on Qwen.
>
> 👤 **Gaolingx** replied the **2025-05-13** at **01:45:22**:<br>
> > > Note: qwen3moe uses 8 experts by default. I found that we can speed up token generation(2.7 token/s->3.2 token/s) by reducing some experts used (from Top-8 to Top-6), without a significant drop in quality.
> >
> > There is this feature: #239 I personally haven't had much success using it (for Deepseek V3/R1) , but it may work for you on Qwen.
>
> All right, it seems that `--smart-expert-reduction` not works well on qwen3moe, there are a lot of garbled characters appeared and continuous output appeared.
>
> `--flash-attn --run-time-repack --smart-expert-reduction 6,1`
> ![批注 2025-05-13 093200](https://github.com/user-attachments/assets/3320649a-ae4f-466e-a2f6-dcc949ca4919)
>
> `--flash-attn --run-time-repack --smart-expert-reduction 7,1`
> ![批注 2025-05-13 094242](https://github.com/user-attachments/assets/370fb493-a9c7-42c7-a380-90935df8f23e)
>
> 👤 **ikawrakow** replied the **2025-05-13** at **12:35:23**:<br>
> Can you both try PR #415 and let me know if it now works? Thanks!
>
> 👤 **Gaolingx** replied the **2025-05-14** at **01:42:24**:<br>
> > Can you both try PR #415 and let me know if it now works? Thanks!
>
> yes, I pulled PR(#415 ), The smart expert reduction works very well on cpu backend, thank you fix it.
> ![批注 2025-05-14 093324](https://github.com/user-attachments/assets/88e0af59-555c-4375-b5f8-78e0fd7789e7)
>
> `--flash-attn --run-time-repack --smart-expert-reduction 6,1`
>
> ```text
> INFO [ print_timings] prompt eval time = 8951.82 ms / 165 tokens ( 54.25 ms per token, 18.43 tokens per second) | tid="52244" timestamp=1747186657 id_slot=0 id_task=491 t_prompt_processing=8951.82 n_prompt_tokens_processed=165 t_token=54.253454545454545 n_tokens_second=18.432006005482684
> INFO [ print_timings] generation eval time = 24997.27 ms / 86 runs ( 290.67 ms per token, 3.44 tokens per second) | tid="52244" timestamp=1747186657 id_slot=0 id_task=491 t_token_generation=24997.269 n_decoded=86 t_token=290.66591860465115 n_tokens_second=3.4403758266553037
> INFO [ print_timings] total time = 33949.09 ms | tid="52244" timestamp=1747186657 id_slot=0 id_task=491 t_prompt_processing=8951.82 t_token_generation=24997.269 t_total=33949.089
> ```
---
👤 **VinnyG9** replied the **2025-05-19** at **15:30:30**:<br>
you forgot to set -nkvo?
what snoop mode you're using for numa?
are you using one node?
here's some numbers on the xeon v4 @Q2KL
| model | size | params | backend | ngl | threads | fa | amb | ser | rtr | fmoe | test | t/s |
| ----------------------------------- | ----------: | ---------: | --------- | ----: | --------: | ---: | ----: | ----: | ----: | -----: | ------: | --------------: |
| ============ Repacked 659 tensors | | | | | | | | | | | | |
| qwen3moe ?B Q2_K - Medium | 81.96 GiB | 235.09 B | CUDA | 0 | 31 | 1 | 512 | 4,1 | 1 | 1 | pp32 | 34.41 ± 2.53 |
| qwen3moe ?B Q2_K - Medium | 81.96 GiB | 235.09 B | CUDA | 0 | 31 | 1 | 512 | 4,1 | 1 | 1 | pp64 | 44.84 ± 1.45 |
| qwen3moe ?B Q2_K - Medium | 81.96 GiB | 235.09 B | CUDA | 0 | 31 | 1 | 512 | 4,1 | 1 | 1 | pp128 | 54.11 ± 0.49 |
| qwen3moe ?B Q2_K - Medium | 81.96 GiB | 235.09 B | CUDA | 0 | 31 | 1 | 512 | 4,1 | 1 | 1 | pp256 | 55.99 ± 2.86 |
| qwen3moe ?B Q2_K - Medium | 81.96 GiB | 235.09 B | CUDA | 0 | 31 | 1 | 512 | 4,1 | 1 | 1 | tg32 | 6.73 ± 0.14 |
| qwen3moe ?B Q2_K - Medium | 81.96 GiB | 235.09 B | CUDA | 0 | 31 | 1 | 512 | 4,1 | 1 | 1 | tg64 | 7.28 ± 0.38 |
| qwen3moe ?B Q2_K - Medium | 81.96 GiB | 235.09 B | CUDA | 0 | 31 | 1 | 512 | 4,1 | 1 | 1 | tg128 | 8.29 ± 0.25 |
| qwen3moe ?B Q2_K - Medium | 81.96 GiB | 235.09 B | CUDA | 0 | 31 | 1 | 512 | 4,1 | 1 | 1 | tg256 | 8.65 ± 0.20 |
---
👤 **ikawrakow** replied the **2025-05-19** at **15:38:58**:<br>
You cannot compare `Q2_K` to `Q8_0` for TG, there is going to be a factor in the range of 3X difference. Her PP is for a short prompt, and we don't know if it was a single prompt of 165 tokens or 10 prompts with 16 tokens each.
> 👤 **VinnyG9** replied the **2025-05-19** at **15:48:34**:<br>
> > You cannot compare `Q2_K` to `Q8_0` for TG, there is going to be a factor in the range of 3X difference. Her PP is for a short prompt, and we don't know if it was a single prompt of 165 tokens or 10 prompts with 16 tokens each.
>
> or 2.5x going by model size :)
> i didn't mean to compare apples to apples just want to see more CPU benchmarks on the big MoEs, and point out OP is on a multi node system with HT On but limiting it to 25% of total threads(the MoEs will scale w/ all threads)
> no --numa flag, no info on snoop mode which makes the biggest difference I've seen in my tests
>
> multi socket is way more complicated but can be worth it
---
👤 **Gaolingx** replied the **2025-05-27** at **13:06:54**:<br>
Well, I use `-ser 4,1` parameter to improve token generation(TG) performance, now we can get ~4.1 token/s TG(< 4k context size), and the
quality not declined too much. all right, I admit this is just my opinion. Others can offer their own opinions on this point...We don't know what will happen in complex tasks...
`.\llama-server --model "%MODEL%" --host %HOST% --port %PORT% --threads 16 --n-gpu-layers 0 --ctx-size 8192 --flash-attn --run-time-repack --fused-moe --smart-expert-reduction 4,1`
```text
INFO [ print_timings] prompt eval time = 3343.34 ms / 66 tokens ( 50.66 ms per token, 19.74 tokens per second) | tid="12196" timestamp=1748316424 id_slot=0 id_task=5716 t_prompt_processing=3343.336 n_prompt_tokens_processed=66 t_token=50.65660606060606 n_tokens_second=19.740761921625587
INFO [ print_timings] generation eval time = 177876.86 ms / 731 runs ( 243.33 ms per token, 4.11 tokens per second) | tid="12196" timestamp=1748316424 id_slot=0 id_task=5716 t_token_generation=177876.858 n_decoded=731 t_token=243.3335950752394 n_tokens_second=4.109584620614335
INFO [ print_timings] total time = 181220.19 ms | tid="12196" timestamp=1748316424 id_slot=0 id_task=5716 t_prompt_processing=3343.336 t_token_generation=177876.858 t_total=181220.19400000002
```
---
![image](https://github.com/user-attachments/assets/7ba9179c-a661-466d-bba8-518ea755d082)

File diff suppressed because one or more lines are too long

View File

@@ -0,0 +1,81 @@
### 🗣️ [#395](https://github.com/ikawrakow/ik_llama.cpp/discussions/395) - Why does imatrix not tokenize special tokens?
| **Author** | `bartowski1182` |
| :--- | :--- |
| **Created** | 2025-05-07 |
| **Updated** | 2025-05-09 |
---
#### Description
Recently there's been some discussion (and I've also experimented slightly) around adding chat tokens to the imatrix dataset and tokenizing them, a change from the default behaviour, so I was curious why the original implementation avoided tokenizing them
Was it just an arbitrary decision or was there a reason at the time?
---
#### 🗣️ Discussion
👤 **ikawrakow** replied the **2025-05-08** at **05:21:04**:<br>
When the `imatrix` tool was written handling of chat, special tokens, etc., was extremely immature/non-existent in `llama.cpp` . If you look at the `llama_tokenize` function in `common` that is being used by the `imatrix` tool to tokenize the calibration data, you will see that the `parse_special` argument was added well after the `imatrix` tool was merged. It was added with a default value of `false`, so that defined the `imatrix` tool behavior with special tokens as this argument is missing in the `imatrix` call to `::lama_tokenize`. By the time `llama_tokenize` got the ability to parse special tokens I had left the `llama.cpp` project, so somebody else needed to notice, investigate, and possibly change.
Back then I had the concept that the calibration data for chat/instruction tuned models need to contain actual instruction tuning datasets. And, instead of blindly dividing the calibration data into chunks of `n_ctx` tokens, the chunks needed to be individual request-response pieces (or series of related request-response chunks in a conversation). But then everybody became an expert on `imatrix` calibration data, people started using the `imatrix` tool the way it is for chat models and it seemed to work OK, so I never followed up.
In any case, it would be interesting to see if including special tokens, using non equal-size chunks, etc., in the `imatrix` calibration data would improve the quality of quantized models.
---
👤 **ikawrakow** replied the **2025-05-09** at **08:46:05**:<br>
@bartowski1182 I see you submitted [this PR](https://github.com/ggml-org/llama.cpp/pull/13389) in mainline.
You are welcome.
> 👤 **bartowski1182** replied the **2025-05-09** at **12:33:00**:<br>
> Ah did I not send that reply here first? Sorry, I had one typed up
>
> That makes perfect sense though! Do you think you'd want the same thing here? Was planning to open one up in each assuming it made sense, it seems like a nice idea for A/B testing anyways, but figured I'd double check with the original architect that there wasn't something glaringly obvious I was missing
>
> Thanks again for the input!
>
> 👤 **bartowski1182** replied the **2025-05-09** at **12:42:35**:<br>
> Truly did not mean to just grab knowledge and run, that's a terrible look, hence I meant to ask if I could contribute the same here so that it wouldn't just be a one-sided deal (not that it's a complex change from me, but just the principle of it, it's not in good taste to open a discussion, get your insight, and run to mainline without saying anything, that isn't my style but it's exactly what I did in this case)
>
> 👤 **ikawrakow** replied the **2025-05-09** at **12:42:53**:<br>
> > Do you think you'd want the same thing here?
>
> Most people are using mainline `llama.cpp` to compute imatrix data, so it is not critical to have this here.
>
> I'm waiting to see if the mainline developers will independently discover what's wrong with the imatrix calculation after their change to support MLA. After they have independently discovered it, or when enough time has passed, I'll make the change here, and at that point I can also put in the ability to use special tokens. Do you hear complains from users about reduced model quality after the MLA change?
>
> 👤 **bartowski1182** replied the **2025-05-09** at **12:47:29**:<br>
> > Do you hear complains from users about reduced model quality after the MLA change
>
> No I didn't hear anything about that yet, but MLA has its own can of worms with speed so I had personally been avoiding remaking those models that have MLA since, hoping for a resolution...
>
> Now I almost want to go on a hunt for it, but know it's gonna go right over my head as with other imatrix code :')
>
> Without looking directly at your commit history I doubt anyone in mainline will figure it out, but who knows
>
> I do know that I like your algorithm for some semi incomplete experts, seems reasonable to have some wiggle room there, especially if after 200k tokens of imatrix it's still not being activated quite enough
>
> 👤 **ikawrakow** replied the **2025-05-09** at **12:48:22**:<br>
> > Truly did not mean to just grab knowledge and run, that's a terrible look, hence I meant to ask if I could contribute the same here so that it wouldn't just be a one-sided deal (not that it's a complex change from me, but just the principle of it, it's not in good taste to open a discussion, get your insight, and run to mainline without saying anything, that isn't my style but it's exactly what I did in this case)
>
> No worries. I know you are not free to mention my name in the mainline repository, else your PR will have the same fate as [that one](https://github.com/ggml-org/llama.cpp/pull/12727)
>
> 👤 **bartowski1182** replied the **2025-05-09** at **12:55:14**:<br>
> > else your PR will have the same fate as that one
>
> I'd *like* to think that's not the reason, but rather the annoying complexity level of that function in general and excitement for a new feature (though the feature does miss out on an important part, counting discrete layers ahead of time and applying variable quantization automatically..)
>
> But who knows, it's not my drama to unpack, so much as I wish we could all get along in a nice Kumbaya circle and contribute to the open world together, I know I'm naive ;)
>
> 👤 **ikawrakow** replied the **2025-05-09** at **13:03:17**:<br>
> It has never been the style of the `llama.cpp` project to wait for the perfect solution before merging a useful change.
>
> Your PR is immensely helpful to anyone using mainline `llama.cpp` and making their own quantized MoE models.
>
> Sadly, there is only one possible conclusion from these two observations.

View File

@@ -0,0 +1,47 @@
### 🗣️ [#396](https://github.com/ikawrakow/ik_llama.cpp/discussions/396) - Best settings for Maverick - Dual CPU Xeon 8480+ - RTX 3090
| **Author** | `justinjja` |
| :--- | :--- |
| **Created** | 2025-05-07 |
| **Updated** | 2025-05-08 |
---
#### Description
With a single 8480+ and a 3090 I get excellent speeds ~40 T/s on Maverick
After installing a second cpu and another 8 sticks of ram I cant get good speeds.
numa distribute gives ~27 T/s
numa isolate (and -t 56) is even slower at ~10 T/s
(With cache cleared between tests)
This is with Sub-NUMA Clustering disabled, so only 2 numa nodes total.
Any recommendations for settings that will get over 40 T/s?
Do I not understand what numa isolate does? I thought that would be the same as a single CPU.
llama-server -m Maverick-UD-IQ4_XS.gguf -c 32000 -fa -fmoe -amb 512 -rtr -ctk q8_0 -ctv q8_0 -ngl 99 -ot ".*ffn_.*_exps.*=CPU" --numa isolate -t 56
---
#### 🗣️ Discussion
👤 **justinjja** replied the **2025-05-08** at **01:11:10**:<br>
Small update,
I replaced --numa isolate with --numa numactl
and added: numactl --physcpubind=0-55,112-167 --membind=0 before my command
This does what I thought isolate would do.
I'm back at 40 T/s
Still no luck finding settings that actually both cpus.
---
👤 **ikawrakow** replied the **2025-05-08** at **08:26:39**:<br>
There have been a lot of discussions around the Internet about `llama.cpp` performance on dual-socket systems, and the conclusion appears to be that the best one can do is to just use one physical CPU.
I don't have access to a dual socket system, so have done nothing related to NUMA in `ik_llama.cpp`. Hence, being a fork of `llama.cpp`, I expect it to behave the same.

View File

@@ -0,0 +1,159 @@
### 🗣️ [#397](https://github.com/ikawrakow/ik_llama.cpp/discussions/397) - KV split while using `-sm row`
| **Author** | `pt13762104` |
| :--- | :--- |
| **Created** | 2025-05-08 |
| **Updated** | 2025-05-08 |
---
#### Description
I have found that ik_llama.cpp does NOT support kv-split while using `-sm row`, which is a limitation compared to llama.cpp. Is there any way to do this or it's just not implemented yet?
Example output:
```
INFO [ main] build info | tid="137884088823808" timestamp=1746690385 build=3673 commit="4084ca73"
INFO [ main] system info | tid="137884088823808" timestamp=1746690385 n_threads=2 n_threads_batch=-1 total_threads=4 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | "
llama_model_loader: loaded meta data with 32 key-value pairs and 707 tensors from /root/Qwen3-32B-UD-Q5_K_XL.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = qwen3
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Qwen3-32B
llama_model_loader: - kv 3: general.basename str = Qwen3-32B
llama_model_loader: - kv 4: general.quantized_by str = Unsloth
llama_model_loader: - kv 5: general.size_label str = 32B
llama_model_loader: - kv 6: general.repo_url str = https://huggingface.co/unsloth
llama_model_loader: - kv 7: qwen3.block_count u32 = 64
llama_model_loader: - kv 8: qwen3.context_length u32 = 40960
llama_model_loader: - kv 9: qwen3.embedding_length u32 = 5120
llama_model_loader: - kv 10: qwen3.feed_forward_length u32 = 25600
llama_model_loader: - kv 11: qwen3.attention.head_count u32 = 64
llama_model_loader: - kv 12: qwen3.attention.head_count_kv u32 = 8
llama_model_loader: - kv 13: qwen3.rope.freq_base f32 = 1000000.000000
llama_model_loader: - kv 14: qwen3.attention.layer_norm_rms_epsilon f32 = 0.000001
llama_model_loader: - kv 15: qwen3.attention.key_length u32 = 128
llama_model_loader: - kv 16: qwen3.attention.value_length u32 = 128
llama_model_loader: - kv 17: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 18: tokenizer.ggml.pre str = qwen2
llama_model_loader: - kv 19: tokenizer.ggml.tokens arr[str,151936] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 20: tokenizer.ggml.token_type arr[i32,151936] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 21: tokenizer.ggml.merges arr[str,151387] = ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "Ġ t",...
llama_model_loader: - kv 22: tokenizer.ggml.eos_token_id u32 = 151645
llama_model_loader: - kv 23: tokenizer.ggml.padding_token_id u32 = 151654
llama_model_loader: - kv 24: tokenizer.ggml.add_bos_token bool = false
llama_model_loader: - kv 25: tokenizer.chat_template str = {%- if tools %}\n {{- '<|im_start|>...
llama_model_loader: - kv 26: general.quantization_version u32 = 2
llama_model_loader: - kv 27: general.file_type u32 = 17
llama_model_loader: - kv 28: quantize.imatrix.file str = Qwen3-32B-GGUF/imatrix_unsloth.dat
llama_model_loader: - kv 29: quantize.imatrix.dataset str = unsloth_calibration_Qwen3-32B.txt
llama_model_loader: - kv 30: quantize.imatrix.entries_count i32 = 448
llama_model_loader: - kv 31: quantize.imatrix.chunks_count i32 = 32
llama_model_loader: - type f32: 257 tensors
llama_model_loader: - type q4_K: 28 tensors
llama_model_loader: - type q5_K: 300 tensors
llama_model_loader: - type q6_K: 122 tensors
llm_load_vocab: special tokens cache size = 26
llm_load_vocab: token to piece cache size = 0.9311 MB
llm_load_print_meta: format = GGUF V3 (latest)
llm_load_print_meta: arch = qwen3
llm_load_print_meta: vocab type = BPE
llm_load_print_meta: n_vocab = 151936
llm_load_print_meta: n_merges = 151387
llm_load_print_meta: vocab_only = 0
llm_load_print_meta: n_ctx_train = 40960
llm_load_print_meta: n_embd = 5120
llm_load_print_meta: n_layer = 64
llm_load_print_meta: n_head = 64
llm_load_print_meta: n_head_kv = 8
llm_load_print_meta: n_rot = 128
llm_load_print_meta: n_swa = 0
llm_load_print_meta: n_swa_pattern = 1
llm_load_print_meta: n_embd_head_k = 128
llm_load_print_meta: n_embd_head_v = 128
llm_load_print_meta: n_gqa = 8
llm_load_print_meta: n_embd_k_gqa = 1024
llm_load_print_meta: n_embd_v_gqa = 1024
llm_load_print_meta: f_norm_eps = 0.0e+00
llm_load_print_meta: f_norm_rms_eps = 1.0e-06
llm_load_print_meta: f_clamp_kqv = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale = 0.0e+00
llm_load_print_meta: n_ff = 25600
llm_load_print_meta: n_expert = 0
llm_load_print_meta: n_expert_used = 0
llm_load_print_meta: causal attn = 1
llm_load_print_meta: pooling type = 0
llm_load_print_meta: rope type = 2
llm_load_print_meta: rope scaling = linear
llm_load_print_meta: freq_base_train = 1000000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn = 40960
llm_load_print_meta: rope_finetuned = unknown
llm_load_print_meta: ssm_d_conv = 0
llm_load_print_meta: ssm_d_inner = 0
llm_load_print_meta: ssm_d_state = 0
llm_load_print_meta: ssm_dt_rank = 0
llm_load_print_meta: model type = ?B
llm_load_print_meta: model ftype = Q5_K - Medium
llm_load_print_meta: model params = 32.762 B
llm_load_print_meta: model size = 21.603 GiB (5.664 BPW)
llm_load_print_meta: repeating layers = 20.510 GiB (5.646 BPW, 31.206 B parameters)
llm_load_print_meta: general.name = Qwen3-32B
llm_load_print_meta: BOS token = 11 ','
llm_load_print_meta: EOS token = 151645 '<|im_end|>'
llm_load_print_meta: PAD token = 151654 '<|vision_pad|>'
llm_load_print_meta: LF token = 148848 'ÄĬ'
llm_load_print_meta: EOT token = 151645 '<|im_end|>'
llm_load_print_meta: max token length = 256
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
Device 0: Tesla T4, compute capability 7.5, VMM: yes
Device 1: Tesla T4, compute capability 7.5, VMM: yes
llm_load_tensors: ggml ctx size = 0.95 MiB
llm_load_tensors: offloading 64 repeating layers to GPU
llm_load_tensors: offloading non-repeating layers to GPU
llm_load_tensors: offloaded 65/65 layers to GPU
llm_load_tensors: CUDA_Split buffer size = 21608.65 MiB
llm_load_tensors: CPU buffer size = 510.04 MiB
llm_load_tensors: CUDA0 buffer size = 2.58 MiB
..................................................................................................
llama_new_context_with_model: n_ctx = 8192
llama_new_context_with_model: n_batch = 4096
llama_new_context_with_model: n_ubatch = 1024
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: mla_attn = 0
llama_new_context_with_model: attn_max_b = 0
llama_new_context_with_model: fused_moe = 0
llama_new_context_with_model: ser = -1, 0
llama_new_context_with_model: freq_base = 1000000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init: CUDA0 KV buffer size = 2048.00 MiB # where is CUDA1?
llama_new_context_with_model: KV self size = 2048.00 MiB, K (f16): 1024.00 MiB, V (f16): 1024.00 MiB
llama_new_context_with_model: CUDA_Host output buffer size = 1.16 MiB
llama_new_context_with_model: CUDA0 compute buffer size = 633.50 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 52.01 MiB
llama_new_context_with_model: graph nodes = 1734
llama_new_context_with_model: graph splits = 2
INFO [ init] initializing slots | tid="137884088823808" timestamp=1746690394 n_slots=1
INFO [ init] new slot | tid="137884088823808" timestamp=1746690394 id_slot=0 n_ctx_slot=8192
INFO [ main] model loaded | tid="137884088823808" timestamp=1746690394
INFO [ main] chat template | tid="137884088823808" timestamp=1746690394 chat_example="<|im_start|>system\nYou are a helpful assistant<|im_end|>\n<|im_start|>user\nHello<|im_end|>\n<|im_start|>assistant\nHi there<|im_end|>\n<|im_start|>user\nHow are you?<|im_end|>\n<|im_start|>assistant\n" built_in=true
INFO [ main] HTTP server listening | tid="137884088823808" timestamp=1746690394 n_threads_http="3" port="8080" hostname="127.0.0.1"
INFO [ update_slots] all slots are idle | tid="137884088823808" timestamp=1746690394
^C
INFO [ update_slots] all slots are idle | tid="137884088823808" timestamp=1746690402
```
---
#### 🗣️ Discussion
👤 **ikawrakow** replied the **2025-05-08** at **08:08:16**:<br>
I have never looked into splitting the KV cache when using `-sm row`, so the behavior is whatever the behavior of `llama.cpp` was when I forked last year.
Out of curiosity: does `-sm row` give you a better performance compared to `-sm layer` ?
> 👤 **pt13762104** replied the **2025-05-08** at **08:36:42**:<br>
> Yes. About 1.5x better

View File

@@ -0,0 +1,109 @@
### 🗣️ [#399](https://github.com/ikawrakow/ik_llama.cpp/discussions/399) - Qwen 30b.A3b IK/LCPP comparisons on lowspec machine
| **Author** | `fizzAI` |
| :--- | :--- |
| **Created** | 2025-05-09 |
| **Updated** | 2025-05-14 |
---
#### Description
Hi! Recently (as in, I finished 5 minutes ago) I got curious as-to how fast my shitbox (for AI use anyways) can run.
Honestly, pretty fast! But the main thing here is the comparison between LCPP and IK_LCPP, and (un)surprisingly mainline LCPP gets pretty hosed.
Specs:
- **CPU**: Ryzen 5 3500, 6 cores/~3.6ghz iirc
- **RAM**: 16gb DDR4 at a max of 2667mhz (Yes, my motherboard sucks. Yes, I know.)
- **GPU**: Nvidia GTX 1650 Super
- **VRAM**: 4gb(!) of GDDR6
Here's the cherrypicked results that show each framework at their best -- both are running with `-ot exps=CPU` (with LCPP table slightly modified because they output different formats)
| framework | model | size | params | backend | ngl | fa | amb | fmoe | test | t/s |
| - | ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ----: | ---: | ------------: | ---------------: |
| ik_llama.cpp | qwen3moe ?B IQ4_XS_R8 - 4.25 bpw | 15.32 GiB | 30.53 B | CUDA | 99 | 0 | 512 | 1 | pp512 | 15.82 ± 1.91 |
| ik_llama.cpp | qwen3moe ?B IQ4_XS_R8 - 4.25 bpw | 15.32 GiB | 30.53 B | CUDA | 99 | 0 | 512 | 1 | tg128 | 3.05 ± 0.30 |
| llama.cpp | qwen3moe 30B.A3B IQ4_XS - 4.25 bpw | 15.32 GiB | 30.53 B | CUDA,BLAS | 99 | 0 | N/A | N/A | pp512 | 14.29 ± 0.05 |
| llama.cpp | qwen3moe 30B.A3B IQ4_XS - 4.25 bpw | 15.32 GiB | 30.53 B | CUDA,BLAS | 99 | 0 | N/A | N/A | tg128 | 2.75 ± 0.27 |
<details>
<summary>
And here's the full log including the commands used and other random attempts
</summary>
```
fizz@MAMMON:~$ ik_llama.cpp/build/bin/llama-bench -fa 0,1 -amb 128,512 -fmoe 1 -ot exps=CPU -ngl 99 -m ~/ggufs/REPACK-Qwen_Qwen3-30B-A3B-IQ4_XS.gguf
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce GTX 1650 SUPER, compute capability 7.5, VMM: yes
| model | size | params | backend | ngl | fa | amb | fmoe | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ----: | ---: | ------------: | ---------------: |
| qwen3moe ?B IQ4_XS_R8 - 4.25 bpw | 15.32 GiB | 30.53 B | CUDA | 99 | 0 | 128 | 1 | pp512 | 15.72 ± 0.19 |
| qwen3moe ?B IQ4_XS_R8 - 4.25 bpw | 15.32 GiB | 30.53 B | CUDA | 99 | 0 | 128 | 1 | tg128 | 2.86 ± 0.34 |
| qwen3moe ?B IQ4_XS_R8 - 4.25 bpw | 15.32 GiB | 30.53 B | CUDA | 99 | 0 | 512 | 1 | pp512 | 15.82 ± 1.91 |
| qwen3moe ?B IQ4_XS_R8 - 4.25 bpw | 15.32 GiB | 30.53 B | CUDA | 99 | 0 | 512 | 1 | tg128 | 3.05 ± 0.30 |
| qwen3moe ?B IQ4_XS_R8 - 4.25 bpw | 15.32 GiB | 30.53 B | CUDA | 99 | 1 | 128 | 1 | pp512 | 16.38 ± 1.32 |
| qwen3moe ?B IQ4_XS_R8 - 4.25 bpw | 15.32 GiB | 30.53 B | CUDA | 99 | 1 | 128 | 1 | tg128 | 2.78 ± 0.18 |
| qwen3moe ?B IQ4_XS_R8 - 4.25 bpw | 15.32 GiB | 30.53 B | CUDA | 99 | 1 | 512 | 1 | pp512 | 15.78 ± 1.96 |
| qwen3moe ?B IQ4_XS_R8 - 4.25 bpw | 15.32 GiB | 30.53 B | CUDA | 99 | 1 | 512 | 1 | tg128 | 2.89 ± 0.24 |
build: 4084ca73 (3673)
fizz@MAMMON:~$ ik_llama.cpp/build/bin/llama-bench -fa 0,1 -amb 128,512 -fmoe 1 -ot ffn=CPU -ngl 99 -m ~/ggufs/REPACK-Qwen_Qwen3-30B-A3B-IQ4_XS.gguf
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce GTX 1650 SUPER, compute capability 7.5, VMM: yes
| model | size | params | backend | ngl | fa | amb | fmoe | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ----: | ---: | ------------: | ---------------: |
| qwen3moe ?B IQ4_XS_R8 - 4.25 bpw | 15.32 GiB | 30.53 B | CUDA | 99 | 0 | 128 | 1 | pp512 | 15.66 ± 0.19 |
| qwen3moe ?B IQ4_XS_R8 - 4.25 bpw | 15.32 GiB | 30.53 B | CUDA | 99 | 0 | 128 | 1 | tg128 | 2.55 ± 0.19 |
| qwen3moe ?B IQ4_XS_R8 - 4.25 bpw | 15.32 GiB | 30.53 B | CUDA | 99 | 0 | 512 | 1 | pp512 | 16.07 ± 1.94 |
| qwen3moe ?B IQ4_XS_R8 - 4.25 bpw | 15.32 GiB | 30.53 B | CUDA | 99 | 0 | 512 | 1 | tg128 | 2.86 ± 0.27 |
| qwen3moe ?B IQ4_XS_R8 - 4.25 bpw | 15.32 GiB | 30.53 B | CUDA | 99 | 1 | 128 | 1 | pp512 | 16.00 ± 1.77 |
| qwen3moe ?B IQ4_XS_R8 - 4.25 bpw | 15.32 GiB | 30.53 B | CUDA | 99 | 1 | 128 | 1 | tg128 | 2.63 ± 0.16 |
| qwen3moe ?B IQ4_XS_R8 - 4.25 bpw | 15.32 GiB | 30.53 B | CUDA | 99 | 1 | 512 | 1 | pp512 | 15.87 ± 2.01 |
| qwen3moe ?B IQ4_XS_R8 - 4.25 bpw | 15.32 GiB | 30.53 B | CUDA | 99 | 1 | 512 | 1 | tg128 | 2.74 ± 0.22 |
build: 4084ca73 (3673)
fizz@MAMMON:~$ llama.cpp/build/bin/llama-bench -fa 0,1 -ot exps=CPU -ngl 99 -m ~/ggufs/Qwen_Qwen3-30B-A3B-IQ4_XS.gguf
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce GTX 1650 SUPER, compute capability 7.5, VMM: yes
| model | size | params | backend | threads | fa | ot | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | -: | --------------------- | --------------: | -------------------: |
| qwen3moe 30B.A3B IQ4_XS - 4.25 bpw | 15.32 GiB | 30.53 B | CUDA,BLAS | 6 | 0 | exps=CPU | pp512 | 14.29 ± 0.05 |
| qwen3moe 30B.A3B IQ4_XS - 4.25 bpw | 15.32 GiB | 30.53 B | CUDA,BLAS | 6 | 0 | exps=CPU | tg128 | 2.75 ± 0.27 |
| qwen3moe 30B.A3B IQ4_XS - 4.25 bpw | 15.32 GiB | 30.53 B | CUDA,BLAS | 6 | 1 | exps=CPU | pp512 | 11.80 ± 0.04 |
| qwen3moe 30B.A3B IQ4_XS - 4.25 bpw | 15.32 GiB | 30.53 B | CUDA,BLAS | 6 | 1 | exps=CPU | tg128 | 2.75 ± 0.36 |
build: 15e03282 (5318)
```
</details>
Some other interesting notes:
- Memory wasn't the bottleneck here (at least not GPU memory), so I didn't really see any tangible benefits from FA -- however, I did test with it enabled, and LCPP's CPU FA is so slow it's not even funny
- There's a bit of an uptick in performance without FA when `amb` is higher, but its faster for `amb` to be lower with FA. ???
- I tried both `exps=CPU` (which I later found only offloads parts of the FFN to the CPU) and `ffn=CPU` (which offloads all of the FFN to the CPU as I was originally intending)... but it's slower to use the one which offloads the norms and stuff too! For some reason!
- I'm not sure whether it's best to build with or without a separate BLAS backend? The docs here and the docs in LCPP don't really clarify, so I went with what people seemed to be using most here for IK (noblas) and compiled LCPP with [Blis](https://github.com/flame/blis).
I still need to try dense models, CPU without offload, etc etc for this to be a fair comparison, but I hope this is still interesting data :)
---
#### 🗣️ Discussion
👤 **VinnyG9** replied the **2025-05-14** at **12:05:43**:<br>
> * I'm not sure whether it's best to build with or without a separate BLAS backend? The docs here and the docs in LCPP don't really clarify, so I went with what people seemed to be using most here for IK (noblas) and compiled LCPP with [Blis](https://github.com/flame/blis).
if you don't specify a blas backend it defaults to llamafile i think which is faster in cpu, but not relevant unless you're using -nkvo ?
> 👤 **ikawrakow** replied the **2025-05-14** at **12:29:26**:<br>
> > if you don't specify a blas backend it defaults to llamafile i think which is faster in cpu.
>
> No, it does not. This is `ik_llama.cpp` not `llama.cpp`. I wrote the matrix multiplication implementation for almost all quants in `llamafile` and for all quants here, so I know that what I have here is faster than llamafile.

View File

@@ -0,0 +1,271 @@
### 🗣️ [#401](https://github.com/ikawrakow/ik_llama.cpp/discussions/401) - install bitnet (or other cpu models) on a fresh termux aarch64
| **Author** | `Benjamin-Wegener` |
| :--- | :--- |
| **Created** | 2025-05-09 |
| **Updated** | 2025-06-21 |
---
#### Description
just for convenience all subsequential commands to install bitnet (or other cpu models) on a fresh termux aarch64:
```bash
apt update && apt install wget cmake git -y
git clone https://github.com/ikawrakow/ik_llama.cpp
cd ik_llama.cpp
cmake -B ./build -DGGML_CUDA=OFF -DGGML_BLAS=OFF -DGGML_ARCH_FLAGS="-march=armv8.2-a+dotprod+fp16" -DGGML_IQK_FLASH_ATTENTION=OFF
cmake --build ./build --config Release -j $(nproc)
wget https://huggingface.co/microsoft/bitnet-b1.58-2B-4T-gguf/resolve/main/ggml-model-i2_s.gguf?download=true -O ./models/ggml-model-i2_s.gguf
./build/bin/llama-quantize --allow-requantize ./models/ggml-model-i2_s.gguf ./models/bitnet.gguf iq2_bn_r4
./build/bin/llama-server -mla 3--model ./models/bitnet.gguf
```
the template for the model in chat prompt under 127.0.0.1:8080 should be
```
<|begin_of_text|>{{prompt}}<|eot_id|>
{{history}}
{{char}}:
```
thanks for the help @ikawrakow @RobertAgee @saood06
edit: sometimes its producing nonsense output
reverted to old prompt template
---
#### 🗣️ Discussion
👤 **VinnyG9** replied the **2025-05-14** at **12:07:00**:<br>
what is a termux?
> 👤 **saood06** replied the **2025-05-14** at **12:25:00**:<br>
> > what is a termux?
>
> Android terminal emulator: https://termux.dev/en/
---
👤 **Benjamin-Wegener** replied the **2025-05-15** at **14:23:33**:<br>
using the built in llama-server standard and pasting that in prompt template field to get correct chat format
<|begin_of_text|>{{prompt}}<|eot_id|>
{{history}}
{{char}}:
> 👤 **saood06** replied the **2025-05-16** at **06:01:00**:<br>
> Just to be clear the proper template is:
>
> <|begin_of_text|>System: {system_message}<|eot_id|>
> User: {user_message_1}<|eot_id|>
> Assistant: {assistant_message_1}<|eot_id|>
> User: {user_message_2}<|eot_id|>
> Assistant: {assistant_message_2}<|eot_id|>
>
> It's been a while since I've used the server's template field but my testing using an alternative front-end following this was successful.
>
> 👤 **saood06** replied the **2025-05-18** at **12:42:54**:<br>
> @Benjamin-Wegener
>
> The template above is grabbed from the paper. It isn't what is meant to actually go into the template field under the server's built in front-end.
>
> That uses the following variables: {{prompt}}, {{history}}, {{char}}, {{name}}, {{message}} and has sections for the System Prompt, Prompt template, and Chat history template, along with names for the user and the AI.
>
> Even when I used the bundled front-end I still basically never used the "Chat" section where those fields existed. I used the completions section where I would manually conform to a template, but I can see why on a mobile device the Chat endpoint would be far more convenient.
>
> Also I have uploaded already converted models [here](https://huggingface.co/tdh111/bitnet-b1.58-2B-4T-GGUF) which might be useful if space is limited (the actual time to convert is minor for this model so unlike other models that benefit doesn't exist for it).
>
> 👤 **RobertAgee** replied the **2025-05-18** at **12:59:53**:<br>
> FWIW, once i got the server running, I was able to confirm it was working with this curl request. Alternatively, you could send this like a regular JSON webhook of course:
>
> ```
> curl http://127.0.0.1:8080/completion -X POST \
> -H "Content-Type: application/json" \
> -d '{
> "prompt": "<|im_start|>system\nYou are a helpful assistant<|im_end|>\n<|im_start|>user\nHello, who are you?<|im_end|>\n<|im_start|>assistant\n",
> "temperature": 0.7,
> "n_predict": 128,
> "stop": ["<|im_end|>"]
> }'
> ```
>
> Also, I was able to connect [ChatterUI's](https://github.com/Vali-98/ChatterUI) (free and oss) mobile app to my termux server with a config file and now I have a superfast, local, AI with TTS, chat interface, and convo history.
>
> Setting up the connection took me awhile to figure out, so if anyone's interested, I'll share the config file and settings. But yeah, all things said Bitnet is rough but shows promise. Would love to try out an abliterated version and Falcon 3 to see if either of those would help it have a little more conversational flow.
>
> 👤 **Benjamin-Wegener** replied the **2025-05-18** at **13:44:35**:<br>
> so we revert that back to what i posted earlier for the server? what do you think?
>
> ```
> <|begin_of_text|>{{prompt}}<|eot_id|>
>
> {{history}}
> {{char}}:
> ```
> @saood06
---
👤 **RobertAgee** replied the **2025-05-16** at **05:26:44**:<br>
Didn't work for me in my case. Stayed hung up at compilation forever
![1000035416](https://github.com/user-attachments/assets/0b55130a-1964-44fb-8f44-da2bd2557b84)
> 👤 **ikawrakow** replied the **2025-05-16** at **05:30:51**:<br>
> You have to be patient. The file is 18k LOC of heavily templated C++ code. It takes a while to compile even on a fast desktop CPU. I know it needs to get refactored into multiple files (#183), but I haven't come around to do it.
>
> 👤 **ikawrakow** replied the **2025-05-16** at **06:21:47**:<br>
> Just measured: it takes 2 minutes on my M2-Max CPU to compile this file. Based on this, my guess is that it is in the 5-10 minutes range on a phone.
>
> 👤 **saood06** replied the **2025-05-16** at **06:26:21**:<br>
> > Just measured: it takes 2 minutes on my M2-Max CPU to compile this file. Based on this, my guess is that it is in the 5-10 minutes range on a phone.
>
> I feel like it took longer when I tested it, and the person reporting the clashing .so files reported around half an hour, but yes the solution is to just be patient.
>
> 👤 **RobertAgee** replied the **2025-05-16** at **06:27:06**:<br>
> I waited more than 10 minutes, without competing processes open. in htop, no rw was happening so there's something causing it to hang idk
>
> 👤 **saood06** replied the **2025-05-16** at **06:29:17**:<br>
> > I waited more than 10 minutes, without competing processes open. in htop, no rw was happening so there's something causing it to hang idk
>
> But was there still CPU usage? Also if you don't mind sharing what device it was on it would help estimate how long it would take. ( I may be able to time a compile on the device I use to test Android on but that may be a while as I have to borrow that device).
>
> 👤 **RobertAgee** replied the **2025-05-17** at **14:17:34**:<br>
> Hi @saood06 I appreciate your patience and willingness to help. I have a Samsung a71 5g
>
> ```
> PLATFORM
> OS Android 10, upgradable to Android 13, One UI 5
> Chipset Exynos 980 (8 nm)
> CPU Octa-core (2x2.2 GHz Cortex-A77 & 6x1.8 GHz Cortex A55)
> GPU Mali-G76 MP5
> ```
>
> I did get it to compile and successfully run with the new FA kernels OFF flag at the compilation step.
>
> 👤 **saood06** replied the **2025-05-18** at **02:49:19**:<br>
> >Hi @saood06 I appreciate your patience and willingness to help
> >I did get it to compile and successfully run with the new FA kernels OFF flag at the compilation step.
>
> I'm glad you were able to get it working. I don't think the new flag is necessary but it definitely would speed things up, which could matter a lot (especially as a lot of users won't have the patience and understanding to just wait).
---
👤 **ikawrakow** replied the **2025-05-17** at **08:24:16**:<br>
You can now disable building the templated flash attention (FA) kernels. Disabling FA should massively improve build times.
See PR #429
> 👤 **RobertAgee** replied the **2025-05-17** at **10:00:36**:<br>
> Thanks @ikawrakow for the fast PR! I was able to successfully get it running and make a call to get a response! :)
>
> For anyone in my situation, it did have a few what looked like errors in the console during the build process, but it was successful, as I said, so no worries. Here's the list of commands with the speed up (disabling flash attention kernels):
>
> ```apt update && apt install wget cmake git -y
>
> git clone https://github.com/ikawrakow/ik_llama.cpp
>
> cd ik_llama.cpp
>
> cmake -B ./build -DGGML_CUDA=OFF -DGGML_BLAS=OFF -DGGML_ARCH_FLAGS="-march=armv8.2-a+dotprod+fp16" -DGGML_IQK_FLASH_ATTENTION=OFF
>
> cmake --build ./build --config Release -j $(nproc)
>
> wget https://huggingface.co/microsoft/bitnet-b1.58-2B-4T-gguf/resolve/main/ggml-model-i2_s.gguf?download=true -O ./models/ggml-model-i2_s.gguf
>
> ./build/bin/llama-quantize --allow-requantize ./models/ggml-model-i2_s.gguf ./models/bitnet.gguf iq2_bn_r4
>
> ./build/bin/llama-server -mla 3 --model ./models/bitnet.gguf
> ```
>
> Sample call I made from my API tester app to the server to test it.
>
> ```
> curl http://127.0.0.1:8080/completion -X POST \
> -H "Content-Type: application/json" \
> -d '{
> "prompt": "<|im_start|>system\nYou are a helpful assistant<|im_end|>\n<|im_start|>user\nHello, who are you?<|im_end|>\n<|im_start|>assistant\n",
> "temperature": 0.7,
> "n_predict": 128,
> "stop": ["<|im_end|>"]
> }'
> ```
---
👤 **ikawrakow** replied the **2025-05-20** at **09:48:56**:<br>
There is now PR #435 that significantly reduces build time. I cannot test on Android myself, so would appreciate if someone did and reported
* New vs old build time (with CPU model)
* Does it still work correctly?
* Is the inference performance affected?
> 👤 **aezendc** replied the **2025-06-02** at **15:30:06**:<br>
> > There is now PR #435 that significantly reduces build time. I cannot test on Android myself, so would appreciate if someone did and reported
> >
> > * New vs old build time (with CPU model)
> > * Does it still work correctly?
> > * Is the inference performance affected?
>
> HI ikawrakow do we have a step by step running microsoft/bitnet-b1.58-2B-4T-gguf in windows?
>
> 👤 **ikawrakow** replied the **2025-06-02** at **15:36:51**:<br>
> There are no prebuild packages, so you need to follow the [above instructions](https://github.com/ikawrakow/ik_llama.cpp/discussions/401#discussioncomment-13178115) and build yourself. They don't work (with small adjustments)?
>
> 👤 **aezendc** replied the **2025-06-02** at **15:45:42**:<br>
> > There are no prebuild packages, so you need to follow the [above instructions](https://github.com/ikawrakow/ik_llama.cpp/discussions/401#discussioncomment-13178115) and build yourself. They don't work (with small adjustments)?
>
> I made it work I use [saood06](https://github.com/saood06) converted model https://huggingface.co/tdh111/bitnet-b1.58-2B-4T-GGUF. I will create a basic commands
>
> 👤 **saood06** replied the **2025-06-03** at **00:51:30**:<br>
> > do we have a step by step running microsoft/bitnet-b1.58-2B-4T-gguf in windows?
>
> There are build instructions with a lot more details for Windows [here](https://github.com/ikawrakow/ik_llama.cpp/blob/main/docs/build.md). Once it is built you can just grab the model either pre-converted one like [this](https://huggingface.co/tdh111/bitnet-b1.58-2B-4T-GGUF) or convert one yourself and just launch server. Which is covered in the above instructions.
>
> It seems like you have already figured it out, but just wanted to link the Windows build instructions in case anyone else finds this and wants to follow along.
>
> 👤 **aezendc** replied the **2025-06-03** at **03:34:32**:<br>
> > > do we have a step by step running microsoft/bitnet-b1.58-2B-4T-gguf in windows?
> >
> > There are build instructions with a lot more details for Windows [here](https://github.com/ikawrakow/ik_llama.cpp/blob/main/docs/build.md). Once it is built you can just grab the model either pre-converted one like [this](https://huggingface.co/tdh111/bitnet-b1.58-2B-4T-GGUF) or convert one yourself and just launch server. Which is covered in the above instructions.
> >
> > It seems like you have already figured it out, but just wanted to link the Windows build instructions in case anyone else finds this and wants to follow along.
>
> Thanks for this @saood06 very helpful and a very detailed one. One thing I have a problem accessing the llama-server ui and its just keep loading.
>
> 👤 **saood06** replied the **2025-06-03** at **07:11:46**:<br>
> > Thanks for this @saood06 very helpful and a very detailed one. One thing I have a problem accessing the llama-server ui and its just keep loading.
>
> Just to be sure, are you making sure to access the server using the port passed in when launching (or 8080 if not set as that is the default), and are you setting the host address (if needed) since it defaults to 127.0.0.1 (AKA localhost) which is only accessible on that machine.
>
> 👤 **aezendc** replied the **2025-06-03** at **12:28:17**:<br>
> > > Thanks for this @saood06 very helpful and a very detailed one. One thing I have a problem accessing the llama-server ui and its just keep loading.
> >
> > Just to be sure, are you making sure to access the server using the port passed in when launching (or 8080 if not set as that is the default), and are you setting the host address (if needed) since it defaults to 127.0.0.1 (AKA localhost) which is only accessible on that machine.
>
> i am using the default http://127.0.0.1:8080/ but somehow it works now. Thanks for the info
>
> 👤 **aezendc** replied the **2025-06-04** at **14:40:21**:<br>
> > > Thanks for this @saood06 very helpful and a very detailed one. One thing I have a problem accessing the llama-server ui and its just keep loading.
> >
> > Just to be sure, are you making sure to access the server using the port passed in when launching (or 8080 if not set as that is the default), and are you setting the host address (if needed) since it defaults to 127.0.0.1 (AKA localhost) which is only accessible on that machine.
>
> How you do make the the model to respond longer?
>
> 👤 **saood06** replied the **2025-06-21** at **16:33:44**:<br>
> >How you do make the the model to respond longer?
>
> I don't have much specific advice for using this model. Beyond benchmarking and minor curiosity of the ability of a model this small, I haven't used it much.
>
> I'd be curious to hear what your experience with it has been? Is it useful (even if the responses are a bit short for your liking)?
>
> I've never actually found a great model and prompt context agnostic way to increase the length of a response without reducing the quality of the response, but my strategies are (in order of least effort to highest effort), are:
>
> * add context specific details or changes to the prompt given
> * break the task apart and only allow it to respond to a fraction at a time
> * manually steer the model to avoid skipping or missing out on details (often is easier with a thinking model as you often only have to steer during thinking tokens).
>
> 👤 **aezendc** replied the **2025-06-21** at **16:46:12**:<br>
> I fix it now. The only problem of mine is the libomp.so build and I do not have a file of it. I set it the openmp off because libggml.so needs the libomp.so an when I build llama-server using windows and transfer the binaries to my android phone and the model is hallucinating.

View File

@@ -0,0 +1,56 @@
### 🗣️ [#403](https://github.com/ikawrakow/ik_llama.cpp/discussions/403) - Tool Calling and Structured Response (Json Mode) support
| **Author** | `mtcl` |
| :--- | :--- |
| **Created** | 2025-05-10 |
| **Updated** | 2025-05-30 |
---
#### Description
Hey Team,
Amazing work here. as compared to llama.cpp the biggest feature that I see missing is support for tool calling. D oyou have any plans to include it in the future roadmap? Or am i missing something and it alredy exists?
I am forced to use other frameworks, even though i like inferencing speeds from ik_llama.cpp, just beacuse i cant live without these features and want to swap it out natively in the openai's python client in my project implementation.
I know tha i can prompt the model in a particular way to force it to produce a json response. I am not looking for that.
Thank you in advance!
---
#### 🗣️ Discussion
👤 **ikawrakow** replied the **2025-05-10** at **08:30:16**:<br>
Hey @mtcl,
we are a very small team, so cannot do everything that `llama.cpp` does. Hence, the strategy is to focus on few things, but do these things really well.
Please enter a feature request in the Issues. I'll label it with "help wanted" and we will see what happens.
> 👤 **mtcl** replied the **2025-05-10** at **08:33:02**:<br>
> No worries my friend. I have a workaround here that I've written.
>
> https://github.com/Teachings/FastAgentAPI
>
> It acts as a wrapper and get me by. Thank you for your hard work!
>
> 👤 **cmoncure** replied the **2025-05-30** at **19:58:13**:<br>
> Before I try and get this running, can you educate me on the mechanics of tool calling within the LLM response? I understand that the LLM may request a call as part of its TG phase, and then the call runner injects the result into the LLM response. Is this correct?
>
> I have some questions about this. Suppose I want to ask the LLM a question about a long document.
>
> What's the difference in outcome between:
> 1) Including the question and document in the prompt, and enduring the long PP time
> 2) Including the question in the prompt, and having the LLM retrieve the document instantly via tool call during TG, then going on to complete the response?
>
> Do all injected tokens need to undergo a form of 'PP during TG'? That would make the most sense, actually...
---
👤 **KCS-Mack** replied the **2025-05-18** at **22:28:59**:<br>
This is great, will give it a try!

View File

@@ -0,0 +1,351 @@
### 🗣️ [#434](https://github.com/ikawrakow/ik_llama.cpp/discussions/434) - Quant Cookers Basic Guide
| **Author** | `ubergarm` |
| :--- | :--- |
| **Created** | 2025-05-18 |
| **Updated** | 2025-05-21 |
---
#### Description
Quant Cooking Basic Guide
===
Example workflow for cooking custom quants with ik_llama.cpp that I used to generate [ubergarm/Qwen3-14B-GGUF](https://huggingface.co/ubergarm/Qwen3-14B-GGUF).
## Goal
The goal is to provide a specific example of methodology that can be adapted for future LLMs and quant types in general.
In this guide we will download and quantize the dense model [Qwen/Qwen3-14B](https://huggingface.co/Qwen/Qwen3-14B) on a gaming rig with a single 3090TI FE 24GB VRAM GPU.
We will use the latest [ik_llama.cpp quants](https://github.com/ikawrakow/ik_llama.cpp/pull/422) to target running this 14B model in GGUF format fully offloaded on <=16GB VRAM systems with 32k context.
This guide does *not* get into more complex things like MLA methodology e.g. converting fp8 to bf16 on older GPU hardware.
## Dependencies
This is all run on a Linux rig, but feel free to use WSL for a similar experience if you're limited to a windows based OS.
Install any build essentials, git, etc. We will use `uv` for python virtual environment management to keep everything clean.
```bash
# Setup folder to do your work and hold the models etc
mkdir /mnt/llms
cd /mnt/llms
# Install uv and python packages
# https://docs.astral.sh/uv/getting-started/installation/
curl -LsSf https://astral.sh/uv/install.sh | sh
uv venv ./venv --python 3.12 --python-preference=only-managed
source ./venv/bin/activate
uv pip install huggingface_hub[hf-xet]
# Start downloading the bf16 safetensors from huggingface
mkdir -p Qwen/Qwen3-14B
cd Qwen/Qwen3-14B
huggingface-cli download --local-dir ./ Qwen/Qwen3-14B
# Make a target directory to hold your finished quants for uploading to huggingface
mkdir -p ubergarm/Qwen3-14B-GGUF # use your name obviously
# Install mainline or evshiron llama.cpp forks just for the python scripts.
cd /mnt/llms
git clone git@github.com:ggml-org/llama.cpp.git
cd llama.cpp
cmake -B build -DGGML_CUDA=ON -DGGML_RPC=OFF -DGGML_BLAS=OFF -DGGML_SCHED_MAX_COPIES=1
cmake --build build --config Release -j $(nproc)
# Install and build ik_llama.cpp for the heavy lifting and SOTA GGUF quants.
cd /mnt/llms
git clone git@github.com:ikawrakow/ik_llama.cpp.git
cd ik_llama.cpp
cmake -B build -DGGML_CUDA=ON -DGGML_RPC=OFF -DGGML_BLAS=OFF -DGGML_SCHED_MAX_COPIES=1
cmake --build build --config Release -j $(nproc)
# Download your imatrix corpus and wiki.test.raw test corpus.
wget https://gist.githubusercontent.com/tristandruyen/9e207a95c7d75ddf37525d353e00659c/raw/571fda718462de863e5a0171078c175420c7649a/calibration_data_v5_rc.txt
wget https://huggingface.co/datasets/ikawrakow/validation-datasets-for-llama.cpp/resolve/main/wiki.test.raw.gz
gunzip wiki.test.raw.gz
# Okay, now your folders should look something like this, and you are ready to begin cooking!
cd /mnt/llms
tree
.
├── venv
├── ik_llama.cpp
├── llama.cpp
├── Qwen
│ └── Qwen3-14B
└── ubergarm
└── Qwen3-14B-GGUF
```
## Convert bf16 safetensors to bf16 gguf
I generally use mainline llama.cpp or evshiron's fork for doing conversion with python script.
```bash
# This took less than 12GiB RAM and about 30 seconds
cd /mnt/llms
uv pip install -r llama.cpp/requirements/requirements-convert_hf_to_gguf.txt --prerelease=allow --index-strategy unsafe-best-match
python \
llama.cpp/convert_hf_to_gguf.py \
--outtype bf16 \
--split-max-size 50G \
--outfile ./ubergarm/Qwen3-14B-GGUF/ \
./Qwen/Qwen3-14B/
du -hc ./ubergarm/Qwen3-14B-GGUF/*.gguf
28G ./ubergarm/Qwen3-14B-GGUF/Qwen3-14B-BF16.gguf
```
## Generate imatrix
Notes:
1. This took just over 5 minutes on my high end gaming rig.
2. If you can't run the bf16 you could make a q8_0 without imatrix and then use that as "baseline" instead
3. I could offload 32 layers naievly with `-ngl 32` but do whatever you need to run inferencing e.g. `-ngl 99 -ot ...` etc.
4. I don't bother with fancy calibration corpus nor extra context length as it isn't clearly proven to always improve results afaict.
5. Assuming you're offloading some to CPU, adjust threads as needed or set to exactly 1 if you are fully offloading to VRAM.
```bash
cd ik_llama.cpp
./build/bin/llama-imatrix \
--verbosity 1 \
-m /mnt/llms/ubergarm/Qwen3-14B-GGUF/Qwen3-14B-BF16.gguf \
-f calibration_data_v5_rc.txt \
-o ./Qwen3-14B-BF16-imatrix.dat \
-ngl 32 \
--layer-similarity \
--ctx-size 512 \
--threads 16
mv ./Qwen3-14B-BF16-imatrix.dat ../ubergarm/Qwen3-14B-GGUF/
```
## Create Quant Recipe
I personally like to make a bash script for each quant recipe. You can explore different mixes using layer-similarity or [other imatrix statistics tools](https://github.com/ggml-org/llama.cpp/pull/12718). Keep log files around with `./blah 2>&1 | tee -a logs/version-blah.log`.
I often like to off with a pure q8_0 for benchmarking and then tweak as desired for target VRAM breakpoints.
```bash
#!/usr/bin/env bash
# token_embd.weight, torch.bfloat16 --> BF16, shape = {5120, 151936}
#
# blk.28.ffn_down.weight, torch.bfloat16 --> BF16, shape = {17408, 5120}
# blk.28.ffn_gate.weight, torch.bfloat16 --> BF16, shape = {5120, 17408}
# blk.28.ffn_up.weight, torch.bfloat16 --> BF16, shape = {5120, 17408}
#
# blk.28.attn_output.weight, torch.bfloat16 --> BF16, shape = {5120, 5120}
# blk.28.attn_q.weight, torch.bfloat16 --> BF16, shape = {5120, 5120}
# blk.28.attn_k.weight, torch.bfloat16 --> BF16, shape = {5120, 1024}
# blk.28.attn_v.weight, torch.bfloat16 --> BF16, shape = {5120, 1024}
#
# blk.28.attn_norm.weight, torch.bfloat16 --> F32, shape = {5120}
# blk.28.ffn_norm.weight, torch.bfloat16 --> F32, shape = {5120}
# blk.28.attn_k_norm.weight, torch.bfloat16 --> F32, shape = {128}
# blk.28.attn_q_norm.weight, torch.bfloat16 --> F32, shape = {128}
#
# output_norm.weight, torch.bfloat16 --> F32, shape = {5120}
# output.weight, torch.bfloat16 --> BF16, shape = {5120, 151936}
custom="
# Attention
blk\.[0-9]\.attn_.*\.weight=iq5_ks
blk\.[1-3][0-9]\.attn_.*\.weight=iq5_ks
# FFN
blk\.[0-9]\.ffn_down\.weight=iq5_ks
blk\.[1-3][0-9]\.ffn_down\.weight=iq5_ks
blk\.[0-9]\.ffn_(gate|up)\.weight=iq4_ks
blk\.[1-3][0-9]\.ffn_(gate|up)\.weight=iq4_ks
# Token embedding/output
token_embd\.weight=iq6_k
output\.weight=iq6_k
"
custom=$(
echo "$custom" | grep -v '^#' | \
sed -Ez 's:\n+:,:g;s:,$::;s:^,::'
)
./build/bin/llama-quantize \
--imatrix /mnt/llms/ubergarm/Qwen3-14B-GGUF/Qwen3-14B-BF16-imatrix.dat \
--custom-q "$custom" \
/mnt/llms/ubergarm/Qwen3-14B-GGUF/Qwen3-14B-BF16.gguf \
/mnt/llms/ubergarm/Qwen3-14B-GGUF/Qwen3-14B-IQ4_KS.gguf \
IQ4_KS \
16
```
## Perplexity
Run some benchmarks to compare your various quant recipes.
```bash
model=/mnt/llms/ubergarm/Qwen3-14B-GGUF/Qwen3-14B-Q8_0.gguf
./build/bin/llama-perplexity \
-m "$model" \
--ctx-size 512 \
--ubatch-size 512 \
-f wiki.test.raw \
-fa \
-ngl 99 \
--seed 1337 \
--threads 1
```
* BF16
- `Final estimate: PPL = 9.0128 +/- 0.07114`
* Q8_0
- `Final estimate: PPL = 9.0281 +/- 0.07136`
* [ubergarm/IQ4_KS](https://huggingface.co/ubergarm/Qwen3-14B-GGUF#qwen3-14b-iq4_ks)
- `Final estimate: PPL = 9.0505 +/- 0.07133`
* [unsloth/UD-Q4_K_XL](https://huggingface.co/unsloth/Qwen3-14B-GGUF?show_file_info=Qwen3-14B-UD-Q4_K_XL.gguf)
- `Final estimate: PPL = 9.1034 +/- 0.07189`
* [bartowski/Q4_K_L](https://huggingface.co/bartowski/Qwen_Qwen3-14B-GGUF?show_file_info=Qwen_Qwen3-14B-Q4_K_L.gguf)
- `Final estimate: PPL = 9.1395 +/- 0.07236`
## KL-Divergence
You can run KLD if you want to measure how much smaller quants diverge from the unquantized model's outputs.
I have a custom ~1.6MiB `ubergarm-kld-test-corpus.txt` made from whisper-large-v3 transcriptions in plain text format from some recent episodes of [Buddha at the Gas Pump BATGAP YT Channel](https://www.youtube.com/c/batgap/videos).
#### Pass 1 Generate KLD Baseline File
The output kld base file can be quite large, this case it is ~55GiB. If
you can't run BF16, you could use Q8_0 as your baseline if necessary.
```bash
model=/mnt/llms/ubergarm/Qwen3-14B-GGUF/Qwen3-14B-BF16.gguf
CUDA_VISIBLE_DEVICES="0" \
./build/bin/llama-perplexity \
-m "$model" \
--kl-divergence-base /mnt/llms/ubergarm/Qwen3-14B-GGUF/Qwen3-14B-BF16-ubergarm-kld-test-corpus-base.dat \
-f ubergarm-kld-test-corpus.txt \
-fa \
-ngl 32 \
--seed 1337 \
--threads 16
```
#### Pass 2 Measure KLD
This uses the above kld base file as input baseline.
```bash
model=/mnt/llms/ubergarm/Qwen3-14B-GGUF/Qwen3-14B-IQ4_KS.gguf
CUDA_VISIBLE_DEVICES="0" \
./build/bin/llama-perplexity \
-m "$model" \
--kl-divergence-base /mnt/llms/ubergarm/Qwen3-14B-GGUF/Qwen3-14B-BF16-ubergarm-kld-test-corpus-base.dat \
--kl-divergence \
-f ubergarm-kld-test-corpus.txt \
-fa \
-ngl 99 \
--seed 1337 \
--threads 1
```
This will report Perplexity on this corpus as well as various other statistics.
* BF16
- `Final estimate: PPL = 14.8587 +/- 0.09987`
* Q8_0
- `Mean PPL(Q) : 14.846724 ± 0.099745`
- `Median KLD: 0.000834`
- `99.0% KLD: 0.004789`
- `RMS Δp: 0.920 ± 0.006 %`
- `99.0% Δp: 2.761%`
* [ubergarm/IQ4_KS](https://huggingface.co/ubergarm/Qwen3-14B-GGUF#qwen3-14b-iq4_ks)
- `Mean PPL(Q) : 14.881428 ± 0.099779`
- `Median KLD: 0.004756`
- `99.0% KLD: 0.041509`
- `RMS Δp: 2.267 ± 0.013 %`
- `99.0% Δp: 6.493%`
* [unsloth/UD-Q4_K_XL](https://huggingface.co/unsloth/Qwen3-14B-GGUF?show_file_info=Qwen3-14B-UD-Q4_K_XL.gguf)
- `Mean PPL(Q) : 14.934694 ± 0.100320`
- `Median KLD: 0.006275`
- `99.0% KLD: 0.060005`
- `RMS Δp: 2.545 ± 0.015 %`
- `99.0% Δp: 7.203%`
* [bartowski/Q4_K_L](https://huggingface.co/bartowski/Qwen_Qwen3-14B-GGUF?show_file_info=Qwen_Qwen3-14B-Q4_K_L.gguf)
- `Mean PPL(Q) : 14.922353 ± 0.100054`
- `Median KLD: 0.006195`
- `99.0% KLD: 0.063428`
- `RMS Δp: 2.581 ± 0.015 %`
- `99.0% Δp: 7.155%`
## Speed Benchmarks
Run some `llama-sweep-bench` to see how fast your quants are over various context lengths.
```bash
model=/mnt/llms/ubergarm/Qwen3-14B-GGUF/Qwen3-14B-IQ4_KS.gguf
./build/bin/llama-sweep-bench \
--model "$model" \
-fa \
-c 32768 \
-ngl 99 \
--warmup-batch \
--threads 1
```
![sweep-bench-qwen3-14b-gguf-more-q4](https://github.com/user-attachments/assets/2ba1f817-c1b9-4648-9cab-5b759f56e4a2)
## Vibe Check
Always remember to actually *run* your model to confirm it is working properly and generating valid responses.
```bash
#!/usr/bin/env bash
model=/mnt/llms/ubergarm/Qwen3-14B-GGUF/Qwen3-14B-IQ4_KS.gguf
./build/bin/llama-server \
--model "$model" \
--alias ubergarm/Qwen3-14B-IQ4_KS \
-fa \
-ctk f16 -ctv f16 \
-c 32768 \
-ngl 99 \
--threads 1 \
--host 127.0.0.1 \
--port 8080
```
## References
* [ik_llama.cpp old getting started guide](https://github.com/ikawrakow/ik_llama.cpp/discussions/258)
* [gist with some benchmarking gist methodology](https://gist.github.com/ubergarm/0f9663fd56fc181a00ec9f634635eb38#methodology)
* [ubergarm/Qwen3-14B-GGUF](https://huggingface.co/ubergarm/Qwen3-14B-GGUF)
---
#### 🗣️ Discussion
👤 **VinnyG9** replied the **2025-05-19** at **14:48:32**:<br>
thanks for this, can you point me where can i read a description of:
-DGGML_RPC=OFF
--seed 1337
> 👤 **ubergarm** replied the **2025-05-19** at **15:07:31**:<br>
> > -DGGML_RPC=OFF
> > --seed 1337
>
> The had turned off the RPC backend building at some point becuase in the past I had enabled it to test some things, you can probably ignore it for the purposes of this guide. If you're interested the RPC "remote procedure call" allows you to run [a client and server(s) distributed across multiple machines or processes](https://github.com/ggml-org/llama.cpp/tree/master/tools/rpc) for distributing inferencing. However, it is very basic and lacking a variety of features which make it less than useful in most of my testing and purposes.
>
> > --seed 1337
>
> I set the same random seed, just for fun, across all of my measurements in a hopeful attempt to reduce differences due to entropy. Not sure if it really matters. [1337](https://www.urbandictionary.com/define.php?term=1337) is leet speek for [leet](https://www.urbandictionary.com/define.php?term=leet).
>
> 👤 **VinnyG9** replied the **2025-05-21** at **03:42:57**:<br>
> > > -DGGML_RPC=OFF
> > > --seed 1337
> >
> > The had turned off the RPC backend building at some point becuase in the past I had enabled it to test some things, you can probably ignore it for the purposes of this guide. If you're interested the RPC "remote procedure call" allows you to run [a client and server(s) distributed across multiple machines or processes](https://github.com/ggml-org/llama.cpp/tree/master/tools/rpc) for distributing inferencing. However, it is very basic and lacking a variety of features which make it less than useful in most of my testing and purposes.
> >
> > > --seed 1337
> >
> > I set the same random seed, just for fun, across all of my measurements in a hopeful attempt to reduce differences due to entropy. Not sure if it really matters. [1337](https://www.urbandictionary.com/define.php?term=1337) is leet speek for [leet](https://www.urbandictionary.com/define.php?term=leet).
>
> you nerds speak like i know what you're talking about xD
> what is it "seeding"?
> i thought it was a reference to the universe's "fine-structure constant"

View File

@@ -0,0 +1,174 @@
### 🗣️ [#451](https://github.com/ikawrakow/ik_llama.cpp/discussions/451) - Context reuse / context shift for long prompts
| **Author** | `SamuelOliveirads` |
| :--- | :--- |
| **Created** | 2025-05-23 |
| **Updated** | 2025-06-10 |
---
#### Description
Hi! — I'm coming from koboldcpp, and I've been testing this fork due to its optimizations.
One feature I found very useful in koboldcpp was the context shift functionality, which helps when working with very long context windows.
I noticed that `llama.cpp` implemented something similar in [PR #9866](https://github.com/ggml-org/llama.cpp/pull/9866), which allows for reusing the prompt cache more efficiently instead of regenerating the entire prompt every time the context overflows.
I searched through this repo but couldnt find an equivalent implementation.
Heres the issue Im currently facing:
- I'm using a 62k context in Qwen 3.
- When the context overflows, the cache keeps my system prompt, but discards the conversation history.
- That leads to reprocessing ~58k tokens from scratch each time, which at ~40 tokens/sec takes several minutes per new message.
- With proper cache reuse (like in llama.cpp), this would take just seconds.
My question is:
- Is there already something similar to context reuse implemented here?
- If not, would this be something feasible to implement, perhaps inspired by how llama.cpp did it?
Thanks!
---
#### 🗣️ Discussion
👤 **mtcl** replied the **2025-05-30** at **16:47:09**:<br>
This is a very useful usecase because of which I have been switching back and forth between ik_llama.cpp and llama.cpp. This works seamlessly with llama.cpp i have noticed. I always thought I am doing something wrong here and it is my user error, but apparantly it is not! Thank you for mentioning it here.
---
👤 **cmoncure** replied the **2025-05-30** at **19:51:44**:<br>
This would be a massive win for me. Currently PP is the millstone around the neck (for which you have had to endure many of my ignorant comments in support of a solution).
KV Cache reuse and tool calling would open up whole new worlds.
> 👤 **mtcl** replied the **2025-06-05** at **02:26:48**:<br>
> I agree 100% with you. Given that I built my own tool calling solution for ik_llama.cpp, at this point of time kv cache reuse would mean an instant switch for me to this!
---
👤 **SamuelOliveirads** replied the **2025-06-03** at **21:52:10**:<br>
Glad to see that others are also interested in this feature! I was about to open an issue myself, but I noticed that @saood06 is already looking into something similar [here](https://github.com/ikawrakow/ik_llama.cpp/issues/455#issuecomment-2917718499) — so now its just a matter of waiting.
By the way, @saood06, if you need any help with testing, Id be happy to assist.
> 👤 **saood06** replied the **2025-06-06** at **09:16:14**:<br>
> Since there does seem to be demand, and people waiting, I'll provide an update which explains what my plan is (and the benefits, but also the limitations), and the current status.
>
> The goal is to create a new mechanism where if enabled a [trie](https://en.wikipedia.org/wiki/Trie) of all processed tokens is kept that can be saved and restored to a file. This should allow you to keep every explored branch of a session (or multiple if you share a large initial prompt between sessions) with the least amount of space and no quality loss.
>
> This may only be viable on MLA models as they are extremely light for KV cache, and this method does not degrade quality like chunking or shifting, but for that reason this does not handle the common case of shifting the cache when you want to remove the thought tokens without having to reprocess as there is no way to do that without losing (at least some) quality.
>
> I was stalled because of #436 but now that saving and loading works I am now unblocked, but this still seems like a large undertaking and may take some time.
>
> I may end up porting the chunk/shift method (or @cmoncure is welcome to do it) anyway (even before I finish), since as I said they have different tradeoffs, but integrating the two fully as nice as it sounds (which would let you be able to chunk and shift from the trie) seems way too difficult.
>
> 👤 **cmoncure** replied the **2025-06-06** at **15:16:33**:<br>
> Do you have any insight into the nature or mechanism behind the quality loss with chunking?
>
> 👤 **ikawrakow** replied the **2025-06-06** at **15:29:13**:<br>
> Are we talking about the `llama.cpp` feature (taken from kobold.cpp) where if I have
> ```
> aaaaccccbbbb
> ```
> in the KV cache, and the new context is
> ```
> aaaabbbb
> ```
> I can reuse the full `aaaabbbb` (mainline `llama.cpp`) instead of just reusing `aaaa` as it happens here?
>
> If so, here is an example:
>
> **KV cache:** Yesterday I saw a movie. I absolutely enjoyed it. The main actor was ...
> **New context:** Yesterday I saw a movie. The main actor was
>
> Suppose **New context** is in the context of the worst movie you have ever seen, so you expect "a disaster" or some such.
> The existing KV cache, despite context shifting and all that, will be heavily biased towards "brilliant", "amazing" and such.
>
> Do you see the problem? You cannot undo the impact of the skipped tokens by just changing the position encoding via RoPE.
>
> 👤 **saood06** replied the **2025-06-06** at **15:41:47**:<br>
> > Are we talking about the `llama.cpp` feature (taken from kobold.cpp) where if I have
>
> Yes that is what we are talking about. Thank you for the very clear example (so much better than what I was typing out).
>
> I'm not sure this is from kobold.cpp. I know they offer a much better context shift where they effectively keep the context full at all times once you hit the limit unlike llama.cpp and here where the context shift unnecessarily removes far more tokens than is needed (I think half) and thus shifts are less frequent. Kobold.cpp on the other hand shifts every token which keeps the maximum information allowed at all times.
>
> 👤 **cmoncure** replied the **2025-06-06** at **19:40:13**:<br>
> >You cannot undo the impact of the skipped tokens by just changing the position encoding via RoPE.
>
> So...
>
> 1. KV Cache is a Key-Value cache
> 2. KV Cache as a "memoization" technique stores the results of the expensive PP computation for reuse.
> 3. But the PP computation is cumulative in such a way that the presence and order of tokens matters.
> 4. Once a token has acted on the KV cache, its effect poisons the KV cache indelibly.
>
> Questions:
>
> 1. Is the effect of tokens on the KV cache _additive_ or _multiplicative_ (or something else)? If additive, can the effect of tokens removed from the prompt be recalculated and their effect subtracted?
> 2. If the presence of token PP computation in the KV cache poisons it forever, then doesn't that imply that tokens outside the context window can continue to affect generation? That would contradict my mental model of how all this is supposed to work. Edit: I suppose that's why the whole thing must be scrapped each time when the context window fills up. It makes sense.
>
> 👤 **saood06** replied the **2025-06-07** at **06:17:39**:<br>
> > 4. Once a token has acted on the KV cache, its effect poisons the KV cache indelibly.
> >
> >
> > Questions:
> >
> > 2. If the presence of token PP computation in the KV cache poisons it forever, then doesn't that imply that tokens outside the context window can continue to affect generation? That would contradict my mental model of how all this is supposed to work. Edit: I suppose that's why the whole thing must be scrapped each time when the context window fills up. It makes sense.
>
> No. If that were the case then you could not have multiple slots which serve independent users that share the KV cache, but that is a well supported use case.
>
> The tokens do not "poison" the cache, it is just that a token holds the information of all prior tokens from that sequence when it was calculated. If you get rid of tokens and then shift tokens that had come after the now deleted tokens in order to re-use them the shifted tokens will still contain the information from the deleted tokens.
>
> To add to the the example given above with the movie, even though you removed the tokens "I absolutely enjoyed it.", their influence is not gone if you keep the tokens after and shift them.
>
> If you shift "The main actor was" then you will see the influence of the removed tokens (but it will be much faster as you are not recomputing those tokens).
>
> If you do recompute the tokens "The main actor was" and do not shift then it will be slower (as you have to actually compute the tokens again) but you will not experience the lingering impact of "I absolutely enjoyed it."
>
> 👤 **cmoncure** replied the **2025-06-10** at **02:35:21**:<br>
> >If you do recompute the tokens "The main actor was" and do not shift then it will be slower (as you have to actually compute the tokens again) but you will not experience the lingering impact of "I absolutely enjoyed it."
>
> Forgive me if I've misunderstood. Suppose we have the following prompt:
>
> `AAAABBBBCCCC`
>
> Then we can understand the state of the fully processed KV cache to be something like the following, where some function `f(X) :-> x` gives the "effect" of the token on subsequent tokens:
>
> `A A A A Ba Ba Ba Ba Cab Cab Cab Cab`
>
> I'm stretching the truth a bit here for the purposes of a convenient representation. But the above illustrates that each part of the prompt carries with it information about the previous parts.
>
> Suppose that our context grows and our `A` tokens must be pushed off the top of the context window. Then we have some intermediate state
>
> `Ba Ba Ba Ba Cab Cab Cab Cab D D D D`
>
> In order to create a properly functioning KV cache, we have to effectuate the following:
>
> 1. The effect of `A` tokens must be removed from `B` and `C`
> 2. D tokens must take into account `B` and `C`
>
> So that finally, we have
>
> `B B B B Cb Cb Cb Cb Dbc Dbc Dbc Dbc`
>
> The way this is currently achieved is (if I am not mistaken) by dropping and re-processing the entire cache pertaining to the prompt, which is expensive, suggesting an algorithmic complexity of O(n^2). Can we not instead of re-processing the entire prompt, simply calculate f(A) and subtract it from the following tokens (or the inverse f'(A) and add it):
>
> `Ba Ba Ba Ba Cab Cab Cab Cab` - f(A) => `B B B B Cb Cb Cb Cb`
>
> Finally computing the rest of the prompt only against D:
>
> `D D D D` + f(B) + F(C) => `Dbc Dbc Dbc Dbc`
>
> Then concatenate the two to get the desired state? I'm still reading through llama.cpp... it's a lot.
---
👤 **cmoncure** replied the **2025-06-05** at **18:35:28**:<br>
Might have to do it myself.

View File

@@ -0,0 +1,414 @@
### 🗣️ [#459](https://github.com/ikawrakow/ik_llama.cpp/discussions/459) - qwen3 metrics on ancient hardware (2x xeon Vs 2x P100)
| **Author** | `VinnyG9` |
| :--- | :--- |
| **Created** | 2025-05-15 |
| **Updated** | 2025-05-28 |
---
#### Description
so i set a snoop mode in bios which does some kind of speculative decoding called Home dir w/ OSB+, and it gives a big boost with numa enabled
all tests with HT off
# p100 numa off, numa balancing=0
CUDA_VISIBLE_DEVICES=0,1 numactl --cpunodebind=0 ~/Projects/ik_llama.cpp/build/bin/llama-bench -t 16 -p 64,128,256 -n 32,64,128 -m /media/gguf/moe/Qwen3-235B-A22B-UD-Q2_K_XL-00001-of-00002.gguf -ngl 94 -ot "([3][2-9]|[4-9][0-9])\.ffn_.*_exps\.=CPU" -ot "([4][7-9]|[5-9][0-9])\.(attn|ffn)_.*(q|k|v|norm|inp|output)\.=CUDA1","([11|12|13|14|15])\.ffn_.*_exps\.=CUDA1" -fa 1 -fmoe 1 -rtr 1 -sm layer --numa isolate -amb 512
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 2 CUDA devices:
Device 0: Tesla P100-PCIE-16GB, compute capability 6.0, VMM: yes
Device 1: Tesla P100-PCIE-16GB, compute capability 6.0, VMM: yes
| model | size | params | backend | ngl | threads | fa | amb | rtr | fmoe | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -: | ----: | --: | ---: | ------------: | ---------------: |
============ Repacked 187 tensors
| qwen3moe ?B Q2_K - Medium | 81.96 GiB | 235.09 B | CUDA | 94 | 16 | 1 | 512 | 1 | 1 | pp64 | 27.35 ± 0.53 |
| qwen3moe ?B Q2_K - Medium | 81.96 GiB | 235.09 B | CUDA | 94 | 16 | 1 | 512 | 1 | 1 | pp128 | 33.71 ± 0.10 |
| qwen3moe ?B Q2_K - Medium | 81.96 GiB | 235.09 B | CUDA | 94 | 16 | 1 | 512 | 1 | 1 | pp256 | 38.88 ± 0.12 |
| qwen3moe ?B Q2_K - Medium | 81.96 GiB | 235.09 B | CUDA | 94 | 16 | 1 | 512 | 1 | 1 | tg32 | 7.26 ± 0.05 |
| qwen3moe ?B Q2_K - Medium | 81.96 GiB | 235.09 B | CUDA | 94 | 16 | 1 | 512 | 1 | 1 | tg64 | 7.18 ± 0.00 |
| qwen3moe ?B Q2_K - Medium | 81.96 GiB | 235.09 B | CUDA | 94 | 16 | 1 | 512 | 1 | 1 | tg128 | 7.17 ± 0.01 |
### 4 experts
| model | size | params | backend | ngl | threads | fa | amb | ser | rtr | fmoe | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -: | ----: | ---------: | --: | ---: | ------------: | ---------------: |
============ Repacked 187 tensors
| qwen3moe ?B Q2_K - Medium | 81.96 GiB | 235.09 B | CUDA | 94 | 16 | 1 | 512 | 4,1 | 1 | 1 | pp64 | 41.04 ± 1.05 |
| qwen3moe ?B Q2_K - Medium | 81.96 GiB | 235.09 B | CUDA | 94 | 16 | 1 | 512 | 4,1 | 1 | 1 | pp128 | 52.35 ± 0.30 |
| qwen3moe ?B Q2_K - Medium | 81.96 GiB | 235.09 B | CUDA | 94 | 16 | 1 | 512 | 4,1 | 1 | 1 | pp256 | 61.34 ± 0.48 |
| qwen3moe ?B Q2_K - Medium | 81.96 GiB | 235.09 B | CUDA | 94 | 16 | 1 | 512 | 4,1 | 1 | 1 | tg32 | 10.48 ± 0.01 |
| qwen3moe ?B Q2_K - Medium | 81.96 GiB | 235.09 B | CUDA | 94 | 16 | 1 | 512 | 4,1 | 1 | 1 | tg64 | 10.27 ± 0.20 |
| qwen3moe ?B Q2_K - Medium | 81.96 GiB | 235.09 B | CUDA | 94 | 16 | 1 | 512 | 4,1 | 1 | 1 | tg128 | 10.10 ± 0.00 |
### --numa distribute, GPUs on node0, numa_balancing=1
CUDA_VISIBLE_DEVICES=0,1 ~/Projects/ik_llama.cpp/build/bin/llama-bench -t 31 -p 64,128,256 -n 32,64,128 -m /media/gguf/moe/Qwen3-235B-A22B-UD-Q2_K_XL-00001-of-00002.gguf -ngl 94 -ot "([3][2-9]|[4-9][0-9])\.ffn_.*_exps\.=CPU" -ot "([4][7-9]|[5-9][0-9])\.(attn|ffn)_.*(q|k|v|norm|inp|output)\.=CUDA1","([11|12|13|14|15])\.ffn_.*_exps\.=CUDA1" -fa 1 -fmoe 1 -rtr 1 -sm layer --numa distribute -amb 512 -ser 4,1
| model | size | params | backend | ngl | threads | fa | amb | ser | rtr | fmoe | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -: | ----: | ---------: | --: | ---: | ------------: | ---------------: |
============ Repacked 187 tensors
| qwen3moe ?B Q2_K - Medium | 81.96 GiB | 235.09 B | CUDA | 94 | 31 | 1 | 512 | 4,1 | 1 | 1 | pp64 | 45.25 ± 0.57 |
| qwen3moe ?B Q2_K - Medium | 81.96 GiB | 235.09 B | CUDA | 94 | 31 | 1 | 512 | 4,1 | 1 | 1 | pp128 | 59.36 ± 1.82 |
| qwen3moe ?B Q2_K - Medium | 81.96 GiB | 235.09 B | CUDA | 94 | 31 | 1 | 512 | 4,1 | 1 | 1 | pp256 | 72.79 ± 1.03 |
| qwen3moe ?B Q2_K - Medium | 81.96 GiB | 235.09 B | CUDA | 94 | 31 | 1 | 512 | 4,1 | 1 | 1 | tg32 | 9.71 ± 0.27 |
| qwen3moe ?B Q2_K - Medium | 81.96 GiB | 235.09 B | CUDA | 94 | 31 | 1 | 512 | 4,1 | 1 | 1 | tg64 | 9.93 ± 0.08 |
| qwen3moe ?B Q2_K - Medium | 81.96 GiB | 235.09 B | CUDA | 94 | 31 | 1 | 512 | 4,1 | 1 | 1 | tg128 | 9.92 ± 0.12 |
### ubergarm's quant
| model | size | params | backend | ngl | threads | fa | amb | ser | ts | rtr | fmoe | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -: | ----: | ---------: | ------------ | --: | ---: | ------------: | ---------------: |
============ Repacked 220 tensors
| qwen3moe ?B IQ3_K - 3.4325 bpw | 106.83 GiB | 235.09 B | CUDA | 94 | 31 | 1 | 512 | 4,1 | 1.00 | 1 | 1 | pp64 | 41.39 ± 1.64 |
| qwen3moe ?B IQ3_K - 3.4325 bpw | 106.83 GiB | 235.09 B | CUDA | 94 | 31 | 1 | 512 | 4,1 | 1.00 | 1 | 1 | pp128 | 52.51 ± 0.57 |
| qwen3moe ?B IQ3_K - 3.4325 bpw | 106.83 GiB | 235.09 B | CUDA | 94 | 31 | 1 | 512 | 4,1 | 1.00 | 1 | 1 | pp256 | 60.54 ± 0.79 |
| qwen3moe ?B IQ3_K - 3.4325 bpw | 106.83 GiB | 235.09 B | CUDA | 94 | 31 | 1 | 512 | 4,1 | 1.00 | 1 | 1 | tg32 | 7.22 ± 0.07 |
| qwen3moe ?B IQ3_K - 3.4325 bpw | 106.83 GiB | 235.09 B | CUDA | 94 | 31 | 1 | 512 | 4,1 | 1.00 | 1 | 1 | tg64 | 6.96 ± 0.13 |
| qwen3moe ?B IQ3_K - 3.4325 bpw | 106.83 GiB | 235.09 B | CUDA | 94 | 31 | 1 | 512 | 4,1 | 1.00 | 1 | 1 | tg128 | 6.81 ± 0.10 |
build: b3036a87 (3701)
and for the giggles:
# CPU Only xeon 2697A v4 x2, numa_balancing=1, 4 experts
CUDA_VISIBLE_DEVICES= ~/Projects/ik_llama.cpp/build/bin/llama-bench -t 31 -p 32,64,128 -n 32,64,128,256 -m /media/gguf/moe/Qwen3-235B-A22B-UD-Q2_K_XL-00001-of-00002.gguf -ngl 0 -nkvo 0 -fa 1 -fmoe 1 -rtr 1 -sm layer --numa distribute -amb 512 -ser 4,1
ggml_cuda_init: failed to initialize CUDA: no CUDA-capable device is detected
WARNING: /proc/sys/kernel/numa_balancing is enabled, this has been observed to impair performance
| model | size | params | backend | ngl | threads | fa | amb | ser | rtr | fmoe | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | -: | ----: | ---------: | --: | ---: | ------------: | ---------------: |
============ Repacked 659 tensors
| qwen3moe ?B Q2_K - Medium | 81.96 GiB | 235.09 B | CUDA | 0 | 31 | 1 | 512 | 4,1 | 1 | 1 | pp32 | 34.41 ± 2.53 |
| qwen3moe ?B Q2_K - Medium | 81.96 GiB | 235.09 B | CUDA | 0 | 31 | 1 | 512 | 4,1 | 1 | 1 | pp64 | 44.84 ± 1.45 |
| qwen3moe ?B Q2_K - Medium | 81.96 GiB | 235.09 B | CUDA | 0 | 31 | 1 | 512 | 4,1 | 1 | 1 | pp128 | 54.11 ± 0.49 |
| qwen3moe ?B Q2_K - Medium | 81.96 GiB | 235.09 B | CUDA | 0 | 31 | 1 | 512 | 4,1 | 1 | 1 | pp256 | 55.99 ± 2.86 |
| qwen3moe ?B Q2_K - Medium | 81.96 GiB | 235.09 B | CUDA | 0 | 31 | 1 | 512 | 4,1 | 1 | 1 | tg32 | 6.73 ± 0.14 |
| qwen3moe ?B Q2_K - Medium | 81.96 GiB | 235.09 B | CUDA | 0 | 31 | 1 | 512 | 4,1 | 1 | 1 | tg64 | 7.28 ± 0.38 |
| qwen3moe ?B Q2_K - Medium | 81.96 GiB | 235.09 B | CUDA | 0 | 31 | 1 | 512 | 4,1 | 1 | 1 | tg128 | 8.29 ± 0.25 |
| qwen3moe ?B Q2_K - Medium | 81.96 GiB | 235.09 B | CUDA | 0 | 31 | 1 | 512 | 4,1 | 1 | 1 | tg256 | 8.65 ± 0.20 |
̶#̶#̶#̶ ̶W̶h̶a̶t̶ ̶h̶a̶p̶p̶e̶n̶e̶d̶?̶
̶
̶w̶h̶e̶n̶ ̶i̶ ̶t̶r̶y̶ ̶t̶o̶ ̶l̶o̶a̶d̶ ̶t̶h̶e̶ ̶2̶3̶5̶B̶ ̶I̶Q̶3̶k̶/̶Q̶4̶ ̶o̶n̶ ̶3̶2̶G̶B̶ ̶v̶r̶a̶m̶ ̶+̶1̶2̶8̶G̶B̶ ̶i̶t̶ ̶t̶h̶r̶o̶w̶s̶ ̶t̶h̶i̶s̶ ̶e̶r̶r̶o̶r̶
̶!̶[̶I̶m̶a̶g̶e̶]̶(̶h̶t̶t̶p̶s̶:̶/̶/̶g̶i̶t̶h̶u̶b̶.̶c̶o̶m̶/̶u̶s̶e̶r̶-̶a̶t̶t̶a̶c̶h̶m̶e̶n̶t̶s̶/̶a̶s̶s̶e̶t̶s̶/̶3̶5̶f̶4̶f̶7̶9̶c̶-̶4̶4̶a̶0̶-̶4̶c̶8̶9̶-̶b̶9̶0̶1̶-̶d̶5̶9̶1̶d̶6̶d̶0̶0̶c̶7̶7̶)̶
̶
̶ ̶i̶ ̶t̶r̶i̶e̶d̶ ̶m̶a̶n̶y̶ ̶r̶e̶g̶e̶x̶ ̶c̶o̶m̶b̶i̶n̶a̶t̶i̶o̶n̶s̶ ̶r̶e̶d̶i̶r̶e̶c̶t̶i̶n̶g̶ ̶t̶e̶n̶s̶o̶r̶s̶ ̶t̶o̶ ̶C̶U̶D̶A̶1̶ ̶e̶t̶c̶ ̶b̶u̶t̶ ̶i̶t̶ ̶a̶l̶w̶a̶y̶s̶ ̶t̶r̶i̶e̶s̶ ̶t̶o̶ ̶a̶l̶l̶o̶c̶a̶t̶e̶ ̶1̶0̶0̶G̶B̶+̶ ̶o̶n̶ ̶C̶U̶D̶A̶0̶ ̶a̶s̶ ̶b̶u̶f̶f̶e̶r̶
̶
̶
̶
̶!̶[̶I̶m̶a̶g̶e̶]̶(̶h̶t̶t̶p̶s̶:̶/̶/̶g̶i̶t̶h̶u̶b̶.̶c̶o̶m̶/̶u̶s̶e̶r̶-̶a̶t̶t̶a̶c̶h̶m̶e̶n̶t̶s̶/̶a̶s̶s̶e̶t̶s̶/̶9̶4̶8̶5̶7̶d̶2̶d̶-̶7̶f̶e̶3̶-̶4̶a̶7̶8̶-̶8̶e̶5̶4̶-̶8̶8̶8̶d̶f̶0̶9̶e̶1̶9̶d̶2̶)̶
̶
̶E̶d̶i̶t̶;̶ ̶f̶i̶x̶e̶d̶ ̶b̶y̶ ̶d̶i̶s̶a̶b̶l̶i̶n̶g̶ ̶c̶u̶b̶l̶a̶s̶
---
#### 🗣️ Discussion
👤 **ikawrakow** replied the **2025-05-15** at **04:26:42**:<br>
You regex is incorrect, so everything goes to the GPU. Try `-ot exps=CPU` instead. When that works and you see how much VRAM you have left on each GPU, you can offload some of the experts to the GPU using additional regular expressions for that that precede the `exps=CPU` expression.
---
👤 **VinnyG9** replied the **2025-05-15** at **14:08:28**:<br>
> You regex is incorrect, so everything goes to the GPU. Try `-ot exps=CPU` instead. When that works and you see how much VRAM you have left on each GPU, you can offload some of the experts to the GPU using additional regular expressions for that that precede the `exps=CPU` expression.
the regex works i can see the override being applied but thanks for the hint at shortening it
since both main and ikllama were ignoring the --tensor-split i set i got around it by explicitly overriding every tensor distributing equally between the 2x 16GB GPUs
this let me fill both cards but performance in both repos was pretty bad like 3pp, 5tg, this didn't change with -nkvo so not sure what's going on, tried both ubergarm/unsloth quants, -fmoe/-fa on/off
offload split was
10 exp layers each gpu
47 remaining layers tensors each gpu
i found this enlightening
https://nvidia.github.io/TensorRT-LLM/advanced/expert-parallelism.html
---
👤 **ikawrakow** replied the **2025-05-15** at **14:13:55**:<br>
The attention tensors are on the GPU, so you don't really want to use `-nkvo` (unless extremely desperate to save more VRAM).
What is the quantization type you are using? Full log, including command line are always very useful. If the log output is too long, you can put it in a gzipped text file and attach it to the issue.
---
👤 **VinnyG9** replied the **2025-05-15** at **17:31:23**:<br>
when i do "exps\.=CPU" only 6GB total are offloaded to the GPUs is that normal?
in contrast if i offload 95 instead of 94 layers it triggers the 300GB alloc bug again:
`ggml_backend_cuda_buffer_type_alloc_buffer: allocating 324566.07 MiB on device 0: cudaMalloc failed: out of memory
`
>What is the quantization type you are using?
@ubergarm @IQ3
ram is 4x2400 ddr4
build flags
`cmake -B build -DGGML_CUDA=ON -DCMAKE_CUDA_ARCHITECTURES="60" -DGGML_NATIVE=1
`
command
` CUDA_VISIBLE_DEVICES=0,1 numactl --cpunodebind=0 ik_llama.cpp/build/bin/llama-bench -t 16 -p 64 -n 32 -m gguf/moe/Qwen3-235B-A22B-mix-IQ3_K-00001-of-00003.gguf -ngl 94 -ot "([1-4][0-9]|[6-9][0-9])\.ffn_.*_exps\.=CPU" -ot "([4][7-9]|[5-9][0-9])\.(attn|ffn)_.*(q|k|v|norm|inp|output)\.=CUDA1","([5][0-9])\.ffn_.*_exps\.=CUDA1" -ot "([4][0-6]|[0-3][0-9])\.(attn|ffn)_.*(q|k|v|norm|inp|output)\.=CUDA0","([0-9])\.ffn_.*_exps\.=CUDA0" -v -fa 1 -fmoe 1`
log> https://pastebin.com/1VEd7tuD
---
👤 **VinnyG9** replied the **2025-05-15** at **18:31:10**:<br>
this tensor override thing makes no sense, i'm testing the Q2K quant it's using 40% of vram and if i set only one more tensor-layer the cuda malloc explodes
---
👤 **Ph0rk0z** replied the **2025-05-15** at **21:23:16**:<br>
>in contrast if i offload 95 instead of 94 layers it triggers the 300GB alloc bug again:
if you compile with pipeline parallel copies of 1, I think it's same as putting ngl 94. You can also try 93 and put some ffn*experts in order on the GPUs. (0,1,2,3,etc) The way it looks now is you randomly throw random layers all over the place. Those "blk.20.ffn_norm.weight" shits don't really do anything to improve speed when on GPU.
I had best luck with numa distribute. Maybe you should do a benchmark of your ram bandwidth with mlc and see what you get. Then you'd know if its "good" or not.
---
👤 **ubergarm** replied the **2025-05-16** at **21:30:59**:<br>
@Fuckingnameless
There is some more discussion on `-ot` and compiling with on [this discussion for the quant](https://huggingface.co/ubergarm/Qwen3-235B-A22B-GGUF/discussions/1#681642d4a383b2fb9aa3bd8c) (others chime in that thread too with some of their examples). Sorry info so so spread out and you have to dig through numerous threads on various platforms, but things move pretty fast and there are so many hardware configurations.
Also as @Ph0rk0z you might want to try compiling with `-DGGML_SCHED_MAX_COPIES=1` as multi-gpu folks have reported that makes it allocate how much they expect. I don't use multi-gpu regularly so haven't messed with it much.
Take your time and be systematic about your changes and regex and you'll get it dialed in.
If you're 128GB RAM is in two numa nodes, consider changing bios to try to get it into a single numa node. Otherwise if you are forced to use multiple NUMA nodes, like @Ph0rk0z mentions, you can try stuff like `echo 0 | sudo tee /proc/sys/kernel/numa_balancing` and `numactl --interleave=all llama-server ... --numa distribute` etc...
I like to use `llama-sweep-bench` to test the various configurations and decide which one suits my needs best.
have fun!
---
👤 **VinnyG9** replied the **2025-05-17** at **01:18:44**:<br>
> > in contrast if i offload 95 instead of 94 layers it triggers the 300GB alloc bug again:
>
> if you compile with pipeline parallel copies of 1, I think it's same as putting ngl 94. You can also try 93 and put some ffn*experts in order on the GPUs. (0,1,2,3,etc) The way it looks now is you randomly throw random layers all over the place. Those "blk.20.ffn_norm.weight" shits don't really do anything to improve speed when on GPU.
>
like i said i have to explicitly set these normal layers otherwise it's not offloading to gpu2
and the reason i split it "all over" is so that the exp/attn tensors for a given layer stay on the same gpu when said layer is offloaded, may not make a difference but this is all trial an error anyway
> I had best luck with numa distribute. Maybe you should do a benchmark of your ram bandwidth with mlc and see what you get. Then you'd know if its "good" or not.
yeah i need to do some benchmarks
i found the issue I'd forgotten the -rtr flag, yesterday i tried the Q2K_L from unsloth and got 38pp/7tg, today i got 5tg not sure why
with 4 active experts tg goes up 60%
numa is not working right for me i need to fiddle with snoop modes is my guess
---
👤 **VinnyG9** replied the **2025-05-17** at **01:25:58**:<br>
> [@Fuckingnameless](https://github.com/Fuckingnameless)
>
> There is some more discussion on `-ot` and compiling with on [this discussion for the quant](https://huggingface.co/ubergarm/Qwen3-235B-A22B-GGUF/discussions/1#681642d4a383b2fb9aa3bd8c) (others chime in that thread too with some of their examples). Sorry info so so spread out and you have to dig through numerous threads on various platforms, but things move pretty fast and there are so many hardware configurations.
>
> Also as [@Ph0rk0z](https://github.com/Ph0rk0z) you might want to try compiling with `-DGGML_SCHED_MAX_COPIES=1` as multi-gpu folks have reported that makes it allocate how much they expect. I don't use multi-gpu regularly so haven't messed with it much.
>
> Take your time and be systematic about your changes and regex and you'll get it dialed in.
>
> If you're 128GB RAM is in two numa nodes, consider changing bios to try to get it into a single numa node. Otherwise if you are forced to use multiple NUMA nodes, like [@Ph0rk0z](https://github.com/Ph0rk0z) mentions, you can try stuff like `echo 0 | sudo tee /proc/sys/kernel/numa_balancing` and `numactl --interleave=all llama-server ... --numa distribute` etc...
>
> I like to use `llama-sweep-bench` to test the various configurations and decide which one suits my needs best.
>
> have fun!
I'll check the --interleave=all, can confirm numa balancing = 0 helps even when doing --cpunodebind=0
my bios has an on/off option for numa that's it but interleaving options are plenty
i was actually using 128GB with 4x32GB ram sticks single node yesterday
>DGGML_SCHED_MAX_COPIES=1
i thought that was default, also read somewhere that doing 2 copies aka data parallel could be interesting on dual socket systems?
---
👤 **ubergarm** replied the **2025-05-17** at **14:41:33**:<br>
@Fuckingnameless
> i was actually using 128GB with 4x32GB ram sticks single node yesterday
Yeah best performance today tends to be setting all RAM into a *single* NUMA node then don't bother with numactl etc. Keeps it a bit more simple that way too. So this might be your best BIOS config for now.
> i thought that was default, also read somewhere that doing 2 copies aka data parallel could be interesting on dual socket systems?
Default is `GGML_SCHED_MAX_COPIES=4` which seems to cause confusion for multi-gpu folks when it allocates more VRAM than they expect is my impression.
So "data parallel" is not implemented in any llama.cpp in terms of loading the entire model weights into RAM multiple times, once for each numa node. It does exist somewhat in ktransformers when compiling that with `USE_NUMA=1` where it can run on exactly 2x NUMA nodes. There are some various experimental PRs for llama.cpp attempting to implement this using hugepages allocations etc, but in my experience it didn't speed things up much on a dual socket 6980P (intel has no equivilent of NPS0 afaict).
Things like vllm and sglang to have "proper" tensor-parallel and data-parallel but only for multi-GPU nodes, not CPU NUMA nodes afaict.
I have a [whole discussion on the NUMA stuff here](https://github.com/ggml-org/llama.cpp/discussions/12088) with a link to that experimental mirror branch with more discussions there.
---
👤 **Ph0rk0z** replied the **2025-05-17** at **15:03:48**:<br>
>Also as @Ph0rk0z you might want to try compiling with -DGGML_SCHED_MAX_COPIES=1
Exact same results as taking a single layer off. Technically you manually decide what's on GPU anyway so NGL becomes irrelevant.
>like i said i have to explicitly set these normal layers otherwise it's not offloading to gpu2
-ot "blk\.(0|1|2|3|4|5|6|7|8|9|10|11|12)\.ffn.*=CUDAx" \
or exp marked layers
-ot "blk.(34|35|36|37|38|39|40|41|42|43|44|45|46|47|48|49|50).ffn.exps.=CUDAx"
If you do it sequentially and just fill as many layers before OOM, you'll have a better time. Put the -ot CPU line last to catch whatever *isn't* on gpu. CUDA0, CUDA1, on and on. -ot line for each.
---
👤 **VinnyG9** replied the **2025-05-18** at **02:01:19**:<br>
> > Also as [@Ph0rk0z](https://github.com/Ph0rk0z) you might want to try compiling with -DGGML_SCHED_MAX_COPIES=1
>
> Exact same results as taking a single layer off. Technically you manually decide what's on GPU anyway so NGL becomes irrelevant.
>
> > like i said i have to explicitly set these normal layers otherwise it's not offloading to gpu2
>
> -ot "blk.(0|1|2|3|4|5|6|7|8|9|10|11|12).ffn.*=CUDAx" \
>
> or exp marked layers
>
> -ot "blk.(34|35|36|37|38|39|40|41|42|43|44|45|46|47|48|49|50).ffn.exps.=CUDAx"
>
> If you do it sequentially and just fill as many layers before OOM, you'll have a better time. Put the -ot CPU line last to catch whatever _isn't_ on gpu. CUDA0, CUDA1, on and on. -ot line for each.
for some reason it's not respecting what i set, just checked again and whatever exps not redirected to -ot =CPU go into CUDA1
I updated the OP with benchmarks
---
👤 **Ph0rk0z** replied the **2025-05-18** at **11:33:22**:<br>
Try some different regex for CPU. In the benchmark command line above its missing the wildcard.
---
👤 **VinnyG9** replied the **2025-05-20** at **14:49:53**:<br>
$ CUDA_VISIBLE_DEVICES=0,1 bin/llama-bench -t 31 -p 64,128,256 -n 32,64,128 -m moe/Qwen3-235B-A22B-UD-Q2_K_XL-00001-of-00002.gguf -ngl 94 -ot "blk.([0-9]|[1][0-3]).ffn_.*=CUDA1","output.=CUDA1","blk.([0-3][0-9]|4[0-6]).ffn_norm.=CUDA1" -ot "blk.(4[7-9]|[5-9][0-9]).ffn_norm.=CUDA0" -ot "blk.([3][1-9]|[4-9][0-9]).ffn_.*=CPU" -fa 1 -fmoe 1 -rtr 1 --numa distribute
norm layers split 1/1, output layers on last gpu
### p100 2 node 2 cpu
| model | size | params | backend | ngl | threads | fa | rtr | fmoe | test | t/s |
| ----------------------------------- | ----------: | ---------: | --------- | ----: | --------: | ---: | ----: | -----: | ------: | --------------: |
| ============ Repacked 189 tensors | | | | | | | | | | |
| qwen3moe ?B Q2_K - Medium | 81.96 GiB | 235.09 B | CUDA | 94 | 31 | 1 | 1 | 1 | pp64 | 31.47 ± 1.52 |
| qwen3moe ?B Q2_K - Medium | 81.96 GiB | 235.09 B | CUDA | 94 | 31 | 1 | 1 | 1 | pp128 | 42.14 ± 0.61 |
| qwen3moe ?B Q2_K - Medium | 81.96 GiB | 235.09 B | CUDA | 94 | 31 | 1 | 1 | 1 | pp256 | 50.67 ± 0.36 |
| qwen3moe ?B Q2_K - Medium | 81.96 GiB | 235.09 B | CUDA | 94 | 31 | 1 | 1 | 1 | tg32 | 8.83 ± 0.08 |
| qwen3moe ?B Q2_K - Medium | 81.96 GiB | 235.09 B | CUDA | 94 | 31 | 1 | 1 | 1 | tg64 | 8.73 ± 0.10 |
| qwen3moe ?B Q2_K - Medium | 81.96 GiB | 235.09 B | CUDA | 94 | 31 | 1 | 1 | 1 | tg128 | 9.15 ± 0.15 |
| build: 2ec2229f (3702) | | | | | | | | | | |
### 4 exps
| model | size | params | backend | ngl | threads | fa | ser | rtr | fmoe | test | t/s |
| ----------------------------------- | ----------: | ---------: | --------- | ----: | --------: | ---: | ----: | ----: | -----: | ------: | --------------: |
| ============ Repacked 189 tensors | | | | | | | | | | | |
| qwen3moe ?B Q2_K - Medium | 81.96 GiB | 235.09 B | CUDA | 94 | 31 | 1 | 4,1 | 1 | 1 | pp64 | 44.32 ± 1.60 |
| qwen3moe ?B Q2_K - Medium | 81.96 GiB | 235.09 B | CUDA | 94 | 31 | 1 | 4,1 | 1 | 1 | pp128 | 59.13 ± 0.77 |
| qwen3moe ?B Q2_K - Medium | 81.96 GiB | 235.09 B | CUDA | 94 | 31 | 1 | 4,1 | 1 | 1 | pp256 | 73.35 ± 1.55 |
| qwen3moe ?B Q2_K - Medium | 81.96 GiB | 235.09 B | CUDA | 94 | 31 | 1 | 4,1 | 1 | 1 | tg32 | 11.29 ± 0.15 |
| qwen3moe ?B Q2_K - Medium | 81.96 GiB | 235.09 B | CUDA | 94 | 31 | 1 | 4,1 | 1 | 1 | tg64 | 11.35 ± 0.10 |
| qwen3moe ?B Q2_K - Medium | 81.96 GiB | 235.09 B | CUDA | 94 | 31 | 1 | 4,1 | 1 | 1 | tg128 | 11.74 ± 0.22 |
| | | | | | | | | | | | |
### ubergarm s quant
| model | size | params | backend | ngl | threads | fa | ser | rtr | fmoe | test | t/s |
| ----------------------------------- | -----------: | ---------: | --------- | ----: | --------: | ---: | ----: | ----: | -----: | ------: | --------------: |
| ============ Repacked 213 tensors | | | | | | | | | | | |
| qwen3moe ?B IQ3_K - 3.4325 bpw | 106.83 GiB | 235.09 B | CUDA | 94 | 31 | 1 | 4,1 | 1 | 1 | pp64 | 39.93 ± 2.54 |
| qwen3moe ?B IQ3_K - 3.4325 bpw | 106.83 GiB | 235.09 B | CUDA | 94 | 31 | 1 | 4,1 | 1 | 1 | pp128 | 53.61 ± 1.04 |
| qwen3moe ?B IQ3_K - 3.4325 bpw | 106.83 GiB | 235.09 B | CUDA | 94 | 31 | 1 | 4,1 | 1 | 1 | pp256 | 64.34 ± 0.73 |
| qwen3moe ?B IQ3_K - 3.4325 bpw | 106.83 GiB | 235.09 B | CUDA | 94 | 31 | 1 | 4,1 | 1 | 1 | tg32 | 8.17 ± 0.10 |
| qwen3moe ?B IQ3_K - 3.4325 bpw | 106.83 GiB | 235.09 B | CUDA | 94 | 31 | 1 | 4,1 | 1 | 1 | tg64 | 8.33 ± 0.08 |
| qwen3moe ?B IQ3_K - 3.4325 bpw | 106.83 GiB | 235.09 B | CUDA | 94 | 31 | 1 | 4,1 | 1 | 1 | tg128 | 8.78 ± 0.31 |
| build: 2ec2229f (3702) | | | | | | | | | | | |
---
👤 **saood06** replied the **2025-05-25** at **05:08:13**:<br>
> ̶E̶d̶i̶t̶;̶ ̶f̶i̶x̶e̶d̶ ̶b̶y̶ ̶d̶i̶s̶a̶b̶l̶i̶n̶g̶ ̶c̶u̶b̶l̶a̶s̶
~Can this be closed then?~
Edit: a discussion makes a lot more sense. Thanks @ikawrakow
> 👤 **ikawrakow** replied the **2025-05-25** at **07:36:49**:<br>
> Yes, I thought this could be useful info for some people.
---
👤 **VinnyG9** replied the **2025-05-25** at **12:51:11**:<br>
trying to figure out why I was seeing a performance drop with numa-cpu inference on debian, tried xanmod 6.12/6.14 kernel, upgraded to debian-testing, tried cuda 12-8/12-9, one change at a time, best i could get was 32t/s on qwen3 30B
also memory mapping doesn't work
booted back on linux mint vanilla
41t/s xD
I'm now a distrohopper
> 👤 **Ph0rk0z** replied the **2025-05-25** at **18:14:19**:<br>
> I've been using xanmod-v3 with mint. Since my CPUs identify as skylake-x, I might try the V4 version and see if there is some difference.
>
> 👤 **VinnyG9** replied the **2025-05-26** at **15:27:17**:<br>
> > I've been using xanmod-v3 with mint. Since my CPUs identify as skylake-x, I might try the V4 version and see if there is some difference.
>
> on mint i had no luck with xanmodv3 either it was like 15% slower
>
> 👤 **Ph0rk0z** replied the **2025-05-27** at **14:35:27**:<br>
> going to have to try and compare a regular kernel of the same version. V4 xanmod seems behind for ubuntu 22.04 based distros, there was no 6.12 even. V3 has been serving me well for more than a year so I'm curious if I get higher memory b/w or other difference that would change t/s.
>
> I'm having a crazy time with GGML_SCHED_MAX_COPIES. I'm not sure what's being offloaded when you set it to 1 and do all model layers. CUDA host compute buffer is smaller but whatever ends up on my other cards forces me to remove 3 gate layers. In theory TG is better but not PP. Maybe I can make up for it. Also means I have to test qwen again because this is deepseek. I'm going to keep juicing the turnip just like you.
>
> 👤 **VinnyG9** replied the **2025-05-28** at **20:13:36**:<br>
> > going to have to try and compare a regular kernel of the same version. V4 xanmod seems behind for ubuntu 22.04 based distros, there was no 6.12 even. V3 has been serving me well for more than a year so I'm curious if I get higher memory b/w or other difference that would change t/s.
> >
> > I'm having a crazy time with GGML_SCHED_MAX_COPIES. I'm not sure what's being offloaded when you set it to 1 and do all model layers. CUDA host compute buffer is smaller but whatever ends up on my other cards forces me to remove 3 gate layers. In theory TG is better but not PP. Maybe I can make up for it. Also means I have to test qwen again because this is deepseek. I'm going to keep juicing the turnip just like you.
>
> i don´t even bother with obscure llama.cpp flags anymore itś usually a waste of time just build it to the cuda arch i am using set the GGML_NATIVE=1 and thats it
---
👤 **VinnyG9** replied the **2025-05-25** at **13:22:48**:<br>
235B Q2 not so bad?
https://oobabooga.github.io/benchmark.html

View File

@@ -0,0 +1,30 @@
### 🗣️ [#466](https://github.com/ikawrakow/ik_llama.cpp/discussions/466) - A curiosity.
| **Author** | `Nexesenex` |
| :--- | :--- |
| **Created** | 2025-05-28 |
| **Updated** | 2025-06-08 |
---
#### Description
I made a little fork of Llama.cpp mainline, integrating some commits of IK_Llama, and able to quantize (for now) in q6_0, IQ3_K, IQ4_K, IQ5_K and IQ6_K.
It's based on b5474 for now, and now I can use the wonderful q6_0 and IQ6_K for any model supported by mainline.
Here's the first alpha : https://github.com/Nexesenex/croco.cpp/releases/tag/v0.01
Edit : https://github.com/Nexesenex/croco.cpp/releases/tag/NXS_v0.04_b5525
Edit 2 : https://github.com/Nexesenex/croco.cpp/releases/tag/v1.93040_b5600_RMv1.11.8 (with NXS_Llama_v0.13_b5600), an attempt to make work the R4 quants supported on Cuda.
---
#### 🗣️ Discussion
👤 **VinnyG9** replied the **2025-05-28** at **20:14:51**:<br>
any performance numberos?
> 👤 **Nexesenex** replied the **2025-05-29** at **07:05:33**:<br>
> None, it barely works for a part of its purpose, which is to quantize models with some IQ quants within the mainline framework.
> PPL test work also, as well as Cuda inference for Gemma 3 in 0.04. And that's it for now. ^^

File diff suppressed because one or more lines are too long

View File

@@ -0,0 +1,478 @@
### 🗣️ [#491](https://github.com/ikawrakow/ik_llama.cpp/discussions/491) - -rtr actually hurts prompt t/s for large ubatch?
| **Author** | `Ph0rk0z` |
| :--- | :--- |
| **Created** | 2025-06-03 |
| **Updated** | 2025-06-11 |
---
#### Description
I had long assumed that -RTR was a universal speedup and just like repacking, it would help your performance always. Seems that is not the case.
<details>
<summary> Qwen 235b command line </summary>
```
CUDA_VISIBLE_DEVICES=0,1,2,3 numactl --interleave=all ./bin/llama-sweep-bench \
-m Smoothie-Qwen3-235B-A22B.IQ4_XS.gguf \
-t 48 \
-c 32768 \
--numa distribute \
-ngl 95 \
-ctk q8_0 \
-ctv q8_0 \
-fa \
-fmoe \
-amb 64 \
-b 4096 \
-ub 4096 \
-ot "blk\.(0|1|2|3|4|5|6|7|8|9|10|11|12|13|)\.ffn_.*_exps.=CUDA0" \
-ot "blk\.(14|15|16|17|18|19|20|21|22|23|24|25|26|27|28|29)\.ffn_.*_exps.=CUDA1" \
-ot "blk\.(30|31|32|33|34|35|36|37|38|39|40|41|42|43|44|45|)\.ffn_.*_exps.=CUDA2" \
-ot "blk\.(46|47|48|49|50|51|52|53|54|55|56|57|58|59)\.ffn_.*_exps.=CUDA3" \
-ot "\.ffn_.*_exps.=CPU"
```
</details>
<details><summary>No RTR Buffers</summary>
```
llama_kv_cache_init: CUDA0 KV buffer size = 816.01 MiB
llama_kv_cache_init: CUDA1 KV buffer size = 816.01 MiB
llama_kv_cache_init: CUDA2 KV buffer size = 816.01 MiB
llama_kv_cache_init: CUDA3 KV buffer size = 748.01 MiB
llama_new_context_with_model: KV self size = 3196.00 MiB, K (q8_0): 1598.00 MiB, V (q8_0): 1598.00 MiB
llama_new_context_with_model: CUDA_Host output buffer size = 0.58 MiB
llama_new_context_with_model: pipeline parallelism enabled (n_copies=1)
llama_new_context_with_model: CUDA0 compute buffer size = 1856.02 MiB
llama_new_context_with_model: CUDA1 compute buffer size = 1094.02 MiB
llama_new_context_with_model: CUDA2 compute buffer size = 836.00 MiB
llama_new_context_with_model: CUDA3 compute buffer size = 2502.00 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 576.05 MiB
llama_new_context_with_model: graph nodes = 3672
llama_new_context_with_model: graph splits = 183
main: n_kv_max = 32768, n_batch = 4096, n_ubatch = 4096, flash_attn = 1, n_gpu_layers = 95, n_threads = 48, n_threads_batch = 48
```
</details>
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
| 4096 | 1024 | 0 | 14.283 | 286.78 | 65.942 | 15.53 |
| 4096 | 1024 | 4096 | 14.803 | 276.70 | 68.941 | 14.85 |
| 4096 | 1024 | 8192 | 15.461 | 264.92 | 73.586 | 13.92 |
| 4096 | 1024 | 12288 | 15.831 | 258.74 | 77.875 | 13.15 |
| 4096 | 1024 | 16384 | 16.185 | 253.08 | 81.513 | 12.56 |
| 4096 | 1024 | 20480 | 16.926 | 241.99 | 85.266 | 12.01 |
<details><summary>Buffers with RTR</summary>
```
llama_kv_cache_init: CUDA0 KV buffer size = 816.01 MiB
llama_kv_cache_init: CUDA1 KV buffer size = 816.01 MiB
llama_kv_cache_init: CUDA2 KV buffer size = 816.01 MiB
llama_kv_cache_init: CUDA3 KV buffer size = 748.01 MiB
llama_new_context_with_model: KV self size = 3196.00 MiB, K (q8_0): 1598.00 MiB, V (q8_0): 1598.00 MiB
llama_new_context_with_model: CUDA_Host output buffer size = 0.58 MiB
llama_new_context_with_model: pipeline parallelism enabled (n_copies=1)
llama_new_context_with_model: CUDA0 compute buffer size = 1664.02 MiB
llama_new_context_with_model: CUDA1 compute buffer size = 1094.02 MiB
llama_new_context_with_model: CUDA2 compute buffer size = 1024.02 MiB
llama_new_context_with_model: CUDA3 compute buffer size = 2502.00 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 1024.05 MiB
llama_new_context_with_model: graph nodes = 3672
llama_new_context_with_model: graph splits = 149
main: n_kv_max = 32768, n_batch = 4096, n_ubatch = 4096, flash_attn = 1, n_gpu_layers = 95, n_threads = 48, n_threads_batch = 48
```
</details>
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
| 4096 | 1024 | 0 | 24.221 | 169.11 | 59.405 | 17.24 |
| 4096 | 1024 | 4096 | 24.852 | 164.82 | 62.359 | 16.42 |
| 4096 | 1024 | 8192 | 25.570 | 160.19 | 67.178 | 15.24 |
| 4096 | 1024 | 12288 | 26.293 | 155.78 | 71.996 | 14.22 |
| 4096 | 1024 | 16384 | 26.979 | 151.82 | 76.468 | 13.39 |
It's even worse on deepseek where my prompt speeds were cut in half while losing about 1.5t/s of TG only. Another thing of note is that no repacking causes much more large transfers to the GPU. I saw rates of up to 16GBs going between cards and I assume the system?
Peculiar thing though, for smaller batches:
<details> <summary> 235b ub 1024 </summary>
```
CUDA_VISIBLE_DEVICES=0,1,2,3 numactl --interleave=all ./bin/llama-sweep-bench \
-m Smoothie-Qwen3-235B-A22B.IQ4_XS.gguf \
-t 48 \
-c 32768 \
--numa distribute \
-ngl 95 \
-ctk q8_0 \
-ctv q8_0 \
-fa \
-rtr \
-fmoe \
-amb 512 \
-ub 1024 \
-ot "blk\.(0|1|2|3|4|5|6|7|8|9|10|11|12|13|14|15)\.ffn_.*_exps.=CUDA0" \
-ot "blk\.(16|17|18|19|20|21|22|23|24|25|26|27|28|29|30|31|32)\.ffn_.*_exps.=CUDA1" \
-ot "blk\.(33|34|35|36|37|38|39|40|41|42|43|44|45|46|47|48|49)\.ffn_.*_exps.=CUDA2" \
-ot "blk\.(50|51|52|53|54|55|56|57|58|59|60|61|62|63|64|65|66)\.ffn_.*_exps.=CUDA3" \
-ot "\.ffn_.*_exps.=CPU"
```
</details>
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
| 1024 | 256 | 0 | 5.432 | 188.50 | 13.878 | 18.45 |
| 1024 | 256 | 1024 | 5.402 | 189.55 | 14.069 | 18.20 |
| 1024 | 256 | 2048 | 5.434 | 188.43 | 14.268 | 17.94 |
| 1024 | 256 | 3072 | 5.514 | 185.71 | 14.499 | 17.66 |
| 1024 | 256 | 4096 | 5.543 | 184.74 | 14.655 | 17.47 |
| 1024 | 256 | 5120 | 5.566 | 183.96 | 15.034 | 17.03 |
| 1024 | 256 | 6144 | 5.624 | 182.08 | 15.241 | 16.80 |
| 1024 | 256 | 7168 | 5.700 | 179.64 | 15.547 | 16.47 |
| 1024 | 256 | 8192 | 5.732 | 178.66 | 15.836 | 16.17 |
| 1024 | 256 | 9216 | 5.820 | 175.96 | 16.136 | 15.87 |
| 1024 | 256 | 10240 | 5.812 | 176.18 | 16.415 | 15.60 |
| 1024 | 256 | 11264 | 5.888 | 173.92 | 16.751 | 15.28 |
| 1024 | 256 | 12288 | 5.907 | 173.37 | 16.951 | 15.10 |
| 1024 | 256 | 13312 | 5.994 | 170.84 | 17.151 | 14.93 |
| 1024 | 256 | 14336 | 5.998 | 170.72 | 17.394 | 14.72 |
| 1024 | 256 | 15360 | 6.043 | 169.46 | 17.623 | 14.53 |
| 1024 | 256 | 16384 | 6.139 | 166.80 | 17.983 | 14.24 |
Without -rtr, this makes ~120 prompt at most. Anyone know the why or noticed something similar?
---
#### 🗣️ Discussion
👤 **Ph0rk0z** replied the **2025-06-04** at **15:59:57**:<br>
I played around with offline repacking next. Oh boy.
Offline repacking on 4096 batch.
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
| 4096 | 1024 | 0 | 24.349 | 168.22 | 69.065 | 14.83 |
| 4096 | 1024 | 4096 | 24.815 | 165.06 | 71.880 | 14.25 |
| 4096 | 1024 | 8192 | 25.604 | 159.97 | 76.457 | 13.39 |
| 4096 | 1024 | 12288 | 26.288 | 155.81 | 80.361 | 12.74 |
It seems like performance here is identical to using -rtr. Debuff to text generation likely from mmap.
Ok.. so let's try it in a configuration where repacking previously helped like the last one in the previous post. Only 6 layers are incorrectly packed and everything has gone into the toilet.
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
| 1024 | 256 | 0 | 6.992 | 146.46 | 192.370 | 1.33 |
| 1024 | 256 | 1024 | 6.969 | 146.95 | 192.509 | 1.33 |
Then I indiscriminately repacked the whole model to see what would happen. It got just as bad. Lots of transfers.Could be related to offload policy? I didn't even bother waiting for the first iteration it took so long. CPU running at 10 cores from the 1000% usage.
And finally I packed the model correctly AND used the configuration that produced a speed gain.
with mmap
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
| 1024 | 256 | 0 | 6.306 | 162.40 | 15.561 | 16.45 |
| 1024 | 256 | 1024 | 5.993 | 170.87 | 15.743 | 16.26 |
| 1024 | 256 | 2048 | 6.004 | 170.54 | 15.897 | 16.10 |
| 1024 | 256 | 3072 | 5.882 | 174.10 | 16.071 | 15.93 |
| 1024 | 256 | 4096 | 6.295 | 162.67 | 16.253 | 15.75 |
| 1024 | 256 | 5120 | 6.144 | 166.67 | 16.608 | 15.41 |
| 1024 | 256 | 6144 | 6.143 | 166.70 | 16.833 | 15.21 |
| 1024 | 256 | 7168 | 6.280 | 163.07 | 17.086 | 14.98 |
| 1024 | 256 | 8192 | 6.298 | 162.58 | 17.373 | 14.74 |
no mmap
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
| 1024 | 256 | 0 | 5.759 | 177.82 | 14.442 | 17.73 |
| 1024 | 256 | 1024 | 5.639 | 181.59 | 14.523 | 17.63 |
| 1024 | 256 | 2048 | 5.867 | 174.53 | 14.656 | 17.47 |
| 1024 | 256 | 3072 | 5.900 | 173.56 | 14.833 | 17.26 |
| 1024 | 256 | 4096 | 6.026 | 169.92 | 15.031 | 17.03 |
| 1024 | 256 | 5120 | 6.069 | 168.73 | 15.389 | 16.63 |
| 1024 | 256 | 6144 | 5.849 | 175.07 | 15.564 | 16.45 |
| 1024 | 256 | 7168 | 5.943 | 172.31 | 15.939 | 16.06 |
| 1024 | 256 | 8192 | 6.154 | 166.39 | 16.184 | 15.82 |
Does it help to cache the model first? Let's run with mmap again....
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
| 1024 | 256 | 0 | 6.441 | 158.99 | 15.466 | 16.55 |
| 1024 | 256 | 1024 | 6.111 | 167.56 | 15.717 | 16.29 |
| 1024 | 256 | 2048 | 5.875 | 174.30 | 15.810 | 16.19 |
| 1024 | 256 | 3072 | 6.029 | 169.84 | 16.001 | 16.00 |
| 1024 | 256 | 4096 | 6.150 | 166.52 | 16.170 | 15.83 |
| 1024 | 256 | 5120 | 6.010 | 170.39 | 16.537 | 15.48 |
| 1024 | 256 | 6144 | 6.008 | 170.44 | 16.727 | 15.30 |
| 1024 | 256 | 7168 | 6.332 | 161.73 | 17.038 | 15.02 |
| 1024 | 256 | 8192 | 6.277 | 163.13 | 17.328 | 14.77 |
NOPE!
**So the point to the whole story, if anyone cares, is that even a few mis-packed layers will tank your speeds. Feels like there is no point to posting R4/R8 quants because the user will have to repack them anyway unless using the EXACT configuration of the author. What am I missing here?**
As a bonus.. let's find where RTR starts to help prompt processing...
First I'll take a new baseline because it seems textgen is not working so good after packing/loading/etc. Could be I need to drop caches?
4096 no rtr/no-mmap Baseline
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
| 4096 | 1024 | 0 | 14.588 | 280.78 | 71.871 | 14.25 |
| 4096 | 1024 | 4096 | 14.877 | 275.33 | 74.257 | 13.79 |
| 4096 | 1024 | 8192 | 15.500 | 264.25 | 78.862 | 12.98 |
| 4096 | 1024 | 12288 | 15.919 | 257.30 | 83.039 | 12.33 |
| 4096 | 1024 | 16384 | 16.476 | 248.60 | 87.030 | 11.77 |
That's the highest we will get for now.
2048 without RTR with no-mmap
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
| 2048 | 512 | 0 | 11.606 | 176.47 | 35.719 | 14.33 |
| 2048 | 512 | 2048 | 11.586 | 176.77 | 36.388 | 14.07 |
| 2048 | 512 | 4096 | 11.683 | 175.30 | 37.146 | 13.78 |
| 2048 | 512 | 6144 | 11.813 | 173.37 | 38.241 | 13.39 |
| 2048 | 512 | 8192 | 11.950 | 171.38 | 39.246 | 13.05 |
| 2048 | 512 | 10240 | 12.194 | 167.95 | 40.579 | 12.62 |
| 2048 | 512 | 12288 | 12.208 | 167.75 | 41.348 | 12.38 |
| 2048 | 512 | 14336 | 12.412 | 165.00 | 42.410 | 12.07 |
| 2048 | 512 | 16384 | 12.407 | 165.07 | 43.277 | 11.83 |
2048 with rtr
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
| 2048 | 512 | 0 | 13.308 | 153.89 | 32.755 | 15.63 |
| 2048 | 512 | 2048 | 13.167 | 155.54 | 33.466 | 15.30 |
| 2048 | 512 | 4096 | 13.308 | 153.89 | 34.117 | 15.01 |
| 2048 | 512 | 6144 | 13.351 | 153.40 | 35.396 | 14.47 |
| 2048 | 512 | 8192 | 13.539 | 151.27 | 36.420 | 14.06 |
| 2048 | 512 | 10240 | 14.000 | 146.28 | 37.873 | 13.52 |
| 2048 | 512 | 12288 | 14.011 | 146.17 | 38.719 | 13.22 |
| 2048 | 512 | 14336 | 14.113 | 145.11 | 39.612 | 12.93 |
| 2048 | 512 | 16384 | 14.596 | 140.32 | 40.743 | 12.57 |
So still a debuff to prompt processing and a mild gain to t/g
Let's try something else....
2048/1024 -rtr
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
| 1024 | 256 | 0 | 6.837 | 149.78 | 16.543 | 15.47 |
| 1024 | 256 | 1024 | 6.830 | 149.93 | 16.713 | 15.32 |
| 1024 | 256 | 2048 | 6.885 | 148.73 | 16.821 | 15.22 |
| 1024 | 256 | 3072 | 7.085 | 144.54 | 17.057 | 15.01 |
| 1024 | 256 | 4096 | 6.899 | 148.42 | 17.248 | 14.84 |
| 1024 | 256 | 5120 | 7.106 | 144.10 | 17.608 | 14.54 |
| 1024 | 256 | 6144 | 6.760 | 151.47 | 17.794 | 14.39 |
| 1024 | 256 | 7168 | 7.181 | 142.60 | 18.080 | 14.16 |
| 1024 | 256 | 8192 | 7.154 | 143.13 | 18.325 | 13.97 |
2048/1024 -no rtr and no-mmap
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
| 1024 | 256 | 0 | 9.905 | 103.38 | 17.792 | 14.39 |
| 1024 | 256 | 1024 | 9.711 | 105.45 | 17.938 | 14.27 |
| 1024 | 256 | 2048 | 9.793 | 104.56 | 18.090 | 14.15 |
| 1024 | 256 | 3072 | 9.786 | 104.64 | 18.292 | 14.00 |
| 1024 | 256 | 4096 | 9.824 | 104.24 | 18.465 | 13.86 |
| 1024 | 256 | 5120 | 9.854 | 103.92 | 18.844 | 13.59 |
| 1024 | 256 | 6144 | 9.874 | 103.71 | 19.033 | 13.45 |
| 1024 | 256 | 7168 | 9.930 | 103.12 | 19.309 | 13.26 |
| 1024 | 256 | 8192 | 10.060 | 101.79 | 19.568 | 13.08 |
Ok.. now prompt processing finally fell.. the original observed effect.
So then -rtr or repacking is only useful in the case of ub being half the batch size? It does allow you to generate text a little bit faster in every test at least.
---
👤 **ikawrakow** replied the **2025-06-04** at **16:48:34**:<br>
Perhaps to understand how repacked quants behave on the CPU and CUDA, it is easier to take a smaller model that would completely fit one GPU, quantize with with `--pure` to your favorite quant and corresponding repacked variant, and then
* Run fully offloaded to the GPU
* Run CPU-only
It is an easy exercise, does not require an imatrix as you are not after the best possible quantization quality, and if you pick a model that is not too large, it is very quick to do.
Without having understood what the repacking does or does not do for you, it becomes very hard to sort out the big models with partial offloads, offload policy, numa, what runs on the GPU or CPU when and why, etc.
> 👤 **Ph0rk0z** replied the **2025-06-04** at **17:17:17**:<br>
> Worth a try. I will have to. I'm repacking exactly what I don't put on GPU and watching the layers in quantize, i.e which become _R8. One other metric would be to do 4096/2048 and see if it really is correlated to half batch size or bound to the 1024 size.
>
> Is there a way to print exactly what tensors are repacked by RTR? I could be missing some tiny layers it did on it's own by using the regex offline.
>
> Textgen is back to 18.x t/s after I dropped caches but prompt processing benchmarks hold universally through my tests.
>
> 👤 **Ph0rk0z** replied the **2025-06-05** at **11:48:40**:<br>
> So I got it to print the tensors. The one that gets repacked by RTR and not offline repacking is token_embd. I had issues moving that tensor to either CPU or GPU manually.
>
> Also notice that quantize will repack to R8, is there a difference between that and R4 as far as the various cuda implementations you are adding?
>
> 👤 **ikawrakow** replied the **2025-06-05** at **11:56:57**:<br>
> `token_embd.weight` is never repacked and always stays on the CPU. It should not go to the GPU, and it should not get repacked. If you managed to make it repack, that's a bug, and you should tell me how you did it.
>
> For some quantization one gets better CPU performance by interleaving 8 rows, so these are the `_R8` quants. `Q4_0`, `Q8_0` and `IQ4_XS` get repacked to `_R8`, all others are `_R4`. Some of those that are `_R4` would benefit from being `_R8`, but I haven't done it, and now that there are `_R4` quantized models floating around the Internet, I don't want to break backwards compatibility (and I don't want to carry `_R4` and `_R8` version of the same quantization type), so it will stay like this.
>
> 👤 **Ph0rk0z** replied the **2025-06-05** at **12:49:05**:<br>
> I uncommented your line near where it says REPACKED XX Tensors which purportedly printed what was repacked. Everything else matches what I sent to CPU. Either the print is incorrect or it repacked it.
>
> Its strange too because I had tried to find layers to to throw on the CPU for just a few MB since my command line was OOM at 22k. Finally settled on 10 ffn_gate_inp towards the end. When I put token_embd=CPU I'd get a crash on qwen right away.
>
> I just realized that *all* of my quants are IQ something. Wonder if it's related. Also tried offload policy from -1 to 29, negligible speed differences all around. Got deepseek lite a while ago which fits on one GPU but it's also IQ4_XS. Perhaps I should download a Q4_K instead.
>
> edit:I enabled a further debug printout that says what got repacked to what and emb isn't there.
---
👤 **Ph0rk0z** replied the **2025-06-06** at **17:29:36**:<br>
Finally got around to testing a smaller model. Non IQ quant as well.
<details><summary>DeepSeek-V2-Lite-Chat.i1-Q4_K_M</summary>
CUDA_VISIBLE_DEVICES= numactl --interleave=all ./bin/llama-sweep-bench \
-m DeepSeek-V2-Lite-Chat.i1-Q4_K_M.gguf \
-t 48 \
-c 32768 \
--numa distribute \
-ngl 0 \
-ctk q8_0 \
-ctv q8_0 \
-fa \
-fmoe \
-rtr \
-b 4096 \
-ub 4096
</details>
No RTR 48c CPU distribute, cache on GPU
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
| 4096 | 1024 | 0 | 2.955 | 1386.18 | 36.494 | 28.06 |
| 4096 | 1024 | 4096 | 3.047 | 1344.07 | 60.110 | 17.04 |
| 4096 | 1024 | 8192 | 3.338 | 1227.20 | 82.831 | 12.36 |
| 4096 | 1024 | 12288 | 3.611 | 1134.32 | 103.469 | 9.90 |
| 4096 | 1024 | 16384 | 3.861 | 1060.81 | 125.330 | 8.17 |
RTR 48c CPU distribute, Cache on GPU (iqk_repack_tensor(output.weight): q6_K -> q6_k_r4. 102400 rows, 3200 chunks, 48 threads)
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
| 4096 | 1024 | 0 | 11.081 | 369.65 | 32.316 | 31.69 |
| 4096 | 1024 | 4096 | 13.410 | 305.44 | 53.593 | 19.11 |
| 4096 | 1024 | 8192 | 15.889 | 257.79 | 74.674 | 13.71 |
24 cores, numa isolate + RTR + no interleave
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
| 4096 | 1024 | 0 | 19.223 | 213.08 | 30.327 | 33.76 |
| 4096 | 1024 | 4096 | 23.378 | 175.21 | 64.052 | 15.99 |
| 4096 | 1024 | 8192 | 28.008 | 146.25 | 97.014 | 10.56 |
24 cores, no interleave + no rtr + numa isolate
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
| 4096 | 1024 | 0 | 3.352 | 1221.83 | 46.758 | 21.90 |
| 4096 | 1024 | 4096 | 3.448 | 1187.76 | 81.010 | 12.64 |
| 4096 | 1024 | 8192 | 3.730 | 1098.15 | 113.951 | 8.99 |
GPU Fully
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
| 4096 | 1024 | 0 | 0.730 | 5613.13 | 7.402 | 138.33 |
| 4096 | 1024 | 4096 | 0.863 | 4745.09 | 10.398 | 98.48 |
| 4096 | 1024 | 8192 | 1.115 | 3674.86 | 13.378 | 76.55 |
No GPU full cores no rtr
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
| 4096 | 1024 | 0 | 13.485 | 303.75 | 36.449 | 28.09 |
| 4096 | 1024 | 4096 | 15.527 | 263.81 | 58.686 | 17.45 |
| 4096 | 1024 | 8192 | 18.000 | 227.55 | 79.114 | 12.94 |
No GPU full cores RTR
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
| 4096 | 1024 | 0 | 10.863 | 377.07 | 33.246 | 30.80 |
| 4096 | 1024 | 4096 | 13.005 | 314.95 | 54.394 | 18.83 |
| 4096 | 1024 | 8192 | 15.463 | 264.88 | 75.656 | 13.53 |
It looks like on this system, RTR only helps when there is no GPU involved or the ubatch is 1024 (previous tests). In every other case, RTR lowers the prompt processing by a lot but improves TG.
> 👤 **ciprianveg** replied the **2025-06-10** at **16:08:25**:<br>
> I noticed it too, and iQ3_XXS_UD pp speed is affected by rtr much more than other quants, it drops from 250t/s to 26t/s, cca 10x slower. q2_xl_ud drops only from 245 to 140t/s. I am using no-mmap and swap disabled..
>
> It is a pitty because while dropping pp speed 90%, it increases the generation speed by 40%.
>
> i have a TR 3955 and 2x3090.
> built with: cmake -B build -DGGML_CUDA=ON -DGGML_RPC=OFF -DGGML_BLAS=OFF -DGGML_SCHED_MAX_COPIES=1 -DGGML_CUDA_IQK_FORCE_BF16=1
>
> started with:
> -ctx-size 71680 \
> -ctk q8_0 \
> -mla 3 \
> -fa \
> -amb 512 \
> -fmoe \
> --temp 0.6 \
> --top_p 0.95 \
> --min_p 0.01 \
> --n-gpu-layers 63 \
> -ot "blk\.[0-3]\.ffn_up_exps=CUDA0,blk\.[0-3]\.ffn_gate_exps=CUDA0,blk\.[0-3]\.ffn_down_exps=CUDA0" \
> -ot "blk\.1[0-1]\.ffn_up_exps=CUDA1,blk\.1[0-1]\.ffn_gate_exps=CUDA1,blk\.1[0]\.ffn_down_exps=CUDA1" \
> --override-tensor exps=CPU \
> --parallel 1 \
> --threads 16 \
> --threads-batch 15 \
> --host 0.0.0.0 --port 5002 \
> --ubatch-size 7168 --batch-size 7168 --no-mmap
>
> BUT, if i build it with: cmake -B build -DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS -DGGML_CUDA=ON -DGGML_SCHED_MAX_COPIES=1
>
> no pp decrease anymore, but no tg speed increase, too..
>
> 👤 **Ph0rk0z** replied the **2025-06-11** at **11:40:47**:<br>
> Could it be using BLAS instead of cuda when built with it? While ubatch size 1024 isn't as good as 4096+, it gives me a happy medium to use the RTR's textgen speed increase.

View File

@@ -0,0 +1,47 @@
### 🗣️ [#519](https://github.com/ikawrakow/ik_llama.cpp/discussions/519) - Android Build
| **Author** | `aezendc` |
| :--- | :--- |
| **Created** | 2025-06-11 |
| **Updated** | 2025-06-21 |
---
#### Description
I just want to ask if how are you guys building for android?
I want to build for android so that I can create an android app. Thank you
---
#### 🗣️ Discussion
👤 **ikawrakow** replied the **2025-06-11** at **14:09:22**:<br>
See #401
> 👤 **aezendc** replied the **2025-06-21** at **05:39:58**:<br>
> I mean using windows. I cant successfully build using NDK
---
👤 **jeffzhou2000** replied the **2025-06-21** at **08:48:21**:<br>
FYI:
refer to project kantv(build a standard Android APK with llama.cpp + whisper.cpp):https://github.com/kantv-ai/kantv
or my forked llama.cpp: https://github.com/zhouwg/ggml-hexagon
> 👤 **aezendc** replied the **2025-06-21** at **09:28:14**:<br>
> is bitnet-b1.58-2B-4T-GGUF model will work here?
>
> 👤 **jeffzhou2000** replied the **2025-06-21** at **09:32:04**:<br>
> haven't try that model yet on Android.
>
> 👤 **aezendc** replied the **2025-06-21** at **09:51:14**:<br>
> thanks for sharing these are great repositories
>
> 👤 **jeffzhou2000** replied the **2025-06-21** at **09:55:01**:<br>
> you are welcome and glad to see it's a little useful.

View File

@@ -0,0 +1,114 @@
### ✨ [#526](https://github.com/ikawrakow/ik_llama.cpp/discussions/526) - Partial requant feature to save compute and time during tests.
| **Author** | `Nexesenex` |
| :--- | :--- |
| **Created** | 2025-06-13 |
| **Updated** | 2025-07-13 |
---
#### Description
Hey,
Could it be possible to have a partial requant feature?
For (a generic) example, one quantizes a IQ2_KT .gguf, but with ffn_down in IQ2_S and the output in IQ5_KS_R4.
Then, one wants to requantize the same model with the same IQ2_KT broad quant strategy, but with ffn_down in IQ3_XXS and the output in IQ5_K.
Could a feature be implemented so the first quantized model is used as a secondary source to the original source, in order import all the already quantized tensors in IQ2_KT from this secondary source, copy them in the destination .gguf, and only requantize from the original source those tensors which the type has been changed in the quantization command?
That could save a lot of time and compute during tests.
---
#### 🗣️ Discussion
👤 **saood06** replied the **2025-06-13** at **12:49:01**:<br>
People do similar things a lot by making scripts that leveraging gguf-py. (Some notable examples was updating the gemma QAT to use use Q6_K instead of fp16 for the embeddings table, manually making deepseek R1-T chimera from a V3 and R1 GGUF, etc.).
I've thought to add support to the C/C++ code to do this, but it seems unnecessary given how flexible gguf-py is.
There has been effort made to keep gguf-py current with all the quant types (see #458 and #298).
---
👤 **ikawrakow** replied the **2025-06-13** at **12:53:13**:<br>
It would be useful, right? When I'm actively experimenting with quantization mixes I wish I had this feature. But implementing it basically means to re-implement quantization, so I have not done it.
The alternative is to run a second quantization where only the tensors that you want to change are quantized (using `--custom-q`), and then, as @saood06 mentions, use gguf-py to stitch the two models together (although, I don't think there is an easy out-of-the-box way of doing, or is there?)
> 👤 **Nexesenex** replied the **2025-06-13** at **12:59:46**:<br>
> Well, I'm not well versed in gguf.py, so I'd trust Saood's word on that.
> It seems to be quite the hassle still, and a proper and straight implementation of such feature would imho be critically important, because it would save time, which is irrecoverable, and compute/money/natural resources, which are not infinite for either one, either all.
>
> 👤 **saood06** replied the **2025-06-13** at **13:01:45**:<br>
> >(although, I don't think there is an easy out-of-the-box way of doing, or is there?)
>
> A script that does so would really not be that difficult to make especially if you reference the existing ones (that are designed for specific one-off situations).
>
> I do think it is trivial enough where it is very likely of the smaller coding oriented models could one-shot a working version (especially if given the references of the notable examples mentioned above).
>
> I do think a polished version would make sense in `gguf-py/scripts` if one gets made and wants to be shared. I haven't done that with any of the one's I have seen in the wild or made myself as they are not generic and handle very specific needs.
>
> 👤 **saood06** replied the **2025-06-13** at **13:15:09**:<br>
> I have actually thought about this before, and thought the most polished version would be to add this functionality both as a standalone script (taking in some regex similar to `--custom-q`, `-ot`, `--repack-pattern`, etc.) and in the GGUF Editor GUI : https://github.com/ggml-org/llama.cpp/pull/12930 (which has yet to be ported here).
>
> I never did it as it was always so easy to make one-off scripts for my gguf-py needs, and I thought it wasn't something that many other people would care about or use, but I guess I was wrong.
>
> 👤 **Nexesenex** replied the **2025-06-13** at **14:20:40**:<br>
> Well, we are actually several folks testing new quants on different models, and so the idea might be quite popoular, ideally in C or at least in Python. I'll try by myself if no one comes with an out of the box solution, but need to read all those references and understand more about what I'm getting into, because I'm far far behind you guys about the know-how.
>
> 👤 **saood06** replied the **2025-06-13** at **14:48:47**:<br>
> > Well, we are actually several folks testing new quants on different models, and so the idea might be quite popoular, ideally in C or at least in Python.
>
> Yeah. I floated this idea a long time ago to a certain quant maker (who pumps out a lot of quants) as it would (and still could) save them a lot of wasted compute, but this was before I knew about gguf-py.
>
> >I'll try by myself if no one comes with an out of the box solution
>
> Nice, if you don't end up getting something working by the time I finish up polishing the frontend I use to be good enough for a public release I'll do it.
>
> >but need to read all those references and understand more
>
> Here's two I mentioned, [QAT embed swap](https://huggingface.co/stduhpf/google-gemma-3-27b-it-qat-q4_0-gguf-small/blob/main/swap_embeds.py), [DIY chimera merge](https://gist.github.com/city96/a05cb7ec6664a5085efb007497f2049b). I know I've seen more but these are the first two that came to mind.
>
> 👤 **saood06** replied the **2025-06-13** at **15:03:21**:<br>
> Also I just remembered there was another hacky idea I had to do this which involved abusing the gguf-split system to isolate any tensors you want to experiment with which would allow you to swap them out (and test many combinations).
>
> The best implementation of this could in theory minimize both the amount of space taken (should be easy) and the amount of files written to (this seems like it would be much more difficult, quantizing only select tensors with gguf-py might not be too bad, but that would limit it to only the tensors it can quantize to, and doing it with `quantize.cpp` means adding that functionality to it which may be difficult).
>
> 👤 **Nexesenex** replied the **2025-06-13** at **16:10:32**:<br>
> > Also I just remembered there was another hacky idea I had to do this which involved abusing the gguf-split system to isolate any tensors you want to experiment with which would allow you to swap them out (and test many combinations).
> >
> > The best implementation of this could in theory minimize both the amount of space taken (should be easy) and the amount of files written to (this seems like it would be much more difficult, quantizing only select tensors with gguf-py might not be too bad, but that would limit it to only the tensors it can quantize to, and doing it with `quantize.cpp` means adding that functionality to it which may be difficult).
>
> Lol, I was just thinking about this 1h ago. (Why don't I simply split the gguf in as many tensor as there is..), and then it becomes a matter of naming. I was contemplating over that a long time ago already, tensor-series based gguf, gguf as directory and so on. But actually, it can already be tried as things are.
---
👤 **saood06** replied the **2025-07-12** at **21:45:17**:<br>
@Nexesenex
Have you seen this: https://github.com/Thireus/GGUF-Tool-Suite? I haven't fully gone through the code yet, but I think it seems to accomplish at least some of the goals you described here (taking the path of using the gguf-split system).
> 👤 **Nexesenex** replied the **2025-07-12** at **22:04:37**:<br>
> > @Nexesenex
> >
> > Have you seen this: https://github.com/Thireus/GGUF-Tool-Suite? I haven't fully gone through the code yet, but I think it seems to accomplish at least some of the goals you described here (taking the path of using the gguf-split system).
>
> You will laugh. I discovered his fork of IKL today, and didn't discover yet his tools suite. Thanks for the heads-up, I will dive into it asap! :)
>
> 👤 **saood06** replied the **2025-07-12** at **23:30:04**:<br>
> >Thanks for the heads-up, I will dive into it asap! :)
>
> Let me know your thoughts, e.g. if it does meet your goals, will you use it, will you change/fork it, etc.
>
> 👤 **Nexesenex** replied the **2025-07-13** at **02:32:53**:<br>
> > > Thanks for the heads-up, I will dive into it asap! :)
> >
> > Let me know your thoughts, e.g. if it does meet your goals, will you use it, will you change/fork it, etc.
>
> Sure thing.

View File

@@ -0,0 +1,290 @@
### 🗣️ [#532](https://github.com/ikawrakow/ik_llama.cpp/discussions/532) - Guidance on GPU Layer Offloading Strategy in ik_llama.cpp for Multi GPU Rig (2x5090 + 2x4090)
| **Author** | `mtcl` |
| :--- | :--- |
| **Created** | 2025-06-16 |
| **Updated** | 2025-06-24 |
---
#### Description
@ikawrakow or @ubergarm
I've recently expanded my GPU rig to include (2x RTX 5090 + 2x RTX 4090) and am seeking your expertise to develop a systematic approach for offloading layers across these GPUs.
While I have experience with hardware configurations, I'd like to avoid ad-hoc experimentation and instead follow best practices or documented methodologies specific to ik_llama.cpp's architecture. Could you please share recommendations regarding:
Which types of layers (e.g., attention, feed-forward) benefit most from GPU acceleration? How do i know which layer I need to offload? Currently I have been randomly offloading whatever i can.
Optimal strategies for distributing work across heterogeneous GPUs (5090 vs 4090)?
Are there built-in features/flags in ik_llama.cpp to control layer distribution?
I'm particularly interested in any rationale behind layer offloading decisions in GPU-accelerated LLMs.
this is one of the commands that I used:
For some reason my nvidia-smi shows GPU 0 and 3 as NVIDIA 5090, but in reality CUDA_VISIBLE_DEVICES sees GPUs 2 and 3 as NVIDIA 5090. So I have arranged it such that the first and last -ot parameter is for NVIDIA 5090 and in between two -ot parameters are for NVIDIA 40900.
for Qwen3-235B
```bash
CUDA_VISIBLE_DEVICES="2,1,0,3" ./build/bin/llama-server \
--model /home/mukul/dev-ai/models/unsloth/Qwen3-235B-A22B-128K-GGUF/Q4_K_M/Qwen3-235B-A22B-128K-Q4_K_M-00001-of-00003.gguf \
--alias unsloth/Qwen3-235B-A22B-128K-Q4_K_M \
--ctx-size 65536 \
-ctk q8_0 -ctv q8_0 \
-fa \
-b 4096 -ub 4096 \
-fmoe \
--n-gpu-layers 100 \
-ot "blk\.([0-9]|1[0-5])\.ffn=CUDA0" \
-ot "blk\.(1[6-9]|2[0-7])\.ffn=CUDA1" \
-ot "blk\.(2[8-9]|3[0-9])\.ffn=CUDA2" \
-ot "blk\.(4[0-9]|5[0-5])\.ffn=CUDA3" \
--override-tensor exps=CPU \
--parallel 1 \
--threads 56 \
--host 0.0.0.0 \
--port 10002
```
and for a DeepSeek-R1
```bash
CUDA_VISIBLE_DEVICES="2,1,0,3" ./build/bin/llama-server \
--model /home/mukul/dev-ai/models/ubergarm/DeepSeek-R1-0528-GGUF/IQ4_KS_R4/DeepSeek-R1-0528-IQ4_KS_R4-00001-of-00009.gguf \
--alias ubergarm/DeepSeek-R1-0528-IQ4_KS_R4 \
--ctx-size 40960 \
-ctk q8_0 \
-mla 3 -fa \
-b 4096 -ub 4096 \
-amb 512 \
-fmoe \
-ngl 63 \
-ot "blk\.[3-4]\.ffn_.*=CUDA0" \
-ot "blk\.[5-6]\.ffn_.*=CUDA1" \
-ot "blk\.[7-8]\.ffn_.*=CUDA2" \
-ot "blk\.[9]\.ffn=CUDA3,blk\.1[0-1]\.ffn=CUDA3" \
-ot exps=CPU \
--parallel 1 \
--threads 56 \
--host 0.0.0.0 \
--port 10002
```
---
#### 🗣️ Discussion
👤 **ubergarm** replied the **2025-06-17** at **01:09:25**:<br>
> I'd like to avoid ad-hoc experimentation and instead follow best practices or documented methodologies specific to ik_llama.cpp's architecture.
I personally *do* advise using ad-hoc experimentation like simple `llama-sweep-bench` a/b comparisons to find out what works best for your specific hardware configuration. There are a number of discussions you could search for such as [this discussion thread](https://github.com/ikawrakow/ik_llama.cpp/discussions/477#discussioncomment-13366794) or search of `CUDA2` etc for other multi-gpu enjoyers like yourself e.g. @Lissanro @Ph0rk0z @ciprianveg @Thireus @rodriMora @Panchovix And it might depend on your PCIe allocation per card and stuff like that too.
If you'd like to document some best practices for multi-GPU offloading strategies for multiple LLMs, that would be welcome contribution! However keep in mind, things change fast so honestly spend some time looking through recently closed and newly opened PRs as some quants are getting a big boost for PP etc.
> Which types of layers (e.g., attention, feed-forward) benefit most from GPU acceleration?
> How do i know which layer I need to offload?
Offload early offload often! No really, if you can offload the whole thing that is great. If not put as much as possible on your fastest GPUs first. Try to keep kv-cache near the attn tensors probably or all on a single e.g. `--main-gpu 0` or whatever maybe?
Usually the dense ffn layers get offloaded to CPU first as they are just larger. Hopefully your quant has optimized those for CPU/RAM usage e.g. `_r4` quant types or use `-rtr` etc.
I don't recommend separating `ffn_(gate|up)` as the `-fmoe` is fusing those together psure. Usually I just put all attn/shexp on GPU and as many other ffn that will fit for DeepSeek-R1. Qwen has no shared experts so same thing basically but be aware the names/regex are different and there is no `-ot exps=CPU` for Qwen btw (so remove that from your command). You can read more on my [ubergarm/Qwen3-235B-A22B-GGUF discussions](https://huggingface.co/ubergarm/Qwen3-235B-A22B-GGUF/discussions/1#6814ea55554bef6174d3bab1)
> Currently I have been randomly offloading whatever i can.
This works pretty well a lot of time!
> Optimal strategies for distributing work across heterogeneous GPUs (5090 vs 4090)?
See above, put more layers on the fast GPUs first.
> Are there built-in features/flags in ik_llama.cpp to control layer distribution?
I'd suggest some combination of this depending on which model you're running:
```
cmake -B ./build -DGGML_CUDA=ON -DGGML_BLAS=OFF -DGGML_SCHED_MAX_COPIES=1 -DGGML_CUDA_IQK_FORCE_BF16=1 -DGGML_CUDA_F16=ON
cmake --build ./build --config Release -j $(nproc)
```
Use the `-DGGML_CUDA_IQK_FORCE_BF16=1` if you're running DeepSeek models. The `-DGGML_CUDA_F16=ON` is fore the new experimental `_kt` quants at least and maybe other stuff I'm not sure. Personally I leave BLAS off and don't fuss with any of that stuff (even the experimental intel compiler stuff i tried once seemed slower so i don't mess with any of it nor AMX stuff).
> For some reason my nvidia-smi shows GPU 0 and 3 as NVIDIA 5090, but in reality CUDA_VISIBLE_DEVICES sees GPUs 2 and 3 as NVIDIA 5090.
I believe there is some device ordering environment variables you can use to swap those around, I thought I saw some chatter near the linked discussion above. e.g. `CUDA_DEVICE_ORDER=<PCI_BUS_ID>`
Cheers!
---
👤 **Ph0rk0z** replied the **2025-06-17** at **13:28:43**:<br>
It's a good idea to view all the layers and their file sizes on the model. That way you know what you can fit onto your GPUs. Not all blocks have the same sized layers. I have used llama.cpp mainline to print the sizes and adjusted accordingly. The more you cram, the more t/s you generally get. Smaller AMB can get you smaller buffers and perhaps fit another layer. Benchmarking is your friend. Once you cache the model in sysram, its easy to re-load and try things. There is some variance in llama-sweep-bench, but you can still spot larger trends. After a lot of testing, it might be wise to dump your cache and re-test your best ones.
> 👤 **Panchovix** replied the **2025-06-17** at **13:56:49**:<br>
> What is the effect by reducing or increasing the amb, besides buffer size?
>
> 👤 **Ph0rk0z** replied the **2025-06-18** at **01:11:46**:<br>
> From the original PR it runs the ops multiple times, if I understand correctly. Breaking them up into batches. On some systems that can be slower? In practice I found little difference.
>
> 👤 **Panchovix** replied the **2025-06-18** at **01:17:28**:<br>
> I'm not sure how it exactly either, the PR is too technical for me haha. I guess in theory reducing amb reduces the buffer sizes?
>
> 👤 **ubergarm** replied the **2025-06-18** at **02:14:07**:<br>
> So without `-amb` it used to just fill up a bunch of VRAM and cause trouble. But using `-amb 512` for example will set aside a buffer of fixed size 512MiB VRAM and yes as @Ph0rk0z says it will use that fixed buffer size and loop over processing all the data that previously filled up a bunch of VRAM.
>
> So it is a trade-off, in that it puts a cap on the amount of VRAM used but there is a little overhead to setup and copy into it looping over all the data to process.
>
> If you make it too small, e.g. `-amb 64` things can get slower or stop working. So in general I leave it at like `-amb 512` or if I am squeezing something in and need a little more I'll drop to `-amb 256` but generally don't go lower than that.
>
> 👤 **Ph0rk0z** replied the **2025-06-18** at **12:31:23**:<br>
> 64 actually worked for me. Some fractions of t/s off at higher contexts and that's all. I think at some point the buffer doesn't get much smaller. Higher AMB also didn't produce higher performance. Had tested 1024 and 2048 but nada. Does it affect anything like PPL or quality tho? Ubatch supposedly has some of that for context.
>
> 👤 **ikawrakow** replied the **2025-06-18** at **13:00:04**:<br>
> Few clarifications for this thread
> * `-amb` only has effect if we have MLA (DeepSeek models), and/or if we are not using flash attention for models with GQA (almost all other models). I.e., when using FA, it has no effect whatsoever on any model other than DeepSeek.
> * It has no effect on accuracy
> * It splits the self attention calculation into chunks over attention heads in such a way that the intermediate buffers required for `K*Q` or, in the case of MLA, for the K-cache times `attn_k_b` tensor, does not exceed the specified amount of memory
> * Obviously this is only possible up to a point. When a chunk reaches a single attention head no further reduction is possible
> * It only controls compute buffer sizes of buffers related to self attention. There are many other operations in an LLM compute graph that also require temporary buffers for storing results. Those are not effected by `-amb`, so the actual compute buffer size is almost always larger than what is specified with `-amb`.
> * It is theoretically slower than no `-amb` because, instead of doing one big matrix matrix multiplication, we need to do several smaller matrix multiplications, and this is always slower. This is why it is an option rather than being on by default.
> * In practice people seem to observe no measurable effect when using `-amb` with DeepSeek-R1/V3.
> * I observe about a 3% performance penalty when running DeepSeek-Lite (a 16B parameter model with the same attention mechanism as DeepSeek-V3/R1) fully offloaded to the GPU, and also about that when running this model CPU only.
>
> 👤 **Ph0rk0z** replied the **2025-06-18** at **13:05:28**:<br>
> >-amb only has effect if we have MLA (DeepSeek models)
>
> Good to know because I was copying 512 for qwen models at first, since people put it in their command line. Monkey see, monkey do. Took it out only when testing multiple sizes and seeing no change.
---
👤 **mtcl** replied the **2025-06-23** at **03:15:56**:<br>
@ubergarm and @ikawrakow I have recently obtained 2XBlackwell Pro 6000 GPUs. so I have 192 Gigs of VRAM. I am able to offload your Qwen3-235B model completely on the model and i get over 1000 prompt processing and 50 tk/sec generation speed. But for the Deepseek model, i cant get beyond the 12-13tk/second. Would you have any advice for me? Below is the video where i compare all the different options, I have chapters in the video so that you can see the right sections. But any help will be really appreciated. I am starting to give up on deepseek. if two blackwells aren't enough then what is :/
https://www.youtube.com/watch?v=cFddXR1nPLg
8:01 - Test 1: Qwen 3 235B (Fully GPU Offloaded)
10:55 - Qwen 3 235B Loaded - Insane Performance!
12:18 - Qwen 3 235B Benchmark: 58 tokens/sec
18:21 - Qwen 3 235B Pushing the Limit: 128k Context Test
21:14 - Test 2: DeepSeek MoE Model (Partial Offload)
26:43 - Experimenting with Layer Offloading
31:29 - DeepSeek Benchmark & Power Draw
35:27 - DeepSeek's Impressive Snake Game
41:35 - DeepSeek Performance Results (12 tokens/sec)
44:27 - Test 3: DeepSeek on iklama.cpp (IQ3 Quant)
59:36 - iklama.cpp Performance Results (15 tokens/sec)
1:08:31 - Test 4: Llama 4 Maverick MoE Model
1:20:22 - Maverick Performance Results (57 tokens/sec!)
> 👤 **Panchovix** replied the **2025-06-23** at **03:19:03**:<br>
> Not them but for 2bpw or more you may get benefits by offloading.
>
> If you want to load 2-4bpw fully on GP without offloading, then you need 2 6000 Pros more haha.
>
> It is not normal to get 12 t/s PP. How are you running the model? On that specific quant I get about 200-300 t/s PP on a 5090 (and other GPUs, offloading about 100GB to RAM).
>
> 👤 **mtcl** replied the **2025-06-23** at **03:28:12**:<br>
> > Not them but for 2bpw or more you may get benefits by offloading.
> >
> > If you want to load 2-4bpw fully on GP without offloading, then you need 2 6000 Pros more haha.
> >
> > It is not normal to get 12 t/s PP. How are you running the model? On that specific quant I get about 200-300 t/s PP on a 5090 (and other GPUs, offloading about 100GB to RAM).
>
> Oh 12 t/s is generation speed, PP is 150ish I might as well get two more blackwells lol
>
> 👤 **Panchovix** replied the **2025-06-23** at **03:30:27**:<br>
> Oh then those 12 t/s are probably kinda normal, I get like 7-8.5 t/s haha. Not sure if there's a way to improve more TG t/s besides running fully on GPU.
>
> 👤 **saood06** replied the **2025-06-23** at **04:47:59**:<br>
> >besides running fully on GPU.
>
> A mix of IQ1_S_R4 and IQ2_KT could fit in 192 Gigs of VRAM (I think pure IQ2_KT would be too big). Some measurements of quants and PPL. https://github.com/ikawrakow/ik_llama.cpp/pull/529#issuecomment-2978837501 and https://github.com/ikawrakow/ik_llama.cpp/pull/495#issuecomment-2988574743
>
> 👤 **Ph0rk0z** replied the **2025-06-23** at **12:29:14**:<br>
> TG limited by the CPU/RAM of the system you are using if it's not fully on GPU.
>
> 👤 **ubergarm** replied the **2025-06-23** at **14:54:25**:<br>
> Heya @mtcl you selling your used 5090s now already too? haha...
>
> Check [this l1t thread for a guy offloading IQ1_S onto 2x Blackwell 6000's](https://forum.level1techs.com/t/deepseek-deep-dive-r1-at-home/225826/153) he did some llama-sweep bench with various batch sizes.
>
> And yeah as the others have said you will be limited by however much active weights are left on CPU/RAM as TG will be bottlenecked by ram bandwidth.
>
> saood06 is correct that a pure IQ2_KT is a little too big, it is like 192GiB (can't find my notes for exact value). But you could make a custom quant that is IQ1_S and IQ2_KT etc to get it down a little bit. I've had some requests for a ~196GB RAM target quant and that IQ2_KT would be pretty good if you can offload it all.
>
> 👤 **mtcl** replied the **2025-06-23** at **15:55:20**:<br>
> I might add both 5090s back on the server to load a slightly bigger model 😂 but I'm in MI this week, I'll be back home in MN on Friday. Till then I only have remote kvm access to my machine.
---
👤 **kirnat** replied the **2025-06-24** at **17:08:34**:<br>
Hi Mukul. Thanks for your helpful videos.
I just wanted to add some data points since we share the same motherboard + cpu setup.
Hardware:
Asus Pro W790 Sage
Intel engineering sample QYFS (Xeon 8480 equivalent)
8x48GB @ 4800 (Sadly memory clock is locked on the CPU, even if you can flip the switch in BIOS)
1x Blackwell 6000 Pro RTX
Command line options:
llama-sweep-bench \
-m ./models/unsloth/R1/DeepSeek-R1-0528-UD-Q4_K_XL-00001-of-00008.gguf \
-fa \
-t 52 \
-ngl 61 \
-ot "blk\.[0-9]\.ffn_(gate)_exps.=CPU" \
-ot "blk\.1[0-9]\.ffn_(gate)_exps.=CPU" \
-ot ".ffn_(up|down)_exps.=CPU" \
-mla 1 \
-rtr \
-fmoe \
-ctk q8_0 \
-ctv q8_0 \
-b 1024 \
-ub 1024 \
-c 32768
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
| 1024 | 256 | 0 | 6.705 | 152.71 | 16.386 | 15.62 |
| 1024 | 256 | 1024 | 6.743 | 151.85 | 16.558 | 15.46 |
| 1024 | 256 | 2048 | 6.811 | 150.35 | 16.605 | 15.42 |
| 1024 | 256 | 3072 | 6.886 | 148.70 | 16.656 | 15.37 |
| 1024 | 256 | 4096 | 6.962 | 147.09 | 16.696 | 15.33 |
| 1024 | 256 | 5120 | 7.042 | 145.40 | 16.756 | 15.28 |
| 1024 | 256 | 6144 | 7.078 | 144.68 | 17.024 | 15.04 |
| 1024 | 256 | 7168 | 7.164 | 142.93 | 17.034 | 15.03 |
| 1024 | 256 | 8192 | 7.241 | 141.42 | 17.057 | 15.01 |
| 1024 | 256 | 9216 | 7.309 | 140.10 | 17.089 | 14.98 |
| 1024 | 256 | 10240 | 7.386 | 138.64 | 17.108 | 14.96 |
| 1024 | 256 | 11264 | 7.462 | 137.24 | 17.141 | 14.94 |
| 1024 | 256 | 12288 | 7.535 | 135.90 | 17.423 | 14.69 |
| 1024 | 256 | 13312 | 7.607 | 134.61 | 17.482 | 14.64 |
| 1024 | 256 | 14336 | 7.679 | 133.34 | 17.495 | 14.63 |
| 1024 | 256 | 15360 | 7.750 | 132.13 | 17.519 | 14.61 |
| 1024 | 256 | 16384 | 7.833 | 130.73 | 17.545 | 14.59 |
| 1024 | 256 | 17408 | 7.907 | 129.51 | 17.589 | 14.55 |
| 1024 | 256 | 18432 | 7.982 | 128.29 | 17.746 | 14.43 |
| 1024 | 256 | 19456 | 8.057 | 127.09 | 17.772 | 14.40 |
| 1024 | 256 | 20480 | 8.133 | 125.91 | 17.777 | 14.40 |
| 1024 | 256 | 21504 | 8.218 | 124.60 | 17.795 | 14.39 |
| 1024 | 256 | 22528 | 8.292 | 123.49 | 17.827 | 14.36 |
| 1024 | 256 | 23552 | 8.376 | 122.25 | 17.840 | 14.35 |
| 1024 | 256 | 24576 | 8.464 | 120.99 | 18.187 | 14.08 |
| 1024 | 256 | 25600 | 8.535 | 119.98 | 18.205 | 14.06 |
| 1024 | 256 | 26624 | 8.608 | 118.96 | 18.235 | 14.04 |
| 1024 | 256 | 27648 | 8.686 | 117.89 | 18.235 | 14.04 |
| 1024 | 256 | 28672 | 8.753 | 116.99 | 18.253 | 14.03 |
| 1024 | 256 | 29696 | 8.833 | 115.92 | 18.286 | 14.00 |
| 1024 | 256 | 30720 | 8.902 | 115.03 | 18.444 | 13.88 |
| 1024 | 256 | 31744 | 8.979 | 114.04 | 18.457 | 13.87 |
I used Ubergarm's high quality DeepSeek V3 R4 quantized model before ik llama.cpp had support for that quantization type on GPU with a 4090 and all experts, except for the shared one on CPU with about 12 output tps <10000 context. I will try again with the latest model from Ubergarm later.

View File

@@ -0,0 +1,43 @@
### 🗣️ [#543](https://github.com/ikawrakow/ik_llama.cpp/discussions/543) - dots.llm1 support and thanks
| **Author** | `Iconology` |
| :--- | :--- |
| **Created** | 2025-06-20 |
| **Updated** | 2025-07-03 |
---
#### Description
Hey, friend,
Out of curiosity, do you have any plans to add dots.llm1 support? The model seems interesting enough. I tried it out on mainline, but the speeds were atrocious for its size, making it unusable, at least for me. Thats why I jumped over to your fork (thanks to ubergarm) for both the insane MoE speedups and for being the godfather of, arguably, the absolute SOTA quants in my eyes.
Here's the pull request from mainline for dots:
https://github.com/ggml-org/llama.cpp/commit/9ae4143bc6ecb4c2f0f0301578f619f6c201b857
---
Regardless of whether its on your roadmap or not, I just wanted to say thank you, ikawrakow, for all that you have done and continue to do. You are one of a kind.
---
#### 🗣️ Discussion
👤 **saood06** replied the **2025-06-20** at **03:21:14**:<br>
>The model seems interesting enough.
I agree, from a quick skim of the PR code, I don't see anything that would lead to a complicated port. I could do it if no one else gets to it first.
Especially due to this part in that PR:
>The model architecture is a combination of Qwen and Deepseek parts, as
seen here:
>
>https://github.com/huggingface/transformers/blob/ffe12627b4e84489d2ab91dd0ec00614855edc79/src/transformers/models/dots1/modular_dots1.py
> 👤 **firecoperana** replied the **2025-07-02** at **22:56:45**:<br>
> @saood06 Are you working on it? If not, I can give a try.
>
> 👤 **saood06** replied the **2025-07-03** at **02:23:35**:<br>
> #573 exists now. Testing is welcome.

View File

@@ -0,0 +1,20 @@
### 🗣️ [#545](https://github.com/ikawrakow/ik_llama.cpp/discussions/545) - Vulkan support?
| **Author** | `luckydevil13` |
| :--- | :--- |
| **Created** | 2025-06-20 |
| **Updated** | 2025-07-06 |
---
#### Description
Is in possible?
---
#### 🗣️ Discussion
👤 **ikawrakow** replied the **2025-07-06** at **13:53:12**:<br>
See #562, #590

View File

@@ -0,0 +1,310 @@
### 🗣️ [#548](https://github.com/ikawrakow/ik_llama.cpp/discussions/548) - Poor performance with bf16 model on Qwen3 30B-A3B
| **Author** | `Gaolingx` |
| :--- | :--- |
| **Created** | 2025-06-22 |
| **Updated** | 2025-07-02 |
---
#### Description
## Introduction
I tried to run model [Qwen3-30B-A3B-GGUF](https://hf-mirror.com/unsloth/Qwen3-30B-A3B-GGUF) with ik_llama.cpp. Because I have a nvidia GPU(RTX 4060Ti) with 8G VRAM on my PC, so I compiled ik_llama.cpp with the cuda backend, and run with `-ot exps=CPU` to offload experts(ffn_down_exps, ffn_up_exps, gate_exps) to CPU.
Build options:
```text
cmake -B build -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DGGML_CUDA=ON -DGGML_AVX2=ON -DGGML_AVX512=OFF -DBUILD_SHARED_LIBS=ON
```
I tested `q8_0` quantization and `bf16` models, on `q8_0` model, the prompt processing speed(PP) the token generate speed(TG) are very quickly, I got a speed of up to 165 token/s PP and 18 token/s TG, that's a good start. but when I ran `bf16` model, the PP speed is much slower than before, It just 30-40token/s PP, 11-12 token/s TG, It's not even as good as only CPU ggml backend(about 51 token/s PP, 11 token/s TG), This performance is obviously not normal on bf16 models. It makes me confused. I've also found that the GPU spends quite a bit of time on the copy every time the token processing phase is processed, but quantization modes(like q8_0) don't have the above problem.
---
### cpu backend, `bf16` model(Qwen3-30B-A3B-BF16)
![ed1da34ea56ffe9a55fdc913fa17104f](https://github.com/user-attachments/assets/7df118ce-d21a-44ff-a4ee-e906dd9e9939)
---
### cuda backend, `bf16` model(Qwen3-30B-A3B-BF16)
![image](https://github.com/user-attachments/assets/34e3fc5c-ec54-45ea-a878-3af7d1a41793)
---
### cuda backend, `q8_0` model(Qwen3-30B-A3B-Q8_0)
![d1315e282e6c9ff022d8c85f8eb13c93](https://github.com/user-attachments/assets/08114f6f-8d8a-4030-9b51-617cd255dab2)
## System Info
Here are my SystemInfo(include hardware and software)
- Hardware
- CPU: Intel(R) Xeon(R) Gold 6138 CPU @ 2.00GHz(20c, 40t) x2
- GPU: NVIDIA GeForce RTX 4060Ti 8G
- RAM: RDIMM DDR4 2666 2Rx4 32G x16(12 Channels total)
- Motherboard: Supermicro X11DPi-N
- SSD: ZHITAI TiPlus7100 1TB
- Software
- OS: Microsoft Windows 10 Pro
- BIOS: Hyper-Threading-Enable, SNC-Disable
- Model: Qwen3-235B-A22B-128K-Q8_0(unsloth/Qwen3-235B-A22B-128K-GGUF)
- ik_llama.cpp:
```text
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce RTX 4060 Ti, compute capability 8.9, VMM: yes
INFO [ main] build info | tid="54808" timestamp=1750526676 build=3761 commit="144ee1c4"
INFO [ main] system info | tid="54808" timestamp=1750526676 n_threads=16 n_threads_batch=-1 total_threads=40 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | "
```
## Benchmark
Here are the results of my initialllama-sweep-bench testing for PP speed and TG speed, the command line for is `ik_llama.cpp`
llama-sweep-bench:
```text
./llama-sweep-bench -m "%MODEL_PATH%" -c 16384 -t 16 -ngl 48 -fa -rtr -fmoe -ser 6,1 -ot exps=CPU
```
### ik_llama.cpp cuda backed (Model: Qwen3-30B-A3B-Q8_0)
<details>
main: n_kv_max = 16384, n_batch = 2048, n_ubatch = 512, flash_attn = 1, n_gpu_layers = 48, n_threads = 16, n_threads_batch = 16
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
| 512 | 128 | 0 | 2.308 | 221.84 | 6.463 | 19.81 |
| 512 | 128 | 512 | 2.437 | 210.11 | 7.741 | 16.54 |
| 512 | 128 | 1024 | 2.295 | 223.07 | 7.040 | 18.18 |
| 512 | 128 | 1536 | 2.537 | 201.81 | 7.739 | 16.54 |
| 512 | 128 | 2048 | 2.327 | 220.05 | 7.006 | 18.27 |
| 512 | 128 | 2560 | 2.523 | 202.97 | 7.766 | 16.48 |
| 512 | 128 | 3072 | 2.571 | 199.15 | 7.901 | 16.20 |
| 512 | 128 | 3584 | 2.531 | 202.26 | 7.717 | 16.59 |
| 512 | 128 | 4096 | 2.600 | 196.89 | 8.016 | 15.97 |
| 512 | 128 | 4608 | 2.602 | 196.80 | 7.962 | 16.08 |
| 512 | 128 | 5120 | 2.623 | 195.21 | 7.880 | 16.24 |
| 512 | 128 | 5632 | 2.614 | 195.86 | 8.090 | 15.82 |
| 512 | 128 | 6144 | 2.647 | 193.44 | 8.055 | 15.89 |
| 512 | 128 | 6656 | 2.647 | 193.43 | 7.963 | 16.07 |
| 512 | 128 | 7168 | 2.686 | 190.62 | 7.975 | 16.05 |
| 512 | 128 | 7680 | 2.687 | 190.54 | 8.069 | 15.86 |
| 512 | 128 | 8192 | 2.691 | 190.28 | 7.990 | 16.02 |
| 512 | 128 | 8704 | 2.713 | 188.69 | 8.030 | 15.94 |
| 512 | 128 | 9216 | 2.690 | 190.33 | 8.081 | 15.84 |
| 512 | 128 | 9728 | 2.706 | 189.24 | 8.015 | 15.97 |
| 512 | 128 | 10240 | 2.712 | 188.80 | 8.034 | 15.93 |
| 512 | 128 | 10752 | 2.777 | 184.35 | 8.097 | 15.81 |
| 512 | 128 | 11264 | 2.728 | 187.69 | 8.142 | 15.72 |
| 512 | 128 | 11776 | 2.651 | 193.15 | 8.040 | 15.92 |
| 512 | 128 | 12288 | 2.715 | 188.57 | 8.032 | 15.94 |
| 512 | 128 | 12800 | 2.727 | 187.76 | 8.091 | 15.82 |
| 512 | 128 | 13312 | 2.693 | 190.12 | 8.145 | 15.72 |
| 512 | 128 | 13824 | 2.692 | 190.22 | 8.137 | 15.73 |
| 512 | 128 | 14336 | 2.579 | 198.54 | 7.770 | 16.47 |
| 512 | 128 | 14848 | 2.688 | 190.45 | 8.211 | 15.59 |
| 512 | 128 | 15360 | 2.592 | 197.57 | 8.075 | 15.85 |
| 512 | 128 | 15872 | 2.660 | 192.47 | 8.132 | 15.74 |
</details>
### ik_llama.cpp cuda backed (Model: Qwen3-30B-A3B-BF16)
<details>
main: n_kv_max = 16384, n_batch = 2048, n_ubatch = 512, flash_attn = 1, n_gpu_layers = 48, n_threads = 16, n_threads_batch = 16
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
| 512 | 128 | 0 | 18.004 | 28.44 | 10.550 | 12.13 |
| 512 | 128 | 512 | 17.938 | 28.54 | 10.384 | 12.33 |
| 512 | 128 | 1024 | 17.859 | 28.67 | 10.370 | 12.34 |
| 512 | 128 | 1536 | 17.924 | 28.57 | 10.399 | 12.31 |
| 512 | 128 | 2048 | 17.989 | 28.46 | 10.386 | 12.32 |
| 512 | 128 | 2560 | 17.935 | 28.55 | 10.435 | 12.27 |
| 512 | 128 | 3072 | 18.006 | 28.44 | 10.513 | 12.18 |
| 512 | 128 | 3584 | 18.030 | 28.40 | 10.495 | 12.20 |
| 512 | 128 | 4096 | 18.063 | 28.35 | 10.578 | 12.10 |
| 512 | 128 | 4608 | 17.570 | 29.14 | 10.613 | 12.06 |
| 512 | 128 | 5120 | 17.685 | 28.95 | 10.600 | 12.08 |
| 512 | 128 | 5632 | 17.744 | 28.86 | 10.682 | 11.98 |
| 512 | 128 | 6144 | 17.911 | 28.59 | 10.640 | 12.03 |
| 512 | 128 | 6656 | 17.727 | 28.88 | 10.719 | 11.94 |
| 512 | 128 | 7168 | 17.529 | 29.21 | 10.636 | 12.03 |
| 512 | 128 | 7680 | 17.547 | 29.18 | 10.660 | 12.01 |
| 512 | 128 | 8192 | 17.517 | 29.23 | 10.708 | 11.95 |
| 512 | 128 | 8704 | 17.572 | 29.14 | 10.814 | 11.84 |
| 512 | 128 | 9216 | 17.542 | 29.19 | 10.813 | 11.84 |
| 512 | 128 | 9728 | 17.615 | 29.07 | 10.815 | 11.84 |
| 512 | 128 | 10240 | 17.573 | 29.14 | 10.839 | 11.81 |
| 512 | 128 | 10752 | 17.616 | 29.06 | 10.858 | 11.79 |
| 512 | 128 | 11264 | 17.670 | 28.98 | 10.899 | 11.74 |
| 512 | 128 | 11776 | 17.764 | 28.82 | 11.194 | 11.44 |
| 512 | 128 | 12288 | 17.622 | 29.05 | 10.960 | 11.68 |
| 512 | 128 | 12800 | 17.658 | 28.99 | 11.039 | 11.60 |
| 512 | 128 | 13312 | 17.661 | 28.99 | 11.036 | 11.60 |
| 512 | 128 | 13824 | 17.624 | 29.05 | 11.093 | 11.54 |
| 512 | 128 | 14336 | 17.587 | 29.11 | 11.094 | 11.54 |
| 512 | 128 | 14848 | 17.650 | 29.01 | 11.174 | 11.45 |
| 512 | 128 | 15360 | 17.648 | 29.01 | 11.190 | 11.44 |
| 512 | 128 | 15872 | 17.645 | 29.02 | 11.204 | 11.42 |
---
#### 🗣️ Discussion
👤 **ikawrakow** replied the **2025-06-22** at **15:16:00**:<br>
Don't use `-rtr` for the `bf16` model.
> 👤 **Gaolingx** replied the **2025-06-22** at **15:31:07**:<br>
> wow, thanks a lot for your suggestion, the speed is normal now, I got ~65 PP speed and ~11.8 TG speed, but the cpu+cuda(`-ot exps=CPU`) speed doesn't seem to be much faster than the pure cpu, although it is a moe model. maybe I should do a more detailed benchmark.
>
> 👤 **ikawrakow** replied the **2025-06-22** at **15:35:13**:<br>
> You need larger u-batch size for better PP performance. The experts are in RAM and need to be offloaded to the GPU, which takes a while. If you run `llama-sweep-bench` with `-ub 2048` you will see much better PP performance.
>
> 👤 **Gaolingx** replied the **2025-07-02** at **10:54:41**:<br>
> Hi, we all know that runtime repack(`-rtr`) is good to use with hybrid GPU + CPU, according to my research in the last few days, if we don't add the '-rtr' parameter, when we input long prompts, the cuda device needs to spend a long time on copying (and you can see in the task manager that the usage of 'Copy1' is quite high, but the usage of `CPU` and `CUDA` is insufficient), and the processing speed of prompt words is also significantly lower than the performance with the '-rtr' parameter, or even worse than the cpu only, what is the reason for this?
> ![09f30b1c-8174-43f0-8b7e-113ec8bbe4dd](https://github.com/user-attachments/assets/ac09c33d-f102-4e89-8c9f-b541d562a902)
>
> 👤 **ikawrakow** replied the **2025-07-02** at **12:14:25**:<br>
> I'm not sure I understand what could be the issue from the description. Can you tell us what is the model you are using and post your command line?
>
> 👤 **Gaolingx** replied the **2025-07-02** at **14:23:58**:<br>
> > I'm not sure I understand what could be the issue from the description. Can you tell us what is the model you are using and post your command line?
>
> Ok. I ran llama-sweep-bench again and tested the 16k context length data of three sets of qwen3 30ba3b models. They are that the q8_0 model with `-rtr` parameter, the q8_0 model without `-rtr` parameter, and the bf16 model without `-rtr` parameter. To control the variables, in the test group without the -rtr parameter, I added the `--no-mmap` parameter. The rest of the startup parameters remained the same. The llama-sweep-bench startup parameters and test results are as follows.
>
> I have come to the conclusion that, whether it is the q8_0 or bf16 model, on my platform, if the `-rtr` parameter is not used, the prompt processing performance during cpu+gpu hybrid inference will be significantly affected. The larger the model, the more serious this situation is. However, The token generation speed is normal and in line with expectations. What causes this? How does the runtime repack tensors(-rtr) to improve prompt processing performance?
>
> ---
> ## ik_llama.cpp cuda backed (Model: Qwen3-30B-A3B-Q8_0 with `-rtr`)
>
> <details>
> <summary>./llama-sweep-bench -m "D:\Downloads\Qwen3-30B-A3B-Q8_0.gguf" -c 16384 -t 16 -ngl 49 -fa -rtr -fmoe -ser 6,1 -ot exps=CPU</summary>
>
> main: n_kv_max = 16384, n_batch = 2048, n_ubatch = 512, flash_attn = 1, n_gpu_layers = 49, n_threads = 16, n_threads_batch = 16
>
> | PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
> |-------|--------|--------|----------|----------|----------|----------|
> | 512 | 128 | 0 | 2.247 | 227.87 | 5.941 | 21.54 |
> | 512 | 128 | 512 | 2.293 | 223.28 | 6.718 | 19.05 |
> | 512 | 128 | 1024 | 2.354 | 217.46 | 6.981 | 18.34 |
> | 512 | 128 | 1536 | 2.382 | 214.94 | 7.088 | 18.06 |
> | 512 | 128 | 2048 | 2.406 | 212.81 | 7.011 | 18.26 |
> | 512 | 128 | 2560 | 2.394 | 213.84 | 7.078 | 18.09 |
> | 512 | 128 | 3072 | 2.408 | 212.61 | 7.080 | 18.08 |
> | 512 | 128 | 3584 | 2.383 | 214.83 | 7.127 | 17.96 |
> | 512 | 128 | 4096 | 2.415 | 211.97 | 7.083 | 18.07 |
> | 512 | 128 | 4608 | 2.391 | 214.12 | 7.170 | 17.85 |
> | 512 | 128 | 5120 | 2.461 | 208.03 | 7.216 | 17.74 |
> | 512 | 128 | 5632 | 2.448 | 209.11 | 7.233 | 17.70 |
> | 512 | 128 | 6144 | 2.458 | 208.31 | 7.286 | 17.57 |
> | 512 | 128 | 6656 | 2.456 | 208.48 | 7.251 | 17.65 |
> | 512 | 128 | 7168 | 2.413 | 212.17 | 7.160 | 17.88 |
> | 512 | 128 | 7680 | 2.450 | 208.98 | 7.310 | 17.51 |
> | 512 | 128 | 8192 | 2.482 | 206.26 | 7.302 | 17.53 |
> | 512 | 128 | 8704 | 2.365 | 216.50 | 7.130 | 17.95 |
> | 512 | 128 | 9216 | 2.371 | 215.94 | 7.109 | 18.01 |
> | 512 | 128 | 9728 | 2.381 | 214.99 | 7.264 | 17.62 |
> | 512 | 128 | 10240 | 2.395 | 213.81 | 7.192 | 17.80 |
> | 512 | 128 | 10752 | 2.402 | 213.16 | 7.103 | 18.02 |
> | 512 | 128 | 11264 | 2.402 | 213.18 | 7.005 | 18.27 |
> | 512 | 128 | 11776 | 2.372 | 215.87 | 7.023 | 18.22 |
> | 512 | 128 | 12288 | 2.474 | 206.92 | 6.762 | 18.93 |
> | 512 | 128 | 12800 | 2.457 | 208.42 | 6.808 | 18.80 |
> | 512 | 128 | 13312 | 2.442 | 209.64 | 6.740 | 18.99 |
> | 512 | 128 | 13824 | 2.447 | 209.22 | 6.824 | 18.76 |
> | 512 | 128 | 14336 | 2.473 | 207.03 | 6.704 | 19.09 |
> | 512 | 128 | 14848 | 2.524 | 202.86 | 6.695 | 19.12 |
> | 512 | 128 | 15360 | 2.573 | 199.00 | 7.093 | 18.05 |
> | 512 | 128 | 15872 | 2.520 | 203.17 | 6.611 | 19.36 |
>
> </details>
>
> ---
> ## ik_llama.cpp cuda backed (Model: Qwen3-30B-A3B-Q8_0 without `-rtr`)
>
> <details>
> <summary>./llama-sweep-bench -m "D:\Downloads\Qwen3-30B-A3B-Q8_0.gguf" -c 16384 -t 16 -ngl 49 -fa --no-mmap -fmoe -ser 6,1 -ot exps=CPU</summary>
>
> main: n_kv_max = 16384, n_batch = 2048, n_ubatch = 512, flash_attn = 1, n_gpu_layers = 49, n_threads = 16, n_threads_batch = 16
>
> | PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
> |-------|--------|--------|----------|----------|----------|----------|
> | 512 | 128 | 0 | 9.527 | 53.74 | 6.171 | 20.74 |
> | 512 | 128 | 512 | 9.556 | 53.58 | 6.117 | 20.93 |
> | 512 | 128 | 1024 | 9.554 | 53.59 | 6.184 | 20.70 |
> | 512 | 128 | 1536 | 9.551 | 53.61 | 6.149 | 20.82 |
> | 512 | 128 | 2048 | 9.590 | 53.39 | 6.255 | 20.46 |
> | 512 | 128 | 2560 | 9.523 | 53.76 | 6.230 | 20.55 |
> | 512 | 128 | 3072 | 9.509 | 53.84 | 6.257 | 20.46 |
> | 512 | 128 | 3584 | 9.555 | 53.58 | 6.274 | 20.40 |
> | 512 | 128 | 4096 | 9.640 | 53.11 | 6.705 | 19.09 |
> | 512 | 128 | 4608 | 9.638 | 53.12 | 6.409 | 19.97 |
> | 512 | 128 | 5120 | 9.615 | 53.25 | 6.388 | 20.04 |
> | 512 | 128 | 5632 | 9.652 | 53.04 | 6.360 | 20.12 |
> | 512 | 128 | 6144 | 9.662 | 52.99 | 6.430 | 19.91 |
> | 512 | 128 | 6656 | 9.702 | 52.77 | 6.480 | 19.75 |
> | 512 | 128 | 7168 | 9.609 | 53.28 | 6.494 | 19.71 |
> | 512 | 128 | 7680 | 9.606 | 53.30 | 6.485 | 19.74 |
> | 512 | 128 | 8192 | 9.622 | 53.21 | 6.521 | 19.63 |
> | 512 | 128 | 8704 | 9.620 | 53.22 | 6.546 | 19.55 |
> | 512 | 128 | 9216 | 9.559 | 53.56 | 6.602 | 19.39 |
> | 512 | 128 | 9728 | 9.538 | 53.68 | 6.542 | 19.57 |
> | 512 | 128 | 10240 | 9.563 | 53.54 | 6.626 | 19.32 |
> | 512 | 128 | 10752 | 9.610 | 53.28 | 6.561 | 19.51 |
> | 512 | 128 | 11264 | 9.689 | 52.85 | 6.618 | 19.34 |
> | 512 | 128 | 11776 | 9.619 | 53.23 | 6.628 | 19.31 |
> | 512 | 128 | 12288 | 9.654 | 53.03 | 6.452 | 19.84 |
> | 512 | 128 | 12800 | 9.800 | 52.24 | 6.578 | 19.46 |
> | 512 | 128 | 13312 | 9.641 | 53.11 | 6.613 | 19.35 |
> | 512 | 128 | 13824 | 9.638 | 53.12 | 6.513 | 19.65 |
> | 512 | 128 | 14336 | 9.686 | 52.86 | 6.555 | 19.53 |
> | 512 | 128 | 14848 | 9.729 | 52.62 | 6.609 | 19.37 |
> | 512 | 128 | 15360 | 9.702 | 52.77 | 6.624 | 19.32 |
> | 512 | 128 | 15872 | 9.697 | 52.80 | 6.636 | 19.29 |
>
> </details>
>
> ---
> ## ik_llama.cpp cuda backed (Model: Qwen3-30B-A3B-BF16 without `-rtr`)
>
> <details>
> <summary>./llama-sweep-bench -m "D:\Downloads\unsloth\Qwen3-30B-A3B-GGUF\BF16\Qwen3-30B-A3B-BF16-00001-of-00002.gguf" -c 16384 -t 16 -ngl 49 -fa --no-mmap -fmoe -ser 6,1 -ot exps=CPU</summary>
>
> main: n_kv_max = 16384, n_batch = 2048, n_ubatch = 512, flash_attn = 1, n_gpu_layers = 49, n_threads = 16, n_threads_batch = 16
>
> | PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
> |-------|--------|--------|----------|----------|----------|----------|
> | 512 | 128 | 0 | 17.771 | 28.81 | 9.791 | 13.07 |
> | 512 | 128 | 512 | 17.398 | 29.43 | 9.025 | 14.18 |
> | 512 | 128 | 1024 | 17.305 | 29.59 | 9.030 | 14.17 |
> | 512 | 128 | 1536 | 17.367 | 29.48 | 9.054 | 14.14 |
> | 512 | 128 | 2048 | 17.859 | 28.67 | 9.342 | 13.70 |
> | 512 | 128 | 2560 | 17.700 | 28.93 | 9.143 | 14.00 |
> | 512 | 128 | 3072 | 17.696 | 28.93 | 9.170 | 13.96 |
> | 512 | 128 | 3584 | 17.939 | 28.54 | 9.241 | 13.85 |
> | 512 | 128 | 4096 | 17.926 | 28.56 | 9.212 | 13.89 |
> | 512 | 128 | 4608 | 17.714 | 28.90 | 9.280 | 13.79 |
> | 512 | 128 | 5120 | 17.822 | 28.73 | 9.226 | 13.87 |
> | 512 | 128 | 5632 | 17.830 | 28.72 | 9.273 | 13.80 |
> | 512 | 128 | 6144 | 17.749 | 28.85 | 9.121 | 14.03 |
> | 512 | 128 | 6656 | 17.581 | 29.12 | 9.356 | 13.68 |
> | 512 | 128 | 7168 | 17.517 | 29.23 | 9.401 | 13.62 |
> | 512 | 128 | 7680 | 17.408 | 29.41 | 9.393 | 13.63 |
> | 512 | 128 | 8192 | 17.451 | 29.34 | 9.371 | 13.66 |
> | 512 | 128 | 8704 | 17.409 | 29.41 | 9.544 | 13.41 |
> | 512 | 128 | 9216 | 17.443 | 29.35 | 9.476 | 13.51 |
> | 512 | 128 | 9728 | 17.449 | 29.34 | 10.037 | 12.75 |
> | 512 | 128 | 10240 | 17.370 | 29.48 | 9.480 | 13.50 |
> | 512 | 128 | 10752 | 17.472 | 29.30 | 9.504 | 13.47 |
> | 512 | 128 | 11264 | 17.612 | 29.07 | 9.500 | 13.47 |
> | 512 | 128 | 11776 | 17.492 | 29.27 | 9.580 | 13.36 |
> | 512 | 128 | 12288 | 17.384 | 29.45 | 9.569 | 13.38 |
> | 512 | 128 | 12800 | 18.000 | 28.44 | 9.436 | 13.56 |
> | 512 | 128 | 13312 | 17.759 | 28.83 | 9.493 | 13.48 |
> | 512 | 128 | 13824 | 17.905 | 28.60 | 9.442 | 13.56 |
> | 512 | 128 | 14336 | 17.843 | 28.69 | 9.372 | 13.66 |
> | 512 | 128 | 14848 | 17.928 | 28.56 | 9.538 | 13.42 |
> | 512 | 128 | 15360 | 17.902 | 28.60 | 9.436 | 13.57 |
> | 512 | 128 | 15872 | 17.971 | 28.49 | 9.336 | 13.71 |
>
> </details>
>
> 👤 **ikawrakow** replied the **2025-07-02** at **14:40:43**:<br>
> When you use `-rtr`, the tensors not offloaded to the GPU get repacked to a row-interleaved version. `Q8_0` becomes `Q8_0_R8`, and `BF16` becomes `BF16_R16`. `Q8_0_R8` and `BF16_R16` are not supported by the CUDA backend, so matrix multiplications with these tensors are done on the CPU. When you do not use `-rtr`, there is no repacking, CUDA supports `Q8_0` and `BF16`, so the tensors stored in RAM get copied to the GPU to do matrix multiplications. If the model is large, and your PCI-E is not very fast, the copying to VRAM takes a long time, so your PP performance becomes low. You can improve the performance by using larger u-batches because more work is done per copy to the GPU (tensors are copied once, but multiply 2048 tokens with `-ub 2048`. To accomplish the same with the u-batch of 512 you are using, tensors need to get copied 4 times). If you don't want to repack, and don't want to use larger u-batches, you can prevent copying to the GPU using `-op 26,0,27,0,29,0`. In that case `bf16` performance will be slightly lower than with `-rtr`, but `Q8_0` performance will be somewhere in the middle between `-rtr` and no `-rtr`.

View File

@@ -0,0 +1,58 @@
### 🗣️ [#556](https://github.com/ikawrakow/ik_llama.cpp/discussions/556) - ik_llama.cpp for Armv8.0
| **Author** | `NotAHero04` |
| :--- | :--- |
| **Created** | 2025-06-25 |
| **Updated** | 2025-06-26 |
---
#### Description
I managed to port ik_llama.cpp to my phone which has a Snapdragon 680 CPU. Although under heavy emulation, it's still much faster than mainline llama.cpp. All of the tests are done using Qwen 3 0.6B model.
![Screenshot_2025_0625_135810](https://github.com/user-attachments/assets/39bd5d8e-d1eb-4dd4-9342-888733cc8fe2)
What works:
- Quants: legacy quants (tested Q4_0, Q8_0), i-quants (IQ4_XS), k-quants (Q4_K_M), iqk-quants (IQ4_KS, IQ5_K).
- Flash attention.
![Screenshot_2025_0625_141018](https://github.com/user-attachments/assets/e31a73c5-1bf9-4bc3-bdd6-303539748765)
What doesn't work:
- Trellis quants (tested IQ4_KT), though it might be specific to model or to my quantization. I'll test it more tonight.
- Repacking (both online and quantized forms, tested Q4_0_R8 and Q8_0_R8).
![Screenshot_2025_0625_141636](https://github.com/user-attachments/assets/21da3aed-d8a8-406e-82f7-ac6cef6d8a76)
If anyone is interested, I'll publish a fork. It just adds emulation for some NEON dot product and float16 arithmetic intrinsics. (mainline also has some level of emulation for v8.0)
---
#### 🗣️ Discussion
👤 **ikawrakow** replied the **2025-06-25** at **07:52:27**:<br>
Nice 😄
The repacked variants don't work because the emulation for `vdotq_laneq_s32` is incorrect, or is there some other issue? But I guess it may not be worth putting too much effort into this as one would need to use `vgetq_lane_X`, which will make the dot products quite slow, I think.
---
👤 **NotAHero04** replied the **2025-06-25** at **14:37:21**:<br>
I did a fresh recompile and repacking works now! Unfortunately IQ4_KT still doesn't work :(
![Screenshot_2025_0625_213454](https://github.com/user-attachments/assets/ecdfd5e3-c7c0-41ce-affa-c35f59d68dfa)
---
👤 **ikawrakow** replied the **2025-06-25** at **15:30:22**:<br>
The `*_KT` quants are very slow on my M2-Max CPU, so it may not be worth putting the effort to make them work on a v8.0 phone.
> 👤 **NotAHero04** replied the **2025-06-26** at **09:18:15**:<br>
> So the KT quants do work after all, I just have to get the model from my PC. And yes, it is unbearably slow. (Q4_0 is 3x faster in TG)
> ![Screenshot_20250626_155507](https://github.com/user-attachments/assets/e0a54dc0-4285-470a-b333-5aba063566b0)
---
👤 **ikawrakow** replied the **2025-06-26** at **16:57:03**:<br>
Yes, the `*_kt` quants performance is very competitive on a GPU, nearly competitive on the two `x86_64` CPU's that I have available, 2X slower than corresponding size quant on the M2-Max CPU, and ridiculously slow on the M2-Max GPU.
But nice you have made all this work!

View File

@@ -0,0 +1,514 @@
### 🗣️ [#562](https://github.com/ikawrakow/ik_llama.cpp/discussions/562) - AMD GPU Vulkan & ROCm/HIP Discussion
| **Author** | `ubergarm` |
| :--- | :--- |
| **Created** | 2025-06-28 |
| **Updated** | 2025-07-06 |
---
#### Description
## Background
I've been asked a few times now about AMD GPU support with ik's fork. I recently got access to an AMD RX 7900 XTX to try it out, and as discussed on [Issue 503](https://github.com/ikawrakow/ik_llama.cpp/issues/503#issuecomment-2953557243) the Vulkan and ROCm backends are *not* the focus of this fork hence limited support on AMD GPU hardware.
I'm starting this discussion to have a place to point folks who might be interested the current state AMD GPU backend support, and especially if they wanted to attempt updates and work on it at all.
## Current State
ik_llama.cpp actually *does* compile with Vulkan and can do some limited inferencing. As it is unmaintained, it is slower than mainline at the moment. However I couldn't get it to compile with ROCm/HIP support. I only tried the AMD official open source AMDVLK backend and not the community open source RADV backend.
There is a [good benchmarking discussion on mainline](https://github.com/ggml-org/llama.cpp/discussions/10879#discussioncomment-13606581) maintained by @netrunnereve which was very helpful for establishing baseline expectations and trying to understand the various AMD GPU driver development environments.
## Benchmarks
I did a comparison between mainline llama.cpp and ik_llama.cpp at the given sha's for what I could get working.
![sweep-bench-amd-gpu-mainline-vs-ik](https://github.com/user-attachments/assets/9a9c2fcc-24db-46bb-8131-9c47fce36084)
## Methodology
To keep things somewhat consistent with the establish methodologies I used [TheBloke's now vintage Llama-2-7B at classic Q4_0 quantization](https://huggingface.co/TheBloke/Llama-2-7B-GGUF/blob/main/llama-2-7b.Q4_0.gguf). The following is how compilation was done as well as running `llama-sweep-bench` with and without flash attention `-fa`:
### Compiling
```bash
# compile for Vulkan
cmake -B build -DGGML_HIP=OFF -DGGML_VULKAN=1 -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j $(nproc)
# couldn't find a combination that worked below
# compile for ROCm/HIP
export HIPCXX="$(hipconfig -l)/clang"
export HIP_PATH="$(hipconfig -R)"
#cmake -B build -DGGML_VULKAN=0 -DGGML_HIP=ON -DGPU_TARGETS=gfx1100 -DGGML_HIP_ROCWMMA_FATTN=ON -DCMAKE_BUILD_TYPE=Release
cmake -B build -DGGML_VULKAN=0 -DGGML_HIPBLAS=ON -DAMDGPU_TARGETS=gfx1100 -DGGML_HIP_ROCWMMA_FATTN=ON -DCMAKE_BUILD_TYPE=Release
cmake --build build --config Release -j $(nproc)
In file included from /home/w/projects/ik_llama.cpp/ggml/src/ggml-cuda/fattn.cu:15:
In file included from /home/w/projects/ik_llama.cpp/ggml/src/ggml-cuda/fattn-mma-f16.cuh:3:
/home/w/projects/ik_llama.cpp/ggml/src/ggml-cuda/mma_new.cuh:49:27: error: use of undeclared identifier '__shfl_sync'
49 | const int ret_low = (__shfl_sync(0xFFFFFFFF, x, src_laneid_low, WARP_SIZE) >> shift_low) & 0x0000FFFF;
| ^
/home/w/projects/ik_llama.cpp/ggml/src/ggml-cuda/mma_new.cuh:50:27: error: use of undeclared identifier '__shfl_sync'
50 | const int ret_high = (__shfl_sync(0xFFFFFFFF, x, src_laneid_high, WARP_SIZE) << shift_high) & 0xFFFF0000;
| ^
4 errors generated when compiling for gfx1100.
```
#### sweep-bench
```bash
export model=/models/TheBloke/Llama-2-7B-GGUF/llama-2-7b.Q4_0.gguf
# try with and without -fa
./build/bin/llama-sweep-bench \
--model "$model" \
-fa \
-c 18432 \
-ngl 99 \
--warmup-batch \
--threads 1
```
### Observations
1. Surprisingly Vulkan without FA managed to complete the benchmark and even give similar performance as mainline for the no FA token generation at longer context lengths.
2. However, Vulkan with FA enabled shows very poor performance and consistently crashes at `N_KV=7680`. `iqk_fa_templates.h:1146: GGML_ASSERT(fms.S[j] > 0) failed`
3. I did *not* test any other quantizations especially the newer ik exclusive quants.
4. I did do a quick vibe check and confirm the model was at least valid tokens, however the chat template seemed odd or could be due to my client settings for temp etc but the responses seemed wrong and had `<|im_start|>` and `<|im_end|>` type tokens which don't usually come back from the chat endpoint.
## Conclusion
Well, sorry if you have AMD GPU hardware and were hoping to try out the latest greatest stuff on ik's fork. You can still make use of the CPU only optimizations fwiw. You can see the relative performance of native CUDA in the linked benchmark thread for one of my other tests, and ik's fork does run faster than mainline for CUDA.
Finally, I saw [and interesting NVIDIA slide deck from the Vulkanised 2025 Developer Conference](https://vulkan.org/user/pages/09.events/vulkanised-2025/T47-Jeff-Bolz-NVIDIA.pdf) which discusses llama.cpp on pages 14 and 15 even showing what looks like [some of ik's IQ4_NL code](https://github.com/ggml-org/llama.cpp/pull/5590) with implementation discussions. I was surprised that some models benchmark faster on NVIDIA GPUs using vulkan backend beating out the native CUDA implementation, but perhaps that is for another day...
Thanks and curious if anyone else has tried this or is interested in improving support here. Cheers!
---
#### 🗣️ Discussion
👤 **OneOfOne** replied the **2025-06-29** at **01:50:14**:<br>
llama.cpp's vulkan backend is faster and uses less memory on my 7900xtx as well (I'm using latest rocm on Arch so it's not that).
> 👤 **ubergarm** replied the **2025-06-29** at **14:41:47**:<br>
> Yup, this is to be expected given ik's fork prioritizes a couple CPU types and CUDA implementations and does not focus on maintaining Vulkan nor ROCm/HIP backends.
---
👤 **firecoperana** replied the **2025-06-29** at **14:50:07**:<br>
I'm working on bringing ik_llama.cpp up to date with llama.cpp's vulkan backend. It is actually easier than I expected.
> 👤 **ubergarm** replied the **2025-06-29** at **14:58:06**:<br>
> @firecoperana very cool to hear :fire: !
>
> As suggested by @0cc4m and some discussion by the author of those Vulkanised Conference PDF slides linked above, @jeffbolznv ,over on the [mainline vulkan benchmark discussion](https://github.com/ggml-org/llama.cpp/discussions/10879#discussioncomment-13606581) I might try to `pacman -Sy extra/nvidia-utils` and build the vulkan backend for my NVIDIA RTX 3090TI FE GPU and compare performance there as well.
>
> Please update us here if you have a fork/branch/PR you'd like to test and if I still have access to the AMD RX 7900 XTX I can give it a go as I'd like to use ik's SOTA quants on that machine for a fun project...
>
> 👤 **ikawrakow** replied the **2025-06-29** at **16:25:52**:<br>
> @firecoperana Great that you want to port the mainline Vulkan back-end to `ik_llama.cpp`, but are you also willing to maintain it?
>
> 👤 **firecoperana** replied the **2025-06-29** at **19:30:13**:<br>
> PR is created. Welcome to test. I can maintain it if the vulkan code there hasn't been refactored too much. With this PR, the future update should be easier too. I don't use vulkan much so need someone to remind me if there is some major improvement in vulkan that is worth porting.
>
> 👤 **ubergarm** replied the **2025-06-29** at **19:41:02**:<br>
> I'll give it a try, I just updated my home rig to latest greatest drivers (which I loathe to do but sometimes u gotta pay the piper...).
>
> Interestingly on a `Qwen3-14B-Q4_0` the Vulkan FA=1 backend beats native CUDA implementation in token generation at sufficiently deep n_kv
>
> https://github.com/ggml-org/llama.cpp/discussions/10879#discussioncomment-13611122
>
> I'll take a look at the PR now, thanks! https://github.com/ikawrakow/ik_llama.cpp/pull/563
>
> 👤 **firecoperana** replied the **2025-06-29** at **19:53:41**:<br>
> https://github.com/ggml-org/llama.cpp/pull/14366
> Vulkan also needs this one, but I couldn't port it in easily. The issue is vulkan does not have FUSED_RMS_NORM and FUSED_MUL_UNARY support, and when using RPC, it needs this. My current workaround is skip ggml_fused_rms_norm and ggml_fused_mul_unary when using vulkan. @ikawrakow
---
👤 **ikawrakow** replied the **2025-07-01** at **13:50:50**:<br>
So, what is the "approved" way of installing the necessary dependencies for Vulkan development on Ubuntu? I ended up installing LunarG VulkanSDK, but the thing almost bricked my system because I hadn't run `sudo apt update && sudo apt upgrade` before importing their repository and attempting to install. Is there a better way, preferably with just Ubuntu packages and no 3rd party stuff?
Anyhow, at the end I got the mainline Vulkan build working, but performance is very far from CUDA on my RTX-4080
### Vulkan sweep-bench, LlaMA-3.1-8B
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
| 1024 | 256 | 0 | 0.339 | 3024.00 | 2.808 | 91.16 |
| 1024 | 256 | 1024 | 0.337 | 3035.97 | 2.709 | 94.51 |
| 1024 | 256 | 2048 | 0.328 | 3121.27 | 2.657 | 96.36 |
| 1024 | 256 | 3072 | 0.336 | 3052.01 | 2.661 | 96.19 |
| 1024 | 256 | 4096 | 0.368 | 2781.06 | 2.704 | 94.67 |
| 1024 | 256 | 5120 | 0.405 | 2531.44 | 2.794 | 91.61 |
| 1024 | 256 | 6144 | 0.465 | 2202.62 | 2.917 | 87.75 |
| 1024 | 256 | 7168 | 0.542 | 1888.01 | 3.047 | 84.00 |
| 1024 | 256 | 8192 | 0.618 | 1656.82 | 3.196 | 80.10 |
| 1024 | 256 | 9216 | 0.657 | 1559.24 | 3.283 | 77.98 |
| 1024 | 256 | 10240 | 0.695 | 1473.46 | 3.365 | 76.08 |
| 1024 | 256 | 11264 | 0.720 | 1422.92 | 3.412 | 75.02 |
| 1024 | 256 | 12288 | 0.753 | 1359.30 | 3.464 | 73.89 |
| 1024 | 256 | 13312 | 0.792 | 1293.13 | 3.523 | 72.67 |
| 1024 | 256 | 14336 | 0.814 | 1257.77 | 3.588 | 71.35 |
| 1024 | 256 | 15360 | 0.858 | 1192.89 | 3.625 | 70.63 |
### CUDA sweep-bench, LlaMA-3.1-8B
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
| 1024 | 256 | 0 | 0.134 | 7649.04 | 2.018 | 126.88 |
| 1024 | 256 | 1024 | 0.129 | 7921.34 | 2.105 | 121.63 |
| 1024 | 256 | 2048 | 0.135 | 7561.83 | 2.170 | 117.99 |
| 1024 | 256 | 3072 | 0.144 | 7121.15 | 2.226 | 114.99 |
| 1024 | 256 | 4096 | 0.151 | 6784.15 | 2.292 | 111.71 |
| 1024 | 256 | 5120 | 0.159 | 6460.57 | 2.354 | 108.75 |
| 1024 | 256 | 6144 | 0.164 | 6225.61 | 2.423 | 105.66 |
| 1024 | 256 | 7168 | 0.172 | 5961.15 | 2.484 | 103.05 |
| 1024 | 256 | 8192 | 0.183 | 5606.81 | 2.545 | 100.61 |
| 1024 | 256 | 9216 | 0.194 | 5289.56 | 2.604 | 98.31 |
| 1024 | 256 | 10240 | 0.195 | 5239.75 | 2.662 | 96.15 |
| 1024 | 256 | 11264 | 0.206 | 4962.13 | 2.731 | 93.72 |
| 1024 | 256 | 12288 | 0.214 | 4777.95 | 2.787 | 91.85 |
| 1024 | 256 | 13312 | 0.217 | 4725.71 | 2.845 | 89.97 |
| 1024 | 256 | 14336 | 0.230 | 4454.44 | 2.919 | 87.71 |
| 1024 | 256 | 15360 | 0.238 | 4311.56 | 2.966 | 86.30 |
So, PP is 3X lower, TG is 20-25% lower.
Given this, does it make sense to spend time on Vulkan? When I forked `llama.cpp` last year the Vulkan stuff was mostly a gimmick, with performance not much better than just running on a moderately fast CPU. They have done a lot of Vulkan development and performance improvements in mainline since then, but it still seems way too far behind.
> 👤 **jeffbolznv** replied the **2025-07-01** at **14:08:19**:<br>
> Installing the Vulkan SDK is the "right" way to get the dependencies. The pp scores shouldn't be that low, it suggests cooperative matrix isn't getting used. What driver version are you using? Can you share the beginning of the log where ggml-vulkan prints device info?
>
> 👤 **ubergarm** replied the **2025-07-01** at **19:20:24**:<br>
> > Given this, does it make sense to spend time on Vulkan?
>
> Personally, the two things I see Vulkan back-end support providing are:
> 1. A path allowing AMD GPUs to be used e.g. RX 7900 XTX 24GB VRAM
> 2. Potentially faster NVIDIA path for some situations/models (this was news to me).
>
> This Qwen3-14B-Q4_0 dense sweep-bench I ran a couple days ago opened my eyes where the vulkan backend on mainline took the lead on TG after about ~8k depth. `NV_coopmat2` [is described in @jeffbolznv recent Vulkanised 2025 slides](https://vulkan.org/user/pages/09.events/vulkanised-2025/T47-Jeff-Bolz-NVIDIA.pdf).
>
> ![sweep-bench-llama-vs-ik-vulkan-qwen3-14b](https://github.com/user-attachments/assets/bc0d855e-5640-45df-bbb0-82e4d048c49c)
>
> Otherwise ik CUDA is generally the fastest. I haven't tested other models/configs but likely vulkan takes the lead in other situations reading the benchmarks in the slides.
>
> However, I also don't want to distract ik whatever optimizations and experiments are most interesting and intrinsically motivating. So nice to see a few folks from the community possibly providing some support. Big thanks @firecoperana for taking a stab at it on https://github.com/ikawrakow/ik_llama.cpp/pull/563
>
> Thanks!
---
👤 **ikawrakow** replied the **2025-07-01** at **14:11:22**:<br>
```code
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = NVIDIA GeForce RTX 4080 (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: KHR_coopmat
build: 5781 (ba3ef86c5) with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
llama_model_load_from_file_impl: using device Vulkan0 (NVIDIA GeForce RTX 4080) - 16376 MiB free
llama_model_loader: loaded meta data with 29 key-value pairs and 292 tensors from ../ncuda/junk.bin (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv 0: general.architecture str = llama
llama_model_loader: - kv 1: general.type str = model
llama_model_loader: - kv 2: general.name str = Meta Llama 3.1 8B Instruct
llama_model_loader: - kv 3: general.finetune str = Instruct
llama_model_loader: - kv 4: general.basename str = Meta-Llama-3.1
llama_model_loader: - kv 5: general.size_label str = 8B
llama_model_loader: - kv 6: general.license str = llama3.1
llama_model_loader: - kv 7: general.tags arr[str,6] = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv 8: general.languages arr[str,8] = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv 9: llama.block_count u32 = 32
llama_model_loader: - kv 10: llama.context_length u32 = 131072
llama_model_loader: - kv 11: llama.embedding_length u32 = 4096
llama_model_loader: - kv 12: llama.feed_forward_length u32 = 14336
llama_model_loader: - kv 13: llama.attention.head_count u32 = 32
llama_model_loader: - kv 14: llama.attention.head_count_kv u32 = 8
llama_model_loader: - kv 15: llama.rope.freq_base f32 = 500000.000000
llama_model_loader: - kv 16: llama.attention.layer_norm_rms_epsilon f32 = 0.000010
llama_model_loader: - kv 17: llama.vocab_size u32 = 128256
llama_model_loader: - kv 18: llama.rope.dimension_count u32 = 128
llama_model_loader: - kv 19: tokenizer.ggml.model str = gpt2
llama_model_loader: - kv 20: tokenizer.ggml.pre str = llama-bpe
llama_model_loader: - kv 21: tokenizer.ggml.tokens arr[str,128256] = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv 22: tokenizer.ggml.token_type arr[i32,128256] = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv 23: tokenizer.ggml.merges arr[str,280147] = ["Ġ Ġ", "Ġ ĠĠĠ", "ĠĠ ĠĠ", "...
llama_model_loader: - kv 24: tokenizer.ggml.bos_token_id u32 = 128000
llama_model_loader: - kv 25: tokenizer.ggml.eos_token_id u32 = 128009
llama_model_loader: - kv 26: tokenizer.chat_template str = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv 27: general.quantization_version u32 = 2
llama_model_loader: - kv 28: general.file_type u32 = 2
llama_model_loader: - type f32: 66 tensors
llama_model_loader: - type q4_0: 225 tensors
llama_model_loader: - type q6_K: 1 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type = Q4_0
print_info: file size = 4.33 GiB (4.64 BPW)
load: special tokens cache size = 256
load: token to piece cache size = 0.7999 MB
print_info: arch = llama
print_info: vocab_only = 0
print_info: n_ctx_train = 131072
print_info: n_embd = 4096
print_info: n_layer = 32
print_info: n_head = 32
print_info: n_head_kv = 8
print_info: n_rot = 128
print_info: n_swa = 0
print_info: is_swa_any = 0
print_info: n_embd_head_k = 128
print_info: n_embd_head_v = 128
print_info: n_gqa = 4
print_info: n_embd_k_gqa = 1024
print_info: n_embd_v_gqa = 1024
print_info: f_norm_eps = 0.0e+00
print_info: f_norm_rms_eps = 1.0e-05
print_info: f_clamp_kqv = 0.0e+00
print_info: f_max_alibi_bias = 0.0e+00
print_info: f_logit_scale = 0.0e+00
print_info: f_attn_scale = 0.0e+00
print_info: n_ff = 14336
print_info: n_expert = 0
print_info: n_expert_used = 0
print_info: causal attn = 1
print_info: pooling type = 0
print_info: rope type = 0
print_info: rope scaling = linear
print_info: freq_base_train = 500000.0
print_info: freq_scale_train = 1
print_info: n_ctx_orig_yarn = 131072
print_info: rope_finetuned = unknown
print_info: ssm_d_conv = 0
print_info: ssm_d_inner = 0
print_info: ssm_d_state = 0
print_info: ssm_dt_rank = 0
print_info: ssm_dt_b_c_rms = 0
print_info: model type = 8B
print_info: model params = 8.03 B
print_info: general.name = Meta Llama 3.1 8B Instruct
print_info: vocab type = BPE
print_info: n_vocab = 128256
print_info: n_merges = 280147
print_info: BOS token = 128000 '<|begin_of_text|>'
print_info: EOS token = 128009 '<|eot_id|>'
print_info: EOT token = 128009 '<|eot_id|>'
print_info: EOM token = 128008 '<|eom_id|>'
print_info: LF token = 198 'Ċ'
print_info: EOG token = 128001 '<|end_of_text|>'
print_info: EOG token = 128008 '<|eom_id|>'
print_info: EOG token = 128009 '<|eot_id|>'
print_info: max token length = 256
load_tensors: loading model tensors, this can take a while... (mmap = true)
load_tensors: offloading 32 repeating layers to GPU
load_tensors: offloading output layer to GPU
load_tensors: offloaded 33/33 layers to GPU
load_tensors: Vulkan0 model buffer size = 4155.99 MiB
load_tensors: CPU_Mapped model buffer size = 281.81 MiB
......................................................................................
llama_context: constructing llama_context
llama_context: n_seq_max = 1
llama_context: n_ctx = 16384
llama_context: n_ctx_per_seq = 16384
llama_context: n_batch = 2048
llama_context: n_ubatch = 1024
llama_context: causal_attn = 1
llama_context: flash_attn = 1
llama_context: freq_base = 500000.0
llama_context: freq_scale = 1
llama_context: n_ctx_per_seq (16384) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
llama_context: Vulkan_Host output buffer size = 0.49 MiB
llama_kv_cache_unified: Vulkan0 KV buffer size = 2048.00 MiB
llama_kv_cache_unified: size = 2048.00 MiB ( 16384 cells, 32 layers, 1 seqs), K (f16): 1024.00 MiB, V (f16): 1024.00 MiB
llama_context: Vulkan0 compute buffer size = 613.01 MiB
llama_context: Vulkan_Host compute buffer size = 80.01 MiB
llama_context: graph nodes = 999
llama_context: graph splits = 2
```
---
👤 **ikawrakow** replied the **2025-07-01** at **14:19:04**:<br>
@jeffbolznv Thank you for chiming in. Above is the log. Is there something additional I need to do to improve performance? I did
```
mkdir vulkan && cd vulkan
cmake .. -DGGML_VULKAN=ON -DGGML_CUDA=OFF
make -j
```
> 👤 **jeffbolznv** replied the **2025-07-01** at **14:43:13**:<br>
> Is it a release build? I can't tell.
>
> You'd probably get a boost from a newer driver (to enable coopmat2), but the pp numbers seem slow for coopmat1.
>
> 👤 **ikawrakow** replied the **2025-07-01** at **14:54:36**:<br>
> Yes, this is a release build. @ubergarm is getting in the range of 3000 t/s for LlaMA-7B on his RX 7900 XTX, so same ball park.
---
👤 **jeffbolznv** replied the **2025-07-01** at **14:53:29**:<br>
What's the llama-bench equivalent of the `N_KV` column in that table? Is it `-d`? I see a big difference between coopmat1 and coopmat2 with large depth.
> 👤 **ikawrakow** replied the **2025-07-01** at **15:00:56**:<br>
> I haven't looked into mainline `llama.cpp`, but the `sweep-bench` here adds `N_KV` tokens to the KV cache, and then runs a batch of a given size (1024 tokens in the above example), and generates a given number of new tokens (256 in the example). Time is measured for both, and resulting tokens/second is printed. The KV cache is increased gradually in a sweep, which corresponds to a typical experience of a user interacting with an LLM. I don't know what the `-d` option in mainline does (I think it is a relatively recent addition), that's why I have a port of `sweep-bench` to mainline `llama.cpp` to be able to run direct (and more meaningful) comparisons than `-p 512` or `-n 128`).
>
> 👤 **jeffbolznv** replied the **2025-07-01** at **15:14:47**:<br>
> OK. I think these are basically the same parameter.
>
> I see much better (>2x) performance for large KV with coopmat2, and I think this is because it's doing more rows at a time (64 vs 16). It might be possible to improve this for the coopmat1 path, but it may start running into register limits, hard to say. For an NV GPU, you should just update to a recent driver (r575) and you'll get the improved performance automatically.
>
> 👤 **ikawrakow** replied the **2025-07-01** at **15:28:31**:<br>
> > you should just update to a recent driver (r575) and you'll get the improved performance automatically.
>
> You mean the Nvidia driver?
> I'm on `560.35.03` and reluctant to update as the machine I'm working on is remote.
>
> But IIRC, you have an RTX-4070. Can you post a comparison between CUDA and Vulkan on your GPU?
>
> 👤 **jeffbolznv** replied the **2025-07-01** at **16:12:50**:<br>
> I recently got a 5090, so the 4070 is no longer in my system. Here's what I'm seeing for coopmat2, coopmat1, and CUDA.
>
> ```
> Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench -m C:\models\meta-llama-3.1-8b-instruct-q4_0.gguf -fa 1 -n 0 -p 1024 --prio 1 -r 1 -d 1024-15360+1024
> ggml_vulkan: Found 1 Vulkan devices:
> ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: NV_coopmat2
> | model | size | params | backend | ngl | fa | test | t/s |
> | ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
> | llama 8B Q4_0 | 5.61 GiB | 8.03 B | Vulkan | 99 | 1 | pp1024 @ d1024 | 10616.78 ± 0.00 |
> | llama 8B Q4_0 | 5.61 GiB | 8.03 B | Vulkan | 99 | 1 | pp1024 @ d2048 | 9960.08 ± 0.00 |
> | llama 8B Q4_0 | 5.61 GiB | 8.03 B | Vulkan | 99 | 1 | pp1024 @ d3072 | 9841.83 ± 0.00 |
> | llama 8B Q4_0 | 5.61 GiB | 8.03 B | Vulkan | 99 | 1 | pp1024 @ d4096 | 9479.70 ± 0.00 |
> | llama 8B Q4_0 | 5.61 GiB | 8.03 B | Vulkan | 99 | 1 | pp1024 @ d5120 | 9019.58 ± 0.00 |
> | llama 8B Q4_0 | 5.61 GiB | 8.03 B | Vulkan | 99 | 1 | pp1024 @ d6144 | 8337.62 ± 0.00 |
> | llama 8B Q4_0 | 5.61 GiB | 8.03 B | Vulkan | 99 | 1 | pp1024 @ d7168 | 8149.66 ± 0.00 |
> | llama 8B Q4_0 | 5.61 GiB | 8.03 B | Vulkan | 99 | 1 | pp1024 @ d8192 | 7892.09 ± 0.00 |
> | llama 8B Q4_0 | 5.61 GiB | 8.03 B | Vulkan | 99 | 1 | pp1024 @ d9216 | 7678.50 ± 0.00 |
> | llama 8B Q4_0 | 5.61 GiB | 8.03 B | Vulkan | 99 | 1 | pp1024 @ d10240 | 7396.89 ± 0.00 |
> | llama 8B Q4_0 | 5.61 GiB | 8.03 B | Vulkan | 99 | 1 | pp1024 @ d11264 | 7160.86 ± 0.00 |
> | llama 8B Q4_0 | 5.61 GiB | 8.03 B | Vulkan | 99 | 1 | pp1024 @ d12288 | 6865.95 ± 0.00 |
> | llama 8B Q4_0 | 5.61 GiB | 8.03 B | Vulkan | 99 | 1 | pp1024 @ d13312 | 6660.70 ± 0.00 |
> | llama 8B Q4_0 | 5.61 GiB | 8.03 B | Vulkan | 99 | 1 | pp1024 @ d14336 | 6481.23 ± 0.00 |
> | llama 8B Q4_0 | 5.61 GiB | 8.03 B | Vulkan | 99 | 1 | pp1024 @ d15360 | 6240.57 ± 0.00 |
>
> Z:\github\jeffbolznv\llama.cpp\build\bin\RelWithDebInfo>llama-bench -m C:\models\meta-llama-3.1-8b-instruct-q4_0.gguf -fa 1 -n 0 -p 1024 --prio 1 -r 1 -d 1024-15360+1024
> ggml_vulkan: Found 1 Vulkan devices:
> ggml_vulkan: 0 = NVIDIA GeForce RTX 5090 (NVIDIA) | uma: 0 | fp16: 1 | warp size: 32 | shared memory: 49152 | int dot: 1 | matrix cores: KHR_coopmat
> | model | size | params | backend | ngl | fa | test | t/s |
> | ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
> | llama 8B Q4_0 | 5.61 GiB | 8.03 B | Vulkan | 99 | 1 | pp1024 @ d1024 | 6484.20 ± 0.00 |
> | llama 8B Q4_0 | 5.61 GiB | 8.03 B | Vulkan | 99 | 1 | pp1024 @ d2048 | 5791.34 ± 0.00 |
> | llama 8B Q4_0 | 5.61 GiB | 8.03 B | Vulkan | 99 | 1 | pp1024 @ d3072 | 5398.55 ± 0.00 |
> | llama 8B Q4_0 | 5.61 GiB | 8.03 B | Vulkan | 99 | 1 | pp1024 @ d4096 | 4879.42 ± 0.00 |
> | llama 8B Q4_0 | 5.61 GiB | 8.03 B | Vulkan | 99 | 1 | pp1024 @ d5120 | 4477.92 ± 0.00 |
> | llama 8B Q4_0 | 5.61 GiB | 8.03 B | Vulkan | 99 | 1 | pp1024 @ d6144 | 4112.65 ± 0.00 |
> | llama 8B Q4_0 | 5.61 GiB | 8.03 B | Vulkan | 99 | 1 | pp1024 @ d7168 | 3902.24 ± 0.00 |
> | llama 8B Q4_0 | 5.61 GiB | 8.03 B | Vulkan | 99 | 1 | pp1024 @ d8192 | 3651.50 ± 0.00 |
> | llama 8B Q4_0 | 5.61 GiB | 8.03 B | Vulkan | 99 | 1 | pp1024 @ d9216 | 3420.07 ± 0.00 |
> | llama 8B Q4_0 | 5.61 GiB | 8.03 B | Vulkan | 99 | 1 | pp1024 @ d10240 | 3236.93 ± 0.00 |
> | llama 8B Q4_0 | 5.61 GiB | 8.03 B | Vulkan | 99 | 1 | pp1024 @ d11264 | 3061.68 ± 0.00 |
> | llama 8B Q4_0 | 5.61 GiB | 8.03 B | Vulkan | 99 | 1 | pp1024 @ d12288 | 2896.88 ± 0.00 |
> | llama 8B Q4_0 | 5.61 GiB | 8.03 B | Vulkan | 99 | 1 | pp1024 @ d13312 | 2734.89 ± 0.00 |
> | llama 8B Q4_0 | 5.61 GiB | 8.03 B | Vulkan | 99 | 1 | pp1024 @ d14336 | 2624.02 ± 0.00 |
> | llama 8B Q4_0 | 5.61 GiB | 8.03 B | Vulkan | 99 | 1 | pp1024 @ d15360 | 2496.16 ± 0.00 |
>
> Z:\github\jeffbolznv\llama.cpp\buildcuda\bin\RelWithDebInfo>llama-bench -m C:\models\meta-llama-3.1-8b-instruct-q4_0.gguf -fa 1 -n 0 -p 1024 --prio 1 -r 1 -d 1024-15360+1024
> ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
> ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
> ggml_cuda_init: found 1 CUDA devices:
> Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes
> | model | size | params | backend | ngl | fa | test | t/s |
> | ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
> | llama 8B Q4_0 | 5.61 GiB | 8.03 B | CUDA | 99 | 1 | pp1024 @ d1024 | 12854.24 ± 0.00 |
> | llama 8B Q4_0 | 5.61 GiB | 8.03 B | CUDA | 99 | 1 | pp1024 @ d2048 | 12101.30 ± 0.00 |
> | llama 8B Q4_0 | 5.61 GiB | 8.03 B | CUDA | 99 | 1 | pp1024 @ d3072 | 11831.37 ± 0.00 |
> | llama 8B Q4_0 | 5.61 GiB | 8.03 B | CUDA | 99 | 1 | pp1024 @ d4096 | 11467.68 ± 0.00 |
> | llama 8B Q4_0 | 5.61 GiB | 8.03 B | CUDA | 99 | 1 | pp1024 @ d5120 | 11072.99 ± 0.00 |
> | llama 8B Q4_0 | 5.61 GiB | 8.03 B | CUDA | 99 | 1 | pp1024 @ d6144 | 10646.26 ± 0.00 |
> | llama 8B Q4_0 | 5.61 GiB | 8.03 B | CUDA | 99 | 1 | pp1024 @ d7168 | 10287.17 ± 0.00 |
> | llama 8B Q4_0 | 5.61 GiB | 8.03 B | CUDA | 99 | 1 | pp1024 @ d8192 | 9873.84 ± 0.00 |
> | llama 8B Q4_0 | 5.61 GiB | 8.03 B | CUDA | 99 | 1 | pp1024 @ d9216 | 9688.37 ± 0.00 |
> | llama 8B Q4_0 | 5.61 GiB | 8.03 B | CUDA | 99 | 1 | pp1024 @ d10240 | 9373.99 ± 0.00 |
> | llama 8B Q4_0 | 5.61 GiB | 8.03 B | CUDA | 99 | 1 | pp1024 @ d11264 | 9117.66 ± 0.00 |
> | llama 8B Q4_0 | 5.61 GiB | 8.03 B | CUDA | 99 | 1 | pp1024 @ d12288 | 8706.74 ± 0.00 |
> | llama 8B Q4_0 | 5.61 GiB | 8.03 B | CUDA | 99 | 1 | pp1024 @ d13312 | 8635.61 ± 0.00 |
> | llama 8B Q4_0 | 5.61 GiB | 8.03 B | CUDA | 99 | 1 | pp1024 @ d14336 | 8351.58 ± 0.00 |
> | llama 8B Q4_0 | 5.61 GiB | 8.03 B | CUDA | 99 | 1 | pp1024 @ d15360 | 8134.32 ± 0.00 |
> ```
>
> > You mean the Nvidia driver?
> > I'm on 560.35.03 and reluctant to update as the machine I'm working on is remote.
>
> Yeah, r575 has coopmat2 support.
>
> 👤 **ikawrakow** replied the **2025-07-01** at **16:50:49**:<br>
> OK, thanks, this looks much better.
>
> 👤 **ubergarm** replied the **2025-07-01** at **19:41:16**:<br>
> @jeffbolznv Thanks for the benchmarks. I'm curious how Vulkan coopmat2 is looking for TG. On the slightly larger Qwen3-14B-Q4_0 I mentioned above how it is actually faster than CUDA on my 3090TI FE for larger kv depths.
>
> If you are interested, here is one way to use llama-sweep-bench on mainline llama.cpp for comparisons. I just updated my fork/branch to llama.cpp tip of master@de5694414
>
> ```bash
> cd llama.cpp
> git remote add ubergarm git@github.com:ubergarm/llama.cpp.git
> git fetch ubergarm
> git checkout ug/port-sweep-bench
> # compile as usual for CUDA/Vulkan Release
> # it runs basically like llama-server with similar argument style
> # this might work on your windows box:
> llama-sweep-bench -m C:\models\meta-llama-3.1-8b-instruct-q4_0.gguf -fa -c 8192 -ngl 99 -t 1
> ```
>
> 👤 **jeffbolznv** replied the **2025-07-01** at **19:53:41**:<br>
> coopmat2 mostly isn't used for tg, but if there's grouped query attention then it may be used for the flash attention shader. It's nice/surprising to see vulkan pull ahead for larger KV. I suspect the Vulkan driver still has some small launch overhead relative to CUDA that hurts at smaller sizes, but I'm not really sure.
>
> 👤 **ikawrakow** replied the **2025-07-02** at **06:28:28**:<br>
> @jeffbolznv
>
> Once you are here, may I ask why flash attention for DeepSeek is not implemented in the `llama.cpp` Vulkan backend? Is it just that nobody has come around to do it, or are there principle issues? The most efficient FA implementation requires k-head = 192, v-head = 128 for prompt processing, and k-head = 576, v-head = 512 for token generation.
>
> 👤 **jeffbolznv** replied the **2025-07-02** at **12:51:50**:<br>
> Just nobody has done it yet. I don't think I've seen a version of the model that would even come close to fitting on my GPU. I suppose I could implement it just from the backend tests, but it would be nice to be able to perf test it.
>
> 👤 **ikawrakow** replied the **2025-07-02** at **13:04:19**:<br>
> Here is a 16B parameter MoE model that easily fits in your 5090 with VRAM to spare that uses the exact same attention mechanism as DeepSeek-V3/R1: https://huggingface.co/deepseek-ai/DeepSeek-V2-Lite (except that it has 16 instead of 128 heads). I think this is what Johannes used for testing when he implemented the k-head-size != v-head-size FA in the `llama.cpp` CUDA backend. I did have a partial implementation here using this model quite a bit earlier than mainline (the part for k-head=192, v-head=128), but I was straggling to get a performant implementation for the k-head=576, v-head=512 case, so that's why I asked the question if there are principle issues with the Vulkan implementation.
>
> 👤 **jeffbolznv** replied the **2025-07-02** at **13:10:56**:<br>
> I thought deepseek v2 was already accelerated and it was only deepseek R1 that uses the large/mixed head sizes?
>
> 👤 **ikawrakow** replied the **2025-07-02** at **13:13:09**:<br>
> Well, get the model and see what happens.
>
> 👤 **jeffbolznv** replied the **2025-07-02** at **13:52:10**:<br>
> OK, I do see FA falling back to CPU with it.
>
> 👤 **jeffbolznv** replied the **2025-07-02** at **20:14:58**:<br>
> I added support for these head sizes in https://github.com/ggml-org/llama.cpp/pull/14509. Performance is tolerable with the coopmat2 shader but very slow for coopmat1/scalar. I'm sure there's some room for tuning.
---
👤 **ikawrakow** replied the **2025-07-02** at **06:16:07**:<br>
> Personally, the two things I see Vulkan back-end support providing are:
>
> A path allowing AMD GPUs to be used e.g. RX 7900 XTX 24GB VRAM
But a port of the mainline Vulkan back-end to `ik_llama.cpp` without the additions that make `ik_llama.cpp` faster for CUDA and CPU inference has zero benefits. People can simply use `llama.cpp` with their AMD GPUs.
> 👤 **firecoperana** replied the **2025-07-02** at **14:32:39**:<br>
> Another benefit is to people who have both nvidia and amd or even intel GPUs. They can use RPC to load different backends or just use vulkan to use non CUDA GPU to offload more weights to vram.
>
> 👤 **ikawrakow** replied the **2025-07-02** at **14:43:52**:<br>
> > Another benefit is to people who have both nvidia and amd or even intel GPUs. They can use RPC to load different backends or just use vulkan to use non CUDA GPU to offload more weights to vram.
>
> They already have this with `llama.cpp`. What does `ik_llama.cpp` without the additions implemented for Vulkan give them that they don't already have with `llama.cpp`?
>
> 👤 **firecoperana** replied the **2025-07-02** at **15:38:13**:<br>
> One major thing I can think of is mla support for old quants of Deepseek V2.5 and V3 models. And if someone is already using ik_llama.cpp, adding AMD gpu that is not useable earlier can offer more speed boost.
---
👤 **ikawrakow** replied the **2025-07-06** at **13:41:44**:<br>
So, the Vulkan back-end is usable, and performance is better than `llama.cpp` (see, e.g., PR #584 that has a comparison for a MoE model). But compared to CUDA on the same GPU, performance is much lower, especially for MoE models (and most users appear to be using `ik_llama.cpp` exactly for one of the giant MoE models). I have mixed feelings how to proceed:
* There is much more performance optimization potential in the Vulkan back-end compared to CUDA or CPU. So, from that point of view it seems worthwhile to put some effort into optimizing the Vulkan back-end
* I know nothing about Vulkan programming in general or the `llama.cpp` Vulkan back-end in particular, hence, at least initially, it will be an uphill battle. Without a significant interest from the user base, I don't feel particularly motivated to do this to myself.
So, if you feel that Vulkan performance improvement in `ik_llama.cpp` is important, go to discussion #590 and vote!

View File

@@ -0,0 +1,28 @@
### 🗣️ [#564](https://github.com/ikawrakow/ik_llama.cpp/discussions/564) - Maybe an interesting CUDA PR here.
| **Author** | `Nexesenex` |
| :--- | :--- |
| **Created** | 2025-06-29 |
| **Updated** | 2025-07-01 |
---
#### Description
Title : Overlap CUDA graph building and processing to minimize GPU idle time and improve tokens per seconds performance.
#11867
Link : https://github.com/ggml-org/llama.cpp/pull/11867
Author : @Aendk
Use : a few % boost on Cuda PP and TG?
---
#### 🗣️ Discussion
👤 **ikawrakow** replied the **2025-07-01** at **13:56:23**:<br>
Yes, I saw this PR. But to quote Diego's statement in the PR discussion
> I still think that this change adds a significant amount of complexity, to code that is already too fragile and complex to reasonably maintain.
I fully agree with that. The back-end is really fragile, so performance gains must be way more than 2-3% to warrant a change such as that one.

View File

@@ -0,0 +1,83 @@
### 🗣️ [#586](https://github.com/ikawrakow/ik_llama.cpp/discussions/586) - Slow KV cache rm operation
| **Author** | `jneloexpirements` |
| :--- | :--- |
| **Created** | 2025-07-05 |
| **Updated** | 2025-07-05 |
---
#### Description
Is this related to #451 ?
I am running DeepSeek-R1-V3-0324-IQ4_K_R4 (ubergarm's Q4) quant and while the token generation is decent (i have seen 12 tps at 0, around 66% when it goes to)
I use intel Xeon QYFS, 512GB DDR5 4800 RAM, and a RTX PRO 6000.
I run the command below and also for real use case change it from sweep-bench to server with host/port
```
CUDA_VISIBLE_DEVICES="0," \
./build/bin/llama-sweep-bench \
--model /mnt/x/models/ubergarm/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324-IQ4_K_R4-00001-of-00010.gguf \
--alias ubergarm/DeepSeek-R1-V3-0324-IQ4_K_R4 \
--ctx-size 98304 \
-ctk q8_0 \
-mla 3 -fa \
-amb 8192 \
-fmoe \
--temp 0.3 \
--min-p 0.05 \
--n-gpu-layers 63 \
-ot "blk\.[3-9]\.ffn_.*=CUDA0" \
-ot exps=CPU \
-ub 8192 -b 8192 \
--parallel 1 \
--threads 57
```
The above command puts VRAM usage to 90376 out of 97887 MiB.
```
....................................................................................................
llama_new_context_with_model: n_ctx = 98304
llama_new_context_with_model: n_batch = 8192
llama_new_context_with_model: n_ubatch = 8192
llama_new_context_with_model: flash_attn = 1
llama_new_context_with_model: mla_attn = 3
llama_new_context_with_model: attn_max_b = 8192
llama_new_context_with_model: fused_moe = 1
llama_new_context_with_model: ser = -1, 0
llama_new_context_with_model: freq_base = 10000.0
llama_new_context_with_model: freq_scale = 0.025
llama_kv_cache_init: CUDA0 KV buffer size = 3499.90 MiB
llama_new_context_with_model: KV self size = 3499.88 MiB, c^KV (q8_0): 3499.88 MiB, kv^T: not used
llama_new_context_with_model: CUDA_Host output buffer size = 0.49 MiB
ggml_cuda_host_malloc: failed to allocate 3296.09 MiB of pinned memory: invalid argument
llama_new_context_with_model: CUDA0 compute buffer size = 20496.03 MiB
llama_new_context_with_model: CUDA_Host compute buffer size = 3296.09 MiB
llama_new_context_with_model: graph nodes = 4219
llama_new_context_with_model: graph splits = 104
```
The raw PP seems to be proper and not irregularly slow from sweep-bench (in this example and also past ones)
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
| 8192 | 2048 | 0 | 65.721 | 124.65 | 173.995 | 11.77 |
| 8192 | 2048 | 8192 | 69.385 | 118.07 | 190.416 | 10.76 |
| 8192 | 2048 | 16384 | 73.025 | 112.18 | 199.023 | 10.29 |
| 8192 | 2048 | 24576 | 76.688 | 106.82 | 204.607 | 10.01 |
| 8192 | 2048 | 32768 | 79.945 | 102.47 | 208.366 | 9.83 |
I can tolerate the TG but...
In real use cases however which are RAG heavy (feeding it long documents, then chatting for a while on it and websearch) and I like to flip flop between conversations, I have to wait for 2-5 minutes for KV cache removal.
```
INFO [ update_slots] kv cache rm [p0, end) | tid="125357154684928" timestamp=1751624758 id_slot=0 id_task=12104 p0=8410
INFO [ print_timings] prompt eval time = 128443.90 ms / 10172 tokens ( 12.63 ms per token, 79.19 tokens per second) | timestamp=1751624830 id_slot=0 id_task=12104 t_prompt_processing=128443.905 n_prompt_tokens_processed=10172 t_token=12.627202615021627 n_tokens_second=79.19410422783393
INFO [ print_timings] generation eval time = 10688.65 ms / 122 runs ( 87.61 ms per token, 11.41 tokens per second) | timestamp=1751624830 id_slot=0 id_task=12104 t_token_generation=10688.646 n_decoded=122 t_token=87.6118524590164 n_tokens_second=11.413980779230597
```
The time it took to for KV removal was around 3 minutes thats imo too slow. even if it is 8192 I tried with 4096 2048 or any number KV is just too slow.
1. Does `ggml_cuda_host_malloc: failed to allocate 3296.09 MiB of pinned memory: invalid argument` have anything to do with that? How to fix this problem?
2. Is 60-120 SPP for 4096/8192 batch expected for systems that offload Dense to GPU and experts to CPU?
3. Is KV removal operation tied to PP or is it a separate thing?
Any help is appreciated so that I can mitigate before-generation slowdowns

View File

@@ -0,0 +1,431 @@
### 🗣️ [#590](https://github.com/ikawrakow/ik_llama.cpp/discussions/590) - How important is Vulkan back-end development?
| **Author** | `ikawrakow` |
| :--- | :--- |
| **Created** | 2025-07-06 |
| **Updated** | 2025-07-18 |
---
#### Description
Tthe Vulkan back-end in `ik_llama.cpp` is now usable, and performance is better than `llama.cpp` (see, e.g., PR #584 that has a comparison for a MoE model). But compared to CUDA on the same GPU, performance is much lower, especially for MoE models (and most users appear to be using `ik_llama.cpp` exactly for one of the giant MoE models). I have mixed feelings how to proceed:
* There is much more performance optimization potential in the Vulkan back-end compared to CUDA or CPU. So, from that point of view it seems worthwhile to put some effort into optimizing the Vulkan back-end
* I know nothing about Vulkan programming in general or the `llama.cpp` Vulkan back-end in particular, hence, at least initially, it will be an uphill battle. Without a significant interest from the user base, I don't feel particularly motivated to do this to myself.
---
#### 🗣️ Discussion
👤 **OneOfOne** replied the **2025-07-06** at **16:55:32**:<br>
On AMD, vulkan is faster and more memory efficient than rocm.
---
👤 **mcm007** replied the **2025-07-06** at **18:25:18**:<br>
Currently, owners of Nvidia GPUs have access to a wide range of inference engines (e.g., vllm, exllama, sglang, mlc, aphrodite-engine) that are optimized for CUDA. This allows them to fully utilize their hardware, which is great.
In contrast, Vulkan support could provide significant benefits to users of AMD and Intel GPUs, which currently have less mature tooling and support.
AMD appears not so friendly toward regular consumers, eg. AMD Rocm barely supports their top GPUs.
The recent Vulkan improvements by jeffbolznv on mainline llama.cpp are higher for Nvidia GPUs because he seems from Nvidia backgrounds.
Is not nice that we don't notice AMD people providing some support... just enough to be noticed.
As much I don't like Nvidia I swapped my new 7900XTX for a used 3090.
Also, with Vulkan support would be possible to run the fast `ik_llama.cpp` on devices like Intel iGPU or Ryzen 3400G APU, using `KS` quants, re-use the quantized files, etc.
I want to acknowledge the effort and quality of your work, therefore whatever you choose (improve speed, quants quality, Vulkan, features, ...) doesn't matter at the end: they will benefit us, users/community.
> 👤 **saood06** replied the **2025-07-06** at **23:34:55**:<br>
> >Currently, owners of Nvidia GPUs have access to a wide range of inference engines (e.g., vllm, exllama, sglang, mlc, aphrodite-engine) that are optimized for CUDA. This allows them to fully utilize their hardware, which is great.
>
> All of the ones you list do offer some form of AMD support: [vllm](https://docs.vllm.ai/en/v0.6.3/getting_started/amd-installation.html), exllama V2 with [builds](https://github.com/turboderp-org/exllamav2/releases/tag/v0.3.1) for rocm and plans for it in v3, [sglang](https://docs.sglang.ai/references/amd.html), [mlc](https://github.com/mlc-ai/mlc-llm) table shows both Vulkan and ROCm support, [aphrodite-engine](https://aphrodite.pygmalion.chat/installation/installation-rocm/).
>
> > As much I don't like Nvidia I swapped my new 7900XTX for a used 3090.
>
> To be transparent, I don't own a modern AMD card, and I do own a 3090, so I have no personal experience using ROCm. But at least it looks like there is support for AMD to some degree in everything you listed.
>
> >Ryzen 3400G APU, using KS quants, re-use the quantized files, etc.
>
> But I have owned and used 3400G (upgraded past it). I'm not sure if the iGPU would be better (or at least better enough to matter) than the AVX2 CPU backend, what I miss about the iGPU, is that it lets you run without discrete GPU (or fully passing it through to a VM).
>
> 👤 **mcm007** replied the **2025-07-07** at **05:45:28**:<br>
> > All of the ones you list do offer some form of AMD support: [vllm](https://docs.vllm.ai/en/v0.6.3/getting_started/amd-installation.html), exllama V2 with [builds](https://github.com/turboderp-org/exllamav2/releases/tag/v0.3.1) for rocm and plans for it in v3, [sglang](https://docs.sglang.ai/references/amd.html), [mlc](https://github.com/mlc-ai/mlc-llm) table shows both Vulkan and ROCm support, [aphrodite-engine](https://aphrodite.pygmalion.chat/installation/installation-rocm/).
>
> Usually, support is not complete and misses features or optimizations like FA, supporting all quants, and quantized cache. :disappointed:
>
> > But I have owned and used 3400G (upgraded past it). I'm not sure if the iGPU would be better (or at least better enough to matter) than the AVX2 CPU backend, what I miss about the iGPU, is that it lets you run without discrete GPU (or fully passing it through to a VM).
>
> Since IK created `-rtr`, or with the recent on-the-fly repacks #531, #533, #534, PP performance has skyrocketed, making the CPU viable for small models on simple tasks :smile:.
> Indeed, the extra performance added by iGPU part doesn't seem worth the effort, but for models small enough to fit in the default 2GB* allocated memory, the sweep-bench looks incredible on the Vulkan build:
> ![performance_comparison_pp](https://github.com/user-attachments/assets/4a10476e-b9cf-47a3-abbf-06a6bf92444d)
> ![performance_comparison_tg](https://github.com/user-attachments/assets/8a9746e6-7dcd-4dcf-a19e-54a1b14b2f10)
>
> * There is a way to increase the memory allocated to the iGPU [Smokeless_UMAF](https://github.com/DavidS95/Smokeless_UMAF) but it's a bit of a hassle - one needs to boot from the modified BIOS every time and make the modification.
>
> 👤 **saood06** replied the **2025-07-07** at **06:09:54**:<br>
> > Usually, support is not complete and misses features or optimizations like FA, supporting all quants, and quantized cache. 😞
>
> I did look into the state of flash attention support for ROCm and it did seem like they are working on it with things like paged attention not fully there yet.
>
> Like I said I don't have personal experience so I don't know what the experience is like, just thought it should be mentioned that they all do seem like they do support the hardware (to some level).
>
> > Since IK created `-rtr`, or with the recent on-the-fly repacks #531, #533, #534, PP performance has skyrocketed, making the CPU viable for small models on simple tasks 😄. Indeed, the extra performance added by iGPU part doesn't seem worth the effort.
>
> Yeah.
>
> >the sweep-bench looks incredible on the Vulkan build
>
> Thanks for the graphs. I'd be curious about peak batched performance comparisons (I never got around to adding a plot tool to `batched-bench`)
>
> >There is a way to increase the memory allocated to the iGPU [Smokeless_UMAF](https://github.com/DavidS95/Smokeless_UMAF) but it's a bit of a hassle - one needs to boot from the modified BIOS every time and make the modification.
>
> That is interesting to hear for if I ever use that CPU again (but if I do use it, I'm not sure if I'd want to allocate more or less VRAM assuming less is possible).
>
> 👤 **ikawrakow** replied the **2025-07-07** at **07:16:56**:<br>
> @mcm007 What is the CPU for these graphs? PP < 200 t/s seems quite low for a 0.6B model.
>
> Here is what I get for `Q6_K`-quantized Qwen3-0.6B on my Ryzen-7950X CPU:
>
> | PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
> |-------|--------|--------|----------|----------|----------|----------|
> | 512 | 128 | 0 | 0.201 | 2546.00 | 1.259 | 101.65 |
> | 512 | 128 | 512 | 0.209 | 2451.59 | 1.463 | 87.48 |
> | 512 | 128 | 1024 | 0.233 | 2197.58 | 1.646 | 77.78 |
> | 512 | 128 | 1536 | 0.258 | 1985.52 | 1.669 | 76.67 |
> | 512 | 128 | 2048 | 0.282 | 1814.45 | 1.715 | 74.63 |
> | 512 | 128 | 2560 | 0.307 | 1665.39 | 1.783 | 71.80 |
> | 512 | 128 | 3072 | 0.333 | 1537.27 | 1.856 | 68.95 |
> | 512 | 128 | 3584 | 0.361 | 1419.98 | 1.925 | 66.48 |
>
> 👤 **mcm007** replied the **2025-07-07** at **09:01:59**:<br>
> @saood06
>
> > I'd be curious about peak batched performance comparisons (I never got around to adding a plot tool to batched-bench)
>
>
>
> <details>
> <summary>Results here, click to expand</summary>
>
> ### Vulkan build
>
> `llama-batched-bench -m /models1/Qwen_Qwen3-0.6B-Q6_K.gguf -c 4096 -b 512 -ub 512 -ngl 0,1 -npp 128 -ntg 128 -npl 1,2,4,6,8,10,12,14,16`
>
> | PP | TG | B | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s | T s | S t/s |
> |-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
> | 128 | 128 | 1 | 256 | 2.158 | 59.33 | 3.076 | 41.61 | 5.233 | 48.92 |
> | 128 | 128 | 2 | 512 | 1.814 | 141.12 | 4.738 | 54.03 | 6.552 | 78.14 |
> | 128 | 128 | 4 | 1024 | 1.870 | 273.78 | 7.437 | 68.84 | 9.308 | 110.02 |
> | 128 | 128 | 6 | 1536 | 3.498 | 219.57 | 10.354 | 74.17 | 13.852 | 110.89 |
> | 128 | 128 | 8 | 2048 | 3.621 | 282.79 | 14.736 | 69.49 | 18.357 | 111.56 |
> | 128 | 128 | 10 | 2560 | 5.542 | 230.95 | 19.563 | 65.43 | 25.106 | 101.97 |
> | 128 | 128 | 12 | 3072 | 5.408 | 284.02 | 24.153 | 63.59 | 29.561 | 103.92 |
> | 128 | 128 | 14 | 3584 | 7.023 | 255.17 | 29.784 | 60.17 | 36.807 | 97.37 |
> | 128 | 128 | 16 | 4096 | 7.103 | 288.33 | 35.599 | 57.53 | 42.702 | 95.92 |
>
>
> `llama-batched-bench -m /models1/Qwen_Qwen3-0.6B-Q6_K.gguf -c 4096 -b 512 -ub 512 -ngl 0,1 -npp 128 -ntg 128 -npl 1,2,4,6,8,10,12,14,16 -fa --cache-type-k q8_0`
>
> | PP | TG | B | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s | T s | S t/s |
> |-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
> | 128 | 128 | 1 | 256 | 2.136 | 59.93 | 2.950 | 43.39 | 5.086 | 50.34 |
> | 128 | 128 | 2 | 512 | 2.843 | 90.03 | 4.471 | 57.26 | 7.314 | 70.00 |
> | 128 | 128 | 4 | 1024 | 4.506 | 113.62 | 7.563 | 67.70 | 12.069 | 84.85 |
> | 128 | 128 | 6 | 1536 | 7.924 | 96.92 | 11.261 | 68.20 | 19.185 | 80.06 |
> | 128 | 128 | 8 | 2048 | 9.385 | 109.12 | 14.843 | 68.99 | 24.228 | 84.53 |
> | 128 | 128 | 10 | 2560 | 13.274 | 96.43 | 21.822 | 58.66 | 35.096 | 72.94 |
> | 128 | 128 | 12 | 3072 | 14.836 | 103.53 | 30.557 | 50.27 | 45.392 | 67.68 |
> | 128 | 128 | 14 | 3584 | 18.849 | 95.07 | 41.660 | 43.02 | 60.509 | 59.23 |
> | 128 | 128 | 16 | 4096 | 20.788 | 98.52 | 34.703 | 59.01 | 55.492 | 73.81 |
>
>
> ### CPU build
>
> `llama-batched-bench -m /models1/Qwen_Qwen3-0.6B-Q6_K.gguf -c 4096 -b 512 -ub 512 -ngl 0,1 -npp 128 -ntg 128 -npl 1,2,4,6,8,10,12,14,16`
>
> | PP | TG | B | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s | T s | S t/s |
> |-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
> | 128 | 128 | 1 | 256 | 0.858 | 149.13 | 3.157 | 40.55 | 4.015 | 63.76 |
> | 128 | 128 | 2 | 512 | 1.683 | 152.13 | 4.879 | 52.47 | 6.562 | 78.02 |
> | 128 | 128 | 4 | 1024 | 3.570 | 143.42 | 7.726 | 66.27 | 11.296 | 90.65 |
> | 128 | 128 | 6 | 1536 | 5.465 | 140.53 | 10.482 | 73.27 | 15.947 | 96.32 |
> | 128 | 128 | 8 | 2048 | 7.761 | 131.94 | 15.193 | 67.40 | 22.954 | 89.22 |
> | 128 | 128 | 10 | 2560 | 9.970 | 128.38 | 19.755 | 64.79 | 29.726 | 86.12 |
> | 128 | 128 | 12 | 3072 | 12.513 | 122.75 | 24.533 | 62.61 | 37.046 | 82.92 |
> | 128 | 128 | 14 | 3584 | 15.011 | 119.38 | 30.032 | 59.67 | 45.043 | 79.57 |
> | 128 | 128 | 16 | 4096 | 17.933 | 114.20 | 35.927 | 57.01 | 53.860 | 76.05 |
>
>
> `llama-batched-bench -m /models1/Qwen_Qwen3-0.6B-Q6_K.gguf -c 4096 -b 512 -ub 512 -ngl 0,1 -npp 128 -ntg 128 -npl 1,2,4,6,8,10,12,14,16 -fa --cache-type-k q8_0`
>
> | PP | TG | B | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s | T s | S t/s |
> |-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
> | 128 | 128 | 1 | 256 | 1.061 | 120.60 | 3.088 | 41.46 | 4.149 | 61.70 |
> | 128 | 128 | 2 | 512 | 1.668 | 153.51 | 4.754 | 53.85 | 6.422 | 79.73 |
> | 128 | 128 | 4 | 1024 | 3.566 | 143.58 | 7.453 | 68.70 | 11.019 | 92.93 |
> | 128 | 128 | 6 | 1536 | 5.346 | 143.65 | 11.886 | 64.61 | 17.232 | 89.13 |
> | 128 | 128 | 8 | 2048 | 7.491 | 136.70 | 14.897 | 68.74 | 22.388 | 91.48 |
> | 128 | 128 | 10 | 2560 | 9.620 | 133.06 | 22.426 | 57.08 | 32.045 | 79.89 |
> | 128 | 128 | 12 | 3072 | 11.950 | 128.54 | 31.101 | 49.39 | 43.051 | 71.36 |
> | 128 | 128 | 14 | 3584 | 14.372 | 124.69 | 42.149 | 42.52 | 56.520 | 63.41 |
> | 128 | 128 | 16 | 4096 | 17.197 | 119.09 | 34.384 | 59.56 | 51.581 | 79.41 |
>
> </details>
>
>
> @ikawrakow
>
> > What is the CPU for these graphs?
>
> [AMD Ryzen 5 3400G](https://www.techpowerup.com/cpu-specs/ryzen-5-3400g.c2204), old and without AVX512 :smile:
>
> But it's always powered on thus `llama-server` Webui immediately accessible even from phone.
>
> `
> Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxs
> r sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc rep_good nopl n
> onstop_tsc cpuid extd_apicid aperfmperf rapl pni pclmulqdq monitor ssse3 fma cx16 sse4_
> 1 sse4_2 movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_lega
> cy abm sse4a misalignsse 3dnowprefetch osvw skinit wdt tce topoext perfctr_core perfctr
> _nb bpext perfctr_llc mwaitx cpb hw_pstate ssbd ibpb vmmcall fsgsbase bmi1 avx2 smep bm
> i2 rdseed adx smap clflushopt sha_ni xsaveopt xsavec xgetbv1 clzero xsaveerptr arat npt
> lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pft
> hreshold avic v_vmsave_vmload vgif overflow_recov succor smca sev sev_es
> `
>
> 👤 **saood06** replied the **2025-07-07** at **09:18:30**:<br>
> > > I'd be curious about peak batched performance comparisons (I never got around to adding a plot tool to batched-bench)
> >
> > Results here, click to expand
>
> Thanks, but not sure what to make of these given you use `-ngl 0,1` (which I think is being interpreted as 1), instead of 99/0 like you did for `sweep-bench`
>
> Edit:
>
> >[AMD Ryzen 5 3400G](https://www.techpowerup.com/cpu-specs/ryzen-5-3400g.c2204), old and without AVX512 😄
>
> My server CPU uses the first CPU architecture with AVX2.
>
> 👤 **mcm007** replied the **2025-07-07** at **10:43:29**:<br>
> Sorry, `0,1` was meant for `fa` I think. It used 0 in the typo `llm_load_tensors: offloaded 0/29 layers to GPU`.
>
> <details>
>
> <summary>CPU/Vulkan/ngl/FA Results</summary>
>
> ### Vulkan build
>
> `llama-batched-bench -m /models1/Qwen_Qwen3-0.6B-Q6_K.gguf -c 4096 -b 512 -ub 512 -npp 128 -ntg 128 -npl 1,2,4,6,8,10,12,14,16 -fa -ngl 0`
>
> | PP | TG | B | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s | T s | S t/s |
> |-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
> | 128 | 128 | 1 | 256 | 2.027 | 63.14 | 3.037 | 42.15 | 5.064 | 50.55 |
> | 128 | 128 | 2 | 512 | 1.793 | 142.76 | 4.595 | 55.71 | 6.388 | 80.15 |
> | 128 | 128 | 4 | 1024 | 1.839 | 278.46 | 7.841 | 65.30 | 9.679 | 105.79 |
> | 128 | 128 | 6 | 1536 | 3.420 | 224.57 | 14.302 | 53.70 | 17.722 | 86.67 |
> | 128 | 128 | 8 | 2048 | 3.590 | 285.26 | 15.373 | 66.61 | 18.963 | 108.00 |
> | 128 | 128 | 10 | 2560 | 5.156 | 248.23 | 27.476 | 46.59 | 32.633 | 78.45 |
> | 128 | 128 | 12 | 3072 | 5.747 | 267.28 | 41.406 | 37.10 | 47.153 | 65.15 |
> | 128 | 128 | 14 | 3584 | 7.283 | 246.05 | 58.771 | 30.49 | 66.054 | 54.26 |
> | 128 | 128 | 16 | 4096 | 8.226 | 248.97 | 37.488 | 54.63 | 45.714 | 89.60 |
>
>
> `llama-batched-bench -m /models1/Qwen_Qwen3-0.6B-Q6_K.gguf -c 4096 -b 512 -ub 512 -npp 128 -ntg 128 -npl 1,2,4,6,8,10,12,14,16 -fa -ngl 99`
>
> | PP | TG | B | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s | T s | S t/s |
> |-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
> | 128 | 128 | 1 | 256 | 0.326 | 392.75 | 2.841 | 45.05 | 3.167 | 80.82 |
> | 128 | 128 | 2 | 512 | 0.388 | 660.42 | 3.400 | 75.29 | 3.788 | 135.16 |
> | 128 | 128 | 4 | 1024 | 0.841 | 608.95 | 5.633 | 90.89 | 6.474 | 158.18 |
> | 128 | 128 | 6 | 1536 | 1.328 | 578.33 | 7.383 | 104.03 | 8.711 | 176.34 |
> | 128 | 128 | 8 | 2048 | 1.960 | 522.41 | 9.095 | 112.59 | 11.055 | 185.25 |
> | 128 | 128 | 10 | 2560 | 2.595 | 493.23 | 16.859 | 75.92 | 19.455 | 131.59 |
> | 128 | 128 | 12 | 3072 | 3.487 | 440.48 | 17.976 | 85.45 | 21.463 | 143.13 |
> | 128 | 128 | 14 | 3584 | 4.313 | 415.48 | 19.101 | 93.82 | 23.414 | 153.07 |
> | 128 | 128 | 16 | 4096 | 5.380 | 380.64 | 20.148 | 101.65 | 25.528 | 160.45 |
>
>
> `llama-batched-bench -m /models1/Qwen_Qwen3-0.6B-Q6_K.gguf -c 4096 -b 512 -ub 512 -npp 128 -ntg 128 -npl 1,2,4,6,8,10,12,14,16 -ngl 0`
>
> | PP | TG | B | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s | T s | S t/s |
> |-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
> | 128 | 128 | 1 | 256 | 2.151 | 59.52 | 3.212 | 39.85 | 5.363 | 47.74 |
> | 128 | 128 | 2 | 512 | 1.815 | 141.07 | 4.438 | 57.69 | 6.252 | 81.89 |
> | 128 | 128 | 4 | 1024 | 1.870 | 273.79 | 7.488 | 68.37 | 9.358 | 109.42 |
> | 128 | 128 | 6 | 1536 | 3.499 | 219.48 | 10.361 | 74.13 | 13.860 | 110.82 |
> | 128 | 128 | 8 | 2048 | 3.622 | 282.70 | 14.533 | 70.46 | 18.155 | 112.81 |
> | 128 | 128 | 10 | 2560 | 5.552 | 230.56 | 19.646 | 65.15 | 25.198 | 101.60 |
> | 128 | 128 | 12 | 3072 | 5.427 | 283.01 | 24.115 | 63.69 | 29.543 | 103.98 |
> | 128 | 128 | 14 | 3584 | 6.983 | 256.63 | 29.911 | 59.91 | 36.894 | 97.14 |
> | 128 | 128 | 16 | 4096 | 7.082 | 289.20 | 36.246 | 56.50 | 43.327 | 94.54 |
>
>
> `llama-batched-bench -m /models1/Qwen_Qwen3-0.6B-Q6_K.gguf -c 4096 -b 512 -ub 512 -npp 128 -ntg 128 -npl 1,2,4,6,8,10,12,14,16 -ngl 99`
>
> | PP | TG | B | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s | T s | S t/s |
> |-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
> | 128 | 128 | 1 | 256 | 0.303 | 422.98 | 2.686 | 47.65 | 2.989 | 85.65 |
> | 128 | 128 | 2 | 512 | 0.335 | 763.09 | 4.162 | 61.50 | 4.498 | 113.83 |
> | 128 | 128 | 4 | 1024 | 0.679 | 753.86 | 7.281 | 70.32 | 7.960 | 128.65 |
> | 128 | 128 | 6 | 1536 | 1.051 | 730.81 | 10.296 | 74.60 | 11.346 | 135.37 |
> | 128 | 128 | 8 | 2048 | 1.433 | 714.54 | 12.580 | 81.40 | 14.013 | 146.15 |
> | 128 | 128 | 10 | 2560 | 1.855 | 690.11 | 17.271 | 74.11 | 19.126 | 133.85 |
> | 128 | 128 | 12 | 3072 | 2.277 | 674.54 | 18.591 | 82.62 | 20.868 | 147.21 |
> | 128 | 128 | 14 | 3584 | 2.747 | 652.35 | 19.879 | 90.15 | 22.626 | 158.40 |
> | 128 | 128 | 16 | 4096 | 3.213 | 637.39 | 21.080 | 97.15 | 24.293 | 168.61 |
>
>
> ### CPU build
>
> `llama-batched-bench -m /models1/Qwen_Qwen3-0.6B-Q6_K.gguf -c 4096 -b 512 -ub 512 -npp 128 -ntg 128 -npl 1,2,4,6,8,10,12,14,16 -fa`
>
> | PP | TG | B | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s | T s | S t/s |
> |-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
> | 128 | 128 | 1 | 256 | 1.079 | 118.58 | 3.090 | 41.43 | 4.169 | 61.40 |
> | 128 | 128 | 2 | 512 | 1.695 | 151.00 | 4.751 | 53.89 | 6.446 | 79.43 |
> | 128 | 128 | 4 | 1024 | 3.609 | 141.89 | 7.772 | 65.88 | 11.380 | 89.98 |
> | 128 | 128 | 6 | 1536 | 5.607 | 136.98 | 15.116 | 50.81 | 20.723 | 74.12 |
> | 128 | 128 | 8 | 2048 | 7.843 | 130.56 | 15.871 | 64.52 | 23.715 | 86.36 |
> | 128 | 128 | 10 | 2560 | 10.113 | 126.57 | 28.216 | 45.36 | 38.329 | 66.79 |
> | 128 | 128 | 12 | 3072 | 12.770 | 120.28 | 42.656 | 36.01 | 55.426 | 55.43 |
> | 128 | 128 | 14 | 3584 | 15.405 | 116.32 | 60.220 | 29.76 | 75.625 | 47.39 |
> | 128 | 128 | 16 | 4096 | 18.308 | 111.86 | 37.814 | 54.16 | 56.122 | 72.98 |
>
>
> `llama-batched-bench -m /models1/Qwen_Qwen3-0.6B-Q6_K.gguf -c 4096 -b 512 -ub 512 -npp 128 -ntg 128 -npl 1,2,4,6,8,10,12,14,16`
>
> | PP | TG | B | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s | T s | S t/s |
> |-------|--------|------|--------|----------|----------|----------|----------|----------|----------|
> | 128 | 128 | 1 | 256 | 0.891 | 143.70 | 3.195 | 40.07 | 4.085 | 62.66 |
> | 128 | 128 | 2 | 512 | 1.690 | 151.47 | 4.721 | 54.23 | 6.411 | 79.86 |
> | 128 | 128 | 4 | 1024 | 3.582 | 142.94 | 7.592 | 67.44 | 11.174 | 91.64 |
> | 128 | 128 | 6 | 1536 | 5.515 | 139.26 | 10.560 | 72.73 | 16.075 | 95.55 |
> | 128 | 128 | 8 | 2048 | 7.711 | 132.79 | 15.253 | 67.13 | 22.964 | 89.18 |
> | 128 | 128 | 10 | 2560 | 9.933 | 128.87 | 19.750 | 64.81 | 29.682 | 86.25 |
> | 128 | 128 | 12 | 3072 | 12.619 | 121.72 | 24.358 | 63.06 | 36.978 | 83.08 |
> | 128 | 128 | 14 | 3584 | 14.966 | 119.73 | 29.971 | 59.79 | 44.938 | 79.75 |
> | 128 | 128 | 16 | 4096 | 17.959 | 114.04 | 36.230 | 56.53 | 54.189 | 75.59 |
>
> </details>
---
👤 **firecoperana** replied the **2025-07-06** at **23:37:31**:<br>
You don't need to make the decision so soon. You can wait and see if this improvement in Vulkan draws more interests from Vulkan users or even developers. It's more important for AMD and Intel users, but they may not know about this yet.
---
👤 **Nexesenex** replied the **2025-07-12** at **01:15:49**:<br>
I personally voted against Vulkan, and only because the community's opinion was asked.
@Ikawrakow : My argument would basically go along yours. If there's demand, and most importantly if there's motivation, and even better if there is help, then I'd love to see IKL support Vulkan, because this backend seems to have a future.
But as of now, your development are so valuable on what you master than it might be more pertinent to focus on your art rather than learn a new technique. A technique which could be provided by skilled Vulkan devs to roll in your wheel, rather than to have to do it yourself. Skilled Vulkan devs who might eventually come to IKL and join you, firecoperana and the fray, because IKL is where the good stuff is, quants and big-moe support-wise, and also "welcoming to all good-wills wise".
Just my opinion, I'll be happy whatever you choose.
Especially after the IQ2_KL surprise! :)
---
👤 **gapeleon** replied the **2025-07-17** at **09:04:15**:<br>
I voted 'no' but regret it / can't remove my vote. I'd rather abstain :)
For me personally, I use this app to get more performance out of my used Nvidia hardware + CPU with MoE's. The biggest win for me would be if someone could improve rpc server performance, as this would make it viable for us to link multiple rigs without cutting prompt processing in half.
But Vulkan would help both Intel and AMD users.
I noticed a lot of people buying multiple MI50's recently to run larger models, and prompt processing on these with Vulkan is incredibly slow.
Intel are releasing a 24GB GPU later this year. And while Openvino and sycl are way faster, there's an issue with Openvino whereby you can't use KV Cache with multiple GPUs. That 48GB dual-GPU one of the board partners is releasing --will effectively be 2x24gb GPUs, so people buying that card would benefit from faster Vulkan performance.
> I have mixed feelings how to proceed
ik_llama is a passion project right? So perhaps just do what would be most interesting?
> 👤 **ikawrakow** replied the **2025-07-17** at **14:03:38**:<br>
> > ik_llama is a passion project right? So perhaps just do what would be most interesting?
>
> "Passion" would be pushing it. But yes, it is a hobby project that I started to hack around for fun. It has never been about winning a popularity contest, and I never went out to beat the drum in HN, Reddit, X, etc. But with time quite a few people have found the project useful, and this is what creates the mixed feelings: it is obvious that a high quality Vulkan back-end will be useful for many, I don't need to be convinced of that. At the same time I'm not sure that I will be having fun adding all the `ik_llama.cpp` quants and the optimizations for MoE models to the Vulkan back-end.
>
> In any case, thank you for voting!
> But 14 votes in total does not provide a very strong motivation.
>
> 👤 **firecoperana** replied the **2025-07-17** at **15:20:39**:<br>
> It's not a big problem not adding ik_llama.cpp quants and other optimization to vulkan because Vulkan users are accustomed to missing features compared to CUDA, especially if you don't feel like doing it. Back then, there was no IQ quant support, and FA was barely supported in vulkan in mainline until recently, but it does not stop people from using Vulkan. Until there is more interest from Vulkan users, it's fine the way it is now.
---
👤 **FullstackSensei** replied the **2025-07-18** at **00:12:31**:<br>
Found this discussion while searching for references to SYCL to see if building for SYCL is supported (having a lot of compilation errors).
I have two inference rigs powered by Nvidia and I'm re-purposing a dual Cascade Lake machine I have for MoE inference by adding A770s.
I voted for improving the Vulkan backend but here are my two cents:
- This project doesn't get that much attention on reddit, etc compared to llama.cpp. So, he current userbase is a lot smaller. Having this question in the discussions, while appropriate, won't attract that much attention.
- Vulkan is the only backend that's not tied to a specific vendor. Any optimization you make there will be useful on all GPUs, discrete or otherwise. If you can bring Vulkan close to parity with CUDA, it will be a huge win for any device that supports Vulkan, including older GPUs from Nvidia and AMD.
- As firecoperana noted, not all quants need to be supported. A handful of the recent IQs used in recent MoE's like Qwen3-235B, DeepSeek-671B, and Kimi-K2 are more than enough. I'd even argue for supporting only power of two IQ quants only initially to limit scope and effort.
- Inte's A770 is now arguably the cheapest 16GB GPU with decent compute and memory bandwidth, but it doesn't get much attention in the community. Vulkan support would benefit those of us running Arcs, and free us from having to fiddle with OneAPI.
---
👤 **ExeVirus** replied the **2025-07-18** at **02:45:00**:<br>
You are correct to ask this question. Your target users are those with a single powerful GPU and a decent dram CPU combo.
Those users are power users and small businesses. Further, most serious ones are using 24GB machines or better. They have rocm and cuda, and if Intel ever comes out with a 24GB single card that is actually available, they'll support it properly as well.
Vulcan helps old hardware, and people that love hassle free setups. I don't think you should be doing that hassle free work yourself, given your users are all very capable of that work/setup, as much as we would like to have that ease of use.
If your goal is mass popularity like llama.cpp, then yeah get started on Vulcan, and also get some help, cause that's a tall order. Just my thoughts
---
👤 **ACWeb23** replied the **2025-07-18** at **04:06:52**:<br>
I think improvements to vulkan performance would be a positive. This would allow uses greater flexibility when deciding on hardware. Also ARC and AMD GPU users would benefit from these improvements.
---
👤 **lin72h** replied the **2025-07-18** at **04:24:40**:<br>
Vote for Vulkan. It's the API that all vendors are pushing hard to support. AMD's RADV driver is really solid, Intel's ANV is steadily improving, and Jeff Bolz from NVIDIA [has been contributing](https://github.com/ggml-org/llama.cpp/issues?q=is%3Apr+author%3Ajeffbolznv) to llama.cpp's Vulkan backend for several months now.
---
👤 **ikawrakow** replied the **2025-07-18** at **04:53:10**:<br>
Wow, I see 18 new votes since I last checked yesterday. For people who came here to vote for Vulkan but are not familiar with this project, the mainline `llama.cpp` Vulkan back-end has been ported to `ik_llama.cpp`(#608), so it should be on par with what you have in mainline. For models utilizing MLA attention (DeepSeek, Kimi-2), `ik_llama.cpp` outperforms `llama.cpp` by quite a margin as it is - see [here](https://github.com/ikawrakow/ik_llama.cpp/pull/608#issuecomment-3069950613).
> 👤 **FullstackSensei** replied the **2025-07-18** at **08:56:51**:<br>
> > Wow, I see 18 new votes since I last checked yesterday. For people who came here to vote for Vulkan but are not familiar with this project, the mainline `llama.cpp` Vulkan back-end has been ported to `ik_llama.cpp`(#608), so it should be on par with what you have in mainline. For models utilizing MLA attention (DeepSeek, Kimi-2), `ik_llama.cpp` outperforms `llama.cpp` by quite a margin as it is - see [here](https://github.com/ikawrakow/ik_llama.cpp/pull/608#issuecomment-3069950613).
>
> I took the liberty of posting about this discussion on LocalLLaMA and IntelArc subreddits. Hope you don't mind! Your work makes large models like DeepSeek and Kimi usable on hardware that doesn't cost a kidney, and Vulkan optimizations would only lower the cost to run such models at decent speeds.
>
> This project doesn't get the exposure it deserves, IMO.. So, I thought at worst more people will become familiar with it.
>
> 👤 **ikawrakow** replied the **2025-07-18** at **11:59:25**:<br>
> > I took the liberty of posting about this discussion on LocalLLaMA and IntelArc subreddits. Hope you don't mind!
>
> This project was the best kept secret on Github for a while, but it no longer is, so feel free to post about it.
>
> > This project doesn't get the exposure it deserves, IMO
>
> Thank you.
---
👤 **DealsBeam** replied the **2025-07-18** at **11:54:36**:<br>
Intel Arc GPUs would greatly benefit from Vulkan improvement, thanks for your hard work and dedicating your time on this great project.
> 👤 **ikawrakow** replied the **2025-07-18** at **12:00:32**:<br>
> > Intel Arc GPUs would greatly benefit from Vulkan improvement
>
> My understanding was that the `llama.cpp` SYCL backend was the better option for Intel GPUs. This is no longer the case?

View File

@@ -0,0 +1,110 @@
### 🗣️ [#591](https://github.com/ikawrakow/ik_llama.cpp/discussions/591) - I dont see any speed improvement in generation, so want to understand if i am missing something
| **Author** | `Greatz08` |
| :--- | :--- |
| **Created** | 2025-07-07 |
| **Updated** | 2025-07-08 |
---
#### Description
First of all thank you very much for your contribution in quantization which helps GPU poor people like us to enjoy LLM's :-)) . I recently compiled llama.cpp with these commands :
`
cmake -B build \
-DCMAKE_BUILD_TYPE=Release \
-DGGML_CUDA=ON \
-DCMAKE_CUDA_ARCHITECTURES="89" \
-DGGML_CUDA_F16=ON \
-DGGML_CUDA_FA_ALL_QUANTS=ON \
-DGGML_BLAS=ON \
-DGGML_BLAS_VENDOR=OpenBLAS \
-DLLAMA_LLGUIDANCE=ON \
`
`cmake --build build --config Release -j`
I have RTX 4060 8GB VRAM, so i asked gemini 2.5 pro latest to guide me. I feeded him all docs context with project gitingest and then i asked it to generate best build command and it did which i pasted above, so do let me know if i have to make some more changes or not, because i used same commands to build the fork version (this project).
I get same speed in both llama.cpp version and this fork version. I used following command to run model.
`GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 ./build/bin/llama-server --device CUDA0 \
-m ~/models/Qwen3-30B-A3B-128K-UD-Q2_K_XL.gguf \
-c 32000 \
-ngl 48 \
-t 4 \
-ot '.*\.ffn_down_exps\.weight=CPU' \
-ot '.*\.ffn_up_exps\.weight=CPU' \
-ub 256 -b 512 \
--host 0.0.0.0 \
--port 8009 \
--flash-attn \
--cache-type-k q8_0 \
--cache-type-v q8_0 \
`
I am getting 20-23 token/s , so i wanted to know if i can improve it further with re compiling or you can guide me to improve this command further. I am asking for much more improvement because i want to go for IQ3_XXS Quant which people reported works great and that's will be my end limit.
---
#### 🗣️ Discussion
👤 **ikawrakow** replied the **2025-07-07** at **16:24:50**:<br>
* Remove `-DGGML_BLAS=ON -DGGML_BLAS_VENDOR=OpenBLAS` from the build command
* I wouldn't know what `-DLLAMA_LLGUIDANCE=ON` does, so just remove from the build command
* You can reduce your build time by not using `-DGGML_CUDA_FA_ALL_QUANTS=ON`, which is only necessary if you want to use more exotic KV cache quantization types (not needed with the `Q8_0` that you have used)
* Does your RTX 4060 support unified memory? If not, remove the `GGML_CUDA_ENABLE_UNIFIED_MEMORY=1` from your server command
* What is your CPU? Does it only have 4 cores? All operations with tensors that were not offloaded to the GPU run on the CPU for token generation, so that's important
* If you are leaving 2 of the 3 FFN tensors on the CPU, I think it is better to have `ffn_up_exps` and `ffn_gate_exps` on the CPU
* Use `-ngl 100` or some such. IIRC Qwen3-30B-A3B has 48 repeating layers, so with `-ngl 48` you are not offloading the output tensor to the GPU. This slows down promo processing and token generation. Or was that your intent?
* You definitely want to add `-fmoe` to your server command
* For better prompt processing speed, you should try to use larger `-b` and `-ub` (if VRAM permits). Given enough VRAM, best prompt processing speed for MoE models such as Qwen3-30B-A3B is obtained with `-b 4096 -ub 4096` (but this requires larger CUDA compute buffers)
Having said all that, token generation speed in the case of CPU-only or hybrid GPU/CPU inference is limited by CPU memory bandwidth, so performance gains compared to mainline `llama.cpp` tend to be smaller. The big advantage of `ik_llama.cpp` is in prompt processing speed. You may also see larger performance gains for token generation with a long context stored in the KV cache.
After you get going with Unsloth's quantized models, you may also want to look into some of the quantized models with `ik_llama.cpp` specific quants, but let's not throw too much information your way all at once.
> 👤 **Greatz08** replied the **2025-07-08** at **00:59:56**:<br>
> > Does your RTX 4060 support unified memory? If not, remove the GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 from your server command
>
> I dont think so, i will remove it.
>
>
>
> > What is your CPU? Does it only have 4 cores? All operations with tensors that were not offloaded to the GPU run on the CPU for token generation, so that's important
>
> I forgot to mention any info about my CPU. My cpu is AMD Ryzen 7840HS (8 core,16 threads). I btw tested both t 4 and t 8, i pasted t 4 version command in my previous message. I was just testing both values for observing inference speed differences.
>
>
>
>
> > If you are leaving 2 of the 3 FFN tensors on the CPU, I think it is better to have ffn_up_exps and ffn_gate_exps on the CPU
>
> Ok, this was interesting thing to know and i will try with these two tensor layers. If possible do share your wisdom on this, like why you think these two will be better (just interested to learn and understand more :-) ).
>
> ![image](https://github.com/user-attachments/assets/8bfe6500-309a-496f-af06-9eafcd108597)
> blk.1.ffn_down_exps.weight - 0.66 % of model param
> blk.1.ffn_gate_exps.weight - 0.66 % of model param
> blk.1.ffn_gate_inp.weight - <0.01 % of model param
> blk.1.ffn_norm.weight - <0.01 % of model param
> blk.1.ffn_up_exps.weight - 0.66 % of model param
>
> On the basis of this i thought two layers would be sufficient to save enough vram space to load all attention layers in GPU VRAM ( https://reddit..com/r/LocalLLaMA/comments/1ki7tg7/dont_offload_gguf_layers_offload_tensors_200_gen/ ) . From this reddit post i got know about this awesome trick of override-tensor.
>
>
>
> > Use -ngl 100 or some such. IIRC Qwen3-30B-A3B has 48 repeating layers, so with -ngl 48 you are not offloading the output tensor to the GPU. This slows down promo processing and token generation. Or was that your intent?
>
> ![image](https://github.com/user-attachments/assets/2d14c597-30d8-48d5-9e50-8d3474d30a19)
>
> Number of Layers: 48 - After seeing this i thought i should be loading all 48 layers in GPU VRAM (for that only i saved VRAM space by offloading specific tensor layers) , because of this i choose 48 layers. I dont know about 'repeating layer' , so i think either i missed a key concept or you might be referring to another model layers ? ( Do let me know about this)
>
>
> > For better prompt processing speed, you should try to use larger -b and -ub (if VRAM permits). Given enough VRAM, best prompt processing speed for MoE models such as Qwen3-30B-A3B is obtained with -b 4096 -ub 4096 (but this requires larger CUDA compute buffers)
>
> I will see how much i can increment those numbers for both params, and will test with longer context. I will also follow rest of your suggestions and will test things out.
>
>
> Thank you very much for your guidance on this matter @ikawrakow :-))

View File

@@ -0,0 +1,23 @@
### 🗣️ [#594](https://github.com/ikawrakow/ik_llama.cpp/discussions/594) - Is AVX2 a hard requirement on x64?
| **Author** | `SmallAndSoft` |
| :--- | :--- |
| **Created** | 2025-07-08 |
| **Updated** | 2025-07-09 |
---
#### Description
I am getting compilation errors on the older CPU with just AVX even if I want to offload everything to CUDA GPU.
---
#### 🗣️ Discussion
👤 **ikawrakow** replied the **2025-07-09** at **08:41:22**:<br>
Yes, `AVX2` or better is a hard requirement on `x86_64`. I think `llama.cpp` is a better option for older hardware.
> 👤 **SmallAndSoft** replied the **2025-07-09** at **08:45:07**:<br>
> Thank you for reply. Yes, I just wanted to try your advanced quants on GPU. It is sad that AVX2 is required even if CPU will be doing next to nothing.

View File

@@ -0,0 +1,357 @@
### 🗣️ [#599](https://github.com/ikawrakow/ik_llama.cpp/discussions/599) - mla matrix absorbtion
| **Author** | `magikRUKKOLA` |
| :--- | :--- |
| **Created** | 2025-07-11 |
| **Updated** | 2025-07-15 |
---
#### Description
As a prefill optimization for the long context as implemented in ktransformers. I found some cool docs. Will leave it here.
https://github.com/madsys-dev/deepseekv2-profile/blob/main/workspace/blog/optimizing-mla.md
deepseek r1 **explanation**:
The **matrix absorption technique** in DeepSeek-V2's MLA (Multi-head Latent Attention) mechanism is a clever mathematical optimization that avoids explicitly decompressing the compressed KV cache, significantly reducing computation and memory overhead. Here's a step-by-step explanation:
### 1. **Core Problem**
Traditional MLA implementations:
- Store **compressed KV representations** (small memory footprint)
- But require **decompression** before attention calculation:
```math
k_t^C = W^{UK} \cdot c_t^{KV} \quad \text{(expensive operation)}
```
```math
v_t = W^{UV} \cdot c_t^{KV} \quad \text{(expensive operation)}
```
### 2. **Key Insight: Matrix Associativity**
Matrix multiplication is associative. Instead of decompressing KV, **absorb the decompression matrices** into adjacent operations:
- **K-absorption**: Fuse decompression into Q projection
- **V-absorption**: Fuse decompression into output projection
---
### 3. **K-Absorption (for Attention Scores)**
**Original computation** for non-RoPE attention scores:
```math
{q_t^C}^\top k_t^C = (W^{UQ} c_t^Q)^\top (W^{UK} c_t^{KV})
```
**Absorbed version** using associativity:
```math
{q_t^C}^\top k_t^C = \underbrace{(c_t^Q)^\top}_{\text{input}} \cdot \underbrace{(W^{UQ})^\top W^{UK}}_{\text{precomputed}} \cdot \underbrace{c_t^{KV}}_{\text{cached}}
```
**Why this helps**:
- Avoids explicit computation of full-dimensional `k_t^C`
- Replaces large matrix multiplication with smaller operations
- **FLOPs reduction**: From 33.64 MFLOP/token → 0.28 MFLOP/token
---
### 4. **V-Absorption (for Attention Output)**
**Original computation**:
```math
o = a \cdot v_t = a \cdot (W^{UV} \cdot c_t^{KV})
```
```math
u = W^O \cdot o
```
**Absorbed version** using Einstein summation:
```python
# Instead of decompressing V explicitly:
attn_output = torch.einsum('bhql,blc->bhqc', attn_weights, compressed_kv) # Weighted sum in compressed space
attn_output = torch.einsum('bhqc,hdc->bhqd', attn_output, W_UV) # Late decompression
u = torch.einsum('hdD,bhqd->bqD', W_O, attn_output) # Output projection
```
**Why this helps**:
- Avoids materializing full `v_t` (128× larger than `c_t^{KV}`)
- Fuses decompression with weighted sum
- Minimizes intermediate memory
---
### 5. **Move Elision Optimization**
**Additional problem**: Original code concatenates RoPE/non-RoPE parts of Q/K, creating large temporary tensors.
**Solution**: Compute attention scores in two parts and sum:
```python
attn_weights = (
torch.matmul(q_pe, k_pe.transpose(2, 3)) # RoPE part
+ torch.einsum('bhqc,blc->bhql', q_nope, compressed_kv) # Non-RoPE (absorbed)
)
```
- **Eliminates** memory-hungry concatenation ops
- **Avoids** storing full Q/K tensors (192-dimensional)
---
### 6. **Why Not Pre-Absorb All Matrices?**
Theoretically, you could precompute:
```math
W_{\text{new}}^{UQ} = (W^{UQ})^\top W^{UK} \quad \text{and} \quad W_{\text{new}}^O = W^O W^{UV}
```
But this is **inefficient** because:
- `W_{\text{new}}^{UQ}` would be a large low-rank matrix (1536×512)
- `W_{\text{new}}^O` would be massive (5120×512 per head)
- **Actual implementation** (sequential small ops) is faster:
![](https://i.imgur.com/5XJzZ9l.png)
---
### 7. **Performance Impact**
| Optimization | KV Cache Size | Compute (FLOPs/token) | Speedup (vs baseline) |
|----------------------------------|---------------|----------------------|----------------------|
| Baseline (CacheDecompressed) | 81.92 KB | 0.08 M | 1.0× |
| CacheCompressed (no absorption) | 1.15 KB | 33.64 M | 0.3× |
| **Absorption + Move Elision** | **1.15 KB** | **0.28 M** | **1.2×** |
**Key wins**:
- 71× smaller KV cache than decompressed version
- 120× less computation than non-absorbed compression
- Enables larger batch sizes (critical for GPU efficiency)
---
### Summary
Matrix absorption works by:
1. **Exploiting associativity** to "push" decompression matrices into adjacent operations
2. **Late decompression** of V during output projection
3. **Splitting attention scores** to avoid tensor concatenation
4. **Leveraging compressed representations** throughout computation
This transforms MLA from a memory-bound problem into a compute-bound one, better utilizing modern GPU tensor cores while maintaining 98.6% KV cache compression.
---
#### 🗣️ Discussion
👤 **ikawrakow** replied the **2025-07-11** at **12:16:34**:<br>
@magikRUKKOLA
You may want to check #246, #260, #273.
As far as I can tell, #246, which explains the basic idea of reducing the amount of multiply-adds when using MLA, precedes the linked doc by about a month, and is surprisingly similar to what they wrote.
#260 explains the `-amb` option, which limits the amount of intermediate compute buffer storage required.
#273 is the best MLA version in `ik_llama.cpp`. The MLA=2 variant (explained in #246) is used for prompt processing, the original MLA (MLA=1) is used for token generation. The main reason it took a while to arrive at #273 was the struggle to implement the MLA=1 case efficiently on CUDA (and the struggle was due to the much larger than usual attention head sizes of 576 and 512).
If you look at all merged PRs, you will see that it has been quite a journey to arrive at what we have today for doing fast DeepSeek inference.
---
👤 **ubergarm** replied the **2025-07-11** at **22:10:57**:<br>
A new model with MLA just dropped only 1000B-A32B https://huggingface.co/moonshotai/Kimi-K2-Instruct .... :sob: lol...
> 👤 **magikRUKKOLA** replied the **2025-07-11** at **23:35:48**:<br>
> @ubergarm
> > A new model with MLA just dropped only 1000B-A32B https://huggingface.co/moonshotai/Kimi-K2-Instruct .... 😭 lol...
>
> ```
> Paper Link (co**mm**ing soon)
> ```
>
> Yeah, I am so excited too! :D
> So the minimum requirements are the 512GB RAM and 48GB VRAM to run some IQ2 quant lol. (?) I guess its time to upgrade.
>
> quote:
> > Agentic Intelligence: Specifically designed for **tool use**, reasoning, and autonomous problem-solving.
>
> I suggest that the setup how the tool usage can be applied with ik_llama.cpp should be documented somewhere. Basically we need a MITM-tool to translate JSON<->TOOL_CALL_TOKENS. And that's about it.
>
> 👤 **ewhacc** replied the **2025-07-12** at **09:59:24**:<br>
> @ubergarm
> Are you going to cook quants? ^^; It uses deekseek architecture, so I'm hoping it runs in ik_llama.cpp flawlessly.
>
> I have 512G RAM and would like to test IQ2. I thought 256G is the best because using 512G (with higher bits) is too slow. I was wrong. Kimi-K2 keep the active experts the same but almost doubled the weights. I guess tg speed is about the same, but pp will be slower.
>
> I'm downloading original FP8 now. I don't know why I'm doing this... ^^
>
> 👤 **ubergarm** replied the **2025-07-12** at **15:43:55**:<br>
> @ewhacc
>
> I haven't looked to see if existing methods for going from fp8 safetensors to bf16 GGUFs would work on that model yet. I use the evshrion llama.cpp fork (from fairydreamings original MLA fork) plus triton-cpu to convert deepseek 671B without a GPU on a big RAM box. That is the first challenge.
>
> Next you'll need over 1TB RAM to inference the Q8_0 to make an imatrix. I don't have access to the big RAM box right now, so I can't do this step at the moment. Plus its a pain to free up like 4TB disk space lol...
>
> Keep us posted, I'm sure people will want to run this monster eventually
>
> 👤 **ubergarm** replied the **2025-07-12** at **15:45:52**:<br>
> @magikRUKKOLA
>
> > I suggest that the setup how the tool usage can be applied with ik_llama.cpp should be documented somewhere. Basically we need a MITM-tool to translate JSON<->TOOL_CALL_TOKENS. And that's about it.
>
> One guy put together a function calling wrapper thing, not sure if it is applicable here: https://github.com/ikawrakow/ik_llama.cpp/issues/407#issuecomment-2953602943
>
> I haven't tried it personally.
>
> 👤 **magikRUKKOLA** replied the **2025-07-12** at **20:58:08**:<br>
> > @magikRUKKOLA
> >
> > One guy put together a function calling wrapper thing, not sure if it is applicable here: [#407 (comment)](https://github.com/ikawrakow/ik_llama.cpp/issues/407#issuecomment-2953602943)
> >
>
> Yeah, I noticed. I suggest some docs should be created on how to provide a frontend for the ik_llama.cpp to support the tool calling. But first let me observe what solution would be the most elegant.
>
> 👤 **magikRUKKOLA** replied the **2025-07-12** at **21:05:55**:<br>
> @ewhacc
>
> > I have 512G RAM and would like to test IQ2.
>
> I just noticed that IQ4_KS_R4 of Deepseek R1 is 368 GiB. So
>
> ```
> echo "scale=2;368*(1000/671)"|bc
> 548.32
> ```
>
> So the kimi k2 with a similar quant might fit within the 512 GB RAM. Or, the IQ3 quant should fit.
>
> But... but... something should be done with the attention mechanism (for the prefill) to reduce the VRAM usage. I am currently looking at flashinfer. That is the exact reason of instability in ktransofmers. Its a hurdle. :)
>
> > I thought 256G is the best because using 512G (with higher bits) is too slow. I was wrong.
>
> Yeah, I made a same mistake.
> Small tip/note -- it you chose to use DDR4 don't buy 3200 MT/s (unless its for Lenovo machines). The Samsung 2666 MT/s ECC overclocks with 1.35V great with crazy timings. But you would have to install the additional fans and the heatsinks on top of the RAM. Also, Gigabyte MC62-G40-00 suck -- it doesn't allow overclocking.
>
> 👤 **magikRUKKOLA** replied the **2025-07-13** at **14:09:14**:<br>
> 621GB Q4_K quant dropped!
>
> https://huggingface.co/KVCache-ai/Kimi-K2-Instruct-GGUF
>
> Can't wait for the Q3 quant to try out on 512GB RAM. :) Also setting up the water cooling for the four RTX 3090 to be able to connect [four of them] without the risers (to support as much context as possible).
---
👤 **ewhacc** replied the **2025-07-13** at **11:25:36**:<br>
@ubergarm
> I haven't looked to see if existing methods for going from fp8 safetensors to bf16 GGUFs would work on that model yet. I use the evshrion llama.cpp fork (from fairydreamings original MLA fork) plus triton-cpu to convert deepseek 671B without a GPU on a big RAM box. That is the first challenge.
I just tried fp8_cast_bf16.py but got VRAM OOM. I didn't think this will be big challenge but 1st one is getting tough. I will try with more VRAM, and perhaps will try evshrion llama.cpp too. Thanks a lot for help. I'm just giving a try your recipes.
> Next you'll need over 1TB RAM to inference the Q8_0 to make an imatrix.
Hmm, this one is what I worried and wanted to ask. Well, time to wake my xeon box (it's too loud). BTW, isn't it possible to make imatrix directly from BF16? Making Q8_0 is a must? Ha ha, it's a long and big way to go. FP8 -> BF16 -> Q8_0 -> imatrix -> Q2
Edit: I'm trying evshiron llama.cpp, which seems to have a direct conversion from fp8 to q8_0.
Edit: Failed to get q8_0. I don't know it needs 1T RAM, but seems not a RAM problem (tried on 512M)
python ev_llama.cpp/convert_hf_to_gguf.py Kimi-K2-Instruct --outfile Kimi-K2-Instruct-q8 --outtype q8_0
ValueError: Pointer argument (at 0) cannot be accessed from Triton (cpu tensor?)
> 👤 **ubergarm** replied the **2025-07-13** at **16:29:49**:<br>
> @ewhacc
>
> > I just tried fp8_cast_bf16.py but got VRAM OOM.
>
> Right for the `fp8_cast_bf16.py` script from deepseek approach it is quite long. `fp8 safetensors -> bf16 safetensors -> bf16 GGUF -> Q8_0 -> imatrix -> Q2`. I believe this is the method used for mainline MLA quants of deepseek. Not sure if this works for the slightly different arch Kimi-K2 1000B-A32B or not.
>
> Regarding OOMing with this method, [i have some notes in a discussion with fairydreaming about using triton-cpu instead for using RAM without GPU](https://github.com/ggml-org/llama.cpp/discussions/11989#discussioncomment-13555486) that I just dug up. Also found a patch that might prevent VRAM OOM on 4090 series cards [here on hugginface](https://huggingface.co/deepseek-ai/DeepSeek-V3/discussions/17).
>
> > BTW, isn't it possible to make imatrix directly from BF16?
>
> Yes, if you can run inferencing with the 2TB VRAM+RAM bf16 GGUF, then you could use it directly for imatrix. I haven't tested the quality difference in terms of perplexity, but I believe the Q8_0 is sufficient given it is quite similar to the native fp8.
>
> > I'm trying evshiron llama.cpp, which seems to have a direct conversion from fp8 to q8_0.
>
> Yes this is my usual method. Not sure it would work with Kimi-K2 though without some modifications. I assume you got `triton-cpu` to build (this is one of the more difficult steps of the process). Notes on building triton-cpu [here where @saood06 helped fix a build bug for them](https://github.com/triton-lang/triton-cpu/issues/237#issuecomment-2878180022).
>
> My script then is to convert the fp8 safetensors directly to bf16 GGUF is:
> ```bash
> # evshiron/llama.cpp@63b7e8aa
> source venv/bin/activate
> python \
> llama.cpp/convert_hf_to_gguf.py \
> --outtype bf16 \
> --split-max-size 50G \
> --outfile /models/ubergarm/DeepSeek-TNG-R1T2-Chimera-GGUF/ \
> /models/tngtech/DeepSeek-TNG-R1T2-Chimera/
> ```
>
> If you're still getting that error, you might have to poke around in `convert_hf_to_gguf.py` search where it says `triton` for the `deepseek-v3` part. Might need to look at the recent Kimi-K2 PR https://github.com/ggml-org/llama.cpp/pull/14654 and add that to the evshiron fork or something.
>
> I don't have access to enough RAM at the moment. Maybe will in the next few weeks :crossed_fingers:
>
> Thanks for blazing the trail! And feel free to open a new discussion/issue specific to Kimi-K2 etc...
>
> 👤 **magikRUKKOLA** replied the **2025-07-13** at **18:17:56**:<br>
> > I don't have access to enough RAM at the moment. Maybe will in the next few weeks 🤞
>
> Hey bro, are you in EU? I can drop you some 1TB DDR5 RAM with a huge discount.
>
> 👤 **ubergarm** replied the **2025-07-13** at **18:38:54**:<br>
> @magikRUKKOLA
>
> Oh man, thanks for the offer, no I'm in east coast usa currently. wendell at level1techs.com is hooking me up with access to a new remote rig he's assembling that is a big dual socket 1.5TB beast that should be online sooner than I expected!
>
> 👤 **ewhacc** replied the **2025-07-14** at **00:16:44**:<br>
> @ubergarm
> You had have gone through all this tough process. Thank so much for sharing experience.
>
> > Yes, if you can run inferencing with the 2TB VRAM+RAM bf16 GGUF, then you could use it directly for imatrix. I haven't tested the quality difference in terms of perplexity, but I believe the Q8_0 is sufficient given it is quite similar to the native fp8.
>
> Oops, 2TB. Sounds like going through Q8_0 is a must.
>
> 👤 **ubergarm** replied the **2025-07-14** at **02:53:30**:<br>
> @ewhacc
>
> So Wendell just hooked me up with remote access to a big dual socket AMD CPU rig with 42TB of kioxia flash storage i put into two RAID0 arrays and with almost 1.5TB RAM - (no GPUs). So working through it now using the "mainline" method of casting the fp8 safetensors to bf16 safetensors first.
>
> If I can get that working, I'll try to see if it is possible to adapt the evshiron fork to do the same MLA treatment to Kimi-K2 as it does for deepseek models and do the direct fp8 safetensors -> bf16 GGUF
>
> A few folks working on it also here feel free to join with your findings: https://huggingface.co/gabriellarson/Kimi-K2-Instruct-GGUF/discussions/1
>
> 👤 **ewhacc** replied the **2025-07-14** at **03:12:39**:<br>
> @ubergarm
> > A few folks working on it also here feel free to join with your findings: https://huggingface.co/gabriellarson/Kimi-K2-Instruct-GGUF/discussions/1
>
> Thanks for inviting. I see you already started there :)
---
👤 **ewhacc** replied the **2025-07-13** at **11:30:20**:<br>
@magikRUKKOLA
> Small tip/note -- it you chose to use DDR4 don't buy 3200 MT/s (unless its for Lenovo machines). The Samsung 2666 MT/s ECC overclocks with 1.35V great with crazy timings. But you would have to install the additional fans and the heatsinks on top of the RAM. Also, Gigabyte MC62-G40-00 suck -- it doesn't allow overclocking.
Thank you for the tip. Yeah, I have temped to overclock DDR4, and even DDR5. But, I have to check my board allow it. Yes, RAM also needs cooling, my DDR5 gets hot when I use R1.
---
👤 **magikRUKKOLA** replied the **2025-07-15** at **19:59:27**:<br>
ATTN! Below is not a joke. Its an actual latest commit for the flashinfer. Please pay attention:
```diff
- return self.run_return_lse(q, paged_kv_cache, k_scale, v_scale)
+ return self.run_return_lse(q, paged_kv_cache, k_scale=k_scale, v_scale=v_scale)
```
Lets read the explanation:
```
fix: correctly pass k_scale and v_scale to run() in forward_return_lse
```
MORE!
```
Bug Fix: Corrected an issue in BatchPrefillWithPagedKVCacheWrapper.forward_return_lse where k_scale and v_scale were incorrectly passed as positional arguments instead of keyword arguments to run_return_lse(). This resolves a **silent misbehavior or potential runtime error** caused by functools.partialmethod expecting keyword-only arguments.
```
the comments from the **maintainer**!!
```
Great catch, left some comments for suggestions :)
```
I mean, this doesn't make sense. I am not really sure its real.

View File

@@ -0,0 +1,81 @@
### 🗣️ [#613](https://github.com/ikawrakow/ik_llama.cpp/discussions/613) - Pathological Quant/CUDA combinations -- How to know what works?
| **Author** | `usrlocalben` |
| :--- | :--- |
| **Created** | 2025-07-15 |
| **Updated** | 2025-07-15 |
---
#### Description
Some quants/tensors seem to be incompatible with CUDA. My current example is a Q6_K (unsloth) quant of Kimi K2. If I leave all routed exp on CPU, I can get e.g. TG=~9tps. There's some VRAM remaining (RTX 8000, Turing, 48GB) so I can put a few e.g. up_exps on GPU. When doing this TG drops to 1tps or worse.
I've seen this phenomena before, trying to offload routed experts with some other quant types (w DeepSeek R1/V3) My understanding (I think somewhere @ubergarm explained it) is that some quants are not supported on CUDA and therefore must be converted before use **per token**.
PP throughput (~80tps) is not noticeably affected, presumably because of batching. (b=ub=4096)
Good outcome, ~9tps TG
```
-mla 2 -fa -fmoe
-b 4096 -ub 4096
-ctk f16 -c 64000
--n-gpu-layers 99
-ot exps=CPU
-op 26,0,27,0,29,0
-m /path/to/Kimi-K2-Instruct-Q6_K-00001-of-00018.gguf
```
if I change to
```
-ot "blk\.(1|2|3|4|5|6)\.ffn_up.*=CUDA0"
-ot exps=CPU
```
TG drops to 1tps or worse.
Assuming the idea is correct, Q6_K is a pathological quant type (at least on Turing) -- how to know this? How can I know what my options are when building GGUFs that match my offload/cpu arrangement?
edit: I shouldn't say they are not _supported_, but they aren't integrated into a kernel for the required op.
---
#### 🗣️ Discussion
👤 **ikawrakow** replied the **2025-07-15** at **17:52:16**:<br>
`Q6_K` has been around forever and hence is a well supported quant on all platforms. So, it is not that.
Instead, you absolutely do not want to split up `ffn_up` and `ffn_gate` when using `-fmoe`. Try
```
-ot "blk\.(1|2|3)\.ffn_up_exps=CUDA0,blk\.(1|2|3)\.ffn_gate_exps=CUDA0"
```
instead.
If you split `ffn_up` and `ffn_gate` and there is a fused `ffn_up/ffn_gate` op where `ffn_up` is on the GPU but `ffn_gate` is on the CPU, whatever the back-end decides to do (run the op on the GPU or the CPU), the tensors need to be copied from the GPU to the CPU or vice versa. This totally kills TG performance.
> 👤 **usrlocalben** replied the **2025-07-15** at **18:37:07**:<br>
> > `Q6_K` has been around forever and hence is a well supported quant on all platforms.
>
> I did find it surprising. If instead it were IQxxx I proably might not have been inspired to ask/write.
>
>
> > Instead, you absolutely do not want to split up `ffn_up` and `ffn_gate` when using `-fmoe`. Try
>
> Makes so much sense it should have been obvious 😥
>
> Thanks
>
> 👤 **usrlocalben** replied the **2025-07-15** at **18:42:37**:<br>
> Furthermore, now I see why I've observed that particular offload pattern mentioned in various places.
>
> I'll have to revisit some of my previous quant layouts and invocations. I had mixed gate/up offloads arbitrarily to optimally fill VRAM and didn't realize I was creating pathological arrangements.
---
👤 **ikawrakow** replied the **2025-07-15** at **18:07:47**:<br>
One more thing: if you have enough VRAM to use batch and u-batch of 4096, you should try removing `-op 26,0,27,0,29,0` to see how this affects your PP performance. Depending on GPU vs CPU speed, this may give you a non-negligible boost in PP performance for long prompts (longer than 4k tokens).
> 👤 **usrlocalben** replied the **2025-07-15** at **18:39:48**:<br>
> In the same testing prior to posting I did a fresh a/b test w & w/o this and it _still_ improve things, maybe 1.5x (I just tossed the measurements). I did notice the recent change to the heuristics wrt. offloading but enforcing the -op policy is still an improvement for my hw combo.

View File

@@ -0,0 +1,55 @@
### 🗣️ [#619](https://github.com/ikawrakow/ik_llama.cpp/discussions/619) - gpu p2p utilization
| **Author** | `magikRUKKOLA` |
| :--- | :--- |
| **Created** | 2025-07-16 |
| **Updated** | 2025-07-17 |
---
#### Description
Is there any mode of the llm inference in ik_llama.cpp that utilizes the p2p functionality between the GPUs? That would include the NVLINKs and, most importantly, the regular p2p master-slave functionality as enabled by the opensource nvidia drivers (see https://github.com/aikitoria/open-gpu-kernel-modules ).
[EDIT]:
with and without p2p functionality:
```bash
/usr/share/doc/nvidia-cuda-toolkit/examples/Samples/5_Domain_Specific/p2pBandwidthLatencyTest/p2pBandwidthLatencyTest
Bidirectional P2P=Disabled Bandwidth Matrix (GB/s)
D\D 0 1 2
0 839.83 14.54 16.64
1 14.53 839.83 16.67
2 16.72 16.67 840.26
Bidirectional P2P=Enabled Bandwidth Matrix (GB/s)
D\D 0 1 2
0 839.15 52.04 52.04
1 52.04 839.83 52.03
2 51.94 52.03 839.83
```
So there is about 35 GB/s free bandwidth available for the nvidia gpu users.
[EDIT]:
If I am reading the code correctly, the p2p functionality is used only at: ggml_backend_sycl_graph_compute and the ggml_sycl_set_peer_access is allowing it only if n_tokens is less than 128? Can anyone provide more info?
[EDIT2]:
Uh oh?
```
4415 //todo, it's known issueerror in device2device cross GPUs. reused when the issue is fixed. DON"T remove
4416 #if 0
4417 SYCL_CHECK(CHECK_TRY_ERROR((*stream).memcpy(
4418 (char *)dst->data, (const char *)src->data, size).wait()));
4419
4420 /*
4421 DPCT1009:201: SYCL uses exceptions to report errors and does not use the
4422 error codes. The original code was commented out and a warning string
4423 was inserted. You need to rewrite this code.
4424 */
4425 SYCL_CHECK(CHECK_TRY_ERROR(
4426 dpct::dev_mgr::instance().get_device(dst_ctx->device).queues_wait_and_throw()));
4427 #endif
```

File diff suppressed because one or more lines are too long

View File

@@ -0,0 +1,26 @@
### 🗣️ [#623](https://github.com/ikawrakow/ik_llama.cpp/discussions/623) - Quantizing panels/bundles instead of blocks?
| **Author** | `jubruckne` |
| :--- | :--- |
| **Created** | 2025-07-17 |
| **Updated** | 2025-07-17 |
---
#### Description
Hi there! I much admire your work in this project.
One thing Ive been wondering… I believe weights are already repacked to make MatMul more efficient for the ffn... now I dont understand the code well enough… are we (or could we possibly) also interleaving weight of w1,w2,w3 into panels? And then quantize based on this panels structures instead of individual blocked weight matrixes?
Maybe this doesnt make my sense at all.. but Ive been thinking about it for a while now, and it seems to me this could also open other possibilities like selecting variable Bitrate for each panel. Or sorting the panels by importance (derived from imatrix), and only calculating the most important ones (like top 50%).
I apologize if some of this seems stupid, it probably is 🙈…
---
#### 🗣️ Discussion
👤 **ikawrakow** replied the **2025-07-17** at **12:19:22**:<br>
You mean, instead of having 256 weights from the same row in a block of 256, we could have used 32 x 8 from 8 different consecutive rows?

View File

@@ -0,0 +1,195 @@
### 🗣️ [#63](https://github.com/ikawrakow/ik_llama.cpp/discussions/63) - LLaMA-3.2 quantization evaluation
| **Author** | `ikawrakow` |
| :--- | :--- |
| **Created** | 2024-09-26 |
| **Updated** | 2024-09-26 |
---
#### Description
LLaMA-3.2 is out. `llama.cpp` does not yet support the vision models, so this post focuses on the 1B ad 3B text models that could be very handy for local usage on low-end devices. The models are small enough even with full precision (`bf16`) but I think it is still interesting to look at quantization as token generation is significantly faster with quantized models.
To reproduce the results reported here
1. Clone my validation dataset repository
```
git clone git@hf.co:datasets/ikawrakow/validation-datasets-for-llama.cpp
cd validation-datasets-for-llama.cpp
gunzip wiki.test.raw.gz
gunzip wiki.train.raw.gz
```
2. Get one or more LLaMA-3.2 models. E.g.
```
git clone git@hf.co:meta-llama/Llama-3.2-3B
```
3. Convert to GGUF. E.g.
```
python3 convert_hf_to_gguf.py --outtype bf16 Llama-3.2-3B/
```
4. Create imatrix data. E.g.
```
./bin/llama-imatrix -m Llama-3.2-3B/Llama-3.2-3B-BF16.gguf -f validation-datasets-for-llama.cpp/wiki.train.raw --chunks 1000 -o l32_imatrix_c512.out
```
5. Quantize. E.g.
```
./bin/llama-quantize --imatrix l32_imatrix_c512.out Llama-3.2-3B/Llama-3.2-3B-BF16.gguf iq4k.gguf iq4_k
```
6. Compute perplexity
```
./bin/llama-perplexity -m iq4k.gguf -f validation-datasets-for-llama.cpp/wiki.test.raw -t 1 -ngl 100
```
7. Compute HellaSwag
```
./bin/llama-perplexity -m iq4k.gguf -bf validation-datasets-for-llama.cpp/hellaswag-validation.bin --multiple-choice -t 1 -ngl 100 -c 2048
```
8. Compute MMLU
```
./bin/llama-perplexity -m iq4k.gguf -bf validation-datasets-for-llama.cpp/mmlu-test.bin --multiple-choice -t 1 -ngl 100 -c 2048
```
### Perplexity
Perplexity (`PPL` in what follows) is not the best measure to compare *different* models, but it is extremely useful when comparing a quantized version of a model to the *same* full precision model. In the graphs below I use the quantization error defined as
```
quantization error = PPL(Q)/PPL(bf16) - 1
```
where `PPL(Q)` is the perplexity of quantization `Q` and `PPL(bf16)` is the perplexity of the full model (the 3.2 models are released as `bf16`, so I use `bf16` throughout as `bf16` support has been added here in PR #39, #40, #41, #56).
The following graph shows quantization error of LLaMA-3.2-3B as a function of bits-per-weight (bpw) for (almost) all quantization types supported here. Note that this is the effective bpw that includes the `token_embedding.weight` tensor, which is quantized with more bits (typically `Q6_K`), and this has a significant impact on the overall bpw balance as this tensor represents a significant fraction of the overall model size. The y-axis is logarithmic, so differences can be quite large even if data points look relatively close. The cyan circles are for the new quants `IQ2_K, IQ3_K, IQ4_K, IQ5_K` and `IQ6_K` that are not available in mainline `llama.cpp`. The black symbols are for i-quants, the red for k-quants, and the blue symbols are legacy quants (`Q4_0, Q4_1, Q5_0`, Q5_1`).
![l32_ppl_3B](https://github.com/user-attachments/assets/602e5623-6a90-4c74-82ef-26dca80c4a86)
The next graph shows results for LLaMA-3.2-3B-Instruct. The results are qualitatively very similar to the base model, with the quantization error being slightly lower compared to the base model.
![l32_it_ppl_3B](https://github.com/user-attachments/assets/91929ff8-f456-4d37-bce1-0105bfc79d7c)
My conclusion from these two graphs are
1. Going below 3 bpw with these models is not useful - the quantization error becomes too large. This is similar to the 3.1 LlaMA models
2. The new iqk-quants `IQ4_K` and `IQ5_K` are significantly better than k- or legacy quants in this bpw range
3. Legacy quants are mostly useless as it is so often the case
The next graph is for the base LLaMA-3.2-1B model
![l32_ppl_1B](https://github.com/user-attachments/assets/3918f73f-f7d4-4a66-80df-16c6dc9d5fcf)
Here the quantization error is significantly larger, going below 2% only for 5+ bpw. At about 4.95 bpw `IQ4_K` has a quantization error of 3%, `Q4_K_S` is at 4.3%, and `Q4_0` at 12.5% (!), nearly the same as `IQ3_K` at 3.68 bpw.
### HellaSwag
The HellaSwag 0-shot score of 74.34 for the 3B base model is surprisingly high for a model of this size. But here we are more interested in looking at the impact of quantization, so I'll focus on that. The following graph shows
```
HellaSwag(bf16) - HellaSwag(Q)
```
for LLaMA-3.2-3B.
![hella_3B](https://github.com/user-attachments/assets/06f69a2f-48e2-440a-876a-2cb5b960ae71)
As one could have expected from the perplexity results, sub-3-bpw quantization destroys the model utility. Hence, it is more useful to focus on the 3+ bpw range, which is the purpose of the next graph
![hella_3B_a](https://github.com/user-attachments/assets/b49e6b58-362e-4844-982b-89c211000df0)
We see that `IQ4_K, IQ5_K, IQ6_K` and `Q6_K` are basically indistinguishable from the `bf16` model for the HellaSwag metrics. But at less than 2 points below `bf16`, even `IQ3_K` and `IQ3_S` could be useful if HellaSwag is representative for the kind of tasks one intends to tackle.
### MMLU
Here I show only results for the 3+ bpw range for LLaMA-3.2-3B in the following graph
![mmlu_3B_a](https://github.com/user-attachments/assets/5562b55f-f2aa-4ee5-b32f-023e698fb22d)
All quantizations above `IQ3_K` (3.6 bpw) are (nearly) indistinguishable from the full `bf16` model according to this metrics.
---
#### 🗣️ Discussion
👤 **ikawrakow** replied the **2024-09-26** at **16:11:00**:<br>
Here some performance numbers for the 1B model on a Ryzen-7950X CPU
| model | size | backend | threads | test | t/s |
| --------------- | ---------: | ---------- | ------: | ------------: | ---------------: |
| llama 1B BF16 | 2.79 GiB | CPU | 16 | pp512 | 1217.13 ± 18.31 |
| llama 1B BF16 | 2.79 GiB | CPU | 1 | tg128 | 15.31 ± 0.19 |
| llama 1B BF16 | 2.79 GiB | CPU | 2 | tg128 | 22.97 ± 0.04 |
| llama 1B BF16 | 2.79 GiB | CPU | 4 | tg128 | 23.86 ± 0.08 |
| llama 1B BF16 | 2.79 GiB | CPU | 8 | tg128 | 23.45 ± 0.32 |
| llama 1B Q8_0 | 1.48 GiB | CPU | 16 | pp512 | 1109.36 ± 24.77 |
| llama 1B Q8_0 | 1.48 GiB | CPU | 1 | tg128 | 38.57 ± 0.24 |
| llama 1B Q8_0 | 1.48 GiB | CPU | 2 | tg128 | 46.86 ± 0.04 |
| llama 1B Q8_0 | 1.48 GiB | CPU | 4 | tg128 | 46.42 ± 0.11 |
| llama 1B Q8_0 | 1.48 GiB | CPU | 8 | tg128 | 44.41 ± 0.07 |
| llama 1B IQ4_K | 935.24 MiB | CPU | 16 | pp512 | 1211.41 ± 12.99 |
| llama 1B IQ4_K | 935.24 MiB | CPU | 1 | tg128 | 30.81 ± 0.04 |
| llama 1B IQ4_K | 935.24 MiB | CPU | 2 | tg128 | 57.37 ± 0.17 |
| llama 1B IQ4_K | 935.24 MiB | CPU | 4 | tg128 | 76.93 ± 0.14 |
| llama 1B IQ4_K | 935.24 MiB | CPU | 8 | tg128 | 74.61 ± 0.09 |
| llama 1B IQ5_K | 1.02 GiB | CPU | 16 | pp512 | 982.76 ± 16.70 |
| llama 1B IQ5_K | 1.02 GiB | CPU | 1 | tg128 | 24.76 ± 0.04 |
| llama 1B IQ5_K | 1.02 GiB | CPU | 2 | tg128 | 46.39 ± 0.06 |
| llama 1B IQ5_K | 1.02 GiB | CPU | 4 | tg128 | 66.47 ± 0.23 |
| llama 1B IQ5_K | 1.02 GiB | CPU | 8 | tg128 | 64.73 ± 0.10 |
| llama 1B Q5_K_S | 1.03 GiB | CPU | 16 | pp512 | 1257.38 ± 13.08 |
| llama 1B Q5_K_S | 1.03 GiB | CPU | 1 | tg128 | 31.56 ± 0.55 |
| llama 1B Q5_K_S | 1.03 GiB | CPU | 2 | tg128 | 55.68 ± 0.28 |
| llama 1B Q5_K_S | 1.03 GiB | CPU | 4 | tg128 | 66.34 ± 0.27 |
| llama 1B Q5_K_S | 1.03 GiB | CPU | 8 | tg128 | 65.35 ± 0.23 |
| llama 1B Q6_K | 1.15 GiB | CPU | 16 | pp512 | 1271.25 ± 12.18 |
| llama 1B Q6_K | 1.15 GiB | CPU | 1 | tg128 | 31.43 ± 0.21 |
| llama 1B Q6_K | 1.15 GiB | CPU | 2 | tg128 | 51.40 ± 0.22 |
| llama 1B Q6_K | 1.15 GiB | CPU | 4 | tg128 | 58.25 ± 0.13 |
| llama 1B Q6_K | 1.15 GiB | CPU | 8 | tg128 | 57.64 ± 0.02 |
---
👤 **ikawrakow** replied the **2024-09-26** at **16:18:44**:<br>
Here some performance numbers for the 3B model on a Ryzen-7950X CPU
| model | size | backend | threads | test | t/s |
| --------------- | ---------: | ---------- | ------: | ------------: | ---------------: |
| llama 3B BF16 | 6.72 GiB | CPU | 16 | pp512 | 482.81 ± 16.34 |
| llama 3B BF16 | 6.72 GiB | CPU | 1 | tg128 | 5.53 ± 0.05 |
| llama 3B BF16 | 6.72 GiB | CPU | 2 | tg128 | 8.65 ± 0.01 |
| llama 3B BF16 | 6.72 GiB | CPU | 4 | tg128 | 9.35 ± 0.02 |
| llama 3B BF16 | 6.72 GiB | CPU | 8 | tg128 | 9.14 ± 0.05 |
| llama 3B Q8_0 | 3.57 GiB | CPU | 16 | pp512 | 383.82 ± 1.85 |
| llama 3B Q8_0 | 3.57 GiB | CPU | 1 | tg128 | 14.93 ± 0.30 |
| llama 3B Q8_0 | 3.57 GiB | CPU | 2 | tg128 | 18.66 ± 0.04 |
| llama 3B Q8_0 | 3.57 GiB | CPU | 4 | tg128 | 18.03 ± 0.13 |
| llama 3B Q8_0 | 3.57 GiB | CPU | 8 | tg128 | 17.20 ± 0.03 |
| llama 3B IQ3_K | 1.55 GiB | CPU | 16 | pp512 | 409.30 ± 3.79 |
| llama 3B IQ3_K | 1.55 GiB | CPU | 1 | tg128 | 11.58 ± 0.01 |
| llama 3B IQ3_K | 1.55 GiB | CPU | 2 | tg128 | 22.28 ± 0.02 |
| llama 3B IQ3_K | 1.55 GiB | CPU | 4 | tg128 | 39.25 ± 0.18 |
| llama 3B IQ3_K | 1.55 GiB | CPU | 8 | tg128 | 37.45 ± 0.08 |
| llama 3B IQ4_K | 2.09 GiB | CPU | 16 | pp512 | 418.06 ± 2.13 |
| llama 3B IQ4_K | 2.09 GiB | CPU | 1 | tg128 | 12.23 ± 0.04 |
| llama 3B IQ4_K | 2.09 GiB | CPU | 2 | tg128 | 23.16 ± 0.07 |
| llama 3B IQ4_K | 2.09 GiB | CPU | 4 | tg128 | 30.55 ± 0.02 |
| llama 3B IQ4_K | 2.09 GiB | CPU | 8 | tg128 | 29.41 ± 0.16 |
| llama 3B Q4_K_S | 2.09 GiB | CPU | 16 | pp512 | 445.79 ± 15.41 |
| llama 3B Q4_K_S | 2.09 GiB | CPU | 1 | tg128 | 13.85 ± 0.03 |
| llama 3B Q4_K_S | 2.09 GiB | CPU | 2 | tg128 | 22.74 ± 0.09 |
| llama 3B Q4_K_S | 2.09 GiB | CPU | 4 | tg128 | 30.74 ± 0.09 |
| llama 3B Q4_K_S | 2.09 GiB | CPU | 8 | tg128 | 29.77 ± 0.02 |
| llama 3B IQ5_K | 2.41 GiB | CPU | 16 | pp512 | 338.86 ± 7.69 |
| llama 3B IQ5_K | 2.41 GiB | CPU | 1 | tg128 | 9.70 ± 0.12 |
| llama 3B IQ5_K | 2.41 GiB | CPU | 2 | tg128 | 18.31 ± 0.02 |
| llama 3B IQ5_K | 2.41 GiB | CPU | 4 | tg128 | 26.21 ± 0.03 |
| llama 3B IQ5_K | 2.41 GiB | CPU | 8 | tg128 | 25.18 ± 0.10 |
| llama 3B Q5_K_S | 2.41 GiB | CPU | 16 | pp512 | 432.96 ± 2.83 |
| llama 3B Q5_K_S | 2.41 GiB | CPU | 1 | tg128 | 12.89 ± 0.15 |
| llama 3B Q5_K_S | 2.41 GiB | CPU | 2 | tg128 | 22.54 ± 0.09 |
| llama 3B Q5_K_S | 2.41 GiB | CPU | 4 | tg128 | 26.37 ± 0.07 |
| llama 3B Q5_K_S | 2.41 GiB | CPU | 8 | tg128 | 25.55 ± 0.02 |
| llama 3B Q6_K | 2.76 GiB | CPU | 16 | pp512 | 439.73 ± 5.86 |
| llama 3B Q6_K | 2.76 GiB | CPU | 1 | tg128 | 12.90 ± 0.19 |
| llama 3B Q6_K | 2.76 GiB | CPU | 2 | tg128 | 21.05 ± 0.01 |
| llama 3B Q6_K | 2.76 GiB | CPU | 4 | tg128 | 22.97 ± 0.01 |
| llama 3B Q6_K | 2.76 GiB | CPU | 8 | tg128 | 22.20 ± 0.01 |

View File

@@ -0,0 +1,127 @@
### 🗣️ [#8](https://github.com/ikawrakow/ik_llama.cpp/discussions/8) - New quantization types IQ2_K, IQ3_K, IQ4_K, IQ5_K
| **Author** | `ikawrakow` |
| :--- | :--- |
| **Created** | 2024-08-01 |
| **Updated** | 2025-07-04 |
---
#### Description
## Why?
I can hear what some are thinking: "Are you crazy? Even more quantization types? Doesn't `llama.cpp` already have enough?"
That was what I was thinking too. Until LLaMA-3 came along, that is.
Quantization errors for LLaMA-3 models are much higher than they have been for all previous models I have experimented with. This is best illustrated with the graph below. LLaMA-3.1 is all the rage these days, but I don't have the ability to run LLaMA-3.1-405B, so I have settled for LLaMA-3.1-70B to generate the graph. We will measure quantization error `QError` of a quantization `Q` using perplexity `PPL` as
```
QError = PPL(Q)/PPL(fp16) - 1
```
As we are not evaluating model performance in language tasks, but are only interested in the performance of a quantized model compared to **the same** full precision model, there is no benefit from looking at commonly used language modeling / reasoning benchmarks, which a) are typically less sensitive to quantization errors than PPL and b) take much longer to evaluate.
One could also use KL divergence, but KL divergence and `PPL` are closely related, and `PPL` is more convenient to calculate with `llama.cpp`, so `PPL` it is.
![l31_70B](https://github.com/user-attachments/assets/e1e8e2ba-1e61-4913-9e86-bc682b227e25)
Blue symbols represent legacy quants (`Q4_0, Q4_1, Q5_0, Q5_1`), red symbols show results for k-quants, i-quants are depicted in black. To show how much higher the quantization error of LLaMA-3.1-70B is, I have included results for LLaMA-v2-70B shown in brown (just for k-quants as I have somehow lost the i-quants runs and did not feel like re-running the quite lengthy calculations). We see that there is basically about 1 bit-per-weight (bpw) gap between LLaMA-v2-70B and LLaMA-3.1-70B. I.e., it looks like the additional tokens used for training LLaMA-3 have paid off, the model has "learned" more from the data, and the model parameters in LLaMA-3.1 contain about 1 bpw extra information. This then results in a higher quantization error for a given bpw quantization budget.
We can now discuss the new quants shown with cyan circles. Please note that the y-axis is logarithmic so that the differences between the data points are quite large, even if they look fairly close to each other. For instance, the blue point around 5.5 bpw (`Q5_0`), which looks quite close to the red point (`Q5_K_S`), has a quantization error of 2.9% vs 1.9%. The cyan point around 5.5 bpw is `IQ5_K`, with a quantization error of 1.4%, i.e., `IQ5_K` has a quantization error that is 2.1X lower compared to `Q5_0`, and 40% lower compared to `Q5_K_S`. The cyan point around 4.5 bpw (`IQ4_K`) has a 2.7X lower quantization error compared to `Q4_0`, and 40% lower compared to `Q4_K_S`. So, even though `IQ4_K` and `IQ5_K` don't come anywhere close to what we used to have for 4- and 5-bit quantization in the pre-LLaMA-3.1 days, they do give a nice improvement compared to the SOTA in the 4+ bpw range.
"But what about the cyan points around 3.5 and 2.4 bpw? They are basically the same as i-quants!" - I hear you asking. These two exist for two reasons:
* My curiosity
* Much better inference performance compared to i-quants on the CPU and old GPU's.
### Curiosity
i-quants are much better than k-quants in the sub-4-bpw range. i-quants in the sub-4-bpw range all use "codebooks" that encode groups of 8 or 4 model weights on the E8 or D4 lattice. The "codebook" idea comes originally from QuIP# and is also being used in, e.g., AQLM. I have been curious for some time to what extent the use of a "codebook" contributes to the better quantization quality of i-quants compared to k-quants. The "codebook" certainly acts as a kind of regularization to avoid/reduce overfitting: one only has a subset of all possible lattice points available in the "codebook" to represent a group of model weights, and hence the quantization algorithm cannot focus too much on individual quants, possibly missing more important model weights in the process. But is there more to it than just it being a regularization technique? I was curious and, as we can see in the above graph, it is indeed possible to match i-quants quantization accuracy with a non-linear quantization technique.
### Performance
The use of a "codebook" requires a lookup in a fairly large table to convert the "codebook" index (which is stored in the quantized model) to actual quantized model weights when performing matrix multiplications. The lookup is handled quite OK by modern GPU's, but leads to a massive performance penalty on CPU's (and, from what I gather from `llama.cpp` user comments, also on older GPU's). The new `IQK` quants use a non-linear mapping between the quantized value stored in the model data (`0...15` for 4-bit quantization, `0...7` for 3-bit, etc.) and the actual model weight, which also needs a lookup table. But these lookup tables are much smaller (4, 8, 16, 32 `INT8` values for 2-, 3-, 4-, 5-bit quantization), so they fit into 1 or 2 SIMD registers, and thus can be handled very efficiently with SIMD instructions (`_mm256_shuffle_epi8` on `AVX2`, `vqtbl1q_s8` on `ARM_NEON`), resulting in a performance that is (nearly) the same as corresponding linear mapping between quants and model weights.
Let's look how this translates into observed inference performance. We compare `IQ2_K` to the matching `IQ2_XS`, and `IQ3_K` to the matching `IQ3_S` quants (matching in the sense that they use basically the same bpw and have very similar quantization accuracy). The following table shows performance in tokens per second (t/s) for prompt processing (`pp512`, so a prompt of 512 tokens) and token generation (`tg128`, so generating 128 tokens one-by-one) between matching quants on `AVX2` (Ryzen-7950X) and `ARM_NEON` (M2-Max CPU). I have also added mainline `llama.cpp` results. The two values in the `Speedup` column are the `t/s` ratios between the new `IQK` quants and the corresponding i-quant in `llama.cpp` and in this repository. For instance, if we look at `IQ3_S` on the Ryzen-7950X, we see that `IQ3_K` will perform prompt processing 6.45 times faster than `llama.cpp`, and token generation speed will be 2.37X!
| Case | test | threads | t/s llama.cpp | t/s this repo | t/s iqk | Speedup |
| -------------- | ----- | ------: | ------------: | ------------: | ------------: | ----------: |
| 8B IQ2_XS AVX2 | pp512 | 16 | 46.45 ± 0.27 | 125.46 ± 0.43 | 194.64 ± 0.66 | 4.19 / 1.55 |
| | tg128 | 4 | 10.88 ± 0.09 | 12.07 ± 0.07 | 21.46 ± 0.03 | 1.97 / 1.78 |
| 8B IQ3_S AVX2 | pp512 | 16 | 28.04 ± 0.08 | 96.28 ± 0.45 | 180.77 ± 0.62 | 6.45 / 1.88 |
| | tg128 | 4 | 6.80 ± 0.01 | 7.62 ± 0.10 | 16.10 ± 0.16 | 2.37 / 2.11 |
| 7B IQ2_XS NEON | pp512 | 8 | 22.77 ± 0.21 | 51.15 ± 0.24 | 60.60 ± 0.97 | 2.66 / 1.18 |
| | tg128 | 8 | 18.19 ± 1.30 | 20.94 ± 0.19 | 28.24 ± 0.39 | 1.55 / 1.35 |
| 7B IQ3_S NEON | pp512 | 8 | 12.08 ± 0.30 | 49.72 ± 0.06 | 55.65 ± 0.82 | 4.61 / 1.12 |
| | tg128 | 8 | 10.32 ± 0.25 | 11.11 ± 0.37 | 20.33 ± 0.06 | 1.97 / 1.83 |
## What are non-linear quants anyway?
Will add later.
## IQ6_K?
Before LLaMA-3, `Q6_K` quantization always had a quantization error in the 0.1-0.15% range, i.e., it was basically as good as the full precision model. But for LLaMA-3.1-70B `Q6_K` quantization error is 0.65%! `Q8_0` does match the full precision model, but it uses 2 extra bpw. I have experimented with 6-bit non-linear quantization in the past, but `Q6_K` quantization error was so low that it was basically not possible to a see a benefit from the non-linearity. Given the much higher `Q6_K` quantization error for LLaMA-3 models, it may be worthwhile to resurrect 6-bit non-linear quantization.
**Update** See PR #14
---
#### 🗣️ Discussion
👤 **afsara-ben** replied the **2025-06-13** at **17:55:20**:<br>
@ikawrakow just found out your fork, wanted to clear my idea - K quants are block based and IQ quants are also block based in llama.cpp with a codebook. The IQn_K quants here is the same as IQ quants but with a non-linear mapping between the quantized weight and actual weight. Maybe its somewhere in the code but can you elaborate what the non-linear function is? And even if the lookup table is small (4x4grid instead of 256x256), the time to access it from L1 cache will still be the same because of memory bandwidth right?
> 👤 **ikawrakow** replied the **2025-06-13** at **18:56:53**:<br>
> Sub 4-bit i-quants use codebooks. `IQ4_XS` and `IQ4_NL`, which were added along with the codebook i-quants `IQ2_XXS, IQ2_S, IQ2_S, IQ3_XXS, IQ3_S` do not use a codebook, but a non-linear mapping for individual quants. They are both 4-bit, so the lookup table has just 16 entries, and the lookup adds negligible overhead.
>
> The `IQX_K` quants also don't use a codebook. If fact, one of the main motivations to create them was to prove to myself that there is nothing special about codebooks. The main difference between `IQX_K` quants and `IQ4_XS/IQ4_NL` is in the use of an extra bit that selects between two lookup tables. `IQ4_KS`, which uses the exact same amount of bits per model weight as `IQ4_XS` (4.25) arrives at a lower quantization error than `IQ4_XS` that way. There are now the following `IQX_K` quants
> * `IQ2_KS` - blocks of 32 weights with a per tensor row scale. Lookup table is 2x4 entries, 2.1875 bpw
> * `IQ2_K` - blocks of 16 weights in super-blocks of 256. Lookup table is 2x4 entries, 2.375 bpw
> * `IQ3_K` - blocks of 16 weights in super-blocks of 256. Lookup table is 2x8 entries, 3.4375 bpw
> * `IQ4_KS` - blocks of 32 weights with a per tensor row scale. Lookup table is 2x16 entries, 4.25 bpw
> * `IQ4_K` - blocks of 16 weights in super-blocks of 256. Lookup table is 2x16 entries, 4.5 bpw
> * `IQ5_KS` - blocks of 32 weights with a per tensor row scale. Lookup table is 2x32 entries, 5.25 bpw
> * `IQ5_K` - blocks of 16 weights in super-blocks of 256. Lookup table is 2x32 entries, 5.5 bpw
> * `IQ6_K` - blocks of 16 weights in super-blocks of 256. Lookup table is 2x64 entries, 6.5 bpw
>
> The sub-4 bpw `IQX_K` quants are much faster on the CPU than the corresponding i-quants and about on par with k-quants. On CUDA performance is more influenced by the block size than it is by the additional lookup required. If we take `IQ4_KS` as an example, it is faster than `Q4_0` (the quant that receives the largest amount of attention and love in mainline `llama.cpp`) for token generation, and only 3-4% slower for prompt processing. On the other hand, the quants that use blocks of 16 tend to be 20-25% slower for prompt processing than quants with blocks of 32 (due to me re-using the GEMM kernel that came from Jonahhes, and the block of 16 kernel not being as good as the block of 32 kernel). Token generation is memory bound, so speed is entirely determined by bpw, and none of the packing details or lookup tables matters that much.
>
> Hope this answers your questions.
>
> 👤 **afsara-ben** replied the **2025-06-13** at **20:51:18**:<br>
> thanks for your reply. What is the non-linear function that results in the lookup grid being smaller? Since it fits into 1/2 SIMD registers, so number of load requests is lower than what would be required for codebook? Additionally, will there be a Metal implementation of the `IQX_K` quants?
>
> 👤 **ikawrakow** replied the **2025-06-14** at **03:03:48**:<br>
> Codebooks are for a group of quants, so much larger. Depending on quantization type the codebooks are between 256 and 2048 entries.
>
> The non-linear function is a 3rd order polynomial. But since it acts on the quantized values it can only take a limited number of different values (4 for 2 bits, 8 for 3 bits, etc). These values can be rounded to the nearest 8-bit integer and put in a lookup table.
>
> There is already a metal implementation for `IQX_K` quants. But since the Apple GPU is very low-end, performance is somewhat lower when I test on my M2-Max. The Metal back-end is not as well maintained as CPU and CUDA in `ik_llama.cpp`, so some of the advanced optimizations are not implemented there.
>
> 👤 **afsara-ben** replied the **2025-06-17** at **23:29:17**:<br>
> thanks for the reply. if its not too much hassle, can you elaborate further how the kgrid matrices in the original IQ quants (PR [#4773]( https://github.com/ggml-org/llama.cpp/pull/4773))were generated ? I wanted to generate my own kgrid matrices so was wondering if there's a script that we can play with?
---
👤 **ikawrakow** replied the **2025-06-21** at **14:15:54**:<br>
@zhouwg
Nice to meet you too.
I don't think I want to get involved with your dispute with the `llama.cpp` maintainers or discuss my reasons for leaving the `llama.cpp` project.
Concerning a port of the `iqk` GEMM/GEMV implementation to Qualcomm Hexagon cDSP: you are obviously free to make a port, and I can try to help as time permits. But be warned: adding this port to your ongoing PR will reduce its chance of getting accepted to zero.
> 👤 **ikawrakow** replied the **2025-06-22** at **13:52:00**:<br>
> You are likely not building the project correctly. `ik_lllama.cpp` is fast, but not 6 times faster than `llama.cpp` for `Q4_0`. What happens if you rebase on the latest main branch and run?
>
> 👤 **ikawrakow** replied the **2025-06-22** at **14:42:43**:<br>
> So, why is the output correct now, but was gibberish before?
>
> 👤 **ikawrakow** replied the **2025-06-22** at **14:52:22**:<br>
> But is correct with `-march=armv8.7-a+dotprod+fp16` ? And then PP-512 is 10 times faster than `llama.cpp`?
>
> 👤 **ikawrakow** replied the **2025-06-22** at **15:02:12**:<br>
> What does `main_gpu=4` mean in the `llama.cpp` run?

View File

@@ -0,0 +1,117 @@
### 🗣️ [#82](https://github.com/ikawrakow/ik_llama.cpp/discussions/82) - 4bpw GGML TYPE?
| **Author** | `Nexesenex` |
| :--- | :--- |
| **Created** | 2024-10-07 |
| **Updated** | 2024-10-17 |
---
#### Description
Hey IK,
It's been a while you forked, and I wondered if you'd be willing to PR something close to a 4 bpw (3.8125-4.0625?, I don't know) ggml type on LlamaCPP, if you have one viable in store. The gap between IQ3_S and IQ4_XS is huge, and there are some reported problems with IQ3_S and IQ3_XXS), which can screw hybrid IQ4_XS based quants where attn_q and attn_output (or some layers of ffn gate and up) are passed in IQ3_S to fit in some VRAM configs.
Maybe with Johannes Gaessler's goodwill, It would make full offload of the 123b parameters viable on 64GB VRAM, and the 70b models viable on 36GB VRAM.
More broadly, your work is sorely missing on LCPP.
Cheers!
---
#### 🗣️ Discussion
👤 **ikawrakow** replied the **2024-10-08** at **05:17:48**:<br>
Hey Nexes the Old, did you try `IQ3_K` and `IQ4_K`? I think a mix of these two will give you what you want, and it will be better than what you could do with i-quants in `llama.cpp`.
---
👤 **ikawrakow** replied the **2024-10-08** at **10:55:49**:<br>
@Nexesenex
Here is an example - LLaMA-3.1-8B-Instruct. We look at `PPL(Q)/PPL(fp16)-1` for a context of 2048 (but note that the `PPL(Q)/PPL(fp16)` ratio is almost independent of context length). First a graph with all quants, including the new `IQX_K` quants in cyan, using a logarithmic y-axis to get the big picture. The two magenta circles that sit around 4 bpw are mixes between `IQ3_K/IQ4_K/IQ5_K/IQ4_XS`. To me it looks like they are pretty much on the quantization error vs model size Pareto front that we can get from i-, k-, and iqk-quants (and i- and iqk-quants are pretty much as good as it gets without additional fine tuning).
![il31_8B](https://github.com/user-attachments/assets/4127966d-0d3d-4ee3-926c-c9eaa18461f1)
Then a zoomed-in graph in the bpw area of interest with a linear y-axis.
![il31_8B_nesenex](https://github.com/user-attachments/assets/cbdd834b-bd66-47e0-aa9e-6e17f82286d4)
The two magenta mixes are at 4.0 and 4.09 bpw. These are bpw that include token embedding and output tensor. The token embedding tensor is quantized with `IQ3_K`, the output tensor `output.weight` with `Q6_K`. In the case of LLaMA-3.1 with its 128k vocabulary `output.weight` is quite large, and hence increases the effective bpw by 0.167 bpw (compared to it being ignored, as quantization literature tends to do, or it being quantized with 4 bpw). Hence, for a larger model where the output tensor represents a much smaller fraction of the overall model size, these mixes will be sub-4 bpw. The smaller mix is composed as follows
* `output` - `Q6_K`
* `token_embd, attn_q, attn_k, ffn_gate` - `IQ3_K`
* `attn_v` - `IQ5_K`
* `attn_output` - `IQ4_K`
* `ffn_down, ffn_up` - half with `IQ3_K`, other half with `IQ4_K` (using function `use_more_bits(i_layer, n_layer)` to select `IQ4_K` vs `IQ3_K`
The larger mix is as the above, but in addition uses
* `ffn_gate` - half with `IQ4_XS`, other half `IQ3_K`, again using `use_more_bits(i_layer, n_layer)`
I can add one of these. Let me know if you prefer the smaller or the larger one.
---
👤 **ikawrakow** replied the **2024-10-09** at **09:54:18**:<br>
See #83
---
👤 **Nexesenex** replied the **2024-10-09** at **14:58:25**:<br>
Hey IK,
I was about to answer you, but of course, you made some magic happen already.
Fantastic work, as always. A new SOTA 4.25BPW GGML_TYPE quant is a huge boost. Can it be integrated in the official LlamaCPP by moving the relevant section of your ik files in the traditionnal equivalents in LCPP official?
As for quant mixes, on LCPP official, I passed attn_v in Q6_K and attn_K in Q5_K for my >IQ3_M and IQ4_XS mixes when vocab is above 128000. The ppl usually drops by more than 0.01, I suspect it might help other indicators even more, for 180MB on Llama 3 70b and ulterior, that's a good trade.
I also generally beef up to the higher quant the first and last layers attn_k, attn_q, and ffns in all cases, because they are either the closest from embeddings (as you were doing already on several quant mixes), or the last ones before the final output.
I use an equivalent IQ3_XXL mix to your IQ3_KL. on the top of a bumped ffn_down, I'll bump ffn_up more than ffn_gate to see if it brings a bonus compared to equalizing them, I used several variants of your more_bits function to achieve steps of 12.5% layers quantized to the higher quant accordingly to my needs.
What I was wondering about is a LCCP official mergeable IQ4_XXS / IQ4_K_"XXS" GGML type (tensor level quant), at 4-4.0625bpw, if such thing is possible and viable compared to a IQ3/IQ4 mix, to get rid of the IQ3_S I'm using, because on some models they are worst than Q3_K (Miqu attn_q and attn_output, for example, I observed some discrepancy on Qwen2 72b as well).
I speak about LCPP official, because I was.. unable to compile IK_Llama on MSVS, and I need official as the base for my fork of KoboldCPP, the inference software I modified and use with everything, rebasing it on your IK LLama while I can't even compile it seems unviable to me. Moreover, I do not know your personal objectives nor relations with the LCPP official project, but a broad compatibility for your quants would allow people to.. use them, and not waste compute, energy, and time on non-SOTA quants for their models.
---
👤 **ikawrakow** replied the **2024-10-09** at **16:23:12**:<br>
> Can it be integrated in the official LlamaCPP...
The license is MIT, so obviously it can be integrated into mainline `llama.cpp`. Will I do it? Of course not.
> I speak about LCPP official, because I was.. unable to compile IK_Llama on MSVS, and I need official as the base for my fork of KoboldCPP, the inference software I modified and use with everything, rebasing it on your IK LLama while I can't even compile it seems unviable to me.
You could have opened an issue, no? With the output of the build process. I don't have access to a Windows box and Windows is certainly not my priority, but sometimes one can fix it just from the compiler error messages.
> Moreover, I do not know your personal objectives nor relations with the LCPP official project, but a broad compatibility for your quants would allow people to.. use them, and not waste compute, energy, and time on non-SOTA quants for their models.
My personal objective is to have fun :smiley:
Quants are kind of orphaned in mainline and have become a "commodity", with tons of low quality quantized models being distributed on HuggingFace as GGUFs. Hence, people interested in (high quality) quantization work are better off here than mainline. Or people running on the CPU. Or people using models that run much faster here than in mainline also on the GPU (e.g., Gemma), etc. I do sync with mainline from time to time, but I did not see anything worth merging since I last synced in August. Am I missing something from mainline that you find essential?
> I use an equivalent IQ3_XXL mix to your IQ3_KL. on the top of a bumped ffn_down, I'll bump ffn_up more than ffn_gate to see if it brings a bonus compared to equalizing them, I used several variants of your more_bits function to achieve steps of 12.5% layers quantized to the higher quant accordingly to my needs.
Sure, one can spend a lot of time experimenting. I see your PR 8917 in mainline has not been merged. As I believe that having a more flexible and convenient way to specify quantization mixes is definitely worth having, your PR is likely to be more successful here than there.
---
👤 **Nexesenex** replied the **2024-10-17** at **04:04:29**:<br>
I submitted my PR 8917 here, as invited to.
As for mainline, there's nothing essential for me since august, aside for maintaining some sort of compatibility with KCPP so I can attempt a rebase on your fork without breaking my head too hard, even if that might still be too hard. :D
A PR maybe worth testing is this one, with several percents boost in PP & TG on my side on Cuda : https://github.com/ggerganov/llama.cpp/pull/8366
For the compile problem, I could have opened an issue but I was a bit discouraged by the idea that I could not even use your quants for my use (KoboldCPP + ST, I look at Lollms with curiosity also). My bad, but a white knight came to fix that a day before a lovely IQ4_KSS appeared, so here I am, llama-server + ST it is for now.
As for the beef with mainline, well, I really regret that the quality and speed of inference went maybe a bit low into the priority list. It seemed already to be the case when Johannes Gaessler developed the first KV quant 8 bits in late 2003. Anyway, I'm glad you keep having fun by blowing up the charts. Your work is really phenomenal, and I wish that your quants became the new baseline of the GGUF side of Hugging Face.
But where would be the fun in that? :X

View File

@@ -0,0 +1,217 @@
### 🗣️ [#95](https://github.com/ikawrakow/ik_llama.cpp/discussions/95) - Bitnet
| **Author** | `ikawrakow` |
| :--- | :--- |
| **Created** | 2024-10-19 |
| **Updated** | 2025-04-22 |
---
#### Description
A Microsoft team has released [CPU inference code](https://github.com/microsoft/BitNet) for 1.58-bit Bitnets. The repo, based 100% on `llama.cpp`, and only adding Bitnet CPU kernels (`ARM_NEON, AVX2`) has 2.1k stars as of this writing. As per @Dampfinchen ["this is just insanity"](https://github.com/ggerganov/llama.cpp/discussions/9945).
Well, here we have had Bitnet inference for while. For CPU and GPU. Faster than Microsoft's by quite some margin.
There is a screen recording in their repo demoing the 3.3B Bitnet model writing a 900 token essay and achieving 71 t/s on **M2 Ultra**. Here is a screen recording from my **M2-Max laptop** (~1/2 the computing power and memory bandwidth of M2 Ultra) getting 74 t/s on the same prompt.
https://github.com/user-attachments/assets/889090a2-4c09-4392-99d6-31a76cf54dc1
And here it is running on the M2-Max 30-core GPU
https://github.com/user-attachments/assets/4c08fa07-177a-4462-b4d8-9ce512733fb3
Finally, here running on RTX-4080
https://github.com/user-attachments/assets/e240fd80-9747-470f-8282-3f53bfacff4b
The prompt is very short (9 tokens), but it is still worth noting that Microsoft's implementation processes the prompt at a rate of 85 t/s, while here we get 157 t/s with half the computing power.
---
#### 🗣️ Discussion
👤 **ikawrakow** replied the **2024-10-19** at **08:44:58**:<br>
I was curious to see Microsoft's Bitnet performance on `X86_64`. So, cloned their repo and followed the setup instructions. The setup script downloaded the `fp32` Bitnet-1.58-3B version, so 13.2 GB instead of 6.6. It also demands `clang-18`, so I had to install that first (even though `llama.cpp` definitely does not require `clang`, and even less `clang-18` to be built, and at a quick glance neither do the added ternary kernels). Their "end-to-end" test script `e2e_benchmark.py` does not do much more than just run the familiar `llama-bench`. Here is what I get on my Ryzen-7950X CPU
| model | size | params | backend | threads | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
| bitnet 3B I2_S - 2 bpw ternary | 873.66 MiB | 3.32 B | CPU | 16 | pp512 | 28.19 ± 0.12 |
| bitnet 3B I2_S - 2 bpw ternary | 873.66 MiB | 3.32 B | CPU | 16 | tg128 | 20.84 ± 0.03 |
The script warns that this is a debug build, but going to the `build` folder and checking shows that, nope, it is a release build. 28 t/s for PP-512 on a 3B ternary model? Hahaha.
Here is what I get with this repo:
| model | size | params | backend | threads | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | ---------------: |
| bitnet 3B IQ2_BN - 2.00 bpw Bitnet | 977.42 MiB | 3.43 B | CPU | 16 | pp512 | 620.63 ± 3.16 |
| bitnet 3B IQ2_BN - 2.00 bpw Bitnet | 977.42 MiB | 3.43 B | CPU | 4 | tg128 | 56.27 ± 0.27 |
22X (!!!) difference in prompt processing speed. 2.8X difference in token generation (TG) speed. TG is memory bound, so let's check what we get with just 1 thread. First theirs (be patient if you try it):
| model | size | params | backend | threads | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
| bitnet 3B I2_S - 2 bpw ternary | 873.66 MiB | 3.32 B | CPU | 1 | tg128 | 2.01 ± 0.01 |
Then ours
| model | size | params | backend | threads | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | ---------------: |
| bitnet 3B IQ2_BN - 2.00 bpw Bitnet | 977.42 MiB | 3.43 B | CPU | 1 | tg128 | 25.72 ± 0.11 |
Aha. 12.8X.
Perhaps they did not turn on `AVX2/AVX512` while building? Let's try this
```
python run_inference.py -m models/bitnet_b1_58-3B/ggml-model-i2_s.gguf -p "I believe the meaning of life is" -t 16
...
system_info: n_threads = 16 (n_threads_batch = 16) / 32 | AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | AVX512_BF16 = 1 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |
sampler seed: 2909124194
sampler params:
repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.800
mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> top-k -> tail-free -> typical -> top-p -> min-p -> temp-ext -> softmax -> dist
generate: n_ctx = 2048, n_batch = 1, n_predict = 128, n_keep = 1
I believe the meaning of life is . really, ... ... ... ... "..., or. ... what a...... ... ... ... just a by we or close... ar is is it is (... m ... is o to _ more _ _ full _ k _ _ good
_ _ ( _ R _ ) P P _ and the a, the * P R
B F F ( F F F F B V V
Com Im Str
American T
,
ter “ ! M M B P IN IN S P P P O PA PA V ST IN AS B BE PA EHER B BTER B B PA
llama_perf_sampler_print: sampling time = 15.96 ms / 136 runs ( 0.12 ms per token, 8521.84 tokens per second)
llama_perf_context_print: load time = 390.49 ms
llama_perf_context_print: prompt eval time = 380.52 ms / 8 tokens ( 47.56 ms per token, 21.02 tokens per second)
llama_perf_context_print: eval time = 6114.10 ms / 127 runs ( 48.14 ms per token, 20.77 tokens per second)
llama_perf_context_print: total time = 6530.61 ms / 135 tokens
```
Oops. `AVX2` and `AVX512` are both on, and we get gibberish.
Perhaps `clang` is mis-compiling the code? Or maybe something went wrong with the `clang-18` installation? Let's try `GCC`.
```
mkdir build1 && cd build1
cmake ..
-- The C compiler identification is GNU 11.4.0
-- The CXX compiler identification is GNU 11.4.0
-- Detecting C compiler ABI info
-- Detecting C compiler ABI info - done
-- Check for working C compiler: /usr/bin/cc - skipped
-- Detecting C compile features
-- Detecting C compile features - done
-- Detecting CXX compiler ABI info
-- Detecting CXX compiler ABI info - done
-- Check for working CXX compiler: /usr/bin/c++ - skipped
-- Detecting CXX compile features
-- Detecting CXX compile features - done
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD
-- Performing Test CMAKE_HAVE_LIBC_PTHREAD - Success
-- Found Threads: TRUE
CMake Error at src/CMakeLists.txt:9 (message):
Clang is required for Bitnet.cpp compilation
-- Configuring incomplete, errors occurred!
```
Arghh. Comment out the `clang` check in `src/CMakeLists.txt` and retry. Now it builds successfully after
```
cmake ..
make -j
```
Running `llama-cli` gives much better performance - 52 t/s - but still gibberish output. PP-512 is also much better - 300 t/s. That's what I would expect from a run-of-the-mill `AVX2/AVX512` implementation. Still very far from being competitive.
---
👤 **ikawrakow** replied the **2024-10-19** at **15:19:26**:<br>
OK, here is apples-to-apples performance comparison on my M2-Max laptop between Microsoft's `I2_S` and `IQ2_BN` here. I used their `generate-dummy-bitnet-model.py` tool to generate fake Bitnet models of different sizes and ran `llama-bench`. Did not go beyond 30B because generating the 30B model almost exhausted my patience. Their code crashes with segmentation fault on PP-512 tests, so just TG-128.
| Model | t/s (MS I2_S) | t/s (IQ2_BN) | Speedup |
| ----- | ------------: | -------------: | ------: |
| 125M | 639.39 ± 10.74 | 947.67 ± 34.86 | 1.482 |
| 350M | 286.92 ± 1.35 | 426.03 ± 6.64 | 1.485 |
| 1B | 144.62 ± 3.96 | 225.76 ± 7.70 | 1.561 |
| 1.5B | 120.12 ± 1.31 | 170.55 ± 8.35 | 1.420 |
| 2.7B | 84.25 ± 0.43 | 115.52 ± 3.13 | 1.371 |
| 3.8B | 64.74 ± 0.22 | 86.58 ± 2.83 | 1.337 |
| 7B | 39.14 ± 0.67 | 51.37 ± 0.82 | 1.312 |
| 13B | 24.04 ± 0.03 | 30.21 ± 0.18 | 1.257 |
| 30B | 11.22 ± 0.05 | 13.57 ± 0.03 | 1.209 |
The difference in performance decreases with model size, but that's just a matter of memory bandwidth saturation for `IQ2_BN`. The 30B model is 7.45 GiB, so at 13.6 t/s this is 101 GiB/s to fetch the model weights from RAM, which is basically as good as it gets on the M2-Max CPU.
> 👤 **saood06** replied the **2025-04-22** at **08:05:03**:<br>
> Interesting to see the TG number here for 2.7B (115.52 t/s) is double the performance you got for bitnet2b_2501 (62.33 t/s) which is 2.741 B parameters. Do you know what makes the different architecture twice as slow?
>
> 👤 **ikawrakow** replied the **2025-04-22** at **08:19:46**:<br>
> This is running on my M2-Max laptop. The M2 has 400 GB/s memory bandwidth. Unfortunately only about 100 GB/s are given to the CPU, the other 300 GB/s are reserved for the GPU (but there are model/quant combinations where I can get up to 110-115 GB/s running CPU-only). As a result the M2-Max has a much better TG performance than a consumer level `x86_64` CPU - nearly twice the TG performance of the Ryzen-7950X. Another interesting thing about the M2-Max is that the silicon spent on the GPU is basically a waste. If it had been spent to double the number of CPU cores, and all of the 400 GB/s had been given to the CPU, that hypothetical CPU would be wiping the floor with the Apple GPU (well, at least for TG, PP would be still 2X lower than the GPU).
>
> 👤 **saood06** replied the **2025-04-22** at **08:31:01**:<br>
> >This is running on my M2-Max laptop.
>
> Sorry, I skipped over that when looking back at this thread.
>
> 👤 **saood06** replied the **2025-04-22** at **08:42:18**:<br>
> > This is running on my M2-Max laptop. The M2 has 400 GB/s memory bandwidth. Unfortunately only about 100 GB/s are given to the CPU, the other 300 GB/s are reserved for the GPU (but there are model/quant combinations where I can get up to 110-115 GB/s running CPU-only). As a result the M2-Max has a much better TG performance than a consumer level `x86_64` CPU - nearly twice the TG performance of the Ryzen-7950X. Another interesting thing about the M2-Max is that the silicon spent on the GPU is basically a waste. If it had been spent to double the number of CPU cores, and all of the 400 GB/s had been given to the CPU, that hypothetical CPU would be wiping the floor with the Apple GPU (well, at least for TG, PP would be still 2X lower than the GPU).
>
> Hmm, I know this is for the M1-Max but this https://www.anandtech.com/show/17024/apple-m1-max-performance-review/2 goes over the memory bandwith situation in a lot of depth.
>
> I'm surprised you tap out at 115 GB/s given what is shown in the linked article.
>
> The silicon design of the Apple chips has always been interesting to me, I've been following it since the early designs from the iPhone.
>
> 👤 **ikawrakow** replied the **2025-04-22** at **09:24:20**:<br>
> The article is about the M1 chips? Yes, I have seen benchmarks such as this article. But we are not interested in shoving some data from here to there (which the benchmark does). We are interested in getting some data to the CPU and actually doing something with it. Here the M2-Max CPU maxes out at 110-115 GB/s, being around 100 GB/s most of the time. For PP I get about 2 TFLOPS out of the M2-Max CPU, so that's 250 GB/s of multiply-add processing power (fused multiply-add counting as 2 ops and needing 4 bytes of data per op), so processing power is not what limit us to ~100 GB/s in TG.
>
> 👤 **saood06** replied the **2025-04-22** at **09:38:31**:<br>
> >Here the M2-Max CPU maxes out at 110-115 GB/s, being around 100 GB/s most of the time.
>
> This shows something similar.
>
> ![a901d026-a1f1-4da4-a410-16c507517571_1256x585](https://github.com/user-attachments/assets/50765a5e-5b5d-4bcf-9aa8-60d4b25bbeff) from https://old.chipsandcheese.com/2023/10/31/a-brief-look-at-apples-m2-pro-igpu/
>
> This article shows the GPU capping out around 200 GB/s though as the article is more focused on it.
>
> ![cf2abde5-a4cc-4638-8380-f45cf13c2bc7_1005x497](https://github.com/user-attachments/assets/df0857d8-cbc0-4cc1-9564-9cf4e35eefbb)
>
> It is a rather impressive chip.
>
> 👤 **ikawrakow** replied the **2025-04-22** at **10:35:47**:<br>
> Yes, it is. I wish AMD/Intel would finally follow suit, and would give their consumer level chips more memory bandwidth.
>
> 👤 **saood06** replied the **2025-04-22** at **10:53:44**:<br>
> The cores are also a lot wider, Intel/AMD were stuck on 4-wide for so long, and look at Apple at 9-wide.
>
> ![image](https://github.com/user-attachments/assets/fa2b157a-365f-4cc7-9ab3-226f65f4c6fb)
>
> Golden cove from Intel shown below is 6-wide.
>
> ![3036f76f-f8e9-476b-8bd7-f3be4aadbc88_768x622](https://github.com/user-attachments/assets/8a0583c8-4ced-4669-9ac2-73d777374b6c)
---
👤 **saood06** replied the **2025-04-15** at **14:27:18**:<br>
They updated the repo with the first Official model (all previous models were just supported models, and had far less training) https://huggingface.co/microsoft/bitnet-b1.58-2B-4T it looks competitive at it's size as it was trained with 4T tokens.
> 👤 **ikawrakow** replied the **2025-04-15** at **15:22:22**:<br>
> Good to know. But has something changed since the preliminary models were published (i.e., do I need to make changes to the Bitnet implementation)?
>
> 👤 **saood06** replied the **2025-04-15** at **15:27:41**:<br>
> I don't think so, they published the i2_s GGUF [here](https://huggingface.co/microsoft/bitnet-b1.58-2B-4T-gguf/tree/main) which you already did the work supporting converting to a type from this repo in #169.
>
> 👤 **saood06** replied the **2025-04-20** at **14:24:15**:<br>
> I think I was wrong, [this](https://github.com/microsoft/BitNet/pull/167) adds the new architecture, seems simple enough to port though (might be interesting to test on Android).