ik_llama.cpp/443 - Streamline a bit the quant strategies.md at main - ik_llama.cpp

ikawrakow/ik_llama.cpp

Fork 0

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-01-26 09:09:50 +00:00

Files

Thomas eaa2510a28 Add GitHub data: filename sanitization (#640 )

2025-07-23 13:31:53 +02:00

3.6 KiB

Raw Permalink Blame History

🔀 #443 - Streamline a bit the quant strategies

Author	`Nexesenex`
State	❌ Closed
Created	2025-05-22
Updated	2025-05-22

Description

Unlike last time..

No change over the existing patterns, except for the bump for attn_k and attn_v for the models with 4 and 6 experts (several frankensteins seen on HF, and which also use GQA). The rest is applying the existing patterns to the new IQ_K quants. Also, a Q8_0 for attn_q slipped into the MOEs 8 experts rule, I removed it, because that tensor is much bigger than attn_k or attn_v.

I have read the contributing guidelines
Self-reported review complexity:
- Low
- Medium
- High

💬 Conversation

👤 ikawrakow commented during a code review the 2025-05-22 at 06:46:59 on src/llama.cpp:

Why do we want to limit to <= 8 experts?

👤 ikawrakow commented during a code review the 2025-05-22 at 06:48:18 on src/llama.cpp:

Why limit to <= 8 experts?

👤 ikawrakow commented during a code review the 2025-05-22 at 06:54:53 on src/llama.cpp:

So, I see you added the condition for Q5_K_S just above but I have forgotten why we want to have it. Can you remind me? I was wondering not too long ago why a model quantized with Q5_K_S ended up having less the 5.5 bpw (but didn't check). Why is the decision to reduce the number of bits dependent on the vocabulary size?

👤 ikawrakow commented during a code review the 2025-05-22 at 06:55:55 on src/llama.cpp:

<= 8?

👤 ikawrakow submitted a review the 2025-05-22 at 06:58:25: 💬 COMMENTED

Looks OK apart from the <= 8 condition for MoE models. I don't think it is needed.

This may make it more convenient for some people, but I basically just use --custom-q these days.

👤 Nexesenex commented during a code review the 2025-05-22 at 13:46:33 on src/llama.cpp:

Oh, I just not wanted to step on bigger MOEs because I didn't test any. I left that to your discretion.

👤 Nexesenex submitted a review the 2025-05-22 at 13:46:34: 💬 COMMENTED

👤 Nexesenex submitted a review the 2025-05-22 at 13:48:21: 💬 COMMENTED

👤 Nexesenex commented during a code review the 2025-05-22 at 13:48:21 on src/llama.cpp:

I just did not want to step on bigger MOEs because I didn't test any. I left that to your discretion. But ofc if it's fine with your we can remove that second condition.

👤 Nexesenex submitted a review the 2025-05-22 at 13:54:19: 💬 COMMENTED

👤 Nexesenex commented during a code review the 2025-05-22 at 13:54:19 on src/llama.cpp:

I added this back then because attn_q endures very well a smaller quant on Llama 3 models, with no perplexity bump or even a drop around 0.005 on L3 (and Also Mistral 123b models). I also observed this with IQ4_XS -> IQ3_S for attn_q. I take benefit of this to bump attn_v instead on L3, which is very sensitive to it. At the time, you agreed with the principle.

👤 Nexesenex submitted a review the 2025-05-22 at 13:54:45: 💬 COMMENTED

👤 Nexesenex commented during a code review the 2025-05-22 at 13:54:45 on src/llama.cpp:

Ok, I will remove this <= part!

👤 ikawrakow submitted a review the 2025-05-22 at 15:04:41: ✅ APPROVED

3.6 KiB Raw Permalink Blame History

🔀 #443 - Streamline a bit the quant strategies

Description

💬 Conversation

3.6 KiB

Raw Permalink Blame History