3.6 KiB
🔀 #443 - Streamline a bit the quant strategies
| Author | Nexesenex |
|---|---|
| State | ❌ Closed |
| Created | 2025-05-22 |
| Updated | 2025-05-22 |
Description
Unlike last time..
No change over the existing patterns, except for the bump for attn_k and attn_v for the models with 4 and 6 experts (several frankensteins seen on HF, and which also use GQA). The rest is applying the existing patterns to the new IQ_K quants. Also, a Q8_0 for attn_q slipped into the MOEs 8 experts rule, I removed it, because that tensor is much bigger than attn_k or attn_v.
- I have read the contributing guidelines
- Self-reported review complexity:
- Low
- Medium
- High
💬 Conversation
👤 ikawrakow commented during a code review the 2025-05-22 at 06:46:59 on src/llama.cpp:
Why do we want to limit to <= 8 experts?
👤 ikawrakow commented during a code review the 2025-05-22 at 06:48:18 on src/llama.cpp:
Why limit to <= 8 experts?
👤 ikawrakow commented during a code review the 2025-05-22 at 06:54:53 on src/llama.cpp:
So, I see you added the condition for Q5_K_S just above but I have forgotten why we want to have it. Can you remind me? I was wondering not too long ago why a model quantized with Q5_K_S ended up having less the 5.5 bpw (but didn't check). Why is the decision to reduce the number of bits dependent on the vocabulary size?
👤 ikawrakow commented during a code review the 2025-05-22 at 06:55:55 on src/llama.cpp:
<= 8?
👤 ikawrakow submitted a review the 2025-05-22 at 06:58:25: 💬 COMMENTED
Looks OK apart from the <= 8 condition for MoE models. I don't think it is needed.
This may make it more convenient for some people, but I basically just use --custom-q these days.
👤 Nexesenex commented during a code review the 2025-05-22 at 13:46:33 on src/llama.cpp:
Oh, I just not wanted to step on bigger MOEs because I didn't test any. I left that to your discretion.
👤 Nexesenex submitted a review the 2025-05-22 at 13:46:34: 💬 COMMENTED
👤 Nexesenex submitted a review the 2025-05-22 at 13:48:21: 💬 COMMENTED
👤 Nexesenex commented during a code review the 2025-05-22 at 13:48:21 on src/llama.cpp:
I just did not want to step on bigger MOEs because I didn't test any. I left that to your discretion. But ofc if it's fine with your we can remove that second condition.
👤 Nexesenex submitted a review the 2025-05-22 at 13:54:19: 💬 COMMENTED
👤 Nexesenex commented during a code review the 2025-05-22 at 13:54:19 on src/llama.cpp:
I added this back then because attn_q endures very well a smaller quant on Llama 3 models, with no perplexity bump or even a drop around 0.005 on L3 (and Also Mistral 123b models). I also observed this with IQ4_XS -> IQ3_S for attn_q. I take benefit of this to bump attn_v instead on L3, which is very sensitive to it. At the time, you agreed with the principle.
👤 Nexesenex submitted a review the 2025-05-22 at 13:54:45: 💬 COMMENTED
👤 Nexesenex commented during a code review the 2025-05-22 at 13:54:45 on src/llama.cpp:
Ok, I will remove this <= part!
👤 ikawrakow submitted a review the 2025-05-22 at 15:04:41: ✅ APPROVED