mirror of
https://github.com/ikawrakow/ik_llama.cpp.git
synced 2026-03-03 10:30:27 +00:00
5584 lines
478 KiB
Markdown
5584 lines
478 KiB
Markdown
### 🔀 [#239](https://github.com/ikawrakow/ik_llama.cpp/pull/239) - SER - Smart Expert Reduction
|
||
|
||
| **Author** | `ikawrakow` |
|
||
| :--- | :--- |
|
||
| **State** | ❌ **Closed** |
|
||
| **Created** | 2025-03-01 |
|
||
| **Updated** | 2025-03-18 |
|
||
|
||
---
|
||
|
||
#### Description
|
||
|
||
The idea behind this PR is very simple: we define new parameters (specified via the command line) $K_{\rm min}$ and $t$. During inference experts are normally selected by sorting their computed probabilities $p_i$ in descending order and picking the top $K$ experts. We modify this expert selection algorithm by always selecting the top $K_{\rm min}$ experts ($K_{\rm min} < K$), and using experts between $K_{\rm min}$ and $K$ only if $p_i > t\cdot p_0$ (i.e., only if their probability $p_i$ relative to the top expert probability $p_0$ is greater than the specified threshold $t$). If we set $t = 0$, this expert selection modification is never invoked, so we have the behavior of the original model. If we set $t = 1$, we use a fixed number of experts $K_{\rm min}$ (the same can be achieved by using `--override-kv deepseek2.expert_used_count=int:Kmin` on the command line, but using `-ser Kmin,1` is clearly much easier to type and remember).
|
||
|
||
What is the purpose of this? We are hoping to gain performance without a significant loss of precision. Let's take a look at some data. Model is DeepSeek-Lite quantized with `IQ4_NL`. We measure accuracy loss (or error) via `PPL(SER)/PPL(full)-1`. I know some people don't like using perplexity. To each their own. On my book perplexity is a perfectly fine way (to not say the best way) to measure accuracy loss due to some model approximation (quantization, or, as here, selectively using fewer experts) as we are comparing to the base model and not to some other model. The following graph shows quantization error (as defined above) as a function of threshold $t$ for $K_{\rm min}=$ 3, 4, and 5 (DeepSeek-Lite has 6 active experts specified).
|
||
|
||

|
||
|
||
We observe kind of expected sigmoid change of the error between base at $t = 0$ (0.8% due to quantization) and the upper threshold defined by always using exactly $K_{\rm min}$ experts. For $K_{\rm min}$ there is barely any increase in the precision loss (1.36% at $t = 1$). For $K_{\rm min} = 3$ and 4 we see that we can keep the error to a more acceptable range if we use $t < \sim0.4$.
|
||
|
||
The best way to examine performance gains is to look at performance relative to base as a function of precision loss. The following graph shows the results for CUDA (RTX-4080). Black symbols are for processing a prompt of 2048 tokens (`pp2048`), red symbols are for token generation (`tg128`).
|
||
|
||

|
||
|
||
What are the megenta symbols? These are for a model quantized with `--pure` (i.e., all tensors are `IQ4_NL` except for the output tensor and the token embeddings). Without this option `llama-quantize` will use a mix of 5-,6- and even 8-bit quants for the attention tensors and shared experts of MoE models such as DeepSeek-Lite/V3/R1. In [this discussion](https://github.com/ikawrakow/ik_llama.cpp/pull/235#issuecomment-2689086533) @saood06 wrote that doing that is not a good idea as this leads to a significant performance penalty. This is of course true, using more bits always comes with a price in TG performance due to TG being memory bound. But typically one wants to pick the best balance between precision loss and performance. Based in the above plot, at least on CUDA, it is much better to use fewer experts than to be stingy with bits for attention tensors. At the 1.6% quantization error of 4-bit attention tensors one can get a 12% TG performance boost with $K_{\rm min} = 4, t = 0.4$ using the default `IQ4_NL` quantization scheme, vs the 2.3% one gets with `--pure`.
|
||
|
||
But this is CUDA specific, so let's look at the same plot running on the CPU (Ryzen-7950X).
|
||
|
||

|
||
|
||
Here magenta TG performance is more competitive with this PR, but still cannot compete with just using 5 instead of 6 experts.
|
||
|
||
In summary: Based on these results, using $K_{min} = 4, t = 0.2$ or $K_{\rm min} = 5, t = 0.4$ looks to me as a very viable option. We get a noticeable TG performance gain of 5-7% without much reduction in model quality. It would be great if somebody could study the behavior of DeepSeekV3/R1 with this PR. There we have slightly more room for expert reduction from 8 to 5, 6, or 7.
|
||
|
||
I wonder if this (or something similar) is what they call "selectively using 6 experts" in the KTransformers repository. Does somebody know?
|
||
|
||
Almost forgot: to use this option, add
|
||
```
|
||
-ser Kmin,t or --smart-expert-reduction Kmin,t
|
||
```
|
||
to the command line.
|
||
|
||
**Caveat:** not implemented on Metal. The Metal back end has started to seriously fall behind, so at some point I need to take the time to add this and all other missing features.
|
||
|
||
---
|
||
|
||
#### 💬 Conversation
|
||
|
||
👤 **ikawrakow** commented the **2025-03-01** at **15:49:06**:<br>
|
||
|
||
Here a graph for error versus performance gain for hybrid CPU/GPU inference (Ryzen-7950X/RTX-4080) for DeepSeek-Lite. Operation with MoE tensors are computed on the CPU, all others on the GPU.
|
||
|
||

|
||
|
||
Here performance gains are much more significant. As attention and shared experts computation done on the GPU is much faster than the MoE calculation done on the CPU, we gain more by selectively reducing experts. If we just use 5 experts instead of 6, TG performance increases by nearly 20% while the associated error is significantly less than using 4 bits for the attention layers.
|
||
|
||
---
|
||
|
||
👤 **davidsyoung** commented the **2025-03-01** at **16:25:50**:<br>
|
||
|
||
This looks very interesting - what would you recommend is the best way to test this with full CUDA off-load with R1? If you have some harnesses to test PPL, that would be great
|
||
|
||
---
|
||
|
||
👤 **ikawrakow** commented the **2025-03-01** at **17:11:55**:<br>
|
||
|
||
I typically use Wikitext2 `PPL`. There are many people out there who believe that this is not good, but I have also compared to C4 `PPL` (English and French) and, once you look at the ratio of `PPL(approximate model)/PPL(full model)-1`, things do not depend that much on the specific test corpus. The same is also true for context length. Even though PPL can change a lot with the context window used for evaluation, the ratio `PPL(approximate model)/PPL(full model)` is nearly independent of context length. One can also compute KL divergence (and many people think this is better than `PPL`), but that is much less convenient (one must first run a calculation with the full model, generate a huge data file, to then run with the approximate model to get the KL divergence values), to only find out that the mean KL divergence correlates almost 100% with `log(PPL(approximate)/PPL(full))`. Same is true for HellaSwag, the other benchmark one can run with `llama.cpp`. The correlation coefficient between `HellaSwag(full) - HellaSwag(approximate)` with `PPL(approximate)/PPL(full)-1` tends to be over 90%, so this doesn't give much additional information (but takes way longer to compute than PPL). So, at then end, if you have settled on a model you want to use, comparing `PPL` with SER to `PPL` without will give good indication about performance degradation.
|
||
|
||
It is of course also important to just use it and see if you think the quality of the responses is degraded. This is very subjective, but it will be you using it, so you must like it.
|
||
|
||
But with the 150-200 t/s you are getting for R1 it will not be easy to get a detailed evaluation. Each point in the graphs above takes less than 2 minutes to compute, so with a simple script it was done in less than 1 hour. In your case, a full PPL calculation on Wikitext2 with optimistically 200 t/s will take close to 30 minutes. I have seen people looking at just the first 10 or 20 batches. This is by far not enough as results tend to change quite a bit after that. So, I think it is important to carefully select the few full runs you want to do. I would first check 6 and 7 experts using `-ser 6,1` / `-ser 7,1`, see how much performance one gains and how much quality degrades, and then decide how to proceed.
|
||
|
||
---
|
||
|
||
👤 **davidsyoung** commented the **2025-03-01** at **17:25:56**:<br>
|
||
|
||
Okay, cool! I am going to first create my own quant somewhere around `i1-IQ3_XXS`, `i1-IQ3_XS`, or `i1-IQ3_S`. I'm downloading the full BF16 model right now, and then when I have the best fit of quants, I'll figure out how to run a PPL test... :) Thank you.
|
||
|
||
---
|
||
|
||
👤 **davidsyoung** commented the **2025-03-03** at **21:35:39**:<br>
|
||
|
||
@ikawrakow a little bit off topic but didn't know where better to ask.
|
||
|
||
I have downloaded the BF16 version, converted to gguf, and then quantisizing to `IQ3_S` with an imatrix from https://huggingface.co/mradermacher/DeepSeek-R1-GGUF with the following command:
|
||
|
||
```
|
||
./llama-quantize --imatrix /models/deepseek-config/imatrix.dat /storage/unsloth_DeepSeek-R1-BF16/unsloth_DeepSeek-R1-BF16-256x21B-F16-00001-of-00059.gguf /models/DeepSeek-R1-GGUF-IQ3_S.gguf IQ3_S
|
||
```
|
||
|
||
All seems to be going well, until I hit:
|
||
|
||
```
|
||
ggml_validate_row_data: found inf value at block 3405774848
|
||
llama_model_quantize: failed to quantize: tensor 'blk.40.ffn_down_exps.weight' has invalid data
|
||
main: failed to quantize model from '/storage/unsloth_DeepSeek-R1-BF16/unsloth_DeepSeek-R1-BF16-256x21B-F16-00001-of-00059.gguf'
|
||
```
|
||
|
||
Now I don't know if this is because of the imatrix, the changes for MLA with the quantize process, or a corrupted BF16 model file. I am currently re-checking the hash of the `BF16` model files to see if I downloaded a corrupt part.
|
||
|
||
Likely a corrupt part. But just wondering, is there anything I'm doing wrong here? I wasn't 100% sure if that's a correct quantize command, or something I'm missing.
|
||
|
||
TYVM
|
||
|
||
---
|
||
|
||
👤 **ikawrakow** commented the **2025-03-04** at **11:21:38**:<br>
|
||
|
||
Let me know if it works after you re-download the corrupt file. If it doesn't, the I would need to make the quantization more robust against missing imatrix data. DeepSeekV3/R1 is tricky because only 8 out of 256 experts are activated per token, so for an imatrix calculation with a given amount of calibration data there will be 32X less data collected for the experts compared to a dense model. This may lead to missing/insufficient imatrix data, which may not be handled gracefully by the quantization functions.
|
||
|
||
---
|
||
|
||
👤 **davidsyoung** commented the **2025-03-04** at **11:48:46**:<br>
|
||
|
||
I will! Reconverting to GGUF from BF16 takes a decent amount of time on HDDs compared to NVME. Should be done around 6pm tonight, and I’ll quantize soon after that! Thank you for all of the help and your work on improving inference with DS V3/R1 - its excellent!
|
||
|
||
---
|
||
|
||
👤 **davidsyoung** commented the **2025-03-04** at **20:16:54**:<br>
|
||
|
||
@ikawrakow
|
||
|
||
Seemed to quantize fine, but got this on model load:
|
||
|
||
```
|
||
INFO [ main] build info | tid="23133942390784" timestamp=1741119264 build=0 commit="unknown"
|
||
INFO [ main] system info | tid="23133942390784" timestamp=1741119264 n_threads=64 n_threads_batch=-1 total_threads=128 system_info="AVX = 1 | AVX_VNNI = 0 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 | "
|
||
llama_model_loader: loaded meta data with 53 key-value pairs and 1147 tensors from /models/DeepSeek-R1-GGUF-IQ3_S.gguf (version GGUF V3 (latest))
|
||
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
|
||
llama_model_loader: - kv 0: general.architecture str = deepseek2
|
||
llama_model_loader: - kv 1: general.type str = model
|
||
llama_model_loader: - kv 2: general.name str = unsloth_DeepSeek R1 BF16
|
||
llama_model_loader: - kv 3: general.size_label str = 256x21B
|
||
llama_model_loader: - kv 4: general.license str = mit
|
||
llama_model_loader: - kv 5: general.base_model.count u32 = 1
|
||
llama_model_loader: - kv 6: general.base_model.0.name str = DeepSeek R1
|
||
llama_model_loader: - kv 7: general.base_model.0.organization str = Deepseek Ai
|
||
llama_model_loader: - kv 8: general.base_model.0.repo_url str = https://huggingface.co/deepseek-ai/De...
|
||
llama_model_loader: - kv 9: general.tags arr[str,3] = ["deepseek", "unsloth", "transformers"]
|
||
llama_model_loader: - kv 10: general.languages arr[str,1] = ["en"]
|
||
llama_model_loader: - kv 11: deepseek2.block_count u32 = 61
|
||
llama_model_loader: - kv 12: deepseek2.context_length u32 = 163840
|
||
llama_model_loader: - kv 13: deepseek2.embedding_length u32 = 7168
|
||
llama_model_loader: - kv 14: deepseek2.feed_forward_length u32 = 18432
|
||
llama_model_loader: - kv 15: deepseek2.attention.head_count u32 = 128
|
||
llama_model_loader: - kv 16: deepseek2.attention.head_count_kv u32 = 128
|
||
llama_model_loader: - kv 17: deepseek2.rope.freq_base f32 = 10000.000000
|
||
llama_model_loader: - kv 18: deepseek2.attention.layer_norm_rms_epsilon f32 = 0.000001
|
||
llama_model_loader: - kv 19: deepseek2.expert_used_count u32 = 8
|
||
llama_model_loader: - kv 20: general.file_type u32 = 26
|
||
llama_model_loader: - kv 21: deepseek2.leading_dense_block_count u32 = 3
|
||
llama_model_loader: - kv 22: deepseek2.vocab_size u32 = 129280
|
||
llama_model_loader: - kv 23: deepseek2.attention.q_lora_rank u32 = 1536
|
||
llama_model_loader: - kv 24: deepseek2.attention.kv_lora_rank u32 = 512
|
||
llama_model_loader: - kv 25: deepseek2.attention.key_length u32 = 192
|
||
llama_model_loader: - kv 26: deepseek2.attention.value_length u32 = 128
|
||
llama_model_loader: - kv 27: deepseek2.expert_feed_forward_length u32 = 2048
|
||
llama_model_loader: - kv 28: deepseek2.expert_count u32 = 256
|
||
llama_model_loader: - kv 29: deepseek2.expert_shared_count u32 = 1
|
||
llama_model_loader: - kv 30: deepseek2.expert_weights_scale f32 = 2.500000
|
||
llama_model_loader: - kv 31: deepseek2.expert_weights_norm bool = true
|
||
llama_model_loader: - kv 32: deepseek2.expert_gating_func u32 = 2
|
||
llama_model_loader: - kv 33: deepseek2.rope.dimension_count u32 = 64
|
||
llama_model_loader: - kv 34: deepseek2.rope.scaling.type str = yarn
|
||
llama_model_loader: - kv 35: deepseek2.rope.scaling.factor f32 = 40.000000
|
||
llama_model_loader: - kv 36: deepseek2.rope.scaling.original_context_length u32 = 4096
|
||
llama_model_loader: - kv 37: deepseek2.rope.scaling.yarn_log_multiplier f32 = 0.100000
|
||
llama_model_loader: - kv 38: tokenizer.ggml.model str = gpt2
|
||
llama_model_loader: - kv 39: tokenizer.ggml.pre str = deepseek-v3
|
||
llama_model_loader: - kv 40: tokenizer.ggml.tokens arr[str,129280] = ["<|begin▁of▁sentence|>", "<<3C>...
|
||
llama_model_loader: - kv 41: tokenizer.ggml.token_type arr[i32,129280] = [3, 3, 3, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
|
||
llama_model_loader: - kv 42: tokenizer.ggml.bos_token_id u32 = 0
|
||
llama_model_loader: - kv 43: tokenizer.ggml.eos_token_id u32 = 1
|
||
llama_model_loader: - kv 44: tokenizer.ggml.padding_token_id u32 = 128815
|
||
llama_model_loader: - kv 45: tokenizer.ggml.add_bos_token bool = true
|
||
llama_model_loader: - kv 46: tokenizer.ggml.add_eos_token bool = false
|
||
llama_model_loader: - kv 47: tokenizer.chat_template str = {% if not add_generation_prompt is de...
|
||
llama_model_loader: - kv 48: general.quantization_version u32 = 2
|
||
llama_model_loader: - kv 49: quantize.imatrix.file str = /models/deepseek-config/imatrix.dat
|
||
llama_model_loader: - kv 50: quantize.imatrix.dataset str = imatrix-training-full-3
|
||
llama_model_loader: - kv 51: quantize.imatrix.entries_count i32 = 720
|
||
llama_model_loader: - kv 52: quantize.imatrix.chunks_count i32 = 315
|
||
llama_model_loader: - type f32: 361 tensors
|
||
llama_model_loader: - type q8_0: 305 tensors
|
||
llama_model_loader: - type q5_K: 61 tensors
|
||
llama_model_loader: - type q6_K: 1 tensors
|
||
llama_model_loader: - type iq3_s: 419 tensors
|
||
llama_model_load: error loading model: error loading model vocabulary: cannot find tokenizer merges in model file
|
||
|
||
llama_load_model_from_file: failed to load model
|
||
llama_init_from_gpt_params: error: failed to load model '/models/DeepSeek-R1-GGUF-IQ3_S.gguf'
|
||
ERR [ load_model] unable to load model | tid="23133942390784" timestamp=1741119264 model="/models/DeepSeek-R1-GGUF-IQ3_S.gguf"
|
||
/app/.devops/tools_new.sh: line 47: 13 Segmentation fault ./llama-server "$@"
|
||
```
|
||
|
||
---
|
||
|
||
👤 **davidsyoung** commented the **2025-03-05** at **12:36:10**:<br>
|
||
|
||
Preliminary results with `-ser 6,1` and `-ser 7,1` show no major difference to TG performance - it's -/+ 1 t/s. Likely that with 16x3090 it's not compute limited, as GPU's are only running at 5-10% during inference.
|
||
|
||
---
|
||
|
||
👤 **ikawrakow** commented the **2025-03-05** at **12:54:10**:<br>
|
||
|
||
> Likely that with 16x3090 it's not compute limited, as GPU's are only running at 5-10% during inference.
|
||
|
||
You observe 5-10% GPU utilization because each GPU is only processing 1/16th of the layers, so it is busy only 1/16th of the time (the other time it is just waiting for the next piece of data). You said you are getting ~ 17 t/s, so each token is taking about 60 ms, so each GPU is busy for about 4 ms out of the 60 ms. But while it is busy, the calculation is limited by something (else it would finish in zero time). If the computation is dominated by the MoE part of the model (it is on my RTX-4080), then using fewer experts will make it run faster, no matter if it is memory or compute bound. With 6 instead of 8 experts it should be spending 3 ms instead of 4 ms in each GPU, so you should see up to 20% speedup. It is less than that in practice due to MoE not being 100%, latencies, etc. Say it is 10%. That's only 1.7 t/s faster. With the massive fluctuations in processing speed that I see in the logs you have posted before, it is probably hard to measure a 10% speedup. You will need `llama-bench`, but you said that `llama-bench` is not doing the layer split correctly. Perhaps you could see it in prompt processing speed if you process a longer prompt. I think @saood06 was mentioning somewhere that one needs to "warm up" the model for quite some time before performance becomes more stable, perhaps this is also true for your system.
|
||
|
||
---
|
||
|
||
👤 **davidsyoung** commented the **2025-03-05** at **13:43:59**:<br>
|
||
|
||
This makes sense, thank you for taking the time to type it out!
|
||
|
||
Do you have commands that you’d like to run to test SER / PPL for you? llama-bench wasn’t splitting over GPUs unfortunately.
|
||
|
||
I’m also quanting a IQ4_KSS which I feel will be a great sweet spot, so thank you!
|
||
|
||
---
|
||
|
||
👤 **davidsyoung** commented the **2025-03-05** at **14:02:55**:<br>
|
||
|
||
Super stuff. When some with quant I’ll do that!
|
||
|
||
Also, just in terms of FA, when I tried to run FA earlier it tried to allocate 150GB to first GPU. So just went back to MLA. Not sure if I was doing something wrong on my side, I just swapped MLA for FA And ran with the same params otherwise.
|
||
|
||
---
|
||
|
||
👤 **ikawrakow** commented the **2025-03-05** at **16:26:50**:<br>
|
||
|
||
> Also, just in terms of FA, when I tried to run FA earlier it tried to allocate 150GB to first GPU.
|
||
|
||
That happened after PR #241 was merged and you updated to latest? I guess, you are trying to run with a context of 163k tokens. For the `perplexity` calculation with the above command (context of 2048 tokens) the KV cache will be 1.2 GiB and the compute buffer should not be more than 1-2 GiB. If you go to `Q8_0` KV cache (add `-ctk q8_0 -ctv q8_0` to the above command), than KV cache will be only 600 MiB.
|
||
|
||
---
|
||
|
||
👤 **davidsyoung** commented the **2025-03-05** at **21:21:02**:<br>
|
||
|
||
Ok got some PPL runs!
|
||
|
||
All perplexity evals were ran with:
|
||
`./llama-perplexity -m /models/DeepSeek-R1-GGUF/DeepSeek-R1-GGUF-IQ3_M.gguf -f /models/wiki.test.raw -fmoe -fa -c 2048 -ub 2048 |