Add GitHub data: filename sanitization (#640)

This commit is contained in:
Thomas
2025-07-23 13:31:53 +02:00
committed by GitHub
parent 3600d82e98
commit eaa2510a28
626 changed files with 0 additions and 0 deletions

View File

@@ -0,0 +1,35 @@
### 🔀 [#589](https://github.com/ikawrakow/ik_llama.cpp/pull/589) - CUDA: small PP performance improvement for MoE models
| **Author** | `ikawrakow` |
| :--- | :--- |
| **State** | ❌ **Closed** |
| **Created** | 2025-07-06 |
| **Updated** | 2025-07-07 |
---
#### Description
This PR brings a small (2-3%) prompt processing performance improvement on CUDA for quantized MoE models (when `-fmoe` is used).
Instead of first copying activations to contiguous memory and the quantizing, quantization is done directly using the row mapping IDs, thus saving the associated kernel launch overhead.
Here is a performance comparison for `Q4_0` quantized DeepSeek-Lite on RTX-4080 using `-mla 3 -fa -fmoe -b 4096 -ub 4096`
### Main branch
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
| 4096 | 1024 | 0 | 0.480 | 8532.52 | 5.640 | 181.55 |
| 4096 | 1024 | 4096 | 0.566 | 7240.62 | 5.904 | 173.43 |
| 4096 | 1024 | 8192 | 0.674 | 6073.99 | 6.143 | 166.68 |
| 4096 | 1024 | 12288 | 0.789 | 5189.61 | 6.421 | 159.47 |
### PR
| PP | TG | N_KV | T_PP s | S_PP t/s | T_TG s | S_TG t/s |
|-------|--------|--------|----------|----------|----------|----------|
| 4096 | 1024 | 0 | 0.469 | 8738.41 | 5.638 | 181.61 |
| 4096 | 1024 | 4096 | 0.554 | 7388.85 | 5.909 | 173.29 |
| 4096 | 1024 | 8192 | 0.670 | 6117.30 | 6.148 | 166.57 |
| 4096 | 1024 | 12288 | 0.779 | 5256.86 | 6.435 | 159.14 |