Files
ik_llama.cpp/github-data/pull_requests/416 - Fix SER _CUDA_.md
2025-07-23 13:31:53 +02:00

71 lines
2.5 KiB
Markdown

### 🐛 [#416](https://github.com/ikawrakow/ik_llama.cpp/pull/416) - Fix SER (CUDA)
| **Author** | `ikawrakow` |
| :--- | :--- |
| **State** | ❌ **Closed** |
| **Created** | 2025-05-13 |
| **Updated** | 2025-05-14 |
---
#### Description
Follow up of #415. This should fix SER issues on CUDA.
---
#### 💬 Conversation
👤 **ubergarm** commented the **2025-05-13** at **15:30:55**:<br>
Interestingly I recompiled main with CUDA (after you merged #415 into main) and haven't been able to reproduce the error now.
fwiw this command is working both with and without this PR:
```
CUDA_VISIBLE_DEVICES="0" \
./build/bin/llama-server \
--model /mnt/raid/hf/DeepSeek-V3-0324-GGUF/DeepSeek-V3-0324-IQ2_K_R4/DeepSeek-V3-0324-IQ2_K_R4-00001-of-00005.gguf \
--alias ubergarm/DeepSeek-R1-IQ2_K_R4 \
--ctx-size 131072 \
-ctk f16 \
-mla 3 -fa \
-amb 512 \
-fmoe \
-ser 6,1 \
--n-gpu-layers 63 \
--override-tensor exps=CPU \
--parallel 1 \
--threads 24 \
--host 127.0.0.1 \
--port 8080
```
I don't have enough VRAM to fully offload any R1/V3 models so not sure how to best test this other than fully offload V2-Lite which probably you already did.
---
👤 **ikawrakow** commented the **2025-05-13** at **15:43:01**:<br>
On CUDA it is more difficult to trigger the bug. I used Qwen3-30B-A3B quantized with `IQ5_K`. I only have a 16 GB GPU, so I had to leave the last 19 layers of exerts on the CPU. I used `llama-cli` like this
```
./bin/llama-cli -m ../ncuda/junk.bin -t 16 -ngl 100 -c 20000 -cnv -p " " -rtr -fa -s 1234 -ot "blk\.29\.ffn=CPU,blk\.[3-4][0-9]\.ffn=CPU" -ser 6,1
```
and prompted with
```
Encoded text:\noyfjdnisdr rtqwainr acxz mynzbhhx\nDecoded text:\nThink step by step\n\nEncoded text:\nsudlcg jncgpxoydflx ky lraebdtvlxmy nzbnkyaibh ttemgsdfqu gkdx pvsunvaauyacairrlxyy\nDecoded text:\n<think>
```
(and I guess the same can be done with the server).
The thinking goes well for a while, but eventually it starts spitting out `GGGGG`.
The PR fixes that.
Interestingly enough, after the fix it does solve the puzzle with `-ser 6,1`, but fails with `-ser 7,1`.
I don't think partial offload is required, and it is likely the bug will trigger quicker if all layers are on the GPU. I found it is easier to debug with a "thinking" model because there isn't much interaction required to have the model generate many tokens one-by-one.
---
👤 **ikawrakow** commented the **2025-05-13** at **15:57:54**:<br>
Oops, it is still failing with DeepSeek-Lite. Converting to draft.