mirror of
https://github.com/ikawrakow/ik_llama.cpp.git
synced 2026-05-02 04:11:41 +00:00
109 lines
8.6 KiB
Markdown
109 lines
8.6 KiB
Markdown
### 🗣️ [#399](https://github.com/ikawrakow/ik_llama.cpp/discussions/399) - Qwen 30b.A3b IK/LCPP comparisons on lowspec machine
|
|
|
|
| **Author** | `fizzAI` |
|
|
| :--- | :--- |
|
|
| **Created** | 2025-05-09 |
|
|
| **Updated** | 2025-05-14 |
|
|
|
|
---
|
|
|
|
#### Description
|
|
|
|
Hi! Recently (as in, I finished 5 minutes ago) I got curious as-to how fast my shitbox (for AI use anyways) can run.
|
|
Honestly, pretty fast! But the main thing here is the comparison between LCPP and IK_LCPP, and (un)surprisingly mainline LCPP gets pretty hosed.
|
|
|
|
Specs:
|
|
- **CPU**: Ryzen 5 3500, 6 cores/~3.6ghz iirc
|
|
- **RAM**: 16gb DDR4 at a max of 2667mhz (Yes, my motherboard sucks. Yes, I know.)
|
|
- **GPU**: Nvidia GTX 1650 Super
|
|
- **VRAM**: 4gb(!) of GDDR6
|
|
|
|
Here's the cherrypicked results that show each framework at their best -- both are running with `-ot exps=CPU` (with LCPP table slightly modified because they output different formats)
|
|
| framework | model | size | params | backend | ngl | fa | amb | fmoe | test | t/s |
|
|
| - | ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ----: | ---: | ------------: | ---------------: |
|
|
| ik_llama.cpp | qwen3moe ?B IQ4_XS_R8 - 4.25 bpw | 15.32 GiB | 30.53 B | CUDA | 99 | 0 | 512 | 1 | pp512 | 15.82 ± 1.91 |
|
|
| ik_llama.cpp | qwen3moe ?B IQ4_XS_R8 - 4.25 bpw | 15.32 GiB | 30.53 B | CUDA | 99 | 0 | 512 | 1 | tg128 | 3.05 ± 0.30 |
|
|
| llama.cpp | qwen3moe 30B.A3B IQ4_XS - 4.25 bpw | 15.32 GiB | 30.53 B | CUDA,BLAS | 99 | 0 | N/A | N/A | pp512 | 14.29 ± 0.05 |
|
|
| llama.cpp | qwen3moe 30B.A3B IQ4_XS - 4.25 bpw | 15.32 GiB | 30.53 B | CUDA,BLAS | 99 | 0 | N/A | N/A | tg128 | 2.75 ± 0.27 |
|
|
|
|
<details>
|
|
<summary>
|
|
And here's the full log including the commands used and other random attempts
|
|
</summary>
|
|
|
|
```
|
|
fizz@MAMMON:~$ ik_llama.cpp/build/bin/llama-bench -fa 0,1 -amb 128,512 -fmoe 1 -ot exps=CPU -ngl 99 -m ~/ggufs/REPACK-Qwen_Qwen3-30B-A3B-IQ4_XS.gguf
|
|
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
|
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
|
ggml_cuda_init: found 1 CUDA devices:
|
|
Device 0: NVIDIA GeForce GTX 1650 SUPER, compute capability 7.5, VMM: yes
|
|
| model | size | params | backend | ngl | fa | amb | fmoe | test | t/s |
|
|
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ----: | ---: | ------------: | ---------------: |
|
|
| qwen3moe ?B IQ4_XS_R8 - 4.25 bpw | 15.32 GiB | 30.53 B | CUDA | 99 | 0 | 128 | 1 | pp512 | 15.72 ± 0.19 |
|
|
| qwen3moe ?B IQ4_XS_R8 - 4.25 bpw | 15.32 GiB | 30.53 B | CUDA | 99 | 0 | 128 | 1 | tg128 | 2.86 ± 0.34 |
|
|
| qwen3moe ?B IQ4_XS_R8 - 4.25 bpw | 15.32 GiB | 30.53 B | CUDA | 99 | 0 | 512 | 1 | pp512 | 15.82 ± 1.91 |
|
|
| qwen3moe ?B IQ4_XS_R8 - 4.25 bpw | 15.32 GiB | 30.53 B | CUDA | 99 | 0 | 512 | 1 | tg128 | 3.05 ± 0.30 |
|
|
| qwen3moe ?B IQ4_XS_R8 - 4.25 bpw | 15.32 GiB | 30.53 B | CUDA | 99 | 1 | 128 | 1 | pp512 | 16.38 ± 1.32 |
|
|
| qwen3moe ?B IQ4_XS_R8 - 4.25 bpw | 15.32 GiB | 30.53 B | CUDA | 99 | 1 | 128 | 1 | tg128 | 2.78 ± 0.18 |
|
|
| qwen3moe ?B IQ4_XS_R8 - 4.25 bpw | 15.32 GiB | 30.53 B | CUDA | 99 | 1 | 512 | 1 | pp512 | 15.78 ± 1.96 |
|
|
| qwen3moe ?B IQ4_XS_R8 - 4.25 bpw | 15.32 GiB | 30.53 B | CUDA | 99 | 1 | 512 | 1 | tg128 | 2.89 ± 0.24 |
|
|
|
|
build: 4084ca73 (3673)
|
|
|
|
fizz@MAMMON:~$ ik_llama.cpp/build/bin/llama-bench -fa 0,1 -amb 128,512 -fmoe 1 -ot ffn=CPU -ngl 99 -m ~/ggufs/REPACK-Qwen_Qwen3-30B-A3B-IQ4_XS.gguf
|
|
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
|
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
|
ggml_cuda_init: found 1 CUDA devices:
|
|
Device 0: NVIDIA GeForce GTX 1650 SUPER, compute capability 7.5, VMM: yes
|
|
| model | size | params | backend | ngl | fa | amb | fmoe | test | t/s |
|
|
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ----: | ---: | ------------: | ---------------: |
|
|
| qwen3moe ?B IQ4_XS_R8 - 4.25 bpw | 15.32 GiB | 30.53 B | CUDA | 99 | 0 | 128 | 1 | pp512 | 15.66 ± 0.19 |
|
|
| qwen3moe ?B IQ4_XS_R8 - 4.25 bpw | 15.32 GiB | 30.53 B | CUDA | 99 | 0 | 128 | 1 | tg128 | 2.55 ± 0.19 |
|
|
| qwen3moe ?B IQ4_XS_R8 - 4.25 bpw | 15.32 GiB | 30.53 B | CUDA | 99 | 0 | 512 | 1 | pp512 | 16.07 ± 1.94 |
|
|
| qwen3moe ?B IQ4_XS_R8 - 4.25 bpw | 15.32 GiB | 30.53 B | CUDA | 99 | 0 | 512 | 1 | tg128 | 2.86 ± 0.27 |
|
|
| qwen3moe ?B IQ4_XS_R8 - 4.25 bpw | 15.32 GiB | 30.53 B | CUDA | 99 | 1 | 128 | 1 | pp512 | 16.00 ± 1.77 |
|
|
| qwen3moe ?B IQ4_XS_R8 - 4.25 bpw | 15.32 GiB | 30.53 B | CUDA | 99 | 1 | 128 | 1 | tg128 | 2.63 ± 0.16 |
|
|
| qwen3moe ?B IQ4_XS_R8 - 4.25 bpw | 15.32 GiB | 30.53 B | CUDA | 99 | 1 | 512 | 1 | pp512 | 15.87 ± 2.01 |
|
|
| qwen3moe ?B IQ4_XS_R8 - 4.25 bpw | 15.32 GiB | 30.53 B | CUDA | 99 | 1 | 512 | 1 | tg128 | 2.74 ± 0.22 |
|
|
|
|
build: 4084ca73 (3673)
|
|
|
|
fizz@MAMMON:~$ llama.cpp/build/bin/llama-bench -fa 0,1 -ot exps=CPU -ngl 99 -m ~/ggufs/Qwen_Qwen3-30B-A3B-IQ4_XS.gguf
|
|
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
|
|
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
|
|
ggml_cuda_init: found 1 CUDA devices:
|
|
Device 0: NVIDIA GeForce GTX 1650 SUPER, compute capability 7.5, VMM: yes
|
|
| model | size | params | backend | threads | fa | ot | test | t/s |
|
|
| ------------------------------ | ---------: | ---------: | ---------- | ------: | -: | --------------------- | --------------: | -------------------: |
|
|
| qwen3moe 30B.A3B IQ4_XS - 4.25 bpw | 15.32 GiB | 30.53 B | CUDA,BLAS | 6 | 0 | exps=CPU | pp512 | 14.29 ± 0.05 |
|
|
| qwen3moe 30B.A3B IQ4_XS - 4.25 bpw | 15.32 GiB | 30.53 B | CUDA,BLAS | 6 | 0 | exps=CPU | tg128 | 2.75 ± 0.27 |
|
|
| qwen3moe 30B.A3B IQ4_XS - 4.25 bpw | 15.32 GiB | 30.53 B | CUDA,BLAS | 6 | 1 | exps=CPU | pp512 | 11.80 ± 0.04 |
|
|
| qwen3moe 30B.A3B IQ4_XS - 4.25 bpw | 15.32 GiB | 30.53 B | CUDA,BLAS | 6 | 1 | exps=CPU | tg128 | 2.75 ± 0.36 |
|
|
|
|
build: 15e03282 (5318)
|
|
```
|
|
|
|
</details>
|
|
|
|
Some other interesting notes:
|
|
- Memory wasn't the bottleneck here (at least not GPU memory), so I didn't really see any tangible benefits from FA -- however, I did test with it enabled, and LCPP's CPU FA is so slow it's not even funny
|
|
- There's a bit of an uptick in performance without FA when `amb` is higher, but its faster for `amb` to be lower with FA. ???
|
|
- I tried both `exps=CPU` (which I later found only offloads parts of the FFN to the CPU) and `ffn=CPU` (which offloads all of the FFN to the CPU as I was originally intending)... but it's slower to use the one which offloads the norms and stuff too! For some reason!
|
|
- I'm not sure whether it's best to build with or without a separate BLAS backend? The docs here and the docs in LCPP don't really clarify, so I went with what people seemed to be using most here for IK (noblas) and compiled LCPP with [Blis](https://github.com/flame/blis).
|
|
|
|
I still need to try dense models, CPU without offload, etc etc for this to be a fair comparison, but I hope this is still interesting data :)
|
|
|
|
---
|
|
|
|
#### 🗣️ Discussion
|
|
|
|
👤 **VinnyG9** replied the **2025-05-14** at **12:05:43**:<br>
|
|
|
|
> * I'm not sure whether it's best to build with or without a separate BLAS backend? The docs here and the docs in LCPP don't really clarify, so I went with what people seemed to be using most here for IK (noblas) and compiled LCPP with [Blis](https://github.com/flame/blis).
|
|
|
|
if you don't specify a blas backend it defaults to llamafile i think which is faster in cpu, but not relevant unless you're using -nkvo ?
|
|
|
|
> 👤 **ikawrakow** replied the **2025-05-14** at **12:29:26**:<br>
|
|
> > if you don't specify a blas backend it defaults to llamafile i think which is faster in cpu.
|
|
>
|
|
> No, it does not. This is `ik_llama.cpp` not `llama.cpp`. I wrote the matrix multiplication implementation for almost all quants in `llamafile` and for all quants here, so I know that what I have here is faster than llamafile. |