### 🗣️ [#399](https://github.com/ikawrakow/ik_llama.cpp/discussions/399) - Qwen 30b.A3b IK/LCPP comparisons on lowspec machine
| **Author** | `fizzAI` |
| :--- | :--- |
| **Created** | 2025-05-09 |
| **Updated** | 2025-05-14 |
---
#### Description
Hi! Recently (as in, I finished 5 minutes ago) I got curious as-to how fast my shitbox (for AI use anyways) can run.
Honestly, pretty fast! But the main thing here is the comparison between LCPP and IK_LCPP, and (un)surprisingly mainline LCPP gets pretty hosed.
Specs:
- **CPU**: Ryzen 5 3500, 6 cores/~3.6ghz iirc
- **RAM**: 16gb DDR4 at a max of 2667mhz (Yes, my motherboard sucks. Yes, I know.)
- **GPU**: Nvidia GTX 1650 Super
- **VRAM**: 4gb(!) of GDDR6
Here's the cherrypicked results that show each framework at their best -- both are running with `-ot exps=CPU` (with LCPP table slightly modified because they output different formats)
| framework | model | size | params | backend | ngl | fa | amb | fmoe | test | t/s |
| - | ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ----: | ---: | ------------: | ---------------: |
| ik_llama.cpp | qwen3moe ?B IQ4_XS_R8 - 4.25 bpw | 15.32 GiB | 30.53 B | CUDA | 99 | 0 | 512 | 1 | pp512 | 15.82 ± 1.91 |
| ik_llama.cpp | qwen3moe ?B IQ4_XS_R8 - 4.25 bpw | 15.32 GiB | 30.53 B | CUDA | 99 | 0 | 512 | 1 | tg128 | 3.05 ± 0.30 |
| llama.cpp | qwen3moe 30B.A3B IQ4_XS - 4.25 bpw | 15.32 GiB | 30.53 B | CUDA,BLAS | 99 | 0 | N/A | N/A | pp512 | 14.29 ± 0.05 |
| llama.cpp | qwen3moe 30B.A3B IQ4_XS - 4.25 bpw | 15.32 GiB | 30.53 B | CUDA,BLAS | 99 | 0 | N/A | N/A | tg128 | 2.75 ± 0.27 |
And here's the full log including the commands used and other random attempts
```
fizz@MAMMON:~$ ik_llama.cpp/build/bin/llama-bench -fa 0,1 -amb 128,512 -fmoe 1 -ot exps=CPU -ngl 99 -m ~/ggufs/REPACK-Qwen_Qwen3-30B-A3B-IQ4_XS.gguf
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce GTX 1650 SUPER, compute capability 7.5, VMM: yes
| model | size | params | backend | ngl | fa | amb | fmoe | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ----: | ---: | ------------: | ---------------: |
| qwen3moe ?B IQ4_XS_R8 - 4.25 bpw | 15.32 GiB | 30.53 B | CUDA | 99 | 0 | 128 | 1 | pp512 | 15.72 ± 0.19 |
| qwen3moe ?B IQ4_XS_R8 - 4.25 bpw | 15.32 GiB | 30.53 B | CUDA | 99 | 0 | 128 | 1 | tg128 | 2.86 ± 0.34 |
| qwen3moe ?B IQ4_XS_R8 - 4.25 bpw | 15.32 GiB | 30.53 B | CUDA | 99 | 0 | 512 | 1 | pp512 | 15.82 ± 1.91 |
| qwen3moe ?B IQ4_XS_R8 - 4.25 bpw | 15.32 GiB | 30.53 B | CUDA | 99 | 0 | 512 | 1 | tg128 | 3.05 ± 0.30 |
| qwen3moe ?B IQ4_XS_R8 - 4.25 bpw | 15.32 GiB | 30.53 B | CUDA | 99 | 1 | 128 | 1 | pp512 | 16.38 ± 1.32 |
| qwen3moe ?B IQ4_XS_R8 - 4.25 bpw | 15.32 GiB | 30.53 B | CUDA | 99 | 1 | 128 | 1 | tg128 | 2.78 ± 0.18 |
| qwen3moe ?B IQ4_XS_R8 - 4.25 bpw | 15.32 GiB | 30.53 B | CUDA | 99 | 1 | 512 | 1 | pp512 | 15.78 ± 1.96 |
| qwen3moe ?B IQ4_XS_R8 - 4.25 bpw | 15.32 GiB | 30.53 B | CUDA | 99 | 1 | 512 | 1 | tg128 | 2.89 ± 0.24 |
build: 4084ca73 (3673)
fizz@MAMMON:~$ ik_llama.cpp/build/bin/llama-bench -fa 0,1 -amb 128,512 -fmoe 1 -ot ffn=CPU -ngl 99 -m ~/ggufs/REPACK-Qwen_Qwen3-30B-A3B-IQ4_XS.gguf
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce GTX 1650 SUPER, compute capability 7.5, VMM: yes
| model | size | params | backend | ngl | fa | amb | fmoe | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ----: | ---: | ------------: | ---------------: |
| qwen3moe ?B IQ4_XS_R8 - 4.25 bpw | 15.32 GiB | 30.53 B | CUDA | 99 | 0 | 128 | 1 | pp512 | 15.66 ± 0.19 |
| qwen3moe ?B IQ4_XS_R8 - 4.25 bpw | 15.32 GiB | 30.53 B | CUDA | 99 | 0 | 128 | 1 | tg128 | 2.55 ± 0.19 |
| qwen3moe ?B IQ4_XS_R8 - 4.25 bpw | 15.32 GiB | 30.53 B | CUDA | 99 | 0 | 512 | 1 | pp512 | 16.07 ± 1.94 |
| qwen3moe ?B IQ4_XS_R8 - 4.25 bpw | 15.32 GiB | 30.53 B | CUDA | 99 | 0 | 512 | 1 | tg128 | 2.86 ± 0.27 |
| qwen3moe ?B IQ4_XS_R8 - 4.25 bpw | 15.32 GiB | 30.53 B | CUDA | 99 | 1 | 128 | 1 | pp512 | 16.00 ± 1.77 |
| qwen3moe ?B IQ4_XS_R8 - 4.25 bpw | 15.32 GiB | 30.53 B | CUDA | 99 | 1 | 128 | 1 | tg128 | 2.63 ± 0.16 |
| qwen3moe ?B IQ4_XS_R8 - 4.25 bpw | 15.32 GiB | 30.53 B | CUDA | 99 | 1 | 512 | 1 | pp512 | 15.87 ± 2.01 |
| qwen3moe ?B IQ4_XS_R8 - 4.25 bpw | 15.32 GiB | 30.53 B | CUDA | 99 | 1 | 512 | 1 | tg128 | 2.74 ± 0.22 |
build: 4084ca73 (3673)
fizz@MAMMON:~$ llama.cpp/build/bin/llama-bench -fa 0,1 -ot exps=CPU -ngl 99 -m ~/ggufs/Qwen_Qwen3-30B-A3B-IQ4_XS.gguf
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce GTX 1650 SUPER, compute capability 7.5, VMM: yes
| model | size | params | backend | threads | fa | ot | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | -: | --------------------- | --------------: | -------------------: |
| qwen3moe 30B.A3B IQ4_XS - 4.25 bpw | 15.32 GiB | 30.53 B | CUDA,BLAS | 6 | 0 | exps=CPU | pp512 | 14.29 ± 0.05 |
| qwen3moe 30B.A3B IQ4_XS - 4.25 bpw | 15.32 GiB | 30.53 B | CUDA,BLAS | 6 | 0 | exps=CPU | tg128 | 2.75 ± 0.27 |
| qwen3moe 30B.A3B IQ4_XS - 4.25 bpw | 15.32 GiB | 30.53 B | CUDA,BLAS | 6 | 1 | exps=CPU | pp512 | 11.80 ± 0.04 |
| qwen3moe 30B.A3B IQ4_XS - 4.25 bpw | 15.32 GiB | 30.53 B | CUDA,BLAS | 6 | 1 | exps=CPU | tg128 | 2.75 ± 0.36 |
build: 15e03282 (5318)
```
Some other interesting notes:
- Memory wasn't the bottleneck here (at least not GPU memory), so I didn't really see any tangible benefits from FA -- however, I did test with it enabled, and LCPP's CPU FA is so slow it's not even funny
- There's a bit of an uptick in performance without FA when `amb` is higher, but its faster for `amb` to be lower with FA. ???
- I tried both `exps=CPU` (which I later found only offloads parts of the FFN to the CPU) and `ffn=CPU` (which offloads all of the FFN to the CPU as I was originally intending)... but it's slower to use the one which offloads the norms and stuff too! For some reason!
- I'm not sure whether it's best to build with or without a separate BLAS backend? The docs here and the docs in LCPP don't really clarify, so I went with what people seemed to be using most here for IK (noblas) and compiled LCPP with [Blis](https://github.com/flame/blis).
I still need to try dense models, CPU without offload, etc etc for this to be a fair comparison, but I hope this is still interesting data :)
---
#### 🗣️ Discussion
👤 **VinnyG9** replied the **2025-05-14** at **12:05:43**:
> * I'm not sure whether it's best to build with or without a separate BLAS backend? The docs here and the docs in LCPP don't really clarify, so I went with what people seemed to be using most here for IK (noblas) and compiled LCPP with [Blis](https://github.com/flame/blis).
if you don't specify a blas backend it defaults to llamafile i think which is faster in cpu, but not relevant unless you're using -nkvo ?
> 👤 **ikawrakow** replied the **2025-05-14** at **12:29:26**:
> > if you don't specify a blas backend it defaults to llamafile i think which is faster in cpu.
>
> No, it does not. This is `ik_llama.cpp` not `llama.cpp`. I wrote the matrix multiplication implementation for almost all quants in `llamafile` and for all quants here, so I know that what I have here is faster than llamafile.