ik_llama.cpp/github-data/discussions/399 - Qwen 30b.A3b IK_LCPP comparisons on lowspec machine.md

### 🗣️ [#399](https://github.com/ikawrakow/ik_llama.cpp/discussions/399) - Qwen 30b.A3b IK/LCPP comparisons on lowspec machine

| **Author** | `fizzAI` |
| :--- | :--- |
| **Created** | 2025-05-09 |
| **Updated** | 2025-05-14 |

---

#### Description

Hi! Recently (as in, I finished 5 minutes ago) I got curious as-to how fast my shitbox (for AI use anyways) can run.
Honestly, pretty fast! But the main thing here is the comparison between LCPP and IK_LCPP, and (un)surprisingly mainline LCPP gets pretty hosed.

Specs:
- **CPU**: Ryzen 5 3500, 6 cores/~3.6ghz iirc
- **RAM**: 16gb DDR4 at a max of 2667mhz (Yes, my motherboard sucks. Yes, I know.)
- **GPU**: Nvidia GTX 1650 Super
- **VRAM**: 4gb(!) of GDDR6

Here's the cherrypicked results that show each framework at their best -- both are running with `-ot exps=CPU` (with LCPP table slightly modified because they output different formats)
| framework | model                          |       size |     params | backend    | ngl | fa |   amb | fmoe |          test |              t/s |
| - | ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ----: | ---: | ------------: | ---------------: |
| ik_llama.cpp | qwen3moe ?B IQ4_XS_R8 - 4.25 bpw |  15.32 GiB |    30.53 B | CUDA       |  99 |  0 |   512 |    1 |         pp512 |     15.82 ± 1.91 |
| ik_llama.cpp | qwen3moe ?B IQ4_XS_R8 - 4.25 bpw |  15.32 GiB |    30.53 B | CUDA       |  99 |  0 |   512 |    1 |         tg128 |      3.05 ± 0.30 |
| llama.cpp | qwen3moe 30B.A3B IQ4_XS - 4.25 bpw |  15.32 GiB |    30.53 B | CUDA,BLAS | 99  |       0 | N/A | N/A |           pp512 |         14.29 ± 0.05 |
| llama.cpp | qwen3moe 30B.A3B IQ4_XS - 4.25 bpw |  15.32 GiB |    30.53 B | CUDA,BLAS  | 99  |       0 | N/A | N/A |           tg128 |          2.75 ± 0.27 |

<details>
<summary>
And here's the full log including the commands used and other random attempts
</summary>

```
fizz@MAMMON:~$ ik_llama.cpp/build/bin/llama-bench -fa 0,1 -amb 128,512 -fmoe 1 -ot exps=CPU -ngl 99 -m ~/ggufs/REPACK-Qwen_Qwen3-30B-A3B-IQ4_XS.gguf
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce GTX 1650 SUPER, compute capability 7.5, VMM: yes
| model                          |       size |     params | backend    | ngl | fa |   amb | fmoe |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ----: | ---: | ------------: | ---------------: |
| qwen3moe ?B IQ4_XS_R8 - 4.25 bpw |  15.32 GiB |    30.53 B | CUDA       |  99 |  0 |   128 |    1 |         pp512 |     15.72 ± 0.19 |
| qwen3moe ?B IQ4_XS_R8 - 4.25 bpw |  15.32 GiB |    30.53 B | CUDA       |  99 |  0 |   128 |    1 |         tg128 |      2.86 ± 0.34 |
| qwen3moe ?B IQ4_XS_R8 - 4.25 bpw |  15.32 GiB |    30.53 B | CUDA       |  99 |  0 |   512 |    1 |         pp512 |     15.82 ± 1.91 |
| qwen3moe ?B IQ4_XS_R8 - 4.25 bpw |  15.32 GiB |    30.53 B | CUDA       |  99 |  0 |   512 |    1 |         tg128 |      3.05 ± 0.30 |
| qwen3moe ?B IQ4_XS_R8 - 4.25 bpw |  15.32 GiB |    30.53 B | CUDA       |  99 |  1 |   128 |    1 |         pp512 |     16.38 ± 1.32 |
| qwen3moe ?B IQ4_XS_R8 - 4.25 bpw |  15.32 GiB |    30.53 B | CUDA       |  99 |  1 |   128 |    1 |         tg128 |      2.78 ± 0.18 |
| qwen3moe ?B IQ4_XS_R8 - 4.25 bpw |  15.32 GiB |    30.53 B | CUDA       |  99 |  1 |   512 |    1 |         pp512 |     15.78 ± 1.96 |
| qwen3moe ?B IQ4_XS_R8 - 4.25 bpw |  15.32 GiB |    30.53 B | CUDA       |  99 |  1 |   512 |    1 |         tg128 |      2.89 ± 0.24 |

build: 4084ca73 (3673)

fizz@MAMMON:~$ ik_llama.cpp/build/bin/llama-bench -fa 0,1 -amb 128,512 -fmoe 1 -ot ffn=CPU -ngl 99 -m ~/ggufs/REPACK-Qwen_Qwen3-30B-A3B-IQ4_XS.gguf
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce GTX 1650 SUPER, compute capability 7.5, VMM: yes
| model                          |       size |     params | backend    | ngl | fa |   amb | fmoe |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ----: | ---: | ------------: | ---------------: |
| qwen3moe ?B IQ4_XS_R8 - 4.25 bpw |  15.32 GiB |    30.53 B | CUDA       |  99 |  0 |   128 |    1 |         pp512 |     15.66 ± 0.19 |
| qwen3moe ?B IQ4_XS_R8 - 4.25 bpw |  15.32 GiB |    30.53 B | CUDA       |  99 |  0 |   128 |    1 |         tg128 |      2.55 ± 0.19 |
| qwen3moe ?B IQ4_XS_R8 - 4.25 bpw |  15.32 GiB |    30.53 B | CUDA       |  99 |  0 |   512 |    1 |         pp512 |     16.07 ± 1.94 |
| qwen3moe ?B IQ4_XS_R8 - 4.25 bpw |  15.32 GiB |    30.53 B | CUDA       |  99 |  0 |   512 |    1 |         tg128 |      2.86 ± 0.27 |
| qwen3moe ?B IQ4_XS_R8 - 4.25 bpw |  15.32 GiB |    30.53 B | CUDA       |  99 |  1 |   128 |    1 |         pp512 |     16.00 ± 1.77 |
| qwen3moe ?B IQ4_XS_R8 - 4.25 bpw |  15.32 GiB |    30.53 B | CUDA       |  99 |  1 |   128 |    1 |         tg128 |      2.63 ± 0.16 |
| qwen3moe ?B IQ4_XS_R8 - 4.25 bpw |  15.32 GiB |    30.53 B | CUDA       |  99 |  1 |   512 |    1 |         pp512 |     15.87 ± 2.01 |
| qwen3moe ?B IQ4_XS_R8 - 4.25 bpw |  15.32 GiB |    30.53 B | CUDA       |  99 |  1 |   512 |    1 |         tg128 |      2.74 ± 0.22 |

build: 4084ca73 (3673)

fizz@MAMMON:~$ llama.cpp/build/bin/llama-bench -fa 0,1 -ot exps=CPU -ngl 99 -m ~/ggufs/Qwen_Qwen3-30B-A3B-IQ4_XS.gguf
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce GTX 1650 SUPER, compute capability 7.5, VMM: yes
| model                          |       size |     params | backend    | threads | fa | ot                    |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | -: | --------------------- | --------------: | -------------------: |
| qwen3moe 30B.A3B IQ4_XS - 4.25 bpw |  15.32 GiB |    30.53 B | CUDA,BLAS  |       6 |  0 | exps=CPU              |           pp512 |         14.29 ± 0.05 |
| qwen3moe 30B.A3B IQ4_XS - 4.25 bpw |  15.32 GiB |    30.53 B | CUDA,BLAS  |       6 |  0 | exps=CPU              |           tg128 |          2.75 ± 0.27 |
| qwen3moe 30B.A3B IQ4_XS - 4.25 bpw |  15.32 GiB |    30.53 B | CUDA,BLAS  |       6 |  1 | exps=CPU              |           pp512 |         11.80 ± 0.04 |
| qwen3moe 30B.A3B IQ4_XS - 4.25 bpw |  15.32 GiB |    30.53 B | CUDA,BLAS  |       6 |  1 | exps=CPU              |           tg128 |          2.75 ± 0.36 |

build: 15e03282 (5318)
```

</details>

Some other interesting notes:
- Memory wasn't the bottleneck here (at least not GPU memory), so I didn't really see any tangible benefits from FA -- however, I did test with it enabled, and LCPP's CPU FA is so slow it's not even funny
- There's a bit of an uptick in performance without FA when `amb` is higher, but its faster for `amb` to be lower with FA. ???
- I tried both `exps=CPU` (which I later found only offloads parts of the FFN to the CPU) and `ffn=CPU` (which offloads all of the FFN to the CPU as I was originally intending)... but it's slower to use the one which offloads the norms and stuff too! For some reason!
- I'm not sure whether it's best to build with or without a separate BLAS backend? The docs here and the docs in LCPP don't really clarify, so I went with what people seemed to be using most here for IK (noblas) and compiled LCPP with [Blis](https://github.com/flame/blis).

I still need to try dense models, CPU without offload, etc etc for this to be a fair comparison, but I hope this is still interesting data :)

---

#### 🗣️ Discussion

👤 **VinnyG9** replied the **2025-05-14** at **12:05:43**:<br>

> * I'm not sure whether it's best to build with or without a separate BLAS backend? The docs here and the docs in LCPP don't really clarify, so I went with what people seemed to be using most here for IK (noblas) and compiled LCPP with [Blis](https://github.com/flame/blis).

if you don't specify a blas backend it defaults to llamafile i think which is faster in cpu, but not relevant unless you're using -nkvo ?

> 👤 **ikawrakow** replied the **2025-05-14** at **12:29:26**:<br>
> > if you don't specify a blas backend it defaults to llamafile i think which is faster in cpu.
>
> No, it does not. This is `ik_llama.cpp` not `llama.cpp`. I wrote the matrix multiplication implementation for almost all quants in `llamafile` and for all quants here, so I know that what I have here is faster than llamafile.