Files
ik_llama.cpp/github-data/discussions/399 - Qwen 30b.A3b IK_LCPP comparisons on lowspec machine.md
2025-07-23 13:31:53 +02:00

8.6 KiB

🗣️ #399 - Qwen 30b.A3b IK/LCPP comparisons on lowspec machine

Author fizzAI
Created 2025-05-09
Updated 2025-05-14

Description

Hi! Recently (as in, I finished 5 minutes ago) I got curious as-to how fast my shitbox (for AI use anyways) can run.
Honestly, pretty fast! But the main thing here is the comparison between LCPP and IK_LCPP, and (un)surprisingly mainline LCPP gets pretty hosed.

Specs:

  • CPU: Ryzen 5 3500, 6 cores/~3.6ghz iirc
  • RAM: 16gb DDR4 at a max of 2667mhz (Yes, my motherboard sucks. Yes, I know.)
  • GPU: Nvidia GTX 1650 Super
  • VRAM: 4gb(!) of GDDR6

Here's the cherrypicked results that show each framework at their best -- both are running with -ot exps=CPU (with LCPP table slightly modified because they output different formats)

framework model size params backend ngl fa amb fmoe test t/s
ik_llama.cpp qwen3moe ?B IQ4_XS_R8 - 4.25 bpw 15.32 GiB 30.53 B CUDA 99 0 512 1 pp512 15.82 ± 1.91
ik_llama.cpp qwen3moe ?B IQ4_XS_R8 - 4.25 bpw 15.32 GiB 30.53 B CUDA 99 0 512 1 tg128 3.05 ± 0.30
llama.cpp qwen3moe 30B.A3B IQ4_XS - 4.25 bpw 15.32 GiB 30.53 B CUDA,BLAS 99 0 N/A N/A pp512 14.29 ± 0.05
llama.cpp qwen3moe 30B.A3B IQ4_XS - 4.25 bpw 15.32 GiB 30.53 B CUDA,BLAS 99 0 N/A N/A tg128 2.75 ± 0.27
And here's the full log including the commands used and other random attempts
fizz@MAMMON:~$ ik_llama.cpp/build/bin/llama-bench -fa 0,1 -amb 128,512 -fmoe 1 -ot exps=CPU -ngl 99 -m ~/ggufs/REPACK-Qwen_Qwen3-30B-A3B-IQ4_XS.gguf
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce GTX 1650 SUPER, compute capability 7.5, VMM: yes
| model                          |       size |     params | backend    | ngl | fa |   amb | fmoe |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ----: | ---: | ------------: | ---------------: |
| qwen3moe ?B IQ4_XS_R8 - 4.25 bpw |  15.32 GiB |    30.53 B | CUDA       |  99 |  0 |   128 |    1 |         pp512 |     15.72 ± 0.19 |
| qwen3moe ?B IQ4_XS_R8 - 4.25 bpw |  15.32 GiB |    30.53 B | CUDA       |  99 |  0 |   128 |    1 |         tg128 |      2.86 ± 0.34 |
| qwen3moe ?B IQ4_XS_R8 - 4.25 bpw |  15.32 GiB |    30.53 B | CUDA       |  99 |  0 |   512 |    1 |         pp512 |     15.82 ± 1.91 |
| qwen3moe ?B IQ4_XS_R8 - 4.25 bpw |  15.32 GiB |    30.53 B | CUDA       |  99 |  0 |   512 |    1 |         tg128 |      3.05 ± 0.30 |
| qwen3moe ?B IQ4_XS_R8 - 4.25 bpw |  15.32 GiB |    30.53 B | CUDA       |  99 |  1 |   128 |    1 |         pp512 |     16.38 ± 1.32 |
| qwen3moe ?B IQ4_XS_R8 - 4.25 bpw |  15.32 GiB |    30.53 B | CUDA       |  99 |  1 |   128 |    1 |         tg128 |      2.78 ± 0.18 |
| qwen3moe ?B IQ4_XS_R8 - 4.25 bpw |  15.32 GiB |    30.53 B | CUDA       |  99 |  1 |   512 |    1 |         pp512 |     15.78 ± 1.96 |
| qwen3moe ?B IQ4_XS_R8 - 4.25 bpw |  15.32 GiB |    30.53 B | CUDA       |  99 |  1 |   512 |    1 |         tg128 |      2.89 ± 0.24 |

build: 4084ca73 (3673)

fizz@MAMMON:~$ ik_llama.cpp/build/bin/llama-bench -fa 0,1 -amb 128,512 -fmoe 1 -ot ffn=CPU -ngl 99 -m ~/ggufs/REPACK-Qwen_Qwen3-30B-A3B-IQ4_XS.gguf
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce GTX 1650 SUPER, compute capability 7.5, VMM: yes
| model                          |       size |     params | backend    | ngl | fa |   amb | fmoe |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ----: | ---: | ------------: | ---------------: |
| qwen3moe ?B IQ4_XS_R8 - 4.25 bpw |  15.32 GiB |    30.53 B | CUDA       |  99 |  0 |   128 |    1 |         pp512 |     15.66 ± 0.19 |
| qwen3moe ?B IQ4_XS_R8 - 4.25 bpw |  15.32 GiB |    30.53 B | CUDA       |  99 |  0 |   128 |    1 |         tg128 |      2.55 ± 0.19 |
| qwen3moe ?B IQ4_XS_R8 - 4.25 bpw |  15.32 GiB |    30.53 B | CUDA       |  99 |  0 |   512 |    1 |         pp512 |     16.07 ± 1.94 |
| qwen3moe ?B IQ4_XS_R8 - 4.25 bpw |  15.32 GiB |    30.53 B | CUDA       |  99 |  0 |   512 |    1 |         tg128 |      2.86 ± 0.27 |
| qwen3moe ?B IQ4_XS_R8 - 4.25 bpw |  15.32 GiB |    30.53 B | CUDA       |  99 |  1 |   128 |    1 |         pp512 |     16.00 ± 1.77 |
| qwen3moe ?B IQ4_XS_R8 - 4.25 bpw |  15.32 GiB |    30.53 B | CUDA       |  99 |  1 |   128 |    1 |         tg128 |      2.63 ± 0.16 |
| qwen3moe ?B IQ4_XS_R8 - 4.25 bpw |  15.32 GiB |    30.53 B | CUDA       |  99 |  1 |   512 |    1 |         pp512 |     15.87 ± 2.01 |
| qwen3moe ?B IQ4_XS_R8 - 4.25 bpw |  15.32 GiB |    30.53 B | CUDA       |  99 |  1 |   512 |    1 |         tg128 |      2.74 ± 0.22 |

build: 4084ca73 (3673)

fizz@MAMMON:~$ llama.cpp/build/bin/llama-bench -fa 0,1 -ot exps=CPU -ngl 99 -m ~/ggufs/Qwen_Qwen3-30B-A3B-IQ4_XS.gguf
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce GTX 1650 SUPER, compute capability 7.5, VMM: yes
| model                          |       size |     params | backend    | threads | fa | ot                    |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | -: | --------------------- | --------------: | -------------------: |
| qwen3moe 30B.A3B IQ4_XS - 4.25 bpw |  15.32 GiB |    30.53 B | CUDA,BLAS  |       6 |  0 | exps=CPU              |           pp512 |         14.29 ± 0.05 |
| qwen3moe 30B.A3B IQ4_XS - 4.25 bpw |  15.32 GiB |    30.53 B | CUDA,BLAS  |       6 |  0 | exps=CPU              |           tg128 |          2.75 ± 0.27 |
| qwen3moe 30B.A3B IQ4_XS - 4.25 bpw |  15.32 GiB |    30.53 B | CUDA,BLAS  |       6 |  1 | exps=CPU              |           pp512 |         11.80 ± 0.04 |
| qwen3moe 30B.A3B IQ4_XS - 4.25 bpw |  15.32 GiB |    30.53 B | CUDA,BLAS  |       6 |  1 | exps=CPU              |           tg128 |          2.75 ± 0.36 |

build: 15e03282 (5318)

Some other interesting notes:

  • Memory wasn't the bottleneck here (at least not GPU memory), so I didn't really see any tangible benefits from FA -- however, I did test with it enabled, and LCPP's CPU FA is so slow it's not even funny
  • There's a bit of an uptick in performance without FA when amb is higher, but its faster for amb to be lower with FA. ???
  • I tried both exps=CPU (which I later found only offloads parts of the FFN to the CPU) and ffn=CPU (which offloads all of the FFN to the CPU as I was originally intending)... but it's slower to use the one which offloads the norms and stuff too! For some reason!
  • I'm not sure whether it's best to build with or without a separate BLAS backend? The docs here and the docs in LCPP don't really clarify, so I went with what people seemed to be using most here for IK (noblas) and compiled LCPP with Blis.

I still need to try dense models, CPU without offload, etc etc for this to be a fair comparison, but I hope this is still interesting data :)


🗣️ Discussion

👤 VinnyG9 replied the 2025-05-14 at 12:05:43:

  • I'm not sure whether it's best to build with or without a separate BLAS backend? The docs here and the docs in LCPP don't really clarify, so I went with what people seemed to be using most here for IK (noblas) and compiled LCPP with Blis.

if you don't specify a blas backend it defaults to llamafile i think which is faster in cpu, but not relevant unless you're using -nkvo ?

👤 ikawrakow replied the 2025-05-14 at 12:29:26:

if you don't specify a blas backend it defaults to llamafile i think which is faster in cpu.

No, it does not. This is ik_llama.cpp not llama.cpp. I wrote the matrix multiplication implementation for almost all quants in llamafile and for all quants here, so I know that what I have here is faster than llamafile.