ik_llama.cpp/github-data/discussions/399 - Qwen 30b.A3b IK_LCPP comparisons on lowspec machine.md at eef360a85f3ff2b5ae5a41db448bc08d2cb7166d - ik_llama.cpp

ikawrakow/ik_llama.cpp

Fork 0

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-05-01 03:41:53 +00:00

Files

Thomas eaa2510a28 Add GitHub data: filename sanitization (#640 )

2025-07-23 13:31:53 +02:00

8.6 KiB

Raw Blame History

🗣️ #399 - Qwen 30b.A3b IK/LCPP comparisons on lowspec machine

Author	`fizzAI`
Created	2025-05-09
Updated	2025-05-14

Description

Hi! Recently (as in, I finished 5 minutes ago) I got curious as-to how fast my shitbox (for AI use anyways) can run.
Honestly, pretty fast! But the main thing here is the comparison between LCPP and IK_LCPP, and (un)surprisingly mainline LCPP gets pretty hosed.

Specs:

CPU: Ryzen 5 3500, 6 cores/~3.6ghz iirc
RAM: 16gb DDR4 at a max of 2667mhz (Yes, my motherboard sucks. Yes, I know.)
GPU: Nvidia GTX 1650 Super
VRAM: 4gb(!) of GDDR6

Here's the cherrypicked results that show each framework at their best -- both are running with -ot exps=CPU (with LCPP table slightly modified because they output different formats)

framework	model	size	params	backend	ngl	amb	fmoe	test	t/s
ik_llama.cpp	qwen3moe ?B IQ4_XS_R8 - 4.25 bpw	15.32 GiB	30.53 B	CUDA	99	512	1	pp512	15.82 ± 1.91
ik_llama.cpp	qwen3moe ?B IQ4_XS_R8 - 4.25 bpw	15.32 GiB	30.53 B	CUDA	99	512	1	tg128	3.05 ± 0.30
llama.cpp	qwen3moe 30B.A3B IQ4_XS - 4.25 bpw	15.32 GiB	30.53 B	CUDA,BLAS	99	N/A	N/A	pp512	14.29 ± 0.05
llama.cpp	qwen3moe 30B.A3B IQ4_XS - 4.25 bpw	15.32 GiB	30.53 B	CUDA,BLAS	99	N/A	N/A	tg128	2.75 ± 0.27

And here's the full log including the commands used and other random attempts

fizz@MAMMON:~$ ik_llama.cpp/build/bin/llama-bench -fa 0,1 -amb 128,512 -fmoe 1 -ot exps=CPU -ngl 99 -m ~/ggufs/REPACK-Qwen_Qwen3-30B-A3B-IQ4_XS.gguf
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce GTX 1650 SUPER, compute capability 7.5, VMM: yes
| model                          |       size |     params | backend    | ngl | fa |   amb | fmoe |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ----: | ---: | ------------: | ---------------: |
| qwen3moe ?B IQ4_XS_R8 - 4.25 bpw |  15.32 GiB |    30.53 B | CUDA       |  99 |  0 |   128 |    1 |         pp512 |     15.72 ± 0.19 |
| qwen3moe ?B IQ4_XS_R8 - 4.25 bpw |  15.32 GiB |    30.53 B | CUDA       |  99 |  0 |   128 |    1 |         tg128 |      2.86 ± 0.34 |
| qwen3moe ?B IQ4_XS_R8 - 4.25 bpw |  15.32 GiB |    30.53 B | CUDA       |  99 |  0 |   512 |    1 |         pp512 |     15.82 ± 1.91 |
| qwen3moe ?B IQ4_XS_R8 - 4.25 bpw |  15.32 GiB |    30.53 B | CUDA       |  99 |  0 |   512 |    1 |         tg128 |      3.05 ± 0.30 |
| qwen3moe ?B IQ4_XS_R8 - 4.25 bpw |  15.32 GiB |    30.53 B | CUDA       |  99 |  1 |   128 |    1 |         pp512 |     16.38 ± 1.32 |
| qwen3moe ?B IQ4_XS_R8 - 4.25 bpw |  15.32 GiB |    30.53 B | CUDA       |  99 |  1 |   128 |    1 |         tg128 |      2.78 ± 0.18 |
| qwen3moe ?B IQ4_XS_R8 - 4.25 bpw |  15.32 GiB |    30.53 B | CUDA       |  99 |  1 |   512 |    1 |         pp512 |     15.78 ± 1.96 |
| qwen3moe ?B IQ4_XS_R8 - 4.25 bpw |  15.32 GiB |    30.53 B | CUDA       |  99 |  1 |   512 |    1 |         tg128 |      2.89 ± 0.24 |

build: 4084ca73 (3673)

fizz@MAMMON:~$ ik_llama.cpp/build/bin/llama-bench -fa 0,1 -amb 128,512 -fmoe 1 -ot ffn=CPU -ngl 99 -m ~/ggufs/REPACK-Qwen_Qwen3-30B-A3B-IQ4_XS.gguf
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce GTX 1650 SUPER, compute capability 7.5, VMM: yes
| model                          |       size |     params | backend    | ngl | fa |   amb | fmoe |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ----: | ---: | ------------: | ---------------: |
| qwen3moe ?B IQ4_XS_R8 - 4.25 bpw |  15.32 GiB |    30.53 B | CUDA       |  99 |  0 |   128 |    1 |         pp512 |     15.66 ± 0.19 |
| qwen3moe ?B IQ4_XS_R8 - 4.25 bpw |  15.32 GiB |    30.53 B | CUDA       |  99 |  0 |   128 |    1 |         tg128 |      2.55 ± 0.19 |
| qwen3moe ?B IQ4_XS_R8 - 4.25 bpw |  15.32 GiB |    30.53 B | CUDA       |  99 |  0 |   512 |    1 |         pp512 |     16.07 ± 1.94 |
| qwen3moe ?B IQ4_XS_R8 - 4.25 bpw |  15.32 GiB |    30.53 B | CUDA       |  99 |  0 |   512 |    1 |         tg128 |      2.86 ± 0.27 |
| qwen3moe ?B IQ4_XS_R8 - 4.25 bpw |  15.32 GiB |    30.53 B | CUDA       |  99 |  1 |   128 |    1 |         pp512 |     16.00 ± 1.77 |
| qwen3moe ?B IQ4_XS_R8 - 4.25 bpw |  15.32 GiB |    30.53 B | CUDA       |  99 |  1 |   128 |    1 |         tg128 |      2.63 ± 0.16 |
| qwen3moe ?B IQ4_XS_R8 - 4.25 bpw |  15.32 GiB |    30.53 B | CUDA       |  99 |  1 |   512 |    1 |         pp512 |     15.87 ± 2.01 |
| qwen3moe ?B IQ4_XS_R8 - 4.25 bpw |  15.32 GiB |    30.53 B | CUDA       |  99 |  1 |   512 |    1 |         tg128 |      2.74 ± 0.22 |

build: 4084ca73 (3673)

fizz@MAMMON:~$ llama.cpp/build/bin/llama-bench -fa 0,1 -ot exps=CPU -ngl 99 -m ~/ggufs/Qwen_Qwen3-30B-A3B-IQ4_XS.gguf
ggml_cuda_init: GGML_CUDA_FORCE_MMQ:    no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
  Device 0: NVIDIA GeForce GTX 1650 SUPER, compute capability 7.5, VMM: yes
| model                          |       size |     params | backend    | threads | fa | ot                    |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | -: | --------------------- | --------------: | -------------------: |
| qwen3moe 30B.A3B IQ4_XS - 4.25 bpw |  15.32 GiB |    30.53 B | CUDA,BLAS  |       6 |  0 | exps=CPU              |           pp512 |         14.29 ± 0.05 |
| qwen3moe 30B.A3B IQ4_XS - 4.25 bpw |  15.32 GiB |    30.53 B | CUDA,BLAS  |       6 |  0 | exps=CPU              |           tg128 |          2.75 ± 0.27 |
| qwen3moe 30B.A3B IQ4_XS - 4.25 bpw |  15.32 GiB |    30.53 B | CUDA,BLAS  |       6 |  1 | exps=CPU              |           pp512 |         11.80 ± 0.04 |
| qwen3moe 30B.A3B IQ4_XS - 4.25 bpw |  15.32 GiB |    30.53 B | CUDA,BLAS  |       6 |  1 | exps=CPU              |           tg128 |          2.75 ± 0.36 |

build: 15e03282 (5318)

Some other interesting notes:

Memory wasn't the bottleneck here (at least not GPU memory), so I didn't really see any tangible benefits from FA -- however, I did test with it enabled, and LCPP's CPU FA is so slow it's not even funny
There's a bit of an uptick in performance without FA when amb is higher, but its faster for amb to be lower with FA. ???
I tried both exps=CPU (which I later found only offloads parts of the FFN to the CPU) and ffn=CPU (which offloads all of the FFN to the CPU as I was originally intending)... but it's slower to use the one which offloads the norms and stuff too! For some reason!
I'm not sure whether it's best to build with or without a separate BLAS backend? The docs here and the docs in LCPP don't really clarify, so I went with what people seemed to be using most here for IK (noblas) and compiled LCPP with Blis.

I still need to try dense models, CPU without offload, etc etc for this to be a fair comparison, but I hope this is still interesting data :)

🗣️ Discussion

👤 VinnyG9 replied the 2025-05-14 at 12:05:43:

I'm not sure whether it's best to build with or without a separate BLAS backend? The docs here and the docs in LCPP don't really clarify, so I went with what people seemed to be using most here for IK (noblas) and compiled LCPP with Blis.

if you don't specify a blas backend it defaults to llamafile i think which is faster in cpu, but not relevant unless you're using -nkvo ?

👤 ikawrakow replied the 2025-05-14 at 12:29:26:

if you don't specify a blas backend it defaults to llamafile i think which is faster in cpu.

No, it does not. This is ik_llama.cpp not llama.cpp. I wrote the matrix multiplication implementation for almost all quants in llamafile and for all quants here, so I know that what I have here is faster than llamafile.

8.6 KiB Raw Blame History

🗣️ #399 - Qwen 30b.A3b IK/LCPP comparisons on lowspec machine

Description

🗣️ Discussion

8.6 KiB

Raw Blame History