8.6 KiB
🗣️ #399 - Qwen 30b.A3b IK/LCPP comparisons on lowspec machine
| Author | fizzAI |
|---|---|
| Created | 2025-05-09 |
| Updated | 2025-05-14 |
Description
Hi! Recently (as in, I finished 5 minutes ago) I got curious as-to how fast my shitbox (for AI use anyways) can run.
Honestly, pretty fast! But the main thing here is the comparison between LCPP and IK_LCPP, and (un)surprisingly mainline LCPP gets pretty hosed.
Specs:
- CPU: Ryzen 5 3500, 6 cores/~3.6ghz iirc
- RAM: 16gb DDR4 at a max of 2667mhz (Yes, my motherboard sucks. Yes, I know.)
- GPU: Nvidia GTX 1650 Super
- VRAM: 4gb(!) of GDDR6
Here's the cherrypicked results that show each framework at their best -- both are running with -ot exps=CPU (with LCPP table slightly modified because they output different formats)
| framework | model | size | params | backend | ngl | fa | amb | fmoe | test | t/s |
|---|---|---|---|---|---|---|---|---|---|---|
| ik_llama.cpp | qwen3moe ?B IQ4_XS_R8 - 4.25 bpw | 15.32 GiB | 30.53 B | CUDA | 99 | 0 | 512 | 1 | pp512 | 15.82 ± 1.91 |
| ik_llama.cpp | qwen3moe ?B IQ4_XS_R8 - 4.25 bpw | 15.32 GiB | 30.53 B | CUDA | 99 | 0 | 512 | 1 | tg128 | 3.05 ± 0.30 |
| llama.cpp | qwen3moe 30B.A3B IQ4_XS - 4.25 bpw | 15.32 GiB | 30.53 B | CUDA,BLAS | 99 | 0 | N/A | N/A | pp512 | 14.29 ± 0.05 |
| llama.cpp | qwen3moe 30B.A3B IQ4_XS - 4.25 bpw | 15.32 GiB | 30.53 B | CUDA,BLAS | 99 | 0 | N/A | N/A | tg128 | 2.75 ± 0.27 |
And here's the full log including the commands used and other random attempts
fizz@MAMMON:~$ ik_llama.cpp/build/bin/llama-bench -fa 0,1 -amb 128,512 -fmoe 1 -ot exps=CPU -ngl 99 -m ~/ggufs/REPACK-Qwen_Qwen3-30B-A3B-IQ4_XS.gguf
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce GTX 1650 SUPER, compute capability 7.5, VMM: yes
| model | size | params | backend | ngl | fa | amb | fmoe | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ----: | ---: | ------------: | ---------------: |
| qwen3moe ?B IQ4_XS_R8 - 4.25 bpw | 15.32 GiB | 30.53 B | CUDA | 99 | 0 | 128 | 1 | pp512 | 15.72 ± 0.19 |
| qwen3moe ?B IQ4_XS_R8 - 4.25 bpw | 15.32 GiB | 30.53 B | CUDA | 99 | 0 | 128 | 1 | tg128 | 2.86 ± 0.34 |
| qwen3moe ?B IQ4_XS_R8 - 4.25 bpw | 15.32 GiB | 30.53 B | CUDA | 99 | 0 | 512 | 1 | pp512 | 15.82 ± 1.91 |
| qwen3moe ?B IQ4_XS_R8 - 4.25 bpw | 15.32 GiB | 30.53 B | CUDA | 99 | 0 | 512 | 1 | tg128 | 3.05 ± 0.30 |
| qwen3moe ?B IQ4_XS_R8 - 4.25 bpw | 15.32 GiB | 30.53 B | CUDA | 99 | 1 | 128 | 1 | pp512 | 16.38 ± 1.32 |
| qwen3moe ?B IQ4_XS_R8 - 4.25 bpw | 15.32 GiB | 30.53 B | CUDA | 99 | 1 | 128 | 1 | tg128 | 2.78 ± 0.18 |
| qwen3moe ?B IQ4_XS_R8 - 4.25 bpw | 15.32 GiB | 30.53 B | CUDA | 99 | 1 | 512 | 1 | pp512 | 15.78 ± 1.96 |
| qwen3moe ?B IQ4_XS_R8 - 4.25 bpw | 15.32 GiB | 30.53 B | CUDA | 99 | 1 | 512 | 1 | tg128 | 2.89 ± 0.24 |
build: 4084ca73 (3673)
fizz@MAMMON:~$ ik_llama.cpp/build/bin/llama-bench -fa 0,1 -amb 128,512 -fmoe 1 -ot ffn=CPU -ngl 99 -m ~/ggufs/REPACK-Qwen_Qwen3-30B-A3B-IQ4_XS.gguf
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce GTX 1650 SUPER, compute capability 7.5, VMM: yes
| model | size | params | backend | ngl | fa | amb | fmoe | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | ----: | ---: | ------------: | ---------------: |
| qwen3moe ?B IQ4_XS_R8 - 4.25 bpw | 15.32 GiB | 30.53 B | CUDA | 99 | 0 | 128 | 1 | pp512 | 15.66 ± 0.19 |
| qwen3moe ?B IQ4_XS_R8 - 4.25 bpw | 15.32 GiB | 30.53 B | CUDA | 99 | 0 | 128 | 1 | tg128 | 2.55 ± 0.19 |
| qwen3moe ?B IQ4_XS_R8 - 4.25 bpw | 15.32 GiB | 30.53 B | CUDA | 99 | 0 | 512 | 1 | pp512 | 16.07 ± 1.94 |
| qwen3moe ?B IQ4_XS_R8 - 4.25 bpw | 15.32 GiB | 30.53 B | CUDA | 99 | 0 | 512 | 1 | tg128 | 2.86 ± 0.27 |
| qwen3moe ?B IQ4_XS_R8 - 4.25 bpw | 15.32 GiB | 30.53 B | CUDA | 99 | 1 | 128 | 1 | pp512 | 16.00 ± 1.77 |
| qwen3moe ?B IQ4_XS_R8 - 4.25 bpw | 15.32 GiB | 30.53 B | CUDA | 99 | 1 | 128 | 1 | tg128 | 2.63 ± 0.16 |
| qwen3moe ?B IQ4_XS_R8 - 4.25 bpw | 15.32 GiB | 30.53 B | CUDA | 99 | 1 | 512 | 1 | pp512 | 15.87 ± 2.01 |
| qwen3moe ?B IQ4_XS_R8 - 4.25 bpw | 15.32 GiB | 30.53 B | CUDA | 99 | 1 | 512 | 1 | tg128 | 2.74 ± 0.22 |
build: 4084ca73 (3673)
fizz@MAMMON:~$ llama.cpp/build/bin/llama-bench -fa 0,1 -ot exps=CPU -ngl 99 -m ~/ggufs/Qwen_Qwen3-30B-A3B-IQ4_XS.gguf
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: no
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 1 CUDA devices:
Device 0: NVIDIA GeForce GTX 1650 SUPER, compute capability 7.5, VMM: yes
| model | size | params | backend | threads | fa | ot | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | -: | --------------------- | --------------: | -------------------: |
| qwen3moe 30B.A3B IQ4_XS - 4.25 bpw | 15.32 GiB | 30.53 B | CUDA,BLAS | 6 | 0 | exps=CPU | pp512 | 14.29 ± 0.05 |
| qwen3moe 30B.A3B IQ4_XS - 4.25 bpw | 15.32 GiB | 30.53 B | CUDA,BLAS | 6 | 0 | exps=CPU | tg128 | 2.75 ± 0.27 |
| qwen3moe 30B.A3B IQ4_XS - 4.25 bpw | 15.32 GiB | 30.53 B | CUDA,BLAS | 6 | 1 | exps=CPU | pp512 | 11.80 ± 0.04 |
| qwen3moe 30B.A3B IQ4_XS - 4.25 bpw | 15.32 GiB | 30.53 B | CUDA,BLAS | 6 | 1 | exps=CPU | tg128 | 2.75 ± 0.36 |
build: 15e03282 (5318)
Some other interesting notes:
- Memory wasn't the bottleneck here (at least not GPU memory), so I didn't really see any tangible benefits from FA -- however, I did test with it enabled, and LCPP's CPU FA is so slow it's not even funny
- There's a bit of an uptick in performance without FA when
ambis higher, but its faster forambto be lower with FA. ??? - I tried both
exps=CPU(which I later found only offloads parts of the FFN to the CPU) andffn=CPU(which offloads all of the FFN to the CPU as I was originally intending)... but it's slower to use the one which offloads the norms and stuff too! For some reason! - I'm not sure whether it's best to build with or without a separate BLAS backend? The docs here and the docs in LCPP don't really clarify, so I went with what people seemed to be using most here for IK (noblas) and compiled LCPP with Blis.
I still need to try dense models, CPU without offload, etc etc for this to be a fair comparison, but I hope this is still interesting data :)
🗣️ Discussion
👤 VinnyG9 replied the 2025-05-14 at 12:05:43:
- I'm not sure whether it's best to build with or without a separate BLAS backend? The docs here and the docs in LCPP don't really clarify, so I went with what people seemed to be using most here for IK (noblas) and compiled LCPP with Blis.
if you don't specify a blas backend it defaults to llamafile i think which is faster in cpu, but not relevant unless you're using -nkvo ?
👤 ikawrakow replied the 2025-05-14 at 12:29:26:
if you don't specify a blas backend it defaults to llamafile i think which is faster in cpu.
No, it does not. This is
ik_llama.cppnotllama.cpp. I wrote the matrix multiplication implementation for almost all quants inllamafileand for all quants here, so I know that what I have here is faster than llamafile.