ik_llama.cpp/18 - CPU beating GPU in token generation speed.md at eaa2510a28b60d43c2210c69cefdf750d5cc119f - ik_llama.cpp

ikawrakow/ik_llama.cpp

Fork 0

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-01-26 09:09:50 +00:00

Files

Thomas eaa2510a28 Add GitHub data: filename sanitization (#640 )

2025-07-23 13:31:53 +02:00

6.4 KiB

Raw Blame History

🗣️ #18 - CPU beating GPU in token generation speed

Author	`ikawrakow`
Created	2024-08-13
Updated	2025-04-03

Description

The TriLM ternary models are available in various sizes, so I was curious to look into prompt processing (PP) and token generation (TG) speed when the model is small enough to fit in the CPU cache. I have a Ryzen-7950X CPU with 64 MiB of L3 cache, and the 99M parameter TriLM model is 46 MiB when quantized with IQ2_TN. So, without further ado, lets look at a comparison between the Ryzen-7950X and an RTX-4080 in this case:

backend	threads	test	t/s
Ryzen-7950X	16	pp1500	8268.11 ± 48.34
Ryzen-7950X	4	tg500	1016.65 ± 22.17
Ryzen-7950X	8	tg500	1224.83 ± 32.28
Ryzen-7950X	16	tg500	1240.54 ± 25.74
RTX-4080	-	pp1500	110388 ± 250
RTX-4080	-	tg500	1136.64 ± 4.99

The GPU is still much faster than the CPU for prompt processing (although the difference, which is typically a factor of ~30 between this specific GPU and CPU, has shrunk to just a factor of 13), but now the CPU beats the GPU in TG speed!

I also have an M2-Max laptop (the version with a 30-core GPU). Here is what we get:

backend	threads	test	t/s
M2-Max CPU	8	pp1500	5209.27 ± 21.48
M2-Max CPU	2	tg500	692.87 ± 1.74
M2-Max CPU	4	tg500	841.48 ± 5.96
M2-Max CPU	8	tg500	894.73 ± 10.03
M2-Max GPU	4	pp1500	25824 ± 562
M2-Max GPU	4	tg500	464.86 ± 3.85

Also here the GPU is faster for PP (but just 5X faster), but the CPU wipes the floor with the GPU for TG, beating it close to 2X using all 8 threads, and 1.5X with just 2 threads!

🗣️ Discussion

👤 ikawrakow replied the 2024-09-02 at 13:20:54:

Now that we have efficient Flash Attention (FA) implementation on the CPU via PR #32, we can compare again performance between the CPU and GPU for this tiny 99M parameter model. We get

model	size	params	backend	ngl	threads	fa	test	t/s
IQ2_BN - 2.06 bpw TriLM	45.89 MiB	99.76 M	CUDA	100	1	1	pp1500	156827.38 ± 727
IQ2_BN - 2.06 bpw TriLM	45.89 MiB	99.76 M	CUDA	100	1	1	tg500	1496.37 ± 36.79
IQ2_BN - 2.06 bpw TriLM	45.89 MiB	99.76 M	CPU	0	16	1	pp1500	12133.80 ± 51.45
IQ2_BN - 2.06 bpw TriLM	45.89 MiB	99.76 M	CPU	0	16	1	tg500	1509.52 ± 9.65

TG speed is now about the same, which is still quite remarkable.

FA has improved CPU prompt processing speed by almost 50%, TG by 22%.

👤 saood06 replied the 2025-04-02 at 10:36:44:
Is there a chance SpargeAttn could be implemented here. Code here, Paper here.

If it could would it benefit speed on CPU?

👤 ikawrakow replied the 2025-04-02 at 13:44:09:
Other than the paper, is there any evidence that this works as advertised? If I did nothing else but implementing breakthroughs announced on arXiv, the day still wouldn't have enough hours.

👤 saood06 replied the 2025-04-03 at 00:24:39:

Other than the paper, is there any evidence that this works as advertised?

Not really (there are multiple ComfyUI custom nodes that port support but not much on people using it), the paper looked interesting to me and the idea makes sense to me, but the implementation they have looks premature. The same group put out SageAttention/SageAttention2 which has been widely adopted (mostly for image/video models) and the performance matched the paper but SpargeAttn has gotten interest but not much adoption because of the state of the implmentation.

If I did nothing else but implementing breakthroughs announced on arXiv, the day still wouldn't have enough hours.

Sorry.

👤 ikawrakow replied the 2024-09-08 at 07:16:59:

With PR #42 we get this

model	size	params	backend	threads	fa	test	t/s
IQ2_BN - 2.06 bpw TriLM	45.89 MiB	99.76 M	CPU	16	1	pp1500	12906.95 ± 61.04
IQ2_BN - 2.06 bpw TriLM	45.89 MiB	99.76 M	CPU	16	1	tg512	1563.62 ± 12.55

I.e., 56% improvement for PP and 26% improvement for TG since the original post from Aug 13!

I see PR-8151, which provides dedicated quantization for the TriLM ternary models in mainline llama.cpp, has been merged. Here is what we get for TQ2_0 that corresponds to our IQ2_TN

model	size	params	backend	threads	fa	test	t/s
TQ2_0 - 2.06 bpw ternary	45.89 MiB	99.76 M	CPU	16	1	pp1500	5187.34 ± 11.69
TQ2_0 - 2.06 bpw ternary	45.89 MiB	99.76 M	CPU	16	0	pp1500	5281.54 ± 53.33
TQ2_0 - 2.06 bpw ternary	45.89 MiB	99.76 M	CPU	16	1	tg500	1156.25 ± 18.14
TQ2_0 - 2.06 bpw ternary	45.89 MiB	99.76 M	CPU	16	0	tg500	1041.27 ± 21.30

Our version is 2.44X faster for PP and 35% faster for TG.

6.4 KiB Raw Blame History

🗣️ #18 - CPU beating GPU in token generation speed

Description

🗣️ Discussion

6.4 KiB

Raw Blame History