Files
ik_llama.cpp/github-data/pull_requests/1 - Offload Bitnet token embeddings to the GPU.md
2025-07-23 13:31:53 +02:00

2.0 KiB

🔀 #1 - Offload Bitnet token embeddings to the GPU

Author ikawrakow
State Closed
Created 2024-07-26
Updated 2024-07-26

Description

This PR puts the token_embedding tensor on the GPU for the Bitnet-1.58b model. This results in a significantly improved performance on CUDA/Metal as can be seen in the table. CUDA is for RTX-4080, Metal is for a 30-code M2-Max GPU, the host CPU is a Ryzen-7950X for CUDA.

model backend test t/s (PR) t/s (main) Speedup
IQ2_BN CUDA tg128 322.10 ± 0.07 241.34 ± 0.27 1.325
IQ1_BN CUDA tg128 301.44 ± 0.12 229.21 ± 0.89 1.315
IQ2_BN CUDA pp512 10780 ± 164 9811 ± 25 1.099
IQ1_BN CUDA pp512 10661 ± 172 9655 ± 21 1.104
IQ2_BN Metal pp512 723.19 ± 0.53 722.66 ± 0.47 1.001
IQ1_BN Metal pp512 698.25 ± 1.91 697.59 ± 2.12 1.000
IQ2_BN Metal tg128 110.39 ± 0.13 95.22 ± 0.55 1.159
IQ1_BN Metal tg128 76.70 ± 0.05 69.33 ± 0.07 1.106

Bitnet uses the same tensor for token embeddings and for output. When the token embedding tensor is specified to be on the CPU, as done in mainline llama.cpp and here before this PR, this leads to the final matrix multiplication with the output tensor to be performed on the CPU even when using a GPU backend, and this results in a significant drop in performance (the larger the performance differential between the GPU and the host CPU, the larger the effect). As this might affect other models as well (e.g., Gemma), it would be useful to find a more general solution, but I'm finding the back-end stuff in llama.cpp to be opaque and hard to understand, so solved in a hacky way just for Bitnet for now.