Files
ik_llama.cpp/github-data/pull_requests/168 - Falcon3 changes.md
2025-07-23 13:31:53 +02:00

5.2 KiB

🔀 #168 - Falcon3 changes

Author ikawrakow
State Closed
Created 2025-01-10
Updated 2025-01-10

Description

Two changes:

  • Add pre-tokenizer for Falcon3 (same as llama3)
  • Use integer arithmetic to perform the summation of a row of activations for Q8_K16

The second change is required for the IQ2_BN_R4 4-row interleaved variant. The existing implementation just sums up the f32 values. This is fine with the original BitNet models and also with the TriLM ternary models, but with the Falcon3 ternary models I observe too large of a difference between the GPU and the CPU perplexity result. With this change the difference is greatly reduced and IQ2_BN_R4 actually arrives at a slightly lower PPL than Microsoft's BitNet implementation (which is claimed to be "losless").


💬 Conversation

👤 ikawrakow commented the 2025-01-10 at 12:56:49:

Oh, here some performance figures for IQ2_BN and Microsoft's Bitnet I2_S quants, which claim to be the fastest CPU implementation of ternary transformer models. Tests run on a Ryzen-7950X CPU.

After following the Bitnet instructions:

git clone --recursive https://github.com/microsoft/BitNet.git
cd BitNet
conda create -n bitnet-cpp python=3.9
conda activate bitnet-cpp
pip install -r requirements.txt
python setup_env.py --hf-repo tiiuae/Falcon3-7B-Instruct-1.58bit -q i2_s

I'm finding that their e2e_benchmark.py Python script is not really working. Or, more precisely, it is working but giving a dismal performance. With

python3 utils/e2e_benchmark.py -m models/Falcon3-7B-Instruct-1.58bit/ggml-model-i2_s.gguf -n 0 -p 512 -t 16

I get this:

model size params backend threads n_batch test t/s
llama 3B I2_S - 2 bpw ternary 3.05 GiB 7.46 B CPU 16 1 pp512 22.15 ± 0.07

Hahaha. 22 t/s for PP-512? Fortunately for us, BitNet is just a thin wrapper around llama.cpp, so we can run the llama-bench tool, which the e2e_benchmark.py uses under the hood, directly:

./build/bin/llama-bench -m models/Falcon3-7B-Instruct-1.58bit/ggml-model-i2_s.gguf -p 512 -n 128 -t 16

and we get

model size params backend threads test t/s
llama 3B I2_S - 2 bpw ternary 3.05 GiB 7.46 B CPU 16 pp512 187.90 ± 0.99
llama 3B I2_S - 2 bpw ternary 3.05 GiB 7.46 B CPU 8 tg128 23.39 ± 0.05

In comparison, here is what we get with IQ2_BN (using -rtr 1 to interleave 4 rows when loading the model:

model size params backend threads test t/s
llama ?B IQ2_BN - 2.00 bpw Bitnet 2.07 GiB 7.46 B CPU 16 pp512 465.85 ± 1.91
llama ?B IQ2_BN - 2.00 bpw Bitnet 2.07 GiB 7.46 B CPU 8 tg128 28.03 ± 0.04

So, 2.5X for PP-512, and ~20% better for TG-128 (both achieve maximum performance at 8 threads). TG-128 is of course memory bound and the BitNet authors make claims about energy efficiency, so let's look at TG with fewer threads:

model size params backend threads test t/s
llama 3B I2_S - 2 bpw ternary 3.05 GiB 7.46 B CPU 1 tg128 9.64 ± 0.05
llama 3B I2_S - 2 bpw ternary 3.05 GiB 7.46 B CPU 2 tg128 15.45 ± 0.04
llama 3B I2_S - 2 bpw ternary 3.05 GiB 7.46 B CPU 4 tg128 22.21 ± 0.20

vs

model size params backend threads test t/s
llama ?B IQ2_BN - 2.00 bpw Bitnet 2.07 GiB 7.46 B CPU 1 tg128 12.83 ± 0.24
llama ?B IQ2_BN - 2.00 bpw Bitnet 2.07 GiB 7.46 B CPU 2 tg128 22.46 ± 0.03
llama ?B IQ2_BN - 2.00 bpw Bitnet 2.07 GiB 7.46 B CPU 4 tg128 27.62 ± 0.05

OK. Now I can claim that IQ2_BN is almost 4X more energy efficient than BitNet as we get (almost) the same performance at 2 threads as their maximum performance at 8 threads.