Add GitHub data: filename sanitization (#640)

2026-03-13 07:20:15 +00:00 · 2025-07-23 13:31:53 +02:00
parent 3600d82e98
commit eaa2510a28
626 changed files with 0 additions and 0 deletions
--- a/github-data/pull_requests/168
+++ b/github-data/pull_requests/168
@@ -0,0 +1,78 @@
+### 🔀 [#168](https://github.com/ikawrakow/ik_llama.cpp/pull/168) - Falcon3 changes
+
+| **Author** | `ikawrakow` |
+| :--- | :--- |
+| **State** | ❌ **Closed** |
+| **Created** | 2025-01-10 |
+| **Updated** | 2025-01-10 |
+
+---
+
+#### Description
+
+Two changes:
+* Add pre-tokenizer for `Falcon3` (same as `llama3`)
+* Use integer arithmetic to perform the summation of a row of activations for `Q8_K16`
+
+The second change is required for the `IQ2_BN_R4` 4-row interleaved variant. The existing implementation just sums up the `f32` values. This is fine with the original BitNet models and also with the TriLM ternary models, but with the Falcon3 ternary models I observe too large of a difference between the GPU and the CPU perplexity result. With this change the difference is greatly reduced and `IQ2_BN_R4` actually arrives at a slightly lower PPL than Microsoft's BitNet implementation (which is claimed to be "losless").
+
+---
+
+#### 💬 Conversation
+
+👤 **ikawrakow** commented the **2025-01-10** at **12:56:49**:<br>
+
+Oh, here some performance figures for `IQ2_BN` and Microsoft's [Bitnet](https://github.com/microsoft/BitNet) `I2_S` quants, which claim to be the fastest CPU implementation of ternary transformer models. Tests run on a Ryzen-7950X CPU. 
+
+After following the Bitnet instructions:
+```
+git clone --recursive https://github.com/microsoft/BitNet.git
+cd BitNet
+conda create -n bitnet-cpp python=3.9
+conda activate bitnet-cpp
+pip install -r requirements.txt
+python setup_env.py --hf-repo tiiuae/Falcon3-7B-Instruct-1.58bit -q i2_s
+```
+I'm finding that their `e2e_benchmark.py` Python script is not really working. Or, more precisely, it is working but giving a dismal performance. With
+```
+python3 utils/e2e_benchmark.py -m models/Falcon3-7B-Instruct-1.58bit/ggml-model-i2_s.gguf -n 0 -p 512 -t 16
+```
+I get this:
+| model                          |       size |     params | backend    | threads | n_batch |          test |                  t/s |
+| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------: | ------------: | -------------------: |
+| llama 3B I2_S - 2 bpw ternary  |   3.05 GiB |     7.46 B | CPU        |      16 |       1 |         pp512 |         22.15 ± 0.07 |
+
+Hahaha. 22 t/s for PP-512? Fortunately for us, BitNet is just a thin wrapper around `llama.cpp`, so we can run the `llama-bench` tool, which the  `e2e_benchmark.py ` uses under the hood, directly:
+```
+./build/bin/llama-bench -m models/Falcon3-7B-Instruct-1.58bit/ggml-model-i2_s.gguf -p 512 -n 128 -t 16
+```
+and we get
+
+| model                          |       size |     params | backend    | threads |          test |                  t/s |
+| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
+| llama 3B I2_S - 2 bpw ternary  |   3.05 GiB |     7.46 B | CPU        |      16 |         pp512 |        187.90 ± 0.99 |
+| llama 3B I2_S - 2 bpw ternary  |   3.05 GiB |     7.46 B | CPU        |       8 |         tg128 |         23.39 ± 0.05 |
+
+In comparison, here is what we get with `IQ2_BN` (using `-rtr 1` to interleave 4 rows when loading the model:
+| model                          |       size |     params | backend    | threads |          test |              t/s |
+| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | ---------------: |
+| llama ?B IQ2_BN - 2.00 bpw Bitnet |   2.07 GiB |     7.46 B | CPU        |      16 |         pp512 |    465.85 ± 1.91 |
+| llama ?B IQ2_BN - 2.00 bpw Bitnet |   2.07 GiB |     7.46 B | CPU        |       8 |         tg128 |     28.03 ± 0.04 |
+
+So, 2.5X for PP-512, and ~20% better for TG-128 (both achieve maximum performance at 8 threads). TG-128 is of course memory bound and the BitNet authors make claims about energy efficiency, so let's look at TG with fewer threads:
+
+| model                          |       size |     params | backend    | threads |          test |                  t/s |
+| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | -------------------: |
+| llama 3B I2_S - 2 bpw ternary  |   3.05 GiB |     7.46 B | CPU        |       1 |         tg128 |          9.64 ± 0.05 |
+| llama 3B I2_S - 2 bpw ternary  |   3.05 GiB |     7.46 B | CPU        |       2 |         tg128 |         15.45 ± 0.04 |
+| llama 3B I2_S - 2 bpw ternary  |   3.05 GiB |     7.46 B | CPU        |       4 |         tg128 |         22.21 ± 0.20 |
+
+vs
+
+| model                          |       size |     params | backend    | threads |          test |              t/s |
+| ------------------------------ | ---------: | ---------: | ---------- | ------: | ------------: | ---------------: |
+| llama ?B IQ2_BN - 2.00 bpw Bitnet |   2.07 GiB |     7.46 B | CPU        |       1 |         tg128 |     12.83 ± 0.24 |
+| llama ?B IQ2_BN - 2.00 bpw Bitnet |   2.07 GiB |     7.46 B | CPU        |       2 |         tg128 |     22.46 ± 0.03 |
+| llama ?B IQ2_BN - 2.00 bpw Bitnet |   2.07 GiB |     7.46 B | CPU        |       4 |         tg128 |     27.62 ± 0.05 |
+
+OK. Now I can claim that `IQ2_BN` is almost 4X more energy efficient than BitNet as we get (almost) the same performance at 2 threads as their maximum performance at 8 threads.