Files
ik_llama.cpp/github-data/pull_requests/79 - Do not quantize activations if not necessary.md
2025-07-23 13:31:53 +02:00

24 lines
2.5 KiB
Markdown

### 🔀 [#79](https://github.com/ikawrakow/ik_llama.cpp/pull/79) - Do not quantize activations if not necessary
| **Author** | `ikawrakow` |
| :--- | :--- |
| **State** | ❌ **Closed** |
| **Created** | 2024-10-04 |
| **Updated** | 2024-10-04 |
---
#### Description
It has always bugged me that `ggml` unnecessarily repeats the "quantization" of activations when the corresponding matrix multiplication cannot be done directly. E.g., `Q`, `K` and `V` all multiply the input to the self-attention layer. Similarly, `ffn_up` and `ffn_gate` multiply the same activations for parallel FFNs. "Quantization" is in quotes, because it applies to `fp16` and `bf16` tensors when the matrix multiplication function used does not work directly with `fp32` activations. There are typically 7 tensors per layer in a transformer model, so basically 3 out of 7 "quantizations" are unnecessary.
This PR remedies this unfortunate situation by storing "quantized" activations in a dedicated part of the work buffer (so the data cannot be trashed by other ops that also need a work buffer), and by remembering the name of the last tensor that was quantized. I was hoping that by avoiding the unnecessary quantization we can also skip the thread synchronization barrier that we have in `ggml_compute_forward_mul_mat` after quantization, but I guess I'm missing something because skipping the barrier may hang the inference pipeline, so for now the barrier is still there.
Quantization takes a relatively small fraction of the overall graph evaluation time, so performance gains are typically in the ~1% range. But for a `bf16` model with a long context I'm finding a non-trivial performance improvement when running on a CPU with native `bf16` support (Ryzen-7950X). Here is a comparison for LLaMA-3.1-8B with a context of 8192 tokens
| model | size | params | backend | threads | type_k | type_v | fa | test | t/s |
| ------------------------------ | ---------: | ---------: | ---------- | ------: | -----: | -----: | -: | ------------: | ---------------: |
| llama 8B BF16 (main) | 14.96 GiB | 8.03 B | CPU | 16 | bf16 | bf16 | 1 | pp8192 | 178.64 ± 0.69 |
| llama 8B BF16 (PR) | 14.96 GiB | 8.03 B | CPU | 16 | bf16 | bf16 | 1 | pp8192 | 188.28 ± 0.49 |
5.4% gain in performance is nothing to sneeze at, especially considering how minor the necessary code change is.