### 🔀 [#441](https://github.com/ikawrakow/ik_llama.cpp/pull/441) - Trellis quants with CPU inference | **Author** | `andrewkchan` | | :--- | :--- | | **State** | ❌ **Closed** | | **Created** | 2025-05-20 | | **Updated** | 2025-05-23 | --- #### Description As requested a while ago, takes (https://github.com/ikawrakow/ik_llama.cpp/pull/113) and adds CPU implementations of the quantized matmuls (via iqk_mul_mat) for inference. AVX2 and F16C support are required. As predicted, the CPU ops are very slow. For Llama-3.1-8B-Instruct, I get ~0.3~ 4.83 t/s with IQ2_KT compared to ~>1.0~ 4.59 t/s with F16 on AMD EPYC 7R32 (32 cores). Note I am not a SIMD expert and have only spent moderate time on optimizations (e.g. basic use of AVX2/F16C, flattening of the trellis iterations), so it may be possible to speed things up. I also have not added implementations for `HAVE_FANCY_SIMD`. Additionally, there are only mulmats for F32 activations, as that is what the 3INST algorithm returns (as pointed out in the original PR description). I am not sure of the PR practices - if you'd like me to merge into https://github.com/ikawrakow/ik_llama.cpp/pull/113 rather than the main branch, happy to change. I also tried to clean up some of the comments / dead code in the WIP branch, but can revert those changes as well. - [x] I have read the [contributing guidelines](https://github.com/ggerganov/llama.cpp/blob/master/CONTRIBUTING.md) - Self-reported review complexity: - [ ] Low - [X] Medium - [ ] High --- #### 💬 Conversation 👤 **ikawrakow** commented the **2025-05-21** at **07:13:48**:
> For Llama-3.1-8B-Instruct, I get 0.3t/s with IQ2_KT compared to >1.0t/s with F16 on AMD EPYC 7R32 (32 cores) Is this in debug mode? I'm getting 10.4 t/s for `IQ2_KT` on my 16-core Ryzen-7950X CPU. Which (as expected) is slow for a 2-bit quantized 8B model, but still in the acceptable range. --- 👤 **andrewkchan** commented the **2025-05-21** at **07:17:47**:
I'm compiling with `cmake --build ./build --config Release -j $(nproc)`. I might need to tweak the number of threads; I've found this greatly impacts performance on my test machine in the past for llama.cpp. Here's how I'm testing: ``` alias ik-build='cmake --build ./build --config Release -j $(nproc)' ik-build && ./build/bin/llama-cli -m ../Llama-3.1-8B-Instruct/Llama-3.1-8B-Instruct-IQ2_KT-2.gguf -cnv -p "You are a helpful assistant" -ngl 0 -c 4096