Files
ik_llama.cpp/README.md
2024-07-23 12:23:06 +02:00

5.5 KiB

llama.cpp clone with better CPU performance

License: MIT


Table of Contents
  1. Description
  2. Usage
  3. Contributing
  4. Coding guidelines
  5. Docs

TL;DR

This repository is a clone of llama.cpp with the following improvements

  • Better implementation of CPU matrix multiplications (AVX2 and ARM_NEON) for fp16/fp32 and all k-, i-, and legacy llama.cpp quants, that leads to a significant improvement in prompt processing (PP) speed. Token generation (TG) also benefits, but to a lesser extent due to TG being memory bound
  • Implementation of the Bitnet b1.58 model for the CPU (AVX2 and ARM_NEON) and GPU (CUDA and Metal)
  • Faster CPU inferrence for MoE models

If you are not already familiar with llama.cpp, it is better to start there. For those familiar with llama.cpp, everything works the same as llama.cpp (or at least the way llama.cpp worked when I last synced on June 21).

Note that I have published some, but not all, of the code in the respository in a series of llamafile PRs (394, 405, 428, 435, 453, and 464)

Why

Mostly out of curiosity:

  • Justine Tunney's tinyBLAS, which she contributed to llama.cpp in PR 6414, only works for Q4_0, Q8_0 and fp16/bf16 models. In the surrounding discussion about possibly extending tinyBLAS to k- and i-quants, she felt that k-quants are not ammenable to block-tiling, which is required to improve performance. This statement piqued my curiosity, so here we are.
  • Bitnet-1.58b has been one of the most discussed topics in the llama.cpp project, so eventually I decided to see how efficiently one can implement a tertiary model

Curiosity aside, improved CPU performance may be (or may become) important in practice. According to The Register, 70% of AI inferrence is done on the CPU, at least in the Android world. With ever increasing number of LLM model parameters, and with Meta's 400B model release imminent, the CPU may become the only option for people not willing (or not able to) rent/buy uber expensive GPU instances capable of running such models. Granted, one would need a pretty beefy computer to run a 400B model, inference speed will be sluggsh, but at least one will not need to spend the equivalent of a luxury apartmenty in the downtown of the city where I live.

Bitnet-1.58B

Two implementations are provided

  • IQ1_BN - uses 1.625 bits-per-weight (bpw)
  • IQ2_BN - uses 2.0 bpw

IQ2_BN is faster for PP. IQ1_BN can arrive at a higher TG performance on the CPU (given enough threads), but is always slower on the GPU.

There is the unmerged PR 8151 in llama.cpp that implements Bitnet-1.58B for the CPU (AVX and ARM_NEON). The following table compares performance between this repo and PR-8151 in llama.cpp.

Performance comparison to llama.cpp

The results in the following table are obtained with the following parameters:

  • Model is LLaMA-v3-8B
  • The AVX2 CPU is a 16-core Ryzen-7950X
  • The ARM_NEON CPU is M2-Max
  • tinyBLAS is enabled in llama.cpp
  • llama.cpp results are for build: 081fe431 (3441), which is the master branch as of July 23 2024.

MoE models

There is PR-6840 from Justine Tunney in llama.cpp, but it has not been merged since April 23, so I'll compare performance to the master branch for Mixtral-8x7B.

To tile or not to tile