5.5 KiB
llama.cpp clone with better CPU performance
Table of Contents
- Description
-
Usage
- Get the Code
- Build
- BLAS Build
- Prepare and Quantize
- Run the quantized model
- Memory/Disk Requirements
- Quantization
- Interactive mode
- Constrained output with grammars
- Obtaining and using the Facebook LLaMA 2 model
- Seminal papers and background on the models
- Perplexity (measuring model quality)
- Android
- Docker
- Contributing
- Coding guidelines
- Docs
TL;DR
This repository is a clone of llama.cpp with the following improvements
- Better implementation of CPU matrix multiplications (
AVX2andARM_NEON) forfp16/fp32and all k-, i-, and legacyllama.cppquants, that leads to a significant improvement in prompt processing (PP) speed. Token generation (TG) also benefits, but to a lesser extent due to TG being memory bound - Implementation of the Bitnet b1.58 model for the CPU (
AVX2andARM_NEON) and GPU (CUDAandMetal) - Faster CPU inferrence for MoE models
If you are not already familiar with llama.cpp, it is better to start there. For those familiar with llama.cpp, everything works the same as llama.cpp (or at least the way llama.cpp worked when I last synced on June 21).
Note that I have published some, but not all, of the code in the respository in a series of llamafile PRs (394, 405, 428, 435, 453, and 464)
Why
Mostly out of curiosity:
- Justine Tunney's
tinyBLAS, which she contributed tollama.cppin PR 6414, only works forQ4_0,Q8_0andfp16/bf16models. In the surrounding discussion about possibly extendingtinyBLASto k- and i-quants, she felt that k-quants are not ammenable to block-tiling, which is required to improve performance. This statement piqued my curiosity, so here we are. - Bitnet-1.58b has been one of the most discussed topics in the
llama.cppproject, so eventually I decided to see how efficiently one can implement a tertiary model
Curiosity aside, improved CPU performance may be (or may become) important in practice. According to The Register, 70% of AI inferrence is done on the CPU, at least in the Android world. With ever increasing number of LLM model parameters, and with Meta's 400B model release imminent, the CPU may become the only option for people not willing (or not able to) rent/buy uber expensive GPU instances capable of running such models. Granted, one would need a pretty beefy computer to run a 400B model, inference speed will be sluggsh, but at least one will not need to spend the equivalent of a luxury apartmenty in the downtown of the city where I live.
Bitnet-1.58B
Two implementations are provided
IQ1_BN- uses 1.625 bits-per-weight (bpw)IQ2_BN- uses 2.0 bpw
IQ2_BN is faster for PP. IQ1_BN can arrive at a higher TG performance on the CPU (given enough threads), but is always slower on the GPU.
There is the unmerged PR 8151 in llama.cpp that implements Bitnet-1.58B for the CPU (AVX and ARM_NEON). The following table compares performance between this repo and PR-8151 in llama.cpp.
Performance comparison to llama.cpp
The results in the following table are obtained with the following parameters:
- Model is LLaMA-v3-8B
- The
AVX2CPU is a 16-core Ryzen-7950X - The
ARM_NEONCPU is M2-Max tinyBLASis enabled inllama.cppllama.cppresults are forbuild: 081fe431 (3441), which is the master branch as of July 23 2024.
MoE models
There is PR-6840 from Justine Tunney in llama.cpp, but it has not been merged since April 23, so I'll compare performance to the master branch for Mixtral-8x7B.