We have 4 groups of 16 in a block of 64 quants.
For each group of 16 we have 3 groups of 5, each using 8 bits.
The remaining 16'th quants of the 4 groups of 16 are encoded
with 8 bits using the same encoding as the groups of 5.
The only kernel where we have complications is the CUDA dequantize
kernel (because we are dequantizing 8 quants there, and we have
different encoding for the 1st and 2nd group of 8 in a group of 16).
Ths achieves better performance on all tested platforms than
any previous 1.625 bpw attempt. We have:
| model | size | params | backend | threads | test | t/s |
| ---------------- | ---------: | ---------: | ---------- | ------: | ------------: | ---------------: |
| 1.625 bpw Bitnet | 729.64 MiB | 3.32 B | CUDA | 8 | pp512 | 9613.02 ± 24.54 |
| 1.625 bpw Bitnet | 729.64 MiB | 3.32 B | CUDA | 8 | tg128 | 229.85 ± 0.33 |
| 1.625 bpw Bitnet | 729.64 MiB | 3.32 B | AVX2 | 16 | pp512 | 322.59 ± 1.00 |
| 1.625 bpw Bitnet | 729.64 MiB | 3.32 B | AVX2 | 16 | tg128 | 59.79 ± 0.03 |
| 1.625 bpw Bitnet | 729.64 MiB | 3.32 B | AVX2 | 8 | tg128 | 57.62 ± 0.21 |
| 1.625 bpw Bitnet | 729.64 MiB | 3.32 B | AVX2 | 4 | tg128 | 33.66 ± 0.29 |
| 1.625 bpw Bitnet | 729.64 MiB | 3.32 B | AVX2 | 2 | tg128 | 18.30 ± 0.01 |
| 1.625 bpw Bitnet | 729.64 MiB | 3.32 B | Metal | 8 | pp512 | 698.13 ± 0.21 |
| 1.625 bpw Bitnet | 729.64 MiB | 3.32 B | Metal | 8 | tg128 | 68.88 ± 0.24 |
| 1.625 bpw Bitnet | 729.64 MiB | 3.32 B | NEON | 8 | pp512 | 196.80 ± 0.50 |
| 1.625 bpw Bitnet | 729.64 MiB | 3.32 B | NEON | 8 | tg128 | 51.58 ± 0.41 |
| 1.625 bpw Bitnet | 729.64 MiB | 3.32 B | NEON | 4 | tg128 | 30.80 ± 0.03 |
| 1.625 bpw Bitnet | 729.64 MiB | 3.32 B | NEON | 2 | tg128 | 16.89 ± 0.01 |
It is still slower than 2 bpw Bitnet, but the difference now is not as
dramatic.
The AVX2 implementation was the only one left using it, so
I decided to see if we can get a performant implementation
using the 0,1,2 lookup table. Turns out we can, and it is
even slightly faster than the sign based table. We now
get PP-512 = 275 t/s and TG-128 = 57.7 t/s with 16 threads
on the Ryzen-7950X.
With only one lookup table left for iq1_bn, I renamed it to
iq1bn_grid_u16.
Faster on CUDA. The scalar version is faster too.
The issue with CUDA is that now I see wild performance
fluctuations. Running llama-bench I can get 220 t/s
for TG-128 one time, and 190 t/s another time, with
uncertaintiers of 1-2 t/s. Same for PP, results are
jumping back-and-fort between ~9500 t/s and ~8900 t/s.
So, basically no reliable measurement at this point,
but for sure faster than the previous version, which was
at around 170-180 t/s.
Use 3 bits for the exponent and 5 bits for the mantissa.
This makes PPL to be the same as fp16 (but the previous
version with 4 bits for the exponent and mantissa was
good enough for any practical purposes).
We get PP-512 = 9600 t/s, TG-128 = 234 t/s
(but we need to use 8 CPU threads, else results are lower,
so clearly there is something being computed on the CPU).
PP-512 is very close to PP-512(fp16) = 9800 t/s
* DRAFT: Introduction of CUDA Graphs to LLama.cpp
* FIx issues raised in comments
* Tidied to now only use CUDA runtime (not mixed with driver calls)
* disable for multi-gpu and batch size > 1
* Disable CUDA graphs for old GPU arch and with env var
* added missing CUDA_CHECKs
* Addressed comments
* further addressed comments
* limit to GGML_ALLOW_CUDA_GRAPHS defined in llama.cpp cmake
* Added more comprehensive graph node checking
* With mechanism to fall back if graph capture fails
* Revert "With mechanism to fall back if graph capture fails"
This reverts commit eb9f15fb6fcb81384f732c4601a5b25c016a5143.
* Fall back if graph capture fails and address other comments
* - renamed GGML_ALLOW_CUDA_GRAPHS to GGML_CUDA_USE_GRAPHS
- rename env variable to disable CUDA graphs to GGML_CUDA_DISABLE_GRAPHS
- updated Makefile build to enable CUDA graphs
- removed graph capture failure checking in ggml_cuda_error
using a global variable to track this is not thread safe, but I am also not safistied with checking an error by string
if this is necessary to workaround some issues with graph capture with eg. cuBLAS, we can pass the ggml_backend_cuda_context to the error checking macro and store the result in the context
- fixed several resource leaks
- fixed issue with zero node graphs
- changed fixed size arrays to vectors
- removed the count of number of evaluations before start capturing, and instead changed the capture mode to relaxed
- removed the check for multiple devices so that it is still possible to use a single device, instead checks for split buffers to disable cuda graphs with -sm row
- changed the op for checking batch size to GGML_OP_ADD, should be more reliable than GGML_OP_SOFT_MAX
- code style fixes
- things to look into
- VRAM usage of the cudaGraphExec_t, if it is significant we may need to make it optional
- possibility of using cudaStreamBeginCaptureToGraph to keep track of which ggml graph nodes correspond to which cuda graph nodes
* fix build without cuda graphs
* remove outdated comment
* replace minimum cc value with a constant
---------
Co-authored-by: slaren <slarengh@gmail.com>
* ggml : group all experts in a single ggml_mul_mat_id
cuda : improve mmid row copy
* cuda : fix bin bcast with non-cont src0
* test-backend-ops : only run all mul mat tests for base types
* llama : disable moe offloading with SYCL
---------
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* Add Command R Plus GGUF
* Add Command R Plus GGUF
* Loading works up to LayerNorm2D
* Export new tensors in 1D so they are not quantized.
* Fix embedding layer based on Noeda's example
* Whitespace
* Add line
* Fix unexpected tokens on MPS. Re-add F16 fix. ((Noeda)
* dranger003: Fix block index overflow in CUDA dequantizing.
* Reverted blocked multiplication code as it still has issues and could affect other Llama arches
* export norms as f32
* fix overflow issues during quant and other cleanup
* Type convention
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* dranger003: Fix more int overflow during quant.
---------
Co-authored-by: S <seast@Ss-Mac-Studio.local>
Co-authored-by: S <s@example.com>
Co-authored-by: slaren <slarengh@gmail.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
* iq1_m: basics
* iq1_m: basics-2
* iq1_m: CUDA dequantize works
Very 1st shot I get PPL = 9.76 for LLaMA-v2-7B.
* iq1_m: separate shifts for each group of 8 in a block
We get
PPL(LLaMA-v2-7B ) = 9.2810
PPL(LLaMA-v2-13B) = 6.8105
Not bad, but slightly higher than
sqrt(PPL(IQ1_S) * PPL(IQ2_XXS))
which is the expected outcome given that IQ1_M is
halfway between IQ1_S and IQ2_XXS in terms of bpw.
From this, we would expect
PPL = 9.14 for LLaMA-v2-7B
PPL = 6.63 for LLaMA-v2-13B
* iq1_m: go to 3-bit scales
There is slight increase in PPL, but the 0.0625 bpw reduction
in size is totally worth it.
We now have
PPL(LLaMA-v2-7B ) = 9.4469 at 1.96 bpw
PPL(LLaMA-v2-13B) = 6.8717 at 1.93 bpw
PPL(LLaMA-v2-70B) = 4.8568 at 1.85 bpw
* iq1_m: scalar dot product
* iq1_m: AVX2 dot product
* iq1_m: very slightly faster AVX2 dot product
* iq1_m: ARM_NEON dot product
Works, but very slow (10.5 t/s)
* iq1_m: Metal - dequantize works, dot product does not
* iq1_m: Metal now works
About the same performance as iq1_s.
* iq1_m: minor
* iq1_m: checking pure iq1_m quantization
It is pretty bad: PPL(LLaMA-v2-7B) = 34 if we quantize output.weight
with Q4_K.
* iiq1_m: slightly faster ARM_NEON dot product
10.5 t/s -> 11.65 t/s
* iq1_m: faster ARM_NEON dot product
11.65 t/s -> 14.9 t/s
* iq1_m: another minor ARM_NEON dot product improvement
14.9 -> 15.0 t/s
* iq1_m: small PPL improvement via super-block scale adjustment
After quantizing block scales redo the super-block scale fit.
PPL(LLaMA-v2-7B ) = 9.3346
PPL(LLaMA-v2-13B) = 6.8419
PPL(LLaMA-v2-70B) = 4.8294
PPL(Mistral-7B ) = 8.1624
* iq1_m: adapt to CUDA refactoring
* iq1_m: remove unused variable
We have progressed to warnings being errors.
* iq1_m: add to backend-ops tests
* iq1_m: fix Windows ARM
* iq1_m: use common definition of iq1m_scale_t
* cuda: assert -> NO_DEVICE_CODE
* iq1_M: PR comments
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>