For the last layer we need to use the input for the output.weight
tensor. Last layer(s) tend(s) to be important, so it is useful to also
have its influence metric.
* Metal: WIP to update Metal FA implementation
Dk=192, Dv=128 works, but not Dk = 576, Dv = 512
* Metal FA: go to float
* WIP
* Metal FA: MLA options now all work
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* MoE improvements on Metal
This version beats mainline, there are things I don't understand:
* Mianline has effectively gone to GEMV for MUL_MAT_ID. We can do the
same, but we are 30% slower. Why?
* Using actual GEMM, we beat mainline with ubtach size of 128. But then
performance degrades. Why?
* Some cleanup
* Much better
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* iq3_k: slightly better quantization
Not much of a difference for most models, but this change
avoids what it looks like a catastrophic failure for DeepSeek-Lite
(PPL is now 7.041 vs 7.314 on main).
* Small improvement for type-1 quants
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Make sure no interleaved quants are being used for token embeddings
also with `--pure` and/or `--custom-q`.
* Simplify
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Better make_qx_quants
Tested with q4_0 and q3_K (pure, imatrix), and the improvement is
quite significant.
* Sae for iq4_nl, iq4_xs
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* WIP - not working
* q8_0 without bells and wistles works
* It works for q8_0
* Use bf16 instead of f16,int16
* q4_0_r8
* q5_0_r4
* q6_0_r4
* Also q4_1 and q5_1
* Add check if selected type is possible with --pure
I often want to quantize with --pure to see quantization performance
without quantization mixes. But for models where there qre tensors
with row sizes that are not multiple of 256, this results in a crash
for k- and i-quants. Hence, lets add a check if the quant selected
via --pure is applicable, and change it if not.
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* WIP - not working
* q8_0 without bells and wistles works
* It works for q8_0
* Use bf16 instead of f16,int16
* q4_0_r8
* q5_0_r4
* q6_0_r4
* Also q4_1 and q5_1
* q8_0_r8 on avx2
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* llama-bench: enable having different number of threads for tg and pp
* Add -tgb to usage
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Make fused MoE reproducible
As a bonus, peak performance at pp2048 with u_batch = 2048 is
~8% better.
* Slightly better
* Also do it for non-fused mul_mat_id
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Improve DeepSeek batched processing speed
* Revert the commented out section in iqk_mul_mat.cpp
It does have some benefit at long contexts.
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Adding ability to use THP on Linux
* Use the actual page size4 used for mmap also in munmap
* Add -thp to llama-bench
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* WIP Gemma3: not working
* gemma3: build_gemma3 seems to be working now
* Revert changes to convert_hf_to_gguf.py
It wasn't working, so I guess, it is better to leave the
conversion up tp upstream.
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Repack a model with the quantize tool
* WIP
* Fixed various issues
As we don't have a way to tell if a repacked quant has been modified,
I had to remove the modification at the expense of a slight decrease
in performance. This affects q8_0_r8, q8_KV_r8, q8_k_r8 on Zen4, and
q4_0_r8 on ARM.
* Create wk_b and wv_b as Q8_0_R8 if the wkv_b type is interleaved
* Fix GCC 13.3 compilation error
* Another one
* Add missing include
* FlashMLA-3: the best of both worlds - CPU only
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Repack a model with the quantize tool
* WIP
* Fixed various issues
As we don't have a way to tell if a repacked quant has been modified,
I had to remove the modification at the expense of a slight decrease
in performance. This affects q8_0_r8, q8_KV_r8, q8_k_r8 on Zen4, and
q4_0_r8 on ARM.
* Create wk_b and wv_b as Q8_0_R8 if the wkv_b type is interleaved
* Fix GCC 13.3 compilation error
* Another one
* Add missing include
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
I broke it with PR #265. I was testing with a model where
the wk_b and wk_v tensors were present, so didn't need to be computed,
so didn't notice that the change I made to ggml_compute_forward_dup_q
breaks that computation.
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* FlashMLA-2: eliminate intermediate f32 tensors
This works on the CPU. PP performance is ~13% better for 16k tokens
and compute buffer is quite a bit smaller.
* FlashMLA-2: enable fast path only on the CPU for now
I did implement the necessary ops on CUDA, but something is
still wrong there, so for now we only use it when running
CPU-only.
* FlashMLA-2: slightly smaller computer buffer size
* Prepare wk_b when loading DeepSeek models (if wk_b is missing)
* Add some comments
* Fix case where wkv_b is quantized with k- or i-quants.
* Fix CUDA
There is an issue with quantized GEMV on CUDA when the left operand
(the matrix) is not contiguous. So, for now, we also create wv_b
during model loading and use that instead of the 3D view of wkv_b.
* FlashMLA-2: avoid conversions to f32 also on CUDA
* Be able to compute for more than 65535 tokens
On CUDA just a quick hack that allows us to cancatenate tensors
with more than 65535 rows along zroth dimension as needed by
FlashMLA-2. Also needed some care in the perplexity tool to
avoid int overflows when evaluating the computed logits.
* Reduce memory usage for FlashMLA-2
Oh, also fix int overflow in the CUDA concat implementation.
It is funny how the llama.cpp 64-bit police has gone (almost) everywhere
and replaced 32-bit ints with 64-bit ints, needed or not,
but hasn't done it where it is actually needed.
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* FlashMLA-2: eliminate intermediate f32 tensors
This works on the CPU. PP performance is ~13% better for 16k tokens
and compute buffer is quite a bit smaller.
* FlashMLA-2: enable fast path only on the CPU for now
I did implement the necessary ops on CUDA, but something is
still wrong there, so for now we only use it when running
CPU-only.
* FlashMLA-2: slightly smaller computer buffer size
* Prepare wk_b when loading DeepSeek models (if wk_b is missing)
* Add some comments
* Fix case where wkv_b is quantized with k- or i-quants.
* Fix CUDA
There is an issue with quantized GEMV on CUDA when the left operand
(the matrix) is not contiguous. So, for now, we also create wv_b
during model loading and use that instead of the 3D view of wkv_b.
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>