* Adding fused mul+multi_add + CPU implementation
* fused mul+multi_add: CUDA
* fused mul+multi_add: command line argument to disable it
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Fuse add+add+fused_rms
* Try this
* Macro to easily enable/disable fusion
* Various:
* Check that all tensors involved are on the same device before applying fusion
* Fuse sigmoid+scale+sum_rows+div
* Fix the fused bailingmoe2 experts selection
The issue there was that the bias was not per row, but per
expert group, so only the first n_per_group biases were used
for al experts.
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Combine all calls to llm_build_norm to a single line
so more easily check what kind of arguments are being passed
by simply using grep.
* Combine add + fused_rms_norm
For many models this happens at each layer: the result of the
layer is added to the ayer input, which then becomes the input
to the next layer, which then is typically normalized via
fused_rms_norm.
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Fuse sigmoid+add+grouped_topk+get_rows (CPU)
* Fix CPU + CUDA
but CUDA is somehow not 100% correct as I get a slightly different
PPL (lower!)
* Minor
* Fuse sigmoid+add+topk+get_rows (CUDA)
* Fuse sigmoid+add+topk+get_rows (CPU)
* Fuse topk+view+get_rows+reshape+softmax (CPU)
* Fuse topk+view+get_rows+reshape+softmax (CUDA)
* cpu: turn off the openai topk fusing for now
Something is not right and I don't see the bug.
On the CPU one doesn't gain much if anything, so not a big loss.
* Also fuse sum_rows and div
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Better argsort (CPU)
* Attemt at grouped topk
* This seems to do the trick for grouped experts routing
* Cleanup
* Trying to merge, something is not right
* Working merged grouped top_k (CPU)
* Add command line option to enable grouped expert routing
* Add grouped expert routing option to llama-bench
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Add mtmd: the beginning
* Add mtmd: mtmd.cpp compiles
* Add mtmd: clip initialization compiles
* Add mtmd: clip.cpp compiles
* Add mtmd: builds successfully
* Add CPU implementation for GGML_OP_GLU
* Add CUDA implementation for GGML_OP_GLU
* Add CPU implementation for GGML_OP_CONV_2D and GGML_OP_CONV_2D_DW
* Add CUDA implementation for GGML_OP_CONV_2D and GGML_OP_CONV_2D_DW
* Add mtmd: refresh CPU rope
* Add mtmd: refresh CUDA rope
* Add mtmd: add Qwen2-VL
* Add mtmd: Qwen2.5-VL text seems to work with this change
* Add mtmd: fix swiglu
* Add mtmd: use LOG_TEE so generated tokens show up in terminal
* Add mtmd: do not attempt to load a GPU backend if none are available
* GLU, not GPU
* Fix typo
* Fix new/free mismatch
* LOG stuff
* Add mtmd: this fixes gibberish on second image
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Avoid computing FA chunks where the mask is -infinity
* Avoid computing FA chunks where the mask is -infinity also for f16/bf16
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Quick attempt to fuse the Q, K, V GEMMs
Doesn't do much on the CPU
* Doesn't do much on the GPU either
* Use llm_build_mul_mat_qkv
* This is not needed
* Revert timing on committed by mistake
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Offload only activated experts
* This seems to do the trick for -fmoe
* Do not recalculate activated expers for fused up/gate
* Log out of bounds access details
* Add a command line argument
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Bounds for flash attention
* Add n_swa to FA parameters
* Fix it
* This seems very slightly better
* Using vec kernel when we have SWA
* Need also this
* f32 vec kernel
* This is slightly better
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Fused up+gate+unary for regular (not MoE) FFN - CPU
* WIP CUDA
* Seems to be working on CUDA
For a dense model we get 2-3% speedup for PP and ~0.6% for TG.
* Add command line option
This time the option is ON by default, and one needs to turn it
off via -no-fug or --no-fused-up-gate
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Skip the row id computation for the ffn_down op
Sadly, almost negligible performance gain.
* Also this doesn't do much
* Also this barely moves the needle
* This is slightly better
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Check for NaNs while loading the model.
* Also tell which experts have NaNs.
* Add command line option to validate quants
* Add checks for more quantization types
* Add checks for more quantizagtion types
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* This fixes confusion around Q8_0 on AVX2
* This does it for iq4_nl, including FA
* This does it for iq4_nl on Zen4, but FA does not work
* Slightly more clear
* Adding forgotten q8_0_r8 to num_rows()
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Use bperm trick for iq2_ks gemm -> 7% gain
* Use bperm trick for iq2_k gemm -> ~5% gain
* Use bperm trick for iq2_k_r4 gemm -> ~3% gain
* Use bperm trick for iq2_ks gemv -> ~7% gain
* Use bperm trick for iq2_k gemv -> ~3% gain
* Use bperm trick for iq2_k_r4 gemv -> ~7% gain
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* q8_k_r16: basics
* q8_k_r16: iq4_xs now uses q8_k_r16 on Zen4+
PP performance is about the same as using q8_k_r8 on the Ryzen-7950X,
so we expect nice gains on Zen5, and we don't need to wory about
using 2 different q8_k_r8 implementations for fancy SIMD.
* q8_k_r16: iq2_xxs now uses q8_k_r16 on Zen4+
* q8_k_r16: iq2_xs now uses q8_k_r16 on Zen4+
* q8_k_r16: iq2_s now uses q8_k_r16 on Zen4+
* q8_k_r16: iq3_xxs now uses q8_k_r16 on Zen4+
* q8_k_r16: iq3_s now uses q8_k_r16 on Zen4+
* q8_k_r16: iq1_s and iq1_m now uses q8_k_r16 on Zen4+
* q8_k_r16: q2_K and q3_K now uses q8_k_r16 on Zen4+
* q8_k_r16: iq2_ks and iq2_k now uses q8_k_r16 on Zen4+
* q8_k_r16: iq2_kl now uses q8_k_r16 on Zen4+
* q8_k_r16: iq3_ks and iq3_k now uses q8_k_r16 on Zen4+
* q8_k_r16: iq4_kss, iq4_ks, and iq4_k now use q8_k_r16 on Zen4+
* q8_k_r16: iq5_ks, iq5_k, and iq6_k now use q8_k_r16 on Zen4+
* Fix AVX2
* Just always set num_rows to 16
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Use bperm trick for iq3_ks - 5% PP performance gain
* Use bperm trick for iq3_k -> 5% PP performance gain
* Use bperm trick for iq3_k -> 8% PP performance gain
* Use bperm trick for iq3_k_r4 gemv -> ~5% faster
* Use bperm trick for iq3_k gemv -> ~3% faster
* Use bperm trick for iq3_k gemv -> 4.5% gain
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Use __byte_perm in get_int_from_table_16
* Use get_int_from_table_16 everywhere for 4-bit quants
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Q8_0 needs Q0_0_X4, but Q8_0_R8 needs Q8_2_X4.
So, if we decide to repack a Q8_0 MoE tensor to Q8_0_R8,
iqk_moe_fused_mul_unary fails because the activations were
prepared as Q0_0_X4, but we now need Q8_2_X4.
For now a simple fix: just take the slow path, do not repack.
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* This does the trick for PP
* Compute mask bounds when creating the mask
* Set mask bounds for all supported SWA models
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* gmp-oss: common
* gpt-oss: attnetion sinks, swiglu_oai
* gpt-oss: WIP llama
Model loads and runs (CPU only), but PPL is much to high
(~1500 for 1st batch vs ~200 in mainline).
Is it because of SWA, because of vocab, or did I introduce a bug somewhere?
* gpt-oss: CPU seems to be working
It was the SWA thta was missing in the previous commit.
There are issues with EOG tokens, so this still needs to be added.
* CUDA: ADD_ID
Just a copy from mainline
* gpt-oss: Seems to be working on CUDA
* gpt-oss: add sinks to the attn-vec kernels
* CUDA: add head size of 64 to new mma
Haven't turned it on yet, but observe slightly better PP and slightly
worse TG performance with that.
* gpt-oss: add ability to use -fmoe (only CUDA for now)
* Move row sums to the write place
* Add sinks to iqk flash attention
* gpt_oss: Implement -fmoe on the CPU
* Simdify swiglu_oai
Turning it off for now as performance becomes more variable,
so perhaps I'm running into thermal trottling imore often
because of making the CPU work too hard.
* llama: factor out model loader
* Builds successfully
* It runs, but mmap does not work
* Fix llama_mmap so mmap works
* Minor
* Fix CUDA after latest changes
* Attempt to use CUDA graphs with MoE models - not working
* CUDA graphs WIP - still not working
* CUDA graphs - seems to be working
Likely not all MLA variants are working.
I no longer remember why I added the q8_0 cpy that
transposes the tensor, but if really needed, this is now
missing. Also missing is q6_0.
* Make q8_0 cache work for DeepSeek models with CUDA graphs
* cuda: cpy for q6_0
* Fix llama_mmap on non-Linux platforms
* Adding forgotten file
* Iterating on Windows build failures
* cuda: re-add q8_0 -> q8_0 transpose
so mla = 2 can be used with CUDA graphs and q8_0 cache.
* Disable graphs without -fmoe
* Minor
* Turn graphs on by default
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>