* Fuse add+add+fused_rms
* Try this
* Macro to easily enable/disable fusion
* Various:
* Check that all tensors involved are on the same device before applying fusion
* Fuse sigmoid+scale+sum_rows+div
* Fix the fused bailingmoe2 experts selection
The issue there was that the bias was not per row, but per
expert group, so only the first n_per_group biases were used
for al experts.
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Add mtmd: the beginning
* Add mtmd: mtmd.cpp compiles
* Add mtmd: clip initialization compiles
* Add mtmd: clip.cpp compiles
* Add mtmd: builds successfully
* Add CPU implementation for GGML_OP_GLU
* Add CUDA implementation for GGML_OP_GLU
* Add CPU implementation for GGML_OP_CONV_2D and GGML_OP_CONV_2D_DW
* Add CUDA implementation for GGML_OP_CONV_2D and GGML_OP_CONV_2D_DW
* Add mtmd: refresh CPU rope
* Add mtmd: refresh CUDA rope
* Add mtmd: add Qwen2-VL
* Add mtmd: Qwen2.5-VL text seems to work with this change
* Add mtmd: fix swiglu
* Add mtmd: use LOG_TEE so generated tokens show up in terminal
* Add mtmd: do not attempt to load a GPU backend if none are available
* GLU, not GPU
* Fix typo
* Fix new/free mismatch
* LOG stuff
* Add mtmd: this fixes gibberish on second image
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* gmp-oss: common
* gpt-oss: attnetion sinks, swiglu_oai
* gpt-oss: WIP llama
Model loads and runs (CPU only), but PPL is much to high
(~1500 for 1st batch vs ~200 in mainline).
Is it because of SWA, because of vocab, or did I introduce a bug somewhere?
* gpt-oss: CPU seems to be working
It was the SWA thta was missing in the previous commit.
There are issues with EOG tokens, so this still needs to be added.
* CUDA: ADD_ID
Just a copy from mainline
* gpt-oss: Seems to be working on CUDA
* gpt-oss: add sinks to the attn-vec kernels
* CUDA: add head size of 64 to new mma
Haven't turned it on yet, but observe slightly better PP and slightly
worse TG performance with that.
* gpt-oss: add ability to use -fmoe (only CUDA for now)
* Move row sums to the write place
* Add sinks to iqk flash attention
* gpt_oss: Implement -fmoe on the CPU
* Simdify swiglu_oai
Turning it off for now as performance becomes more variable,
so perhaps I'm running into thermal trottling imore often
because of making the CPU work too hard.
* llama: factor out model loader
* Builds successfully
* It runs, but mmap does not work
* Fix llama_mmap so mmap works
* Minor
* Fix CUDA after latest changes
* Attempt to use CUDA graphs with MoE models - not working
* CUDA graphs WIP - still not working
* CUDA graphs - seems to be working
Likely not all MLA variants are working.
I no longer remember why I added the q8_0 cpy that
transposes the tensor, but if really needed, this is now
missing. Also missing is q6_0.
* Make q8_0 cache work for DeepSeek models with CUDA graphs
* cuda: cpy for q6_0
* Fix llama_mmap on non-Linux platforms
* Adding forgotten file
* Iterating on Windows build failures
* cuda: re-add q8_0 -> q8_0 transpose
so mla = 2 can be used with CUDA graphs and q8_0 cache.
* Disable graphs without -fmoe
* Minor
* Turn graphs on by default
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Fusing MoE up * unary(gate)
* Fusing MoE up * unary(gate): CUDA
We get ~13% speedup for PP-512 and ~2% for TG-128
for DeepSeek-Lite
* On CUDA also fuse MoE down * (up * unary(gate))
in case the MUL_MAT_ID op for the down experts is the next
op in the graph.
* Command line option to enable fused MoE up*unary(gate)
* Add fmoe option to llama-bench
* Adding forgotten gelu, relu, silu on ARM
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* multi_sdd: WIP
* multi_sdd: CPU works
* multi_add: CUDA
* multi_add: simplify
* multi_add: Metal
* Metal: speed up mul_mat_id
For the Granite-1B MoE model PP-512 goes from
156 t/s to 890 t/s, so nearly a 6X speedup!
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Adding fused y*unary(x) op
* Fused y*unary(x) op: CUDA
* Fused y*unary(x) op: dedicated CPU implementation for silu and gelu
* Fused y*unary(x) op: Metal
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Adding GGML_UNARY_OP_SWIGLU
This commit implements the ggml op and CPU compute
forward. I see ~3-4% speedup of PP-512 for Phi-3.5-mini.
* GGML_UNARY_OP_SWIGLU: CUDA implementation
I observe ~12% speedup for PP-512(Phi-3.5-mini).
* GGML_UNARY_OP_SWIGLU: Metal implementation
We get ~2% speedup for PP-512(Phi-3.5-mini).
* GGML_UNARY_OP_SWIGLU: minor improvement on Metal
* GGML_UNARY_OP_SWIGLU: cleanup
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Merging mainline - WIP
* Merging mainline - WIP
AVX2 and CUDA appear to work.
CUDA performance seems slightly (~1-2%) lower as it is so often
the case with llama.cpp/ggml after some "improvements" have been made.
* Merging mainline - fix Metal
* Remove check
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>