* Args for MMVQ functions
* WIP
* Fused ffn_up*unary_op(ffn_gate) for MMVQ (no bias)
We see nearly 2% TG speedup for Ling-mini-2.0 and
about 1% for DeepSeek-Lite.
* Fused ffn_up*unary_op(ffn_gate) for MMVQ (with bias)
* Fusing also for iqk/trellis/repacked quants
* Fusing mmvq also in non-MoE up+gate
* Fuse mul_mat_id and add_id into a single kernel for mmvq
* Also iqk quants
* Split mmvq.cu and iqk_mmvq.cu into separate template instances
* Put iqk mmvq implementations into template instances
* Somehow I forgot to change the ggml_type in the legacy template calls
* Add disagnostics
* Disable assert
* Fix TG fused up*nary(gate) when down cannot be fused
The wrong memory buffer got used in that case
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Change fmoe to be on by default
* Change default fmoe also in llama-bench
* Change flash attention to be on by default
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Adding fused mul+multi_add + CPU implementation
* fused mul+multi_add: command line argument to disable it
* Faster tensor name formatting
We gain ~1% for Ling-mini-2.0 when running on CUDA.
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Adding fused mul+multi_add + CPU implementation
* fused mul+multi_add: CUDA
* fused mul+multi_add: command line argument to disable it
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Fuse add+add+fused_rms
* Try this
* Macro to easily enable/disable fusion
* Various:
* Check that all tensors involved are on the same device before applying fusion
* Fuse sigmoid+scale+sum_rows+div
* Fix the fused bailingmoe2 experts selection
The issue there was that the bias was not per row, but per
expert group, so only the first n_per_group biases were used
for al experts.
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Combine all calls to llm_build_norm to a single line
so more easily check what kind of arguments are being passed
by simply using grep.
* Combine add + fused_rms_norm
For many models this happens at each layer: the result of the
layer is added to the ayer input, which then becomes the input
to the next layer, which then is typically normalized via
fused_rms_norm.
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Do not allocate KV cache for unused layers
* Do not apply experts weight scale if it is 1
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Fuse sigmoid+add+grouped_topk+get_rows (CPU)
* Fix CPU + CUDA
but CUDA is somehow not 100% correct as I get a slightly different
PPL (lower!)
* Minor
* Fuse sigmoid+add+topk+get_rows (CUDA)
* Fuse sigmoid+add+topk+get_rows (CPU)
* Fuse topk+view+get_rows+reshape+softmax (CPU)
* Fuse topk+view+get_rows+reshape+softmax (CUDA)
* cpu: turn off the openai topk fusing for now
Something is not right and I don't see the bug.
On the CPU one doesn't gain much if anything, so not a big loss.
* Also fuse sum_rows and div
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Conditionally write moe_shared_expert_intermediate_size
Ling-1T config.json does *not* have `moe_shared_expert_intermediate_size`.
Ling-flash-2.0a *does* have it.
This small patch just makes the gguf_writer conditionally detect as
needed.
* Fix Ling-1T missing moe_shared_expert_intermediate_size
Thanks CISC for the proper patch to include the needed values!
* Better argsort (CPU)
* Attemt at grouped topk
* This seems to do the trick for grouped experts routing
* Cleanup
* Trying to merge, something is not right
* Working merged grouped top_k (CPU)
* Add command line option to enable grouped expert routing
* Add grouped expert routing option to llama-bench
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Parallelize mask
We see non-negligible PP gains for long contexts.
More importantly, the strange drop in performance
observed for GPT-OSS for context >= 32k tokens is gone.
* Whith FA on, create mask as f16 directly
* WIP
* Reduce KQ mask padding to 16
Why was it 64 in the first place?
I don't observe any issues, while TG performance
for long contexts improves by 2-4%.
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* llama_model and llama_hparams
* llama_build_context
Surprisingly small reduction in llama.cpp compile time given
the reduction in LOCs (22k -> 14k)
* LLM_TN
llama.cpp compilation: 50 s -> 33 s
* llama_quantize
* arch names
* All graph building is now in llm-build-context.cpp
* hparams loading
llama.cpp is now just 9300 LOC, but still takes 32 seconds to compile.
* We are now at 6 seconds to build the src folder
* load -> create
We are not actually loading the tensors, but just creating them.
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Add mtmd: the beginning
* Add mtmd: mtmd.cpp compiles
* Add mtmd: clip initialization compiles
* Add mtmd: clip.cpp compiles
* Add mtmd: builds successfully
* Add CPU implementation for GGML_OP_GLU
* Add CUDA implementation for GGML_OP_GLU
* Add CPU implementation for GGML_OP_CONV_2D and GGML_OP_CONV_2D_DW
* Add CUDA implementation for GGML_OP_CONV_2D and GGML_OP_CONV_2D_DW
* Add mtmd: refresh CPU rope
* Add mtmd: refresh CUDA rope
* Add mtmd: add Qwen2-VL
* Add mtmd: Qwen2.5-VL text seems to work with this change
* Add mtmd: fix swiglu
* Add mtmd: use LOG_TEE so generated tokens show up in terminal
* Add mtmd: do not attempt to load a GPU backend if none are available
* GLU, not GPU
* Fix typo
* Fix new/free mismatch
* LOG stuff
* Add mtmd: this fixes gibberish on second image
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Avoid computing FA chunks where the mask is -infinity
* Avoid computing FA chunks where the mask is -infinity also for f16/bf16
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
This commit is mostly a cherry-pick of ggml-org/llama.cpp#10783, plus
optimization to do partial sort when sorting the logits.
That mainline PR and friends were partially cherry-picked by #723, but
wasn't really in a working state yet.
A couple of additional changes:
* Include timing information in response, which was (unintentionally?)
done in mainline since ggml-org/llama.cpp#10643.
* Also return the actual logprobs for accepted draft tokens. This is
still a TODO in mainline [1].
Note that there is a TG performance penalty to return the logprobs, as
we need to sort the logits. By doing partial sort, the penalty is quite
small. Here are some numbers I got using the same prompt:
This PR with partial sort:
* no draft, no logprobs: 12.87 tok/s
* no draft, with logprobs: 12.61 tok/s (2.0% drop)
* with draft, no logprobs: 36.74 tok/s
* with draft, with logprobs: 36.12 tok/s (1.7% drop)
If cherry-pick the full sort from mainline PR:
* no draft, no logprobs: 12.81 tok/s
* no draft, with logprobs: 12.02 tok/s (6.2% drop)
* with draft, no logprobs: 36.59 tok/s
* with draft, with logprobs: 29.08 tok/s (20.5% drop)
[1] https://github.com/ggml-org/llama.cpp/blob/b6548/tools/server/server.cpp#L4019
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Quick attempt to fuse the Q, K, V GEMMs
Doesn't do much on the CPU
* Doesn't do much on the GPU either
* Use llm_build_mul_mat_qkv
* This is not needed
* Revert timing on committed by mistake
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* handle reasoning content in webui
server : include usage statistics only when user request them (#16052)
server : only attempt to enable thinking if using jinja (#15967)
* config reasoning_content in webui and change default to auto
---------
Co-authored-by: firecoperana <firecoperana>