* WIP: absorb adding input into std_attn and std_ffn
* WIP: NCCL infra
* WIP: add reduce and fake_cpy ops
* WIP
* WIP: graph appears to work, layer is broken
* WIP: Qwen3-MoE works with graph, layer still broken
* WIP: GLM-4.5 graph works
* WIP: fix sm layer (dense)
* WIP: fix sm layer (MoE)
* WIP: fast PP with bespoke 4-GPU NCCL
I guess, I'm not using NCCL the right way as PP is very
low with a single communicator group for 3 or more GPUs.
But if I create 4 communicator groups for pairs of GPUs
(0,1, 2,3, 0,2, 1,3) and use that, PP is fast: I'm hitting
1500 t/s for L3-70B on the 4x3090 system, which is
~20% better than the previous sm graph without NCCL.
But that cannot be the solution (I cannot be creating pairwise
communicators and associated logic for every possible number of GPUs).
* WIP: Cohere2
* Explicitely set device
* Bespoke 3-GPU case
* WIP
* Do not repeat get_rows multiple times
* Fix 3 GPUs
* OK, let's leave it in
* Implement the reduce op without NCCL available
* Be able to build without NCCL
cmake -DGGML_NCCL=OFF disables it
* Make --max-gpu work again
* Slightly better for 4 GPUs without NCCL
* Cleanup
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Use -smgs or --split-mode-graph-scheduling in CLI to bypass the disabling of split mode graph scheduling when tensor overrides is used.
Co-authored-by: Kawrakow <iwankawrakow@gmail.com>
* This works and TG is descent, but PP is low
* Better
* Apply f_logit_scale before mul mat with output tensor
* This is better for PP: 600 t/s -> 700 t/s
* To not lose this again
* WIP
* Equal split
* WIP
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Rearrange graph nodes
So that we can do graph portions that are the same on 2 or more
GPUs at the same time.
* Separate graph compute implementation for split mode graph
* This is better
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Required adding the "temperature scaling" to the standard attention
implementation.
But in this way split mode "graph" is automatically supported.
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Remove most of split mode row
* WIP
* WIP: also allocate the KV cache using tensor split
* WIP: it runs with wrong result
But it also looks like the backend scheduler is not going to help:
* It copies mask and input positions to GPU 0
* => RoPE ops must run on GPU 0
* => To proceed attn evaluation, GPU 1 must wait for GPU 0 to finish its
entire attn calculation
* Same with FFN. The rms_norm gets scheduled on GPU 0. Hence, GPU 1 must
wait for GPU 0 to finish its entore FFN calculation before it can
start (as it needs to copy the result of rms_norm from GPU 0)
* => Seems useless without writing a bespoke TP scheduling
* WIP
* This works, but it is slow
* This is slightly better
the graph is still not being computed in parallel.
Why? Because the scheduler creates graph splits where the
result of the computation on one GPU becomes an input for the
other split. Hence, to trigger the computation on the second GPU
one needs to wait for the computation on the first GPU to finish,
even thiough the two can be done in parallel up to the sunchronization
point. So, all that is left to do is to trick the scheduler to create
to splits that can be done in parallel, and then have a graph split
where the results get combined.
* Playing games with the scheduler
This change tricks it into doing the right thing^TM.
Still quite a bit slower than split mode layer for the 8B LlaMA model.
But for the 70B LlaMA it now beats split mode layer for TG:
28 t/s vs 24.4 t/s. PP is 627 t/s vs 744 t/s.
In comparison, split mode "row" in mainline gets
484 t/s PP and 19.3 t/s TG.
* Fix attn split
Granularity for Wq, Wo is not just head size, but
head size * gqa_ratio.
Else the Wk, Wv tensors end up not being a multiple of the
head size when we divide the split determined by Wo with
the gqa_ratio.
* Show memory used per device
* Make it work with partial offload
but no tensor overrides yet, just ngl < num_layers.
* Allow for f16 source in fused_rms_norm
* This results in faster PP.
Now PP is faster than split mode layer for L3-70B.
* Rename split mode "row" to split mode "graph"
* Leave FFN partial results as f16
* WIP GLM4.5 - runs with wrong results
* WIP GLM4.5 - this works
PP is already better than split mode layer, but TG for zero context
is kind of low - 60 vs 92 t/s. TG becomes better than split mode layer
at around 20k tokens. PP at 26k tokens is 1.55X of sm layer.
* Work around compiler bug
It issues a warning that there is an extra semicolon outside of a function,
but there isn't. If I remove the anonymous namespace and turn the
functions inside into static, the warning disapears, so clearly
a compiler bug.
* Make graph reuse work with split mode graph
* Remove more split mode row remnants
* WIP tensor overrides
Runs with wrong results, don't see where the issue could be.
* This works but is slow
Still does not work for row-interleaved quants
* Slightly better
* Slightly better
* Row-interleaved quants work
* Better
* Minor
* Guarad against using split mode "graph" for unsupported models
* Guards against using merge_qkv with split mode "graph"
* WIP split mode attn
Works for LlaMA models, but not for GLM-4.5.
Doesn't seem to improve performance, so I guess no point in trying to
fix it.
* Split mode graph for qwen3moe
* Try to better distribute the splits
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Fixing Gigachat support
* Gigachat: CUDA FA (needs 192 x 192 for MLA = 3)
* Gigachat: CPU FA (needs 192 x 192 for MLA = 3)
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Add mainline compatible FA command line option
* Graph reuse: add command line argument to turn it on
* WIP
* This seems to work
* This is perhaps cleaner
* Change the command line option to -gr
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Fuse concat and copy into K cache
* Avoid ggml_cont() when n_token = 1
Combined effect: about +2% in TG performance with full GPU offload
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Merge Q and K into a single tensor
* Make V mul mat follow QK mul mat
so they can be fused, which gives a slightly bbetter TG performance.
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* server: add support for vision model
webui: add support for vision model
* server : remove hack for extra parallel slot#10187
* llama : fix KV shift for qwen2vl #13870
* add no-context-shift parameter
---------
Co-authored-by: firecoperana <firecoperana>
* Introducing rope cache
When computing RoPE, the rotation angles in each layer
are exactly the same, and only depend on the token positions
(and other constant, model dependent parameters).
So, I wonder, why don't we compute the angles just once
and then reuse for the Q and K RoPE in each layer?
This commit does it as a POC on the CPU, and uses it in
the Qwen3-MoE compute graph.
* cuda: neox works
* WIP
* rope_cache: norm works
* Fused rope+rope
* Fused rope+rope (norm)
* Fused rms+rms+rope+rope (neox) - not working
* WIP
* Also qwen3
* Add command line arg to disable rope cache
* Disable RoPE cache if rope type is not neox or norm
* Add missing break after merge with main
* Fused fused_rms+fused_rms+rope+rope (with -mqkv)
* Fused fused_rms+fused_rms+rope+rope (without -mqkv)
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Biased mmvq: minor optimization
* Fusing Q and K rms_norm for TG on CUDA
* Remove commented out code
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* POC: merge Q, K, V into a single, contiguous tensor
Done just for Qwen3-MoE, where I see a 4% uplift in TG.
PP performance gain is sub-percent, if any.
Still, it seems it makes sense to do it in general given
the TG performance gain.
* WIP
* merge_qkv: it works for gpt-oss
...but we see a smaller TG gain (~1.5%)
* WIP
* Don't ignore the return value of create_tensors()
else, when q, k, v get merged and we are running on the CPU,
we get a crash because the backend is trying to use mmap,
but that no longer works.
* merge_qkv: bias can be required, optional, or mandatory
* merge_qkv: glm4.5moe
* merge_qkv: add command loine argument to enable
* merge_qkv: fix tensor dimensions
* merge_qkv: llama-4
* merge_qkv: qwen3 (dense)
* merge_qkv: simplify build_qwen3moe
* cohere2 - simplify graph building
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Fuse Q, K, V gemv+add
* More gemv+add fusing
* Faster copy when tensors are contiguous
Relevant for storing data into the KV cache. I see ~1% speedup
for fast models (Ling-mini-2.0, gpt-oss-20b, etc.)
* Cleanup
* Make sure the bias really is 1 row to use fusion
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Adding fused mul+multi_add + CPU implementation
* fused mul+multi_add: command line argument to disable it
* Faster tensor name formatting
We gain ~1% for Ling-mini-2.0 when running on CUDA.
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Adding fused mul+multi_add + CPU implementation
* fused mul+multi_add: CUDA
* fused mul+multi_add: command line argument to disable it
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Fuse add+add+fused_rms
* Try this
* Macro to easily enable/disable fusion
* Various:
* Check that all tensors involved are on the same device before applying fusion
* Fuse sigmoid+scale+sum_rows+div
* Fix the fused bailingmoe2 experts selection
The issue there was that the bias was not per row, but per
expert group, so only the first n_per_group biases were used
for al experts.
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Combine all calls to llm_build_norm to a single line
so more easily check what kind of arguments are being passed
by simply using grep.
* Combine add + fused_rms_norm
For many models this happens at each layer: the result of the
layer is added to the ayer input, which then becomes the input
to the next layer, which then is typically normalized via
fused_rms_norm.
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Do not allocate KV cache for unused layers
* Do not apply experts weight scale if it is 1
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Fuse sigmoid+add+grouped_topk+get_rows (CPU)
* Fix CPU + CUDA
but CUDA is somehow not 100% correct as I get a slightly different
PPL (lower!)
* Minor
* Fuse sigmoid+add+topk+get_rows (CUDA)
* Fuse sigmoid+add+topk+get_rows (CPU)
* Fuse topk+view+get_rows+reshape+softmax (CPU)
* Fuse topk+view+get_rows+reshape+softmax (CUDA)
* cpu: turn off the openai topk fusing for now
Something is not right and I don't see the bug.
On the CPU one doesn't gain much if anything, so not a big loss.
* Also fuse sum_rows and div
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Better argsort (CPU)
* Attemt at grouped topk
* This seems to do the trick for grouped experts routing
* Cleanup
* Trying to merge, something is not right
* Working merged grouped top_k (CPU)
* Add command line option to enable grouped expert routing
* Add grouped expert routing option to llama-bench
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Parallelize mask
We see non-negligible PP gains for long contexts.
More importantly, the strange drop in performance
observed for GPT-OSS for context >= 32k tokens is gone.
* Whith FA on, create mask as f16 directly
* WIP
* Reduce KQ mask padding to 16
Why was it 64 in the first place?
I don't observe any issues, while TG performance
for long contexts improves by 2-4%.
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* llama_model and llama_hparams
* llama_build_context
Surprisingly small reduction in llama.cpp compile time given
the reduction in LOCs (22k -> 14k)
* LLM_TN
llama.cpp compilation: 50 s -> 33 s
* llama_quantize
* arch names
* All graph building is now in llm-build-context.cpp
* hparams loading
llama.cpp is now just 9300 LOC, but still takes 32 seconds to compile.
* We are now at 6 seconds to build the src folder
* load -> create
We are not actually loading the tensors, but just creating them.
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>