Commit Graph

17 Commits

Author SHA1 Message Date
Kawrakow
0ff3a43289 Bring back #1333 and #1335 (#1340)
* Bring back fused delta net 3

* Remove autoregressive and chunking
2026-02-28 14:31:42 +01:00
Kawrakow
1922449b2c Revert delta net 3 (#1339)
* Revert "Simplify delta-net (#1335)"

This reverts commit e5fc30244c.

* Revert "Fused delta net 3 (#1333)"

This reverts commit 7b68353e09.
2026-02-28 13:12:08 +01:00
Kawrakow
e5fc30244c Simplify delta-net (#1335)
* Simplify delta-net

* Minor

* Minor
2026-02-28 11:12:19 +01:00
Kawrakow
c77ec4b8b8 Fused delta-net (#1315)
* Revive fused delta-net

* Add command line argument for fused delta net

* Simplify/improve CUDA delta-net

* Add -fdn to llama-bench

* More CUDA fused delta net optimizations

* CPU optimizations

* Much faster fused delta-net on the CPU

It seems it is faster than the chunked implementation!

* Change meaning of fdn from bool flag to threshold value

* Use eps = 1e-6

* Give some nodes a name
2026-02-25 14:12:48 +01:00
Samuel Oliveira Alves
09a88c9ae5 Add MTP decoding support for GLM-4.x MoE (#1270)
* wip: port MTP architecture

Ports the Multi-Token Prediction (MTP) architecture to the older `llama.cpp` codebase used by `ikllama`.

Changes include:
- Updating `llama_batch` to support `mtp_params`.
- Modifying `llama_decode_internal` (and `encode`) to handle MTP operations (Warmup, Update, Draft).
- Adding public APIs for MTP state management (`llama_set_draft_input_hidden_state`).
- Adapting the embedding extraction logic to skip MTP update passes.

* Refactors `server_slot` to support generic speculative decoding (MTP or Draft Model).

* core: enable hybrid outputs (logits + embeddings) for MTP support

* fix(mtp): correct KV-cache slot finding for updates

* fix(mtp): persist hidden states to prevent context corruption during drafting

* refactor(mtp): clean unused code

* fix(mtp): update server to new functions name

* fix(mtp): fix graph and save hidden state

* mtp: refactor integration, context params and kv cache search

* mtp: fix hidden state extraction and speculative acceptance flow

* server: fix MTP warmup for long prompts and reset token buffer

* llama: refactor MTP operation state to context parameters

* server: fix n_past calculation in MTP acceptance

* llama: fix mtp enable flags

* speculative: refactor MTP to use common_speculative interface

* context: remove unused signatures

* clip: fix deprecated enum-enum conversion warning

* common: fix format string crash in help message

* context: fix mtp activation logic
2026-02-22 18:14:39 +01:00
Kawrakow
7024fdbc72 Additional graph reduce types for split mode graph (#1154)
* WIP: add Q8_0 and BF16 as possible reduce types

Does not work - there is a big somewhere

* This finally works
2026-01-18 08:02:49 +02:00
Kawrakow
519405dc97 Async compute graph evaluation (2 or more GPUs) (#1089)
* WIP: absorb adding input into std_attn and std_ffn

* WIP: NCCL infra

* WIP: add reduce and fake_cpy ops

* WIP

* WIP: graph appears to work, layer is broken

* WIP: Qwen3-MoE works with graph, layer still broken

* WIP: GLM-4.5 graph works

* WIP: fix sm layer (dense)

* WIP: fix sm layer (MoE)

* WIP: fast PP with bespoke 4-GPU NCCL

I guess, I'm not using NCCL the right way as PP is very
low with a single communicator group for 3 or more GPUs.
But if I create 4 communicator groups for pairs of GPUs
(0,1, 2,3, 0,2, 1,3) and use that, PP is fast: I'm hitting
1500 t/s for L3-70B on the 4x3090 system, which is
~20% better than the previous sm graph without NCCL.
But that cannot be the solution (I cannot be creating pairwise
communicators and associated logic for every possible number of GPUs).

* WIP: Cohere2

* Explicitely set device

* Bespoke 3-GPU case

* WIP

* Do not repeat get_rows multiple times

* Fix 3 GPUs

* OK, let's leave it in

* Simple async

* This sync seems enough

* Only do async for 4 or more backends

With 2 GPUs (so, 3 backends) not using async is slightly faster

* Scheduler changes

* Use OpenMP if available

Surprisingly (at least to me), this is quite a bit faster than
std::thread and std::barrier. GLM-4.5-AIR with 4 GPUs is now
at 105 t/s at zero context!

* Do not use OpenMP if there are tensor overrides

* Set omp max active levels

* Be more careful with having set the device before using a stream

* Command line option to turn on async. Set to false by defualt for now

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-12-27 08:18:06 +01:00
Kawrakow
5e64235d4c Be able to set reduce op data type for split mode "graph" (#1087)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-12-24 14:01:29 +01:00
Nexes the Elder
d1dd45b4b9 add split-mode-graph-scheduling parameter (#1068)
Use -smgs or --split-mode-graph-scheduling in CLI to bypass the disabling of split mode graph scheduling when tensor overrides is used.

Co-authored-by: Kawrakow <iwankawrakow@gmail.com>
2025-12-17 07:58:19 +01:00
Kawrakow
18fdd80eaf Hadamard transforms for K-cache - CPU only (#1033)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-12-04 06:51:11 +01:00
Kawrakow
6b9d1bf4b4 Graph reuse (#947)
* Add mainline compatible FA command line option

* Graph reuse: add command line argument to turn it on

* WIP

* This seems to work

* This is perhaps cleaner

* Change the command line option to -gr

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-14 06:58:19 +02:00
Kawrakow
532a05e466 CUDA: set compute parameters via command line arguments (#910)
* cuda: set compute parameters via command line arguments

* Also llama-bench

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-07 07:11:23 +02:00
Kawrakow
fb0d5a995c RoPE cache (#887)
* Introducing rope cache

When computing RoPE, the rotation angles in each layer
are exactly the same, and only depend on the token positions
(and other constant, model dependent parameters).
So, I wonder, why don't we compute the angles just once
and then reuse for the Q and K RoPE in each layer?

This commit does it as a POC on the CPU, and uses it in
the Qwen3-MoE compute graph.

* cuda: neox works

* WIP

* rope_cache: norm works

* Fused rope+rope

* Fused rope+rope (norm)

* Fused rms+rms+rope+rope (neox) - not working

* WIP

* Also qwen3

* Add command line arg to disable rope cache

* Disable RoPE cache if rope type is not neox or norm

* Add missing break after merge with main

* Fused fused_rms+fused_rms+rope+rope (with -mqkv)

* Fused fused_rms+fused_rms+rope+rope (without -mqkv)

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-03 18:42:20 +02:00
firecoperana
904e994bfb Support --device and --device-draft parameter (#866)
* add --device and --device-draft parameter

* don't print debug message in release mode

* fix

* bug fix to throw exception when no device specified

* add const

---------

Co-authored-by: firecoperana <firecoperana>
2025-10-27 18:13:28 +02:00
Kawrakow
0549be76e5 Fused mul + multi_add op (#858)
* Adding fused mul+multi_add + CPU implementation

* fused mul+multi_add: CUDA

* fused mul+multi_add: command line argument to disable it

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-10-24 07:40:35 +03:00
Kawrakow
cde642e591 Grouped expert routing (CPU only) (#836)
* Better argsort (CPU)

* Attemt at grouped topk

* This seems to do the trick for grouped experts routing

* Cleanup

* Trying to merge, something is not right

* Working merged grouped top_k (CPU)

* Add command line option to enable grouped expert routing

* Add grouped expert routing option to llama-bench

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-10-16 14:57:02 +03:00
Kawrakow
4daff01b39 Refactor file llama.cpp (#823)
* llama_model and llama_hparams

* llama_build_context

Surprisingly small reduction in llama.cpp compile time given
the reduction in LOCs (22k -> 14k)

* LLM_TN

llama.cpp compilation: 50 s -> 33 s

* llama_quantize

* arch names

* All graph building is now in llm-build-context.cpp

* hparams loading

llama.cpp is now just 9300 LOC, but still takes 32 seconds to compile.

* We are now at 6 seconds to build the src folder

* load -> create

We are not actually loading the tensors, but just creating them.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-10-11 11:35:20 +03:00