Commit Graph

278 Commits

Author SHA1 Message Date
Kawrakow
1e6d36b1b4 Graph parallel for dense Qwen-3.5 models (#1331)
* Graph parallel for idense Qwen-3.5 models

* Cleanup
2026-02-27 07:03:25 +01:00
Kawrakow
0aa6f7e7cd iAdding support for dense Qwen-3.5 models (#1326) 2026-02-26 08:51:01 +01:00
Kawrakow
2616efa296 Fused delta net 2 (#1320)
* Revive fused delta-net

* Add command line argument for fused delta net

* Simplify/improve CUDA delta-net

* Add -fdn to llama-bench

* More CUDA fused delta net optimizations

* CPU optimizations

* Much faster fused delta-net on the CPU

It seems it is faster than the chunked implementation!

* Change meaning of fdn from bool flag to threshold value

* Use eps = 1e-6

* Give some nodes a name

* Don't re-apply L2 norm - it has already been done

* This seems quite a bit better

* More tweaks

* Restore per context buffer size log

Not everybody uses models split in 2000 parts, and those who do,
actually want to see the biffer sizes.
2026-02-26 06:53:43 +01:00
firecoperana
3fac78c48b server: enable checkpoint for recurrent models (#1310)
* server: enable checkpoint for recurrent models

create checkpoint after cancel

fix ban string and rm context during rewind

add checkpoint interval

only save recurrent cache

* save checkpoint during pp

---------

Co-authored-by: firecoperana <firecoperana>
2026-02-26 06:51:18 +01:00
Kawrakow
c77ec4b8b8 Fused delta-net (#1315)
* Revive fused delta-net

* Add command line argument for fused delta net

* Simplify/improve CUDA delta-net

* Add -fdn to llama-bench

* More CUDA fused delta net optimizations

* CPU optimizations

* Much faster fused delta-net on the CPU

It seems it is faster than the chunked implementation!

* Change meaning of fdn from bool flag to threshold value

* Use eps = 1e-6

* Give some nodes a name
2026-02-25 14:12:48 +01:00
Nexes the Elder
0bf7043a7b Display the size of the tensors overriden during the tensor loading (#1318)
* Display the size of the tensors overriden during the tensor loading

Ex:

`Tensor blk.60.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.60.ffn_up_exps.weight buffer type overriden to CPU`

become

`Tensor blk.60.ffn_up_exps.weight (size = 668467200 bytes) buffer type overriden to CPU
Tensor blk.60.ffn_gate_exps.weight (size = 668467200 bytes) buffer type overriden to CPU`

And pass in debug the later displayed size of the unnamed buffer overrides.

Ex : `llm_load_tensors:        CPU buffer size =   XXX.XX MiB`

That double display is cluttering the screen without being very informative.

* change bytes display to MiB.

Co-authored-by: Kawrakow <iwankawrakow@gmail.com>

---------

Co-authored-by: Kawrakow <iwankawrakow@gmail.com>
2026-02-25 07:36:27 +01:00
Nexes the Elder
170467e835 Llama-quantize: Partial requant feature (#1313)
* Partial Requant feature for llama-quantize

- Inspired by the recently portcopied --dry-run feature.
- Allows to partially requantize a split quantized .gguf by requantizing only the missing splits in the destination directory.
- Works both for GGUF which are split tensors by tensors, or by group of several tensors (though this one is not very much tested beyond 2 tensors by split).
- Vibe coded.

* Create output directory if it doesn't exist in llama-quantize

* Create output directory if it doesn't exist in gguf-split

* Add exit when directory fails to be created on Windows

* Use std::filesystem

* cleanup
2026-02-25 07:25:15 +01:00
dungquixote42
aaa545c3dc adaptive p: collect probability before logit bias (#1314) 2026-02-24 15:39:17 +01:00
Kawrakow
7065488135 Slightly better graph parallel for Qwen3-Next (#1307)
* Make sure we pick the reduced tensor from the right GPU

* Minor
2026-02-24 15:22:30 +01:00
Kawrakow
cfb6747776 llama-quantize: --dry-run option (#1309) 2026-02-24 15:21:52 +01:00
Kawrakow
5dacb5355a Graph parallel for Qwen3-Next (#1292)
* WIP

* This works, but is slower than split mode layer
2026-02-23 07:58:00 +01:00
Kawrakow
89b1e2b518 Better estimate for max. nuber of compute nodes (#1296)
* Better estimate for max. nuber of compute nodes

* Just in case
2026-02-22 18:16:49 +01:00
Samuel Oliveira Alves
09a88c9ae5 Add MTP decoding support for GLM-4.x MoE (#1270)
* wip: port MTP architecture

Ports the Multi-Token Prediction (MTP) architecture to the older `llama.cpp` codebase used by `ikllama`.

Changes include:
- Updating `llama_batch` to support `mtp_params`.
- Modifying `llama_decode_internal` (and `encode`) to handle MTP operations (Warmup, Update, Draft).
- Adding public APIs for MTP state management (`llama_set_draft_input_hidden_state`).
- Adapting the embedding extraction logic to skip MTP update passes.

* Refactors `server_slot` to support generic speculative decoding (MTP or Draft Model).

* core: enable hybrid outputs (logits + embeddings) for MTP support

* fix(mtp): correct KV-cache slot finding for updates

* fix(mtp): persist hidden states to prevent context corruption during drafting

* refactor(mtp): clean unused code

* fix(mtp): update server to new functions name

* fix(mtp): fix graph and save hidden state

* mtp: refactor integration, context params and kv cache search

* mtp: fix hidden state extraction and speculative acceptance flow

* server: fix MTP warmup for long prompts and reset token buffer

* llama: refactor MTP operation state to context parameters

* server: fix n_past calculation in MTP acceptance

* llama: fix mtp enable flags

* speculative: refactor MTP to use common_speculative interface

* context: remove unused signatures

* clip: fix deprecated enum-enum conversion warning

* common: fix format string crash in help message

* context: fix mtp activation logic
2026-02-22 18:14:39 +01:00
firecoperana
66323b92f7 Qwen3.5-MoE: fix regenerating message error (#1295)
Co-authored-by: firecoperana <firecoperana>
2026-02-21 18:24:12 +01:00
Kawrakow
13c3d83ce7 Qwen3.5-MoE support (#1288)
* WIP: loads and runs, but not correct

Very high PPL, empty TG.

* This appears to work
2026-02-21 08:33:06 +01:00
dungquixote42
0f411b02e2 Fix adaptive p sampler bug with string ban (#1287)
* adaptive p: upadte internal state only if not rewinding

* adaptive p: conditional update for speculative decoding

* adaptive p: refactor to rewind instead of update

* adaptive p fix: better comments

* fix rewind check

* add record to handle multi-token rewind

* better comment
2026-02-20 07:11:36 +01:00
Kawrakow
e30198a553 WIP: Qwen3Next (#1266)
* qwen3next: add architecture support and recurrent-state fixes

* qwen3next: optimize broadcast sub and single-seq ssm conv

* cuda: build MoE row mapping on device in mul_mat_id

* cuda: add guarded multi-seq fast path for ssm_conv

* docs: update qwen3next perf report for cuda MoE/SSM tuning

* cuda: reduce qwen3next moe/ssm sync overhead and refresh eval

* qwen3next: split cpu/cuda eval builds and tune PP scheduling

* qwen3next: harden seq-state flow and support optional dense FFN layers

* qwen3next: trim delta-net graph overhead in chunking path

* qwen3next: remove redundant v_conv cont in delta path

* qwen3next: avoid extra cont on linear attention output

* qwen3next: drop redundant cont before recurrent state flatten

* qwen3next: keep recurrent state in 4d layout through delta path

* qwen3next: add fused delta-net op and wire model path

* tests: add backend-op coverage for ggml_delta_net

* qwen3next: add runtime switch for fused delta-net path

* docs: refresh qwen3next perf review and benchmark matrix

* qwen3next: default fused delta-net off and document quality checks

* qwen3next: add decode-only fused delta mode

* qwen3next: make fused delta safe by default and fix fused tensor layout

* qwen3next: warn when forcing fused decode mode

* qwen3next: add fused-delta regression runner script

* qwen3next: integrate fused regression into eval harness

* qwen3next: clean up chunked delta-net shape handling

* qwen3next: add absolute sanity guards to fused regression

* qwen3next: add unified regression runner script

* qwen3next: disable flash-attn for cpu-only contexts

* docs: reconcile qwen3next status and remaining upstream gaps

* common: add qwen3next fused-delta runtime flag

* cuda: add qwen3next delta-net kernel dispatch override

* docs: update qwen3next quality and serving baseline findings

* qwen3next: keep fused delta on safe path and remove PR artifacts

* qwen3next: align autoregressive delta-net decode layout

* Revert "qwen3next: align autoregressive delta-net decode layout"

This reverts commit 9241164a5e.

* cuda: port solve-tri fast-paths for qwen3next delta-net

* qwen3next: add fused-delta runtime flag and drop env toggle

* qwen3next: make fused delta single-flag and default on

* Account for GPU arch differences

* Revert "cuda: build MoE row mapping on device in mul_mat_id"

This reverts commit 89e9ecfa84.

* qwen3next: drop non-essential MoE scheduling and split heuristics

* qwen3next: avoid generic ggml_sub broadcast changes

* llama: restore only_active_experts log message

* Remove unnecessary hacks, disable fusion for now.

* qwen3next: port hybrid recurrent state memory semantics

* qwen3next: clean up recurrent state slot plumbing

* qwen3next: fix hybrid V-cache layout plumbing

* qwen3next: guard recurrent state slots against kv capacity

* qwen3next: persist recurrent state in session data

- serialize/restore qwen3next cache.s_l in state/session paths\n- bump session and sequence-state file versions for format change\n- fallback to single-token chunking for mixed repeated seq_id batches

* qwen3next: drop unused fused-delta builder path

- remove dead build_delta_net_fused lambda\n- remove unused llm_build_context::fused_delta member

* qwen3next: remove unused fused-delta CLI/context plumbing

- drop -fd/-no-fd options and related YAML dump field\n- remove fused_delta fields from public/internal context params\n- remove fused_delta assignment and logging in context init

* ggml: remove unused DELTA_NET operator stack

* Missing include

* Reorder ops/unary ops

So we don't change again the enum values of the mul mat ops

* Minor

* Discard unnecessary changes in llama-build-context.cpp

* Minor

* Revert "Discard unnecessary changes in llama-build-context.cpp"

This reverts commit edadb80ed6.

* Increase GGML_SCHED_MAX_SPLITS - required for larger u-batches

* Fix CPU concat in the TG case: 7.25 -> 10.5 t/s for Qwen3Next

* Fix CPU sum_rows: 10.5 -> 13.6 t/s for Qwen3Next

It was single-threaded and was taking ~25% of the computation time
during TG. It is now down to 2%.

Strangely enough, I measure 13.6 t/s with llama-bench, but if I
let the model give me an actual response with llama-cli, I get close
to 17 t/s.

* Fix CPU scale: 13.6 -> 16.7 t/s for Qwen3Next

For Qwen3Next there is a scale op on a largish tensor (548k elements)
that has a single row for TG, so was done in a single thread.
We now simply use blocks of 1024 elements.

* Optimize CPU mul: 16.7 -> 17.6 t/s for Qwen3Next

* CPU: fuse transpose -> cont -> sum_rows -> transpos: 17.6 -> 23.1 t/s for Qwen3Next

* Optimize CPU repeat: 176 -> 200 t/s for Qwen3Next PP-512

* Multithreading for OP_SUB

* Don't commit with timing trace on

* Multithread neg and sigmoid

* Be able to turn on/off fusion more easily (CPU)

* Name the mul_mat ops so we know where the time goes

* WIP

* Much better PP on CUDA

* CUDA: fuse transpose -> cont -> sum_rows -> transpose

Needs non-coontiguous variant of sum_rows.
On the CPU this gave 30+% improvement in TG performance,
on CUDA ist is disapointing 6-7%. I guess, this is because
Georgi's cont CPU implementation was so bad that skipping
it made such a big difference.

* CUDA: faster mul for special case relevant for Qwen3Next

Worth 1% in TG

* Fix CPU OP_CONT

---------

Co-authored-by: yurko <yurko@local>
Co-authored-by: Yurko <yurko@example.com>
Co-authored-by: yurko <yurko@pop-os.tail5a1a6b.ts.net>
Co-authored-by: Yurko Hoshko <YurkoHoshko@users.noreply.github.com>
2026-02-16 06:50:28 +01:00
Kawrakow
528cadb07b GLM-5 support (#1268) 2026-02-15 07:49:44 +01:00
firecoperana
1cb7e1bf39 spec : add self speculative decoding, ngram and refactor (#1261)
* spec : add self speculative decoding and ngram-mod and refactor

common : use common_ prefix for common library function

llama : use LLAMA_TOKEN_NULL

spec : add self speculative decoding (no draft model required) + refactor

spec : add ngram-mod

spec : various improvements ton ngram-map + docs

spec : fix the check-rate logic of ngram-simple

common : add common_speculative_is_compat()

spec : simplify time measurement using common_time_meas

refactor common_sampler_init

refactor common_token_to_piece

refactor and fix cur_p bug

clean up

* spec : remove check rate

* spec: show warnings instead of abort

---------

Co-authored-by: firecoperana <firecoperana>
Co-authored-by: Sascha Rogmann <59577610+srogmann@users.noreply.github.com>
2026-02-13 19:04:55 +01:00
Kawrakow
c5d74f66e2 Fix graph parallel when ngl < n_layers (#1241)
* Fix graph parallel when ngl < n_layers

* Fix using ffn_norm

When using graph parallel with ngl < n_layers, the ffn_norm tensor
may have ended up being split, while the ffn tensors are on the CPU.
In that case we will get a crash because we attempt to use the not-split
buffer of ffn_norm, which is invalid. Thi commit fixes that.

* Cleanup
2026-02-06 11:48:24 +02:00
Kawrakow
81ea911f0d Graph parallel for Step-3.5-Flash (#1236)
* WIP

* This works but is slow

* Turn off the up / gate clamps for now

* OK we need the clamping

* Fuse the clamp (CUDA)

* Fuse the clamp (CPU)

* WIP

* Be able to use merged q, k, v

* Be able to use merged up/gate experts

* Fuse the clamp (CUDA mmvq)

* WIP: graph parallel for Step-3.5

* WIP

* This should be it

* Cleanup

* Fix merge
2026-02-06 06:56:51 +02:00
Kawrakow
9c1c74acda Step-3.5-Flash support (#1231)
* WIP

* This works but is slow

* Turn off the up / gate clamps for now

* OK we need the clamping

* Fuse the clamp (CUDA)

* Fuse the clamp (CPU)

* WIP

* Be able to use merged q, k, v

* Be able to use merged up/gate experts

* Fuse the clamp (CUDA mmvq)
2026-02-05 08:13:22 +02:00
Kawrakow
b41b8cf813 Graph parallel for SEED-OSS (#1222)
* Graph parallel for SEED-OSS

* Cleanup
2026-02-04 16:07:43 +02:00
firecoperana
7e8d444033 llama : add token matching support to llama-grammar (#1220)
* llama : add token matching support to llama-grammar

llama : add token matching support to llama-grammar (#17816)

common/grammar : replace problematic backtracking regex `[\s\S]*` (#18342)

* disable tests and fix warnings

---------

Co-authored-by: firecoperana <firecoperana>
2026-02-03 07:57:17 +02:00
saood06
8ba7e2b40c Add support for Seed-OSS (#1218)
* it compiles

* Fix constants.py
2026-02-03 07:39:45 +02:00
dungquixote42
b86d8024a5 Adaptive p: history update fix + temp as flag (#1213)
* adaptive_p: fix history update + use current probability for high temp

* adaptive_p: fix history update bug, update with current probability if temp is high

* replace temp-as-signal with server argument

* adaptive_p: rename ema_w_cur_p to updt_w_cur

* delete test code
2026-02-03 07:36:12 +02:00
Kawrakow
4d13ae03b5 Also these other two places 2026-01-30 15:36:29 +00:00
Kawrakow
098b1a2e04 Fix MiniMax-M2 KV-cache loading/saving 2026-01-30 13:38:07 +00:00
Kawrakow
68ed62447c Split mode graph for Minimax-M2 (#1195)
* Split mode graph for Minimax-M2

* Cleanup

* Forgotten ffn_exp_probs_b
2026-01-29 07:27:06 +02:00
Kawrakow
2a7cc09149 Remove llamafile remnants (#1179) 2026-01-22 13:20:23 +02:00
Kawrakow
851fda3509 Split mode graph: use CUDA graphs (#1177)
* Use GUDA graphs also when theretensor overrides

* Change graph key

* This seems to work
2026-01-22 12:38:36 +02:00
Kawrakow
996e77047a Avoid ggml_get_rows if not necessary (#1160)
* Copy reduce result to other GPUs if necessary

* Avoid ggml_get_rows for TG

* For the output ops use the result of the split that ran on the main GPU

* More models
2026-01-20 15:38:21 +02:00
Kawrakow
98b30e5e81 Faster adaptive_p sampling (#1165)
* A hopefully more efficient adaptive_p sampling

* Once at it, lets fix the formatting too

* More formatting

* Hopefully better

* This should be better

* Correctly accumulate adaptive_p sampling time

* AVX2
2026-01-19 16:03:09 +02:00
Kawrakow
fa58c20c42 A hopefully more efficient adaptive_p sampling (#1161)
* A hopefully more efficient adaptive_p sampling

* Once at it, lets fix the formatting too

* More formatting

* Correctly accumulate sampling time for adaptive_p
2026-01-19 15:01:55 +02:00
dungquixote42
6dfbef27ec Adaptive p: bugfix + optimization + refactor (#1155)
* adaptive-p sampler: fix zeroed orig_probs bug and refactor

- Fix bug where original probabilities were captured as zero by calculating
  them from logits in llama_prep_adaptive_p (new).
- Replace vector with unordered_map to track candidate probabilities,
  filtering for relevance via logit delta (16.6f).
- Standardize API naming: llama_<action/verb>_<focus/name/topic>_<extra/info>
- Update function signatures to follow most other samplers.

* resolve merge bug

* adaptive-p: revert reordering function definitions
2026-01-18 08:26:06 +02:00
Kawrakow
7024fdbc72 Additional graph reduce types for split mode graph (#1154)
* WIP: add Q8_0 and BF16 as possible reduce types

Does not work - there is a big somewhere

* This finally works
2026-01-18 08:02:49 +02:00
firecoperana
1a461525d5 server: stop processing the prompt when client disconnects (#1134)
implement generator-based API for task results

Update httplib.h to 0.27.0

Fix embedding error

Stop prompt processing when disconnected

Co-authored-by: firecoperana <firecoperana>
2026-01-13 07:56:59 +02:00
Kawrakow
c03c2d7cc6 Merge ffn_up and ffn_gate experts tensors (#1137)
* WIP - not working

* WIP - not working

* WIP - GPT-OSS working

However, extremely stupid. The only way I could correctly repack the
up/gate experts is to copy up and gate into host buffers, repack
into another host buffer, copy back into the ffn_up_gate_exps tensor.
This is going to be very slow for giant 500 GB models.

My attempts to do this via a compute graph on the backend holding
the tensors was unsuccessful.

For GPT-OSS-20B I see ~6-7% better PP when using the original
ik_llama.cpp fused_up_gate CUDA implementation, and ~10% when
using the small batch size implementation.

Other models are not working yet on CUDA as I need to fix the
fused mul-unary implementation.

* WIP

* WIP - Qwen3-MoE (and hopefully all others) working

But when I say here and in the previous commit "working",
I mean PP is working. TG is still broken.

* WIP: TG seems to be working

* Minor

* Add command line option to merge experts up/gate

* Add merge up/gate command line parameter to llama-bench

* Turn off merge_up_gate_exps if split mode graph

It is not yet implemented

* When no bias, allow merging up/gate with tensor overrides

* Arghh, we need to increase the context size again

* Cleanup
2026-01-12 18:30:53 +02:00
dungquixote42
52ad1c6421 Implement Adaptive-P Sampler (#1100)
* initial implementation of adaptive-p sampler

* explicitly mark candidates unsorted + cleanup qualifiers

* cosmetic update

* reorg prototypes

* lockstep with mainline

* add _impl for _init + reorg

* add LLAMA_API to prototypes

* update sharpness to 10

* lockstep: rng seed

* delete llama_sampling member in llama_sampler_adaptive_p

* fix LLAMA_API return type

* lockstep: rng seed cont

* actually correct implementation

* lockstep: sorting behavior

* const -> constexpr for known constants

* add missing space

* fix softmax usage in adaptive p sampler

* cosmetic changes

* implement do-not-sort version of softmax

* simpify rng seed, add static to constexpr

* refactor: remove iface + use shared rng + use actually original probabilities

* adaptive-p: add dedicated rng back in

* fix initial max_logit + add float vector to adaptive p sampler context + stochastic sampling

* adaptive-p: fuse first softmax with transformation

* adaptive-p: implement binary search selection

* adaptive-p: update comment
2026-01-10 07:58:53 +02:00
Kawrakow
eaf2e1c15a Split mode "graph" for Ernie-4.5-MoE (#1121)
* Ernie-4.5-MoE split mode graph

* Cleanup

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2026-01-08 16:46:41 +02:00
Kawrakow
5ef98f8b0f Split mode "graph" for GPT-OSS (#1118)
* Split mode "graph" for GPT-OSS

* Force split_mode_f16 to false

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2026-01-08 09:14:15 +02:00
Kawrakow
99fbd84971 Split mode "graph" for Hunyuan-MoE (#1116)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2026-01-07 13:38:08 +02:00
Kawrakow
ab1616767b Enable up to 4 GPUs for Mimo2-Flash (#1115)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2026-01-07 09:40:29 +02:00
Kawrakow
3c99284b67 Split mode 'graph' fpr Qwen3-VL (#1107)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2026-01-05 17:32:00 +02:00
Kawrakow
218dcc5727 Split mode graph for Qwen3 (#1106)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2026-01-05 14:31:36 +02:00
Kawrakow
419a397ce0 Graph parallel for Mimo-V2-Flash (#1105)
* WIP

* Cleanup

* Set max_gpu to 2 for Mimo2

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2026-01-05 09:58:54 +02:00
Kawrakow
ab50c6cdcb Mimo-V2-Flash support (#1096)
* Mimo-2 support

* Fix bug for head sizes not being the same

It still does not solve the Mimo-2 quantized cache issue.

* Fix quantized cache

* Minor

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2026-01-05 08:00:01 +02:00
firecoperana
56dceefd6b Fix windows build with CUDA (#1101)
Co-authored-by: firecoperana <firecoperana>
2026-01-05 07:59:23 +02:00
Kawrakow
f878adbe90 Turn on graph reuse by default (#1094)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-12-27 08:27:16 +01:00
Kawrakow
519405dc97 Async compute graph evaluation (2 or more GPUs) (#1089)
* WIP: absorb adding input into std_attn and std_ffn

* WIP: NCCL infra

* WIP: add reduce and fake_cpy ops

* WIP

* WIP: graph appears to work, layer is broken

* WIP: Qwen3-MoE works with graph, layer still broken

* WIP: GLM-4.5 graph works

* WIP: fix sm layer (dense)

* WIP: fix sm layer (MoE)

* WIP: fast PP with bespoke 4-GPU NCCL

I guess, I'm not using NCCL the right way as PP is very
low with a single communicator group for 3 or more GPUs.
But if I create 4 communicator groups for pairs of GPUs
(0,1, 2,3, 0,2, 1,3) and use that, PP is fast: I'm hitting
1500 t/s for L3-70B on the 4x3090 system, which is
~20% better than the previous sm graph without NCCL.
But that cannot be the solution (I cannot be creating pairwise
communicators and associated logic for every possible number of GPUs).

* WIP: Cohere2

* Explicitely set device

* Bespoke 3-GPU case

* WIP

* Do not repeat get_rows multiple times

* Fix 3 GPUs

* OK, let's leave it in

* Simple async

* This sync seems enough

* Only do async for 4 or more backends

With 2 GPUs (so, 3 backends) not using async is slightly faster

* Scheduler changes

* Use OpenMP if available

Surprisingly (at least to me), this is quite a bit faster than
std::thread and std::barrier. GLM-4.5-AIR with 4 GPUs is now
at 105 t/s at zero context!

* Do not use OpenMP if there are tensor overrides

* Set omp max active levels

* Be more careful with having set the device before using a stream

* Command line option to turn on async. Set to false by defualt for now

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-12-27 08:18:06 +01:00