Commit Graph

4247 Commits

Author SHA1 Message Date
Kawrakow
a568e12c8f Minor delta-net tweak (#1337) 2026-03-01 17:45:02 +01:00
Kawrakow
04c140fe54 Make vision woork with Qwen-3.5 models (#1345) 2026-03-01 17:44:37 +01:00
Kawrakow
0ff3a43289 Bring back #1333 and #1335 (#1340)
* Bring back fused delta net 3

* Remove autoregressive and chunking
2026-02-28 14:31:42 +01:00
Kawrakow
1922449b2c Revert delta net 3 (#1339)
* Revert "Simplify delta-net (#1335)"

This reverts commit e5fc30244c.

* Revert "Fused delta net 3 (#1333)"

This reverts commit 7b68353e09.
2026-02-28 13:12:08 +01:00
Kawrakow
e5fc30244c Simplify delta-net (#1335)
* Simplify delta-net

* Minor

* Minor
2026-02-28 11:12:19 +01:00
Kawrakow
702e0765b8 Update README with clarification on '_XL' models
Clarified warning about Unsloth '_XL' models in README.
2026-02-27 16:22:10 +01:00
Kawrakow
7b68353e09 Fused delta net 3 (#1333)
* This is better than chunked

* Keep the state in registers

* Cleanup

* Remove unused stuff

* Minor

* Make fused delta-net the default

* Fix race
2026-02-27 15:02:56 +01:00
Kawrakow
1e6d36b1b4 Graph parallel for dense Qwen-3.5 models (#1331)
* Graph parallel for idense Qwen-3.5 models

* Cleanup
2026-02-27 07:03:25 +01:00
Kawrakow
facc8fdc44 Very slightly better fused delta-net (#1330) 2026-02-27 07:03:09 +01:00
Kawrakow
62a7dcac5a Move the Qwen-3.5 models to the standard attention mechanism (#1329) 2026-02-26 15:50:51 +01:00
Kawrakow
757bee6238 Add special FA handling for dense Qwen3.5 (#1328) 2026-02-26 11:27:41 +01:00
Kawrakow
0aa6f7e7cd iAdding support for dense Qwen-3.5 models (#1326) 2026-02-26 08:51:01 +01:00
Kawrakow
2616efa296 Fused delta net 2 (#1320)
* Revive fused delta-net

* Add command line argument for fused delta net

* Simplify/improve CUDA delta-net

* Add -fdn to llama-bench

* More CUDA fused delta net optimizations

* CPU optimizations

* Much faster fused delta-net on the CPU

It seems it is faster than the chunked implementation!

* Change meaning of fdn from bool flag to threshold value

* Use eps = 1e-6

* Give some nodes a name

* Don't re-apply L2 norm - it has already been done

* This seems quite a bit better

* More tweaks

* Restore per context buffer size log

Not everybody uses models split in 2000 parts, and those who do,
actually want to see the biffer sizes.
2026-02-26 06:53:43 +01:00
Kawrakow
87b35dac0c Faster quantization for MoE models with many experts (#1322) 2026-02-26 06:52:28 +01:00
firecoperana
3fac78c48b server: enable checkpoint for recurrent models (#1310)
* server: enable checkpoint for recurrent models

create checkpoint after cancel

fix ban string and rm context during rewind

add checkpoint interval

only save recurrent cache

* save checkpoint during pp

---------

Co-authored-by: firecoperana <firecoperana>
2026-02-26 06:51:18 +01:00
Kawrakow
216f44363f Fix KT quantization yet again (#1321)
* Fix KT quantization yet again

* Add same 1e-16f check for all quants in iqk_uantize.cpp

* Fixes for k-quants

* Also this one
2026-02-25 18:07:12 +01:00
Kawrakow
c77ec4b8b8 Fused delta-net (#1315)
* Revive fused delta-net

* Add command line argument for fused delta net

* Simplify/improve CUDA delta-net

* Add -fdn to llama-bench

* More CUDA fused delta net optimizations

* CPU optimizations

* Much faster fused delta-net on the CPU

It seems it is faster than the chunked implementation!

* Change meaning of fdn from bool flag to threshold value

* Use eps = 1e-6

* Give some nodes a name
2026-02-25 14:12:48 +01:00
Nexes the Elder
0bf7043a7b Display the size of the tensors overriden during the tensor loading (#1318)
* Display the size of the tensors overriden during the tensor loading

Ex:

`Tensor blk.60.ffn_gate_exps.weight buffer type overriden to CPU
Tensor blk.60.ffn_up_exps.weight buffer type overriden to CPU`

become

`Tensor blk.60.ffn_up_exps.weight (size = 668467200 bytes) buffer type overriden to CPU
Tensor blk.60.ffn_gate_exps.weight (size = 668467200 bytes) buffer type overriden to CPU`

And pass in debug the later displayed size of the unnamed buffer overrides.

Ex : `llm_load_tensors:        CPU buffer size =   XXX.XX MiB`

That double display is cluttering the screen without being very informative.

* change bytes display to MiB.

Co-authored-by: Kawrakow <iwankawrakow@gmail.com>

---------

Co-authored-by: Kawrakow <iwankawrakow@gmail.com>
2026-02-25 07:36:27 +01:00
Nexes the Elder
170467e835 Llama-quantize: Partial requant feature (#1313)
* Partial Requant feature for llama-quantize

- Inspired by the recently portcopied --dry-run feature.
- Allows to partially requantize a split quantized .gguf by requantizing only the missing splits in the destination directory.
- Works both for GGUF which are split tensors by tensors, or by group of several tensors (though this one is not very much tested beyond 2 tensors by split).
- Vibe coded.

* Create output directory if it doesn't exist in llama-quantize

* Create output directory if it doesn't exist in gguf-split

* Add exit when directory fails to be created on Windows

* Use std::filesystem

* cleanup
2026-02-25 07:25:15 +01:00
Joshua Jolley
68431b049a server: propagate task index to response objects for batch requests (#1303)
When multiple prompts are sent in a single /v1/completions request,
each response needs to carry the correct index so the client can
match results to their corresponding prompts. The index field was
not being set on partial responses, final responses, or embedding
responses, causing batch results to all report index 0.

Set res->index = slot.task->index in send_partial_response,
send_final_response, and send_embedding.

Generated with [Devin](https://cli.devin.ai/docs)

Co-authored-by: Joshua Jolley <jjolley@clearwateranalytics.com>
Co-authored-by: Devin <noreply@cognition.ai>
2026-02-24 15:39:38 +01:00
dungquixote42
aaa545c3dc adaptive p: collect probability before logit bias (#1314) 2026-02-24 15:39:17 +01:00
Kawrakow
38ca19d828 Minor delta-net tweak (#1308)
* Make sure we pick the reduced tensor from the right GPU

* Minor

* Minor delta-net tweak
2026-02-24 15:22:57 +01:00
Kawrakow
7065488135 Slightly better graph parallel for Qwen3-Next (#1307)
* Make sure we pick the reduced tensor from the right GPU

* Minor
2026-02-24 15:22:30 +01:00
Kawrakow
cfb6747776 llama-quantize: --dry-run option (#1309) 2026-02-24 15:21:52 +01:00
TheAIGuyFromAR
96b8298472 Fix typo in merge-up-gate-experts argument (#1311) 2026-02-24 15:13:22 +01:00
Kawrakow
68bd30d99c Fix max nodes (again) (#1306) 2026-02-23 11:17:37 +01:00
Kawrakow
2bb40f8c35 Fix llm_arch_is_hybrid (#1305) 2026-02-23 08:55:53 +01:00
Kawrakow
5dacb5355a Graph parallel for Qwen3-Next (#1292)
* WIP

* This works, but is slower than split mode layer
2026-02-23 07:58:00 +01:00
Yap Sok Ann
dcf50d8279 Fix tool call for Qwen3.5 (#1300)
* Fix tool call for Qwen3.5

Loosely based on mainline changes from:
* https://github.com/ggml-org/llama.cpp/pull/19635
* https://github.com/ggml-org/llama.cpp/pull/19765

Also need to change the grammar to allow the model to make multiple
tool calls in a row. This was likely broken for Qwen3 Coder prior to
this commit.

* Fix the grammar for the subsequent parameters after the first one
2026-02-23 07:54:56 +01:00
firecoperana
efc294cc39 server: fix crash from adaptive p (#1304)
Co-authored-by: firecoperana <firecoperana>
2026-02-23 07:25:52 +01:00
Kawrakow
89b1e2b518 Better estimate for max. nuber of compute nodes (#1296)
* Better estimate for max. nuber of compute nodes

* Just in case
2026-02-22 18:16:49 +01:00
Samuel Oliveira Alves
09a88c9ae5 Add MTP decoding support for GLM-4.x MoE (#1270)
* wip: port MTP architecture

Ports the Multi-Token Prediction (MTP) architecture to the older `llama.cpp` codebase used by `ikllama`.

Changes include:
- Updating `llama_batch` to support `mtp_params`.
- Modifying `llama_decode_internal` (and `encode`) to handle MTP operations (Warmup, Update, Draft).
- Adding public APIs for MTP state management (`llama_set_draft_input_hidden_state`).
- Adapting the embedding extraction logic to skip MTP update passes.

* Refactors `server_slot` to support generic speculative decoding (MTP or Draft Model).

* core: enable hybrid outputs (logits + embeddings) for MTP support

* fix(mtp): correct KV-cache slot finding for updates

* fix(mtp): persist hidden states to prevent context corruption during drafting

* refactor(mtp): clean unused code

* fix(mtp): update server to new functions name

* fix(mtp): fix graph and save hidden state

* mtp: refactor integration, context params and kv cache search

* mtp: fix hidden state extraction and speculative acceptance flow

* server: fix MTP warmup for long prompts and reset token buffer

* llama: refactor MTP operation state to context parameters

* server: fix n_past calculation in MTP acceptance

* llama: fix mtp enable flags

* speculative: refactor MTP to use common_speculative interface

* context: remove unused signatures

* clip: fix deprecated enum-enum conversion warning

* common: fix format string crash in help message

* context: fix mtp activation logic
2026-02-22 18:14:39 +01:00
Kawrakow
cbf7fc7e2f Update README with warning about '_XL' models from Unsloth
Added important note regarding quantized models from Unsloth.
2026-02-22 07:42:17 +01:00
Kawrakow
bd387a279a Add new authors to the AUTHORS file 2026-02-21 19:20:31 +01:00
firecoperana
66323b92f7 Qwen3.5-MoE: fix regenerating message error (#1295)
Co-authored-by: firecoperana <firecoperana>
2026-02-21 18:24:12 +01:00
Kawrakow
13c3d83ce7 Qwen3.5-MoE support (#1288)
* WIP: loads and runs, but not correct

Very high PPL, empty TG.

* This appears to work
2026-02-21 08:33:06 +01:00
mcm007
b2cb4512c5 Create parameters overview (#1269)
* raw parameters.md

* fix small typos in common.cpp

* Update build args in parameters.md

* Update parameters.md

- format as table
- sections

* Update README.md

- quickstart
- build and run

* Update parameters.md

other tools examples

* add PR links

* multiple updates to parameters.md

- description
- add jargon section
- add suggestions from feedbacks

* don't imply that only linux is supported in README.md

* add alias to parameters.md

* Update README.md with recent models and features

* Update parameters.md with latest features

* address suggestions

- no-ooae
- placeholder for common commands
- no-kv-offload
- llama-sweep-bench
- placeholder for unique parameters

* specify Linux distro in README.md
2026-02-20 07:20:56 +01:00
dungquixote42
0f411b02e2 Fix adaptive p sampler bug with string ban (#1287)
* adaptive p: upadte internal state only if not rewinding

* adaptive p: conditional update for speculative decoding

* adaptive p: refactor to rewind instead of update

* adaptive p fix: better comments

* fix rewind check

* add record to handle multi-token rewind

* better comment
2026-02-20 07:11:36 +01:00
rkozuch
b855bf92de Fix slot prompt updating. (#1285)
Co-authored-by: Rkozuch <you@example.com>
2026-02-19 08:15:49 +01:00
Kawrakow
d81cde5cea Fix very low bpw missing imatrix check (#1284) 2026-02-19 08:15:26 +01:00
Samuel Oliveira Alves
51df09be8a Feat - add kimi 2.5 Vision (#1280)
* port kimi 25-vision  from upstream

* feat(clip): add support for Kimi K2.5 vision model
2026-02-19 08:15:03 +01:00
Kawrakow
04cf685e82 Factor out delta net (#1286)
* WIP: factor out delta net implementation

* WIP

* Use the standard FFN functions

* More standard attn for Qwen3-Next
2026-02-18 17:16:17 +01:00
Kawrakow
d2d65c0d64 Better CPU performance for Qwen3-Next (#1283)
* Better CPU silu - +4% PP

* Improve ggml_compute_forward_dup_bytes
2026-02-18 15:55:11 +01:00
Kawrakow
84831fc3ee Don't disable CUDA graphs for Qwen3-Next (#1278) 2026-02-18 08:47:45 +01:00
Kawrakow
cafeef484c More Qwen3-Next optimizations (#1277)
* Optimizing q3next TG

* Fused add -> softplus -> mul on CUDA

* Remove forgotten debug log

* Increase ggml context size

Required for Qwen3-Next with batch/u-batch size of 4096

* WIP

* Avoid some contiguous ops

* Avoid some repeats

* Avoid some more repeats
2026-02-17 16:03:51 +01:00
Samuel Oliveira Alves
88f98c891d server: add string ban in speculative path (#1274) 2026-02-17 12:33:28 +01:00
Kawrakow
16fe459a49 Faster CPU PP performance for Qwen3-Next - optimize concat (#1276) 2026-02-17 11:46:27 +01:00
Kawrakow
35c99f9f41 Faster Qwen3-Next PP on CUDA - optimize concat (#1275) 2026-02-16 11:46:39 +01:00
Kawrakow
97e7c091cd Update AUTHORS file with new contributors
Added new contributors to the AUTHORS file.
2026-02-16 07:13:25 +01:00
firecoperana
868ac2128e fix build error (#1272)
Co-authored-by: firecoperana <firecoperana>
2026-02-16 06:51:03 +01:00