Commit Graph

3912 Commits

Author SHA1 Message Date
Kawrakow
cde642e591 Grouped expert routing (CPU only) (#836)
* Better argsort (CPU)

* Attemt at grouped topk

* This seems to do the trick for grouped experts routing

* Cleanup

* Trying to merge, something is not right

* Working merged grouped top_k (CPU)

* Add command line option to enable grouped expert routing

* Add grouped expert routing option to llama-bench

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-10-16 14:57:02 +03:00
Kawrakow
e66d307e13 Better argsort (CPU) (#835)
* Better argsort (CPU)

* Minor

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-10-16 11:31:03 +03:00
Kawrakow
f7adde1043 Adding Ling/Ring (a.k.a., Bailing-MoE2) support (#833)
* Adding Ling/Ring (a.k.a., Bailing-MoE2)

* Add expert group selection (not working, so turned off)

* BailingMoE2 conversion

* WIP

* Bits and pieces

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-10-15 14:20:40 +03:00
Kawrakow
ba9fefb73d gpt-oss: duplicate experts biases when necessary (#829)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-10-14 14:38:40 +03:00
Viktor Ivakin
2b71974af9 Fix incomplete utf-8 characters in streaming text completions (#810) 2025-10-13 16:25:29 +03:00
Kawrakow
4e24d48e63 Attention mask tweaks for better long context performance (#825)
* Parallelize mask

We see non-negligible PP gains for long contexts.
More importantly, the strange drop in performance
observed for GPT-OSS for context >= 32k tokens is gone.

* Whith FA on, create mask as f16 directly

* WIP

* Reduce KQ mask padding to 16

Why was it 64 in the first place?

I don't observe any issues, while TG performance
for long contexts improves by 2-4%.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-10-13 14:01:11 +03:00
Kawrakow
21a0bfb1c0 Fix PATH_MAX not defined on Windows (#828)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-10-13 09:25:57 +03:00
Kawrakow
78409c95ff Fix performance regression introduced in #823 (#826)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-10-13 08:09:55 +03:00
Kawrakow
764eefd1bc Enable and clean up compiler warnings in src (#824)
* WIP: enable and clean up warnings in src

* All warnings handled

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-10-11 16:01:13 +03:00
Kawrakow
4daff01b39 Refactor file llama.cpp (#823)
* llama_model and llama_hparams

* llama_build_context

Surprisingly small reduction in llama.cpp compile time given
the reduction in LOCs (22k -> 14k)

* LLM_TN

llama.cpp compilation: 50 s -> 33 s

* llama_quantize

* arch names

* All graph building is now in llm-build-context.cpp

* hparams loading

llama.cpp is now just 9300 LOC, but still takes 32 seconds to compile.

* We are now at 6 seconds to build the src folder

* load -> create

We are not actually loading the tensors, but just creating them.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-10-11 11:35:20 +03:00
AesSedai
23275ac066 Remove duplicate 99% KLD output, add additional percentiles to match mainline (#817) 2025-10-05 07:13:32 +02:00
Downtown-Case
5a633bb0e9 Mark some multi-prediction tensors as not required. (#814) 2025-10-01 20:37:31 +02:00
Kawrakow
475223079c Attempt to fix AVX2 FA (#807)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-09-30 08:06:53 +02:00
Kawrakow
9932e6b102 Fix gemma3 vision (#803)
* Remove unnecessary assert in im2col

* Remove unnecessary assert in im2col (CPU)

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-09-27 11:15:32 +02:00
Kawrakow
e2f21c8dc8 Move minja and nlohmann/json to vendor (#802)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-09-27 09:12:35 +02:00
Kawrakow
346f580267 Remove stb_image.h copy in common - it is now in vendor (#801)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-09-27 08:55:42 +02:00
Kawrakow
c1a0e15377 Port mdmd from mainline + Qwen2/2.5-VL support (#798)
* Add mtmd: the beginning

* Add mtmd: mtmd.cpp compiles

* Add mtmd: clip initialization compiles

* Add mtmd: clip.cpp compiles

* Add mtmd: builds successfully

* Add CPU implementation for GGML_OP_GLU

* Add CUDA implementation for GGML_OP_GLU

* Add CPU implementation for GGML_OP_CONV_2D and GGML_OP_CONV_2D_DW

* Add CUDA implementation for GGML_OP_CONV_2D and GGML_OP_CONV_2D_DW

* Add mtmd: refresh CPU rope

* Add mtmd: refresh CUDA rope

* Add mtmd: add Qwen2-VL

* Add mtmd: Qwen2.5-VL text seems to work with this change

* Add mtmd: fix swiglu

* Add mtmd: use LOG_TEE so generated tokens show up in terminal

* Add mtmd: do not attempt to load a GPU backend if none are available

* GLU, not GPU

* Fix typo

* Fix new/free mismatch

* LOG stuff

* Add mtmd: this fixes gibberish on second image

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-09-27 08:45:29 +02:00
firecoperana
7d8d232896 sync: vendor (#799)
Co-authored-by: firecoperana <firecoperana>
2025-09-26 18:22:47 +02:00
Kawrakow
bc34573356 CPU: faster FA (#797)
* Avoid computing FA chunks where the mask is -infinity

* Avoid computing FA chunks where the mask is -infinity also for f16/bf16

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-09-26 09:00:25 +02:00
Yap Sok Ann
4f9b0ec4f0 Fix logprobs (#787)
This commit is mostly a cherry-pick of ggml-org/llama.cpp#10783, plus
optimization to do partial sort when sorting the logits.

That mainline PR and friends were partially cherry-picked by #723, but
wasn't really in a working state yet.

A couple of additional changes:
* Include timing information in response, which was (unintentionally?)
  done in mainline since ggml-org/llama.cpp#10643.
* Also return the actual logprobs for accepted draft tokens. This is
  still a TODO in mainline [1].

Note that there is a TG performance penalty to return the logprobs, as
we need to sort the logits. By doing partial sort, the penalty is quite
small. Here are some numbers I got using the same prompt:

This PR with partial sort:
* no draft, no logprobs: 12.87 tok/s
* no draft, with logprobs: 12.61 tok/s (2.0% drop)
* with draft, no logprobs: 36.74 tok/s
* with draft, with logprobs: 36.12 tok/s (1.7% drop)

If cherry-pick the full sort from mainline PR:
* no draft, no logprobs: 12.81 tok/s
* no draft, with logprobs: 12.02 tok/s (6.2% drop)
* with draft, no logprobs: 36.59 tok/s
* with draft, with logprobs: 29.08 tok/s (20.5% drop)

[1] https://github.com/ggml-org/llama.cpp/blob/b6548/tools/server/server.cpp#L4019

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-09-25 15:43:30 +02:00
Kawrakow
f8b66238fa Fused matrix multiplications (CUDA and CPU) (#796)
* Quick attempt to fuse the Q, K, V GEMMs

Doesn't do much on the CPU

* Doesn't do much on the GPU either

* Use llm_build_mul_mat_qkv

* This is not needed

* Revert timing on committed by mistake

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-09-24 16:52:54 +02:00
Kawrakow
9c6988f61c Fix dequantization when requantizing (#795)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-09-24 12:44:30 +02:00
Kawrakow
f59b2909d4 cpu: fused softmax+topk (#794)
* cpu: fused softmax+topk

* Cleanup

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-09-24 09:02:21 +02:00
firecoperana
17f7f1ed18 Update webui to handle reasoning content and include usage stats in server only when requested (#791)
* handle reasoning content in webui
server : include usage statistics only when user request them (#16052)
server : only attempt to enable thinking if using jinja (#15967)

* config reasoning_content in webui and change default to auto

---------

Co-authored-by: firecoperana <firecoperana>
2025-09-24 07:45:09 +02:00
Kawrakow
8b4208e789 Fix #772 (#790)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-09-23 16:43:02 +02:00
firecoperana
079231c291 model : add grok-2 support (#782)
Co-authored-by: firecoperana <firecoperana>
2025-09-23 16:31:01 +02:00
Kawrakow
4591e83825 cuda: fused top_k+softmax as used in most MoE models (#789)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-09-23 13:45:57 +02:00
Kawrakow
af5f2859c2 Fix compiler warnings (#788)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-09-23 10:30:15 +02:00
firecoperana
a6da22beb2 Deepseek V3.1 native tool calling support (OpenAI Style) (#771) 2025-09-13 07:51:40 +02:00
firecoperana
de97c33b40 fix convert error for ernie 4.5 (#774) 2025-09-11 07:59:24 +02:00
firecoperana
8403308d8e fix v1 completions streaming mode (#768) 2025-09-09 15:38:12 +02:00
Kawrakow
540a26514f This is very slightly better (#762)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-09-05 21:31:02 +02:00
Kawrakow
f74dd77143 Fix ggml_is_contiguously_allocated (#764)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-09-05 19:05:02 +02:00
firecoperana
426032c27a Add Ernie 4.5 MOE and 0.3B Support (#759)
* Add Ernie4_5MoeModel

* add ernie 4.5 0.3B model

---------

Co-authored-by: firecoperana <firecoperana>
2025-09-05 11:54:35 +02:00
firecoperana
49979ba9e9 llama: enable K-shift for quantized KV cache for cuda (#760)
cuda: add q8_0->f32 cpy operation (#9571)
It will fail on unsupported backends or quant types.

Co-authored-by: Ivan <nekotekina@gmail.com>
2025-09-05 11:54:18 +02:00
Kawrakow
13c3b6412e Offload only activated experts to the GPU (#698)
* Offload only activated experts

* This seems to do the trick for -fmoe

* Do not recalculate activated expers for fused up/gate

* Log out of bounds access details

* Add a command line argument

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-09-04 12:22:30 +02:00
Kawrakow
144d456717 Better CPU SWA (#757)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-09-04 11:58:16 +02:00
Kawrakow
4a6a6f17ee Alternative CUDA FA for SWA models (#754)
* Bounds for flash attention

* Add n_swa to FA parameters

* Fix it

* This seems very slightly better

* Using vec kernel when we have SWA

* Need also this

* f32 vec kernel

* This is slightly better

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-09-04 08:42:18 +02:00
Kawrakow
727f7b7d9f Refactor CUDA flash attention (#745)
* Factor out mma

* Factor out wmma

* Factor out vec

* Remove unnecessary includes from fattn.cu

* Move mma launch to fattn-mma-f16.cuh

* Slightly better PP

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-09-02 10:12:56 +02:00
Kawrakow
d29c21ecbc Set default value of GGML_SCHED_MAX_COPIES to 1 (#751)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-09-02 07:04:39 +02:00
Kawrakow
56e0f897ae Revert "CUDA: prompt processing optimizations for MoE models (#739)" (#748)
This reverts commit f22a9ef95a.

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-09-02 06:55:48 +02:00
Kawrakow
cc73811ddc Remove double definition of LLAMA_LOG_DEBUG 2025-09-01 08:42:04 +03:00
firecoperana
d7882c3cf8 Tool calls support from mainline (#723)
* Tool calls support from mainline

* update cmake

* revert api for /completions

* Fix broken thinking process for gpt-oss

* add missing args and fix webui bugs

* add missing args and fix webui bugs2

* Fix reasoning format error

* add usage

* change default post_sampling_probs to true

* add back generated_text

* Remove server endpoints tests

* add log

* Chat fixes

* Remove logs

* webui: revert extra handling of thinking process

---------

Co-authored-by: firecoperana <firecoperana>
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-09-01 08:38:49 +03:00
Kawrakow
8de297b795 Fused FFN_UP+FFN_GATE op (#741)
* Fused up+gate+unary for regular (not MoE) FFN - CPU

* WIP CUDA

* Seems to be working on CUDA

For a dense model we get 2-3% speedup for PP and ~0.6% for TG.

* Add command line option

This time the option is ON by default, and one needs to turn it
off via -no-fug or --no-fused-up-gate

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-08-31 18:16:36 +03:00
Kawrakow
d55e98519f CUDA: prompt processing optimizations for MoE models (#739)
* Skip the row id computation for the ffn_down op

Sadly, almost negligible performance gain.

* Also this doesn't do much

* Also this barely moves the needle

* This is slightly better

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-08-30 12:09:41 +03:00
Kawrakow
f529c3a808 Sanitize imatrix (#735)
* sanitize importance matrix: WIP

* sanitize importance matrix: iq4_k

* sanitize importance matrix: iq5_k, iq6_k

* sanitize imatrix: iq4_ks

* sanitize imatrix: iq4_kss

* sanitize imatrix: iq2_ks and iq2_kl

* sanitize imatrix: iq5_ks

* sanitize imatrix: iq4_nl_r4

* sanitize imatrix: q4_0_r8

* sanitize imatrix: q6_0_r4

* sanitize imatrix: iq4_xs_r8

* sanitize imatrix: iq4_xs_r8 and q3_k_r4 with a template

* sanitize imatrix: q2_k_r4, q4_k_r4, q5_k_r4, q6_k_r4

* sanitize imatrix: repacked i-quants

* Minor

* Add more checks for iq3_k, iq3_ks

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-08-29 09:08:15 +03:00
Kawrakow
29be3e93c4 Make yarn_log_multiplier optional (#738)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-08-28 14:09:59 +03:00
Kawrakow
e760b4dc41 Check for NaNs while loading the model. (#727)
* Check for NaNs while loading the model.

* Also tell which experts have NaNs.

* Add command line option to validate quants

* Add checks for more quantization types

* Add checks for more quantizagtion types

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-08-27 19:00:17 +03:00
Kawrakow
ca5b6ab9b1 Fix typo 2025-08-27 14:43:44 +03:00
Kawrakow
1dcc34f70a Heuristics for mmq_id -> original threshold (#734)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-08-27 08:17:41 +03:00