Commit Graph

225 Commits

Author SHA1 Message Date
Kawrakow
c51968b6d8 Split mode graph for qwen3moe 2025-12-01 11:56:05 +00:00
Kawrakow
63d0389e18 WIP split mode attn
Works for LlaMA models, but not for GLM-4.5.
Doesn't seem to improve performance, so I guess no point in trying to
fix it.
2025-12-01 09:34:14 +00:00
Kawrakow
ee0f02dcb0 Guarad against using split mode "graph" for unsupported models 2025-12-01 06:39:17 +00:00
Kawrakow
bf2a1dad98 Make graph reuse work with split mode graph 2025-11-30 18:05:13 +00:00
Kawrakow
ceaa71bcc8 Rename split mode "row" to split mode "graph" 2025-11-30 18:05:13 +00:00
Kawrakow
094b51d5ae Make it work with partial offload
but no tensor overrides yet, just ngl < num_layers.
2025-11-30 18:05:13 +00:00
Kawrakow
5305590f04 Show memory used per device 2025-11-30 18:05:13 +00:00
Kawrakow
97143330a1 This works, but it is slow 2025-11-30 18:05:13 +00:00
Kawrakow
5d68e4eb35 WIP: it runs with wrong result
But it also looks like the backend scheduler is not going to help:
* It copies mask and input positions to GPU 0
* => RoPE ops must run on GPU 0
* => To proceed attn evaluation, GPU 1 must wait for GPU 0 to finish its
     entire attn calculation
* Same with FFN. The rms_norm gets scheduled on GPU 0. Hence, GPU 1 must
  wait for GPU 0 to finish its entore FFN calculation before it can
  start (as it needs to copy the result of rms_norm from GPU 0)
* => Seems useless without writing a bespoke TP scheduling
2025-11-30 18:05:12 +00:00
Kawrakow
bc4be331ee WIP: also allocate the KV cache using tensor split 2025-11-30 18:05:12 +00:00
Kawrakow
32c6df015b WIP 2025-11-30 18:05:12 +00:00
firecoperana
15771072c7 RPC: support multiple devices including cpu (#1024)
* RPC support multiple devices

* rpc : update documentation (#16441)

Update the README file to match the newly added functionality of
exposing multiple devices from a single server.

Co-authored-by: Diego Devesa <slarengh@gmail.com>

# Conflicts:
#	examples/rpc/README.md

* Remove memory settings

* rpc : cache and reuse compute graphs (#15405)

Store the last computed graph and reuse it when possible.
Also do not return response from GRAPH_COMPUTE and assume it always
completes successfully. If this this is not the case, the server closes
the connection. This saves us a network round trip to the server.

* Add -cpu to include cpu backend

---------

Co-authored-by: firecoperana <firecoperana>
Co-authored-by: Radoslav Gerganov <rgerganov@gmail.com>
2025-11-30 18:48:02 +01:00
firecoperana
869557c8fd Update mtmd to improve accuracy of M-RoPE (#993)
* model : Granite docling + Idefics3 preprocessing (SmolVLM) (#16206)

* feat: Add granite-docling conversion using trillion pretokenizer

Branch: gabe-l-hart/GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Add granite-docling vocab pre enum

Branch: gabe-l-hart/GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Use granite-docling pre

Branch: gabe-l-hart/GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Add clip_is_idefics3

Branch: gabe-l-hart/GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Allow multi-token boundary sequences for image templating

Branch: gabe-l-hart/GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Add tiling support for idefices3 in clip.cpp

This should likely be moved into llava_uhd::get_slice_instructions, but for
now this avoids disrupting the logic there.

Branch: gabe-l-hart/GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Partial support for full templating for idefics3 in mtmd

There are still errors encoding some of the image chunks, but the token
sequence now matches transformers _almost_ perfectly, except for the double
newline before the global image which shows up as two consecutive newline
tokens instead of a single double-newline token. I think this is happening
because the blocks are tokenized separately then concatenated.

Branch: gabe-l-hart/GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Fully working image preprocessing for idefics3 w/ resize and slicing

Branch: gabe-l-hart/GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* feat: Parse the preprocessor config's longest side and add it to the mmproj hparams

Branch: GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Use the longest side instead of size * scale_factor

For Granite Docling, these come out to the same value, but that was just a
conicidence.

Branch: GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* fix: Allow batch encoding and remove clip_is_idefics3

Branch: GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refactor: Remove unnecessary conditionals for empty token vectors

Branch: GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* refactor: Use image_manipulation util

Branch: GraniteDocling

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>

* add test model

---------

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
# Conflicts:
#	convert_hf_to_gguf.py
#	convert_hf_to_gguf_update.py
#	gguf-py/gguf/constants.py
#	gguf-py/gguf/gguf_writer.py
#	src/llama-vocab.cpp
#	src/llama-vocab.h

* mtmd : support home-cooked Mistral Small Omni (#14928)

* model : add LightOnOCR-1B model (#16764)

* model : add LightOnOCR-1B model

* add test
# Conflicts:
#	convert_hf_to_gguf.py
#	gguf-py/gguf/constants.py

* mtmd : fix idefics3 preprocessing (#16806)

* mtmd : fix idefics3 preprocessing

* disable granite test

* fix test for granite

* model: Add support for CogVLM model (#15002)

* Added GGUF mappings for CogVLM model

* Add tensor mapping for CogVLM visual encoder

* Add CogVLM to conversion script, no vision part yet

* Added CogVLM vision model to conversion script

* Add graph for CogVLM CLIP model

* Add graph for CogVLM

* Fixes for CogVLM. Now compiles.

* Model now runs

* Fixes for cogvlm graph

* Account for graph context change after rebase

* Changes for whitespace

* Changes in convert script according to comments

* Switch CogVLM LLM graph to merged QKV tensor

* Use rope_type variable instead of direct definition

* Change CogVLM CLIP encoder to use SWIGLU

* Switch CogVLM CLIP to use merged QKV

* Apply rebase edits and remove ggml_cont call that is now unnecessary

* clean up

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
# Conflicts:
#	convert_hf_to_gguf.py
#	examples/mtmd/clip.cpp
#	gguf-py/gguf/constants.py
#	gguf-py/gguf/tensor_mapping.py
#	src/llama-arch.cpp
#	src/llama-arch.h
#	src/llama-model.cpp
#	src/llama-model.h

* mtmd: refactor preprocessing + support max/min pixels (#16878)

* mtmd: refactor preprocessing + support max/min pixels

* fix mlp type

* implement mix/max pixels

* improve hparams

* better image preproc for qwen

* fix

* fix out of bound composite

* fix (2)

* fix token calculation

* get_merge_kernel_size()

* fix llama4 and lfm2

* gonna fix them all

* use simple resize for qwen

* qwen: increase min tokens

* no resize if dst size == src size

* restore to initial min/max tokens value for qwen
# Conflicts:
#	examples/mtmd/clip.cpp

* clip : use FA (#16837)

* clip : use FA

* cont : add warning about unsupported ops

* implement "auto" mode for clip flash attn

* clip : print more detailed op support info during warmup

* cont : remove obsolete comment [no ci]

* improve debugging message

* trailing space

* metal : remove stray return

---------

Co-authored-by: Xuan Son Nguyen <son@huggingface.co>

* model: add Janus Pro for image understanding (#16906)

* Add support for Janus Pro

* Update gguf-py/gguf/tensor_mapping.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Update gguf-py/gguf/tensor_mapping.py

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Address reviewer suggestions

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* Add JANUS_PRO constant

* Update clip model handling

Co-authored-by: Xuan-Son Nguyen <son@huggingface.co>

* Update tools/mtmd/clip.cpp

Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>

* Refactor JANUS_PRO handling in clip.cpp

Co-authored-by: Xuan-Son Nguyen <son@huggingface.co>

* Update tools/mtmd/clip.cpp

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>

* em whitespace

---------

Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Xuan-Son Nguyen <son@huggingface.co>
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
# Conflicts:
#	convert_hf_to_gguf.py
#	gguf-py/gguf/constants.py
#	gguf-py/gguf/tensor_mapping.py

* mtmd: pad mask for qwen2.5vl (#16954)

* mtmd: pad mask for qwen2.5vl

* improve

* mtmd: add --image-min/max-tokens (#16921)

* mtmd: improve struct initialization (#16981)

* mtmd: allow QwenVL to process larger image by default (#17020)

* Disable flash attention

* mtmd : fix embedding size for image input (#17123)

* mtmd: fix patch_size initialized to random value in audio models (#17128)

* mtmd: fix patch_size initialized to random value in audio models

* add default hparams

* add llama_model_n_embd_inp

* Fix load qwen3 vl

Change batch size

* Add description

* Fix cli build error

---------

Signed-off-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Gabe Goodhart <ghart@us.ibm.com>
Co-authored-by: Xuan Son Nguyen <son@huggingface.co>
Co-authored-by: Tianyue-Zhao <zhaotianyue@outlook.com>
Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>
Co-authored-by: Zhiyong Wang <85110830+ravenouse@users.noreply.github.com>
Co-authored-by: Sigbjørn Skjæret <sigbjorn.skjaeret@scala.com>
Co-authored-by: Xuan-Son Nguyen <thichthat@gmail.com>
Co-authored-by: firecoperana <firecoperana>
2025-11-29 07:27:15 +01:00
Kawrakow
45cd1a70f5 Fix llama-bench mla parameter (#1016)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-27 09:33:30 +01:00
firecoperana
8c39ff966d Change default RPC order and fix wrong RPC server order in --device arg (#1011)
* Change default RPC order and fix wrong RPC order in --device arg

* Update

---------

Co-authored-by: firecoperana <firecoperana>
2025-11-26 16:51:51 +01:00
Kawrakow
dffb45d44a Fix rtr when mqkv is enabled (#971)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-16 16:51:45 +02:00
firecoperana
b40d11b22d Fix kv cache save and load for GLM model (#965)
Co-authored-by: firecoperana <firecoperana>
2025-11-15 17:04:16 +02:00
Kawrakow
6b9d1bf4b4 Graph reuse (#947)
* Add mainline compatible FA command line option

* Graph reuse: add command line argument to turn it on

* WIP

* This seems to work

* This is perhaps cleaner

* Change the command line option to -gr

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-14 06:58:19 +02:00
Kawrakow
ddc88bac17 Set mla=3 by default (#943)
so more recent users that haven't followed the history of FlashMLA
evolution and hence don't know about the MLA options get the best setting
without having to add -mla 3 on the command line.

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-12 11:00:58 +02:00
Kawrakow
263be6670b Add support for SmolLM3 (#934)
* Convert from HF

* Model loading and compute graph

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-10 15:40:12 +02:00
Kawrakow
532a05e466 CUDA: set compute parameters via command line arguments (#910)
* cuda: set compute parameters via command line arguments

* Also llama-bench

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-07 07:11:23 +02:00
firecoperana
e15a215e6b model : Port Minimax M2 from mainline (#907)
Co-authored-by: firecoperana <firecoperana>
2025-11-06 18:09:24 +02:00
Kawrakow
e68f50be9a Allow quantization of ffn_gate_inp (#896)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-05 10:44:32 +02:00
Thireus ☠
86597623a5 Port of Qwen3-VL support from mainline (#883)
* Port of Qwen3-VL for latest ik_llama.cpp

- convert_hf_to_gguf.py - Not touched, use llama.cpp to convert model instead
- sysl and metal support for imrope not added
- Vulkan support for imrope not tested
- Code not tested

* Bugfix n_embd was declared multiple times

https://github.com/ikawrakow/ik_llama.cpp/pull/883#issuecomment-3471179655

* Fix n_embd issue with qwen3vl

* model.output tensor not required

https://github.com/ikawrakow/ik_llama.cpp/pull/883#discussion_r2480388389

* Improved logic for qkv combined tensors

59ceaf8fcb (r2480395800)
59ceaf8fcb (r2480398187)

* Fix n_embd for merge_qkv() + cleaner code

https://github.com/ikawrakow/ik_llama.cpp/pull/883#discussion_r2481227395

* Revert TENSOR_NOT_REQUIRED
2025-11-04 19:20:54 +02:00
Kawrakow
c23fda2103 Disable some fusion, RoPE cache off by default (#894)
* Disable some fusion and make rope cahe off by default

* Minor

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-04 07:50:14 +02:00
Kawrakow
fb0d5a995c RoPE cache (#887)
* Introducing rope cache

When computing RoPE, the rotation angles in each layer
are exactly the same, and only depend on the token positions
(and other constant, model dependent parameters).
So, I wonder, why don't we compute the angles just once
and then reuse for the Q and K RoPE in each layer?

This commit does it as a POC on the CPU, and uses it in
the Qwen3-MoE compute graph.

* cuda: neox works

* WIP

* rope_cache: norm works

* Fused rope+rope

* Fused rope+rope (norm)

* Fused rms+rms+rope+rope (neox) - not working

* WIP

* Also qwen3

* Add command line arg to disable rope cache

* Disable RoPE cache if rope type is not neox or norm

* Add missing break after merge with main

* Fused fused_rms+fused_rms+rope+rope (with -mqkv)

* Fused fused_rms+fused_rms+rope+rope (without -mqkv)

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-11-03 18:42:20 +02:00
firecoperana
a3bd0158f7 Disable pipeline parallel for tensor override or allocation failed (#879)
* disable pipeline parallelism when tensor override present

* disable pipeline parallel if allocation failed

---------

Co-authored-by: firecoperana <firecoperana>
2025-10-31 14:20:48 +02:00
Kawrakow
56fc5454ff Merge Q, K, V (#878)
* POC: merge Q, K, V into a single, contiguous tensor

Done just for Qwen3-MoE, where I see a 4% uplift in TG.
PP performance gain is sub-percent, if any.
Still, it seems it makes sense to do it in general given
the TG performance gain.

* WIP

* merge_qkv: it works for gpt-oss

...but we see a smaller TG gain (~1.5%)

* WIP

* Don't ignore the return value of create_tensors()

else, when q, k, v get merged and we are running on the CPU,
we get a crash because the backend is trying to use mmap,
but that no longer works.

* merge_qkv: bias can be required, optional, or mandatory

* merge_qkv: glm4.5moe

* merge_qkv: add command loine argument to enable

* merge_qkv: fix tensor dimensions

* merge_qkv: llama-4

* merge_qkv: qwen3 (dense)

* merge_qkv: simplify build_qwen3moe

* cohere2 - simplify graph building

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-10-30 10:49:48 +02:00
Kawrakow
d0992d6e1f Fix device parsing bug 2025-10-29 08:28:57 +02:00
firecoperana
904e994bfb Support --device and --device-draft parameter (#866)
* add --device and --device-draft parameter

* don't print debug message in release mode

* fix

* bug fix to throw exception when no device specified

* add const

---------

Co-authored-by: firecoperana <firecoperana>
2025-10-27 18:13:28 +02:00
Kawrakow
41d6c42b96 Change flash attention and fmoe to be on by default (#863)
* Change fmoe to be on by default

* Change default fmoe also in llama-bench

* Change flash attention to be on by default

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-10-25 09:37:28 +03:00
Kawrakow
0549be76e5 Fused mul + multi_add op (#858)
* Adding fused mul+multi_add + CPU implementation

* fused mul+multi_add: CUDA

* fused mul+multi_add: command line argument to disable it

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-10-24 07:40:35 +03:00
Kawrakow
1f072ab135 Do not allocate KV cache for unused layers (#843)
* Do not allocate KV cache for unused layers

* Do not apply experts weight scale if it is 1

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-10-20 10:09:39 +03:00
Kawrakow
cde642e591 Grouped expert routing (CPU only) (#836)
* Better argsort (CPU)

* Attemt at grouped topk

* This seems to do the trick for grouped experts routing

* Cleanup

* Trying to merge, something is not right

* Working merged grouped top_k (CPU)

* Add command line option to enable grouped expert routing

* Add grouped expert routing option to llama-bench

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-10-16 14:57:02 +03:00
Kawrakow
f7adde1043 Adding Ling/Ring (a.k.a., Bailing-MoE2) support (#833)
* Adding Ling/Ring (a.k.a., Bailing-MoE2)

* Add expert group selection (not working, so turned off)

* BailingMoE2 conversion

* WIP

* Bits and pieces

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-10-15 14:20:40 +03:00
Kawrakow
4e24d48e63 Attention mask tweaks for better long context performance (#825)
* Parallelize mask

We see non-negligible PP gains for long contexts.
More importantly, the strange drop in performance
observed for GPT-OSS for context >= 32k tokens is gone.

* Whith FA on, create mask as f16 directly

* WIP

* Reduce KQ mask padding to 16

Why was it 64 in the first place?

I don't observe any issues, while TG performance
for long contexts improves by 2-4%.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-10-13 14:01:11 +03:00
Kawrakow
764eefd1bc Enable and clean up compiler warnings in src (#824)
* WIP: enable and clean up warnings in src

* All warnings handled

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-10-11 16:01:13 +03:00
Kawrakow
4daff01b39 Refactor file llama.cpp (#823)
* llama_model and llama_hparams

* llama_build_context

Surprisingly small reduction in llama.cpp compile time given
the reduction in LOCs (22k -> 14k)

* LLM_TN

llama.cpp compilation: 50 s -> 33 s

* llama_quantize

* arch names

* All graph building is now in llm-build-context.cpp

* hparams loading

llama.cpp is now just 9300 LOC, but still takes 32 seconds to compile.

* We are now at 6 seconds to build the src folder

* load -> create

We are not actually loading the tensors, but just creating them.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-10-11 11:35:20 +03:00
Downtown-Case
5a633bb0e9 Mark some multi-prediction tensors as not required. (#814) 2025-10-01 20:37:31 +02:00
Kawrakow
c1a0e15377 Port mdmd from mainline + Qwen2/2.5-VL support (#798)
* Add mtmd: the beginning

* Add mtmd: mtmd.cpp compiles

* Add mtmd: clip initialization compiles

* Add mtmd: clip.cpp compiles

* Add mtmd: builds successfully

* Add CPU implementation for GGML_OP_GLU

* Add CUDA implementation for GGML_OP_GLU

* Add CPU implementation for GGML_OP_CONV_2D and GGML_OP_CONV_2D_DW

* Add CUDA implementation for GGML_OP_CONV_2D and GGML_OP_CONV_2D_DW

* Add mtmd: refresh CPU rope

* Add mtmd: refresh CUDA rope

* Add mtmd: add Qwen2-VL

* Add mtmd: Qwen2.5-VL text seems to work with this change

* Add mtmd: fix swiglu

* Add mtmd: use LOG_TEE so generated tokens show up in terminal

* Add mtmd: do not attempt to load a GPU backend if none are available

* GLU, not GPU

* Fix typo

* Fix new/free mismatch

* LOG stuff

* Add mtmd: this fixes gibberish on second image

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-09-27 08:45:29 +02:00
Kawrakow
f8b66238fa Fused matrix multiplications (CUDA and CPU) (#796)
* Quick attempt to fuse the Q, K, V GEMMs

Doesn't do much on the CPU

* Doesn't do much on the GPU either

* Use llm_build_mul_mat_qkv

* This is not needed

* Revert timing on committed by mistake

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-09-24 16:52:54 +02:00
Kawrakow
9c6988f61c Fix dequantization when requantizing (#795)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-09-24 12:44:30 +02:00
firecoperana
079231c291 model : add grok-2 support (#782)
Co-authored-by: firecoperana <firecoperana>
2025-09-23 16:31:01 +02:00
Kawrakow
4591e83825 cuda: fused top_k+softmax as used in most MoE models (#789)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-09-23 13:45:57 +02:00
firecoperana
426032c27a Add Ernie 4.5 MOE and 0.3B Support (#759)
* Add Ernie4_5MoeModel

* add ernie 4.5 0.3B model

---------

Co-authored-by: firecoperana <firecoperana>
2025-09-05 11:54:35 +02:00
firecoperana
49979ba9e9 llama: enable K-shift for quantized KV cache for cuda (#760)
cuda: add q8_0->f32 cpy operation (#9571)
It will fail on unsupported backends or quant types.

Co-authored-by: Ivan <nekotekina@gmail.com>
2025-09-05 11:54:18 +02:00
Kawrakow
13c3b6412e Offload only activated experts to the GPU (#698)
* Offload only activated experts

* This seems to do the trick for -fmoe

* Do not recalculate activated expers for fused up/gate

* Log out of bounds access details

* Add a command line argument

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-09-04 12:22:30 +02:00
Kawrakow
4a6a6f17ee Alternative CUDA FA for SWA models (#754)
* Bounds for flash attention

* Add n_swa to FA parameters

* Fix it

* This seems very slightly better

* Using vec kernel when we have SWA

* Need also this

* f32 vec kernel

* This is slightly better

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-09-04 08:42:18 +02:00
Kawrakow
56e0f897ae Revert "CUDA: prompt processing optimizations for MoE models (#739)" (#748)
This reverts commit f22a9ef95a.

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-09-02 06:55:48 +02:00
firecoperana
d7882c3cf8 Tool calls support from mainline (#723)
* Tool calls support from mainline

* update cmake

* revert api for /completions

* Fix broken thinking process for gpt-oss

* add missing args and fix webui bugs

* add missing args and fix webui bugs2

* Fix reasoning format error

* add usage

* change default post_sampling_probs to true

* add back generated_text

* Remove server endpoints tests

* add log

* Chat fixes

* Remove logs

* webui: revert extra handling of thinking process

---------

Co-authored-by: firecoperana <firecoperana>
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-09-01 08:38:49 +03:00