ik_llama.cpp/common at a50bd821ec055e36ae91d6a310fcc1df9c614937 - ik_llama.cpp - Public git mirror

ikawrakow/ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-03-13 07:20:15 +00:00

Files

History

Kawrakow c03c2d7cc6 Merge ffn_up and ffn_gate experts tensors (#1137 )

* WIP - not working

* WIP - not working

* WIP - GPT-OSS working

However, extremely stupid. The only way I could correctly repack the
up/gate experts is to copy up and gate into host buffers, repack
into another host buffer, copy back into the ffn_up_gate_exps tensor.
This is going to be very slow for giant 500 GB models.

My attempts to do this via a compute graph on the backend holding
the tensors was unsuccessful.

For GPT-OSS-20B I see ~6-7% better PP when using the original
ik_llama.cpp fused_up_gate CUDA implementation, and ~10% when
using the small batch size implementation.

Other models are not working yet on CUDA as I need to fix the
fused mul-unary implementation.

* WIP

* WIP - Qwen3-MoE (and hopefully all others) working

But when I say here and in the previous commit "working",
I mean PP is working. TG is still broken.

* WIP: TG seems to be working

* Minor

* Add command line option to merge experts up/gate

* Add merge up/gate command line parameter to llama-bench

* Turn off merge_up_gate_exps if split mode graph

It is not yet implemented

* When no bias, allow merging up/gate with tensor overrides

* Arghh, we need to increase the context size again

* Cleanup

2026-01-12 18:30:53 +02:00

..

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

base64.hpp

llava : expose as a shared library for downstream projects (#3613 )

2023-11-07 00:36:23 +03:00

build-info.cpp.in

build : link against build info instead of compiling against it (#3879 )

2023-11-02 08:50:16 +02:00

chat-parser-xml-toolcall.cpp

fix kimi-k2 tool call (#996 )

2025-11-24 06:51:16 +01:00

chat-parser-xml-toolcall.h

common: Generalized XML-style tool-call parsing with streaming support (#958 )

2025-11-18 15:29:58 +01:00

chat-parser.cpp

Add back the fix for Kimi-K2 tool-call parsing issues (#1070 )

2025-12-16 14:44:47 +01:00

chat-parser.h

Refactor chat and server file (#1062 )

2025-12-15 08:27:20 +01:00

chat.cpp

fix grammar for Kimi-K2 (#1103 )

2026-01-05 07:57:25 +02:00

chat.h

Refactor chat and server file (#1062 )

2025-12-15 08:27:20 +01:00

CMakeLists.txt

Refactor chat and server file (#1062 )

2025-12-15 08:27:20 +01:00

common.cpp

Merge ffn_up and ffn_gate experts tensors (#1137 )

2026-01-12 18:30:53 +02:00

common.h

Merge ffn_up and ffn_gate experts tensors (#1137 )

2026-01-12 18:30:53 +02:00

console.cpp

check C++ code with -Wmissing-declarations (#3184 )

2023-09-15 15:38:27 -04:00

console.h

gguf : new file format with flexible meta data (beta) (#2398 )

2023-08-21 23:07:43 +03:00

grammar-parser.cpp

Update grammar (#1023 )

2025-11-30 18:45:38 +01:00

grammar-parser.h

Tool calls support from mainline (#723 )

2025-09-01 08:38:49 +03:00

json-partial.cpp

common: Generalized XML-style tool-call parsing with streaming support (#958 )

2025-11-18 15:29:58 +01:00

json-partial.h

Move minja and nlohmann/json to vendor (#802 )

2025-09-27 09:12:35 +02:00

json-schema-to-grammar.cpp

Update grammar (#1023 )

2025-11-30 18:45:38 +01:00

json-schema-to-grammar.h

common: Generalized XML-style tool-call parsing with streaming support (#958 )

2025-11-18 15:29:58 +01:00

llguidance.cpp

Tool calls support from mainline (#723 )

2025-09-01 08:38:49 +03:00

log.cpp

Refactor chat and server file (#1062 )

2025-12-15 08:27:20 +01:00

log.h

Fix log issue for llama-cli (#1071 )

2025-12-16 18:12:16 +01:00

ngram-cache.cpp

Fixed lookup compilation issues on Windows (#6273 )

2024-03-24 14:21:17 +01:00

ngram-cache.h

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

regex-partial.cpp

Tool calls support from mainline (#723 )

2025-09-01 08:38:49 +03:00

regex-partial.h

Tool calls support from mainline (#723 )

2025-09-01 08:38:49 +03:00

sampling.cpp

Implement Adaptive-P Sampler (#1100 )

2026-01-10 07:58:53 +02:00

sampling.h

Implement Adaptive-P Sampler (#1100 )

2026-01-10 07:58:53 +02:00

speculative.cpp

Support --device and --device-draft parameter (#866 )

2025-10-27 18:13:28 +02:00

speculative.h

Port universal assisted decoding to llama-server (#699 )

2025-08-18 09:22:23 +03:00

train.cpp

train : change default FA argument (#7528 )

2024-05-25 15:22:35 +03:00

train.h

sync : ggml (backend v2) (#3912 )

2023-11-13 14:16:23 +02:00