Commit Graph

401 Commits

Author SHA1 Message Date
Kawrakow
e66d307e13 Better argsort (CPU) (#835)
* Better argsort (CPU)

* Minor

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-10-16 11:31:03 +03:00
Kawrakow
f7adde1043 Adding Ling/Ring (a.k.a., Bailing-MoE2) support (#833)
* Adding Ling/Ring (a.k.a., Bailing-MoE2)

* Add expert group selection (not working, so turned off)

* BailingMoE2 conversion

* WIP

* Bits and pieces

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-10-15 14:20:40 +03:00
Kawrakow
ba9fefb73d gpt-oss: duplicate experts biases when necessary (#829)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-10-14 14:38:40 +03:00
Kawrakow
4e24d48e63 Attention mask tweaks for better long context performance (#825)
* Parallelize mask

We see non-negligible PP gains for long contexts.
More importantly, the strange drop in performance
observed for GPT-OSS for context >= 32k tokens is gone.

* Whith FA on, create mask as f16 directly

* WIP

* Reduce KQ mask padding to 16

Why was it 64 in the first place?

I don't observe any issues, while TG performance
for long contexts improves by 2-4%.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-10-13 14:01:11 +03:00
Kawrakow
475223079c Attempt to fix AVX2 FA (#807)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-09-30 08:06:53 +02:00
Kawrakow
9932e6b102 Fix gemma3 vision (#803)
* Remove unnecessary assert in im2col

* Remove unnecessary assert in im2col (CPU)

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-09-27 11:15:32 +02:00
Kawrakow
c1a0e15377 Port mdmd from mainline + Qwen2/2.5-VL support (#798)
* Add mtmd: the beginning

* Add mtmd: mtmd.cpp compiles

* Add mtmd: clip initialization compiles

* Add mtmd: clip.cpp compiles

* Add mtmd: builds successfully

* Add CPU implementation for GGML_OP_GLU

* Add CUDA implementation for GGML_OP_GLU

* Add CPU implementation for GGML_OP_CONV_2D and GGML_OP_CONV_2D_DW

* Add CUDA implementation for GGML_OP_CONV_2D and GGML_OP_CONV_2D_DW

* Add mtmd: refresh CPU rope

* Add mtmd: refresh CUDA rope

* Add mtmd: add Qwen2-VL

* Add mtmd: Qwen2.5-VL text seems to work with this change

* Add mtmd: fix swiglu

* Add mtmd: use LOG_TEE so generated tokens show up in terminal

* Add mtmd: do not attempt to load a GPU backend if none are available

* GLU, not GPU

* Fix typo

* Fix new/free mismatch

* LOG stuff

* Add mtmd: this fixes gibberish on second image

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-09-27 08:45:29 +02:00
Kawrakow
bc34573356 CPU: faster FA (#797)
* Avoid computing FA chunks where the mask is -infinity

* Avoid computing FA chunks where the mask is -infinity also for f16/bf16

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-09-26 09:00:25 +02:00
Kawrakow
f8b66238fa Fused matrix multiplications (CUDA and CPU) (#796)
* Quick attempt to fuse the Q, K, V GEMMs

Doesn't do much on the CPU

* Doesn't do much on the GPU either

* Use llm_build_mul_mat_qkv

* This is not needed

* Revert timing on committed by mistake

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-09-24 16:52:54 +02:00
Kawrakow
f59b2909d4 cpu: fused softmax+topk (#794)
* cpu: fused softmax+topk

* Cleanup

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-09-24 09:02:21 +02:00
Kawrakow
8b4208e789 Fix #772 (#790)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-09-23 16:43:02 +02:00
Kawrakow
4591e83825 cuda: fused top_k+softmax as used in most MoE models (#789)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-09-23 13:45:57 +02:00
Kawrakow
540a26514f This is very slightly better (#762)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-09-05 21:31:02 +02:00
Kawrakow
f74dd77143 Fix ggml_is_contiguously_allocated (#764)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-09-05 19:05:02 +02:00
firecoperana
49979ba9e9 llama: enable K-shift for quantized KV cache for cuda (#760)
cuda: add q8_0->f32 cpy operation (#9571)
It will fail on unsupported backends or quant types.

Co-authored-by: Ivan <nekotekina@gmail.com>
2025-09-05 11:54:18 +02:00
Kawrakow
13c3b6412e Offload only activated experts to the GPU (#698)
* Offload only activated experts

* This seems to do the trick for -fmoe

* Do not recalculate activated expers for fused up/gate

* Log out of bounds access details

* Add a command line argument

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-09-04 12:22:30 +02:00
Kawrakow
144d456717 Better CPU SWA (#757)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-09-04 11:58:16 +02:00
Kawrakow
4a6a6f17ee Alternative CUDA FA for SWA models (#754)
* Bounds for flash attention

* Add n_swa to FA parameters

* Fix it

* This seems very slightly better

* Using vec kernel when we have SWA

* Need also this

* f32 vec kernel

* This is slightly better

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-09-04 08:42:18 +02:00
Kawrakow
727f7b7d9f Refactor CUDA flash attention (#745)
* Factor out mma

* Factor out wmma

* Factor out vec

* Remove unnecessary includes from fattn.cu

* Move mma launch to fattn-mma-f16.cuh

* Slightly better PP

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-09-02 10:12:56 +02:00
Kawrakow
d29c21ecbc Set default value of GGML_SCHED_MAX_COPIES to 1 (#751)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-09-02 07:04:39 +02:00
Kawrakow
56e0f897ae Revert "CUDA: prompt processing optimizations for MoE models (#739)" (#748)
This reverts commit f22a9ef95a.

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-09-02 06:55:48 +02:00
Kawrakow
8de297b795 Fused FFN_UP+FFN_GATE op (#741)
* Fused up+gate+unary for regular (not MoE) FFN - CPU

* WIP CUDA

* Seems to be working on CUDA

For a dense model we get 2-3% speedup for PP and ~0.6% for TG.

* Add command line option

This time the option is ON by default, and one needs to turn it
off via -no-fug or --no-fused-up-gate

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-08-31 18:16:36 +03:00
Kawrakow
d55e98519f CUDA: prompt processing optimizations for MoE models (#739)
* Skip the row id computation for the ffn_down op

Sadly, almost negligible performance gain.

* Also this doesn't do much

* Also this barely moves the needle

* This is slightly better

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-08-30 12:09:41 +03:00
Kawrakow
f529c3a808 Sanitize imatrix (#735)
* sanitize importance matrix: WIP

* sanitize importance matrix: iq4_k

* sanitize importance matrix: iq5_k, iq6_k

* sanitize imatrix: iq4_ks

* sanitize imatrix: iq4_kss

* sanitize imatrix: iq2_ks and iq2_kl

* sanitize imatrix: iq5_ks

* sanitize imatrix: iq4_nl_r4

* sanitize imatrix: q4_0_r8

* sanitize imatrix: q6_0_r4

* sanitize imatrix: iq4_xs_r8

* sanitize imatrix: iq4_xs_r8 and q3_k_r4 with a template

* sanitize imatrix: q2_k_r4, q4_k_r4, q5_k_r4, q6_k_r4

* sanitize imatrix: repacked i-quants

* Minor

* Add more checks for iq3_k, iq3_ks

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-08-29 09:08:15 +03:00
Kawrakow
e760b4dc41 Check for NaNs while loading the model. (#727)
* Check for NaNs while loading the model.

* Also tell which experts have NaNs.

* Add command line option to validate quants

* Add checks for more quantization types

* Add checks for more quantizagtion types

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-08-27 19:00:17 +03:00
Kawrakow
ca5b6ab9b1 Fix typo 2025-08-27 14:43:44 +03:00
Kawrakow
1dcc34f70a Heuristics for mmq_id -> original threshold (#734)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-08-27 08:17:41 +03:00
Kawrakow
6afe9b48ab Sanitize importances for KT quantization (#720)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-08-27 08:04:15 +03:00
Kawrakow
3dc4dffed5 Fix avx2 GEMM mess (v2) (#724)
* This fixes confusion around Q8_0 on AVX2

* This does it for iq4_nl, including FA

* This does it for iq4_nl on Zen4, but FA does not work

* Slightly more clear

* Adding forgotten q8_0_r8 to num_rows()

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-08-27 08:03:47 +03:00
Kawrakow
ac4ec50f03 CUDA: muh faster prompt processing for MoE models and small u-batch sizes (#728)
* WIP: adding mainline mmq_id implementation

* This seems to work

* Now also -fmoe works

* WIP

* WIP

* WIP

* This works for mainline supported quants

* mmq_id: add iq2_k, iq2_k_r4

* mmiq_id: don't assume row size is multiple of type size (per row scales)

* mmiq_id: don't assume row size is multiple of type size

* mmq_id: add iq2_ks

So we are sure it works with per row scales

* mmq_id: add iq2_kl

* mmq_id: add iq3_ks

* mmq_id: adding iq3_k, iq3_k_r4

* mmq_id: add iq4_kss, iq4_ks, iq4_ks_r4

* mmq_id: adding iq4_k, iq4_k_r4

* mmq_id: adding iq5_ks, iq5_ks_r4

* mmq_id: adding iq5_k, iq5_k_r4, q6_0

* mmq_id: adding iq6_k

* mmq_id: add iq1_s_r4

* mmq_id: adding iq1_kt, iq2_kt

* mmq_id: add iq3_kt, iq4_kt

* Add CUDA fp8 header

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-08-26 13:30:35 +03:00
Kawrakow
d3aecc7f37 Log for debugging #721 (#722)
* Log for debugging #721

* Remove the 16

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-08-23 15:24:34 +03:00
Kawrakow
8e23ecdd96 Fix more Q8_0 repacking mess on AVX2 (#719)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-08-23 09:04:51 +03:00
Kawrakow
e962ce8c70 CUDA: faster IQ2_K, IQ2_KS, IQ2_K_R4 (#716)
* Use bperm trick for iq2_ks gemm -> 7% gain

* Use bperm trick for iq2_k gemm -> ~5% gain

* Use bperm trick for iq2_k_r4 gemm -> ~3% gain

* Use bperm trick for iq2_ks gemv -> ~7% gain

* Use bperm trick for iq2_k gemv -> ~3% gain

* Use bperm trick for iq2_k_r4 gemv -> ~7% gain

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-08-22 07:25:35 +03:00
Kawrakow
ca8c72ff1a AVX512+AVXVNNI GEMM implementation for quants using Q8_K for activations (#710)
* q8_k_r16: basics

* q8_k_r16: iq4_xs now uses q8_k_r16 on Zen4+

PP performance is about the same as using q8_k_r8 on the Ryzen-7950X,
so we expect nice gains on Zen5, and we don't need to wory about
using 2 different q8_k_r8 implementations for fancy SIMD.

* q8_k_r16: iq2_xxs now uses q8_k_r16 on Zen4+

* q8_k_r16: iq2_xs now uses q8_k_r16 on Zen4+

* q8_k_r16: iq2_s now uses q8_k_r16 on Zen4+

* q8_k_r16: iq3_xxs now uses q8_k_r16 on Zen4+

* q8_k_r16: iq3_s now uses q8_k_r16 on Zen4+

* q8_k_r16: iq1_s and iq1_m now uses q8_k_r16 on Zen4+

* q8_k_r16: q2_K and q3_K now uses q8_k_r16 on Zen4+

* q8_k_r16: iq2_ks and iq2_k now uses q8_k_r16 on Zen4+

* q8_k_r16: iq2_kl now uses q8_k_r16 on Zen4+

* q8_k_r16: iq3_ks and iq3_k now uses q8_k_r16 on Zen4+

* q8_k_r16: iq4_kss, iq4_ks, and iq4_k now use q8_k_r16 on Zen4+

* q8_k_r16: iq5_ks, iq5_k, and iq6_k now use q8_k_r16 on Zen4+

* Fix AVX2

* Just always set num_rows to 16

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-08-22 06:27:07 +03:00
Kawrakow
c5f58e0270 CUDA: faster IQ3_K, IQ3_KS, IQ3_K_R4 (#714)
* Use bperm trick for iq3_ks - 5% PP performance gain

* Use bperm trick for iq3_k -> 5% PP performance gain

* Use bperm trick for iq3_k -> 8% PP performance gain

* Use bperm trick for iq3_k_r4 gemv -> ~5% faster

* Use bperm trick for iq3_k gemv -> ~3% faster

* Use bperm trick for iq3_k gemv -> 4.5% gain

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-08-21 19:08:57 +03:00
Kawrakow
9f3d062ba7 CUDA: faster prompt processing for 4-bit quants (#713)
* Use __byte_perm in get_int_from_table_16

* Use get_int_from_table_16 everywhere for 4-bit quants

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-08-21 15:57:35 +03:00
Kawrakow
a9eeef53f3 Fix q8_0 repacking issues on AVX2 (#708)
Q8_0 needs Q0_0_X4, but Q8_0_R8 needs Q8_2_X4.
So, if we decide to repack a Q8_0 MoE tensor to Q8_0_R8,
iqk_moe_fused_mul_unary fails because the activations were
prepared as Q0_0_X4, but we now need Q8_2_X4.

For now a simple fix: just take the slow path, do not repack.

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-08-19 19:49:58 +03:00
usrlocalben
f98b1befdb remove curious assertions (#705)
This assertion can hit during prefill as MLA/KV tensors grow,
e.g. Kimi K2 n_ctx >= 32768.
2025-08-19 14:41:29 +03:00
Kawrakow
6b2c84b099 Revert "Better CPU prompt processing performance for SWA models (#696)" (#701)
This reverts commit 93a4f6089f.

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-08-17 15:44:02 +03:00
Kawrakow
1ca612375e Fix GLM-4.5 attention (#700)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-08-17 14:31:03 +03:00
Kawrakow
d4d017766e Better CPU prompt processing performance for SWA models (#696)
* This does the trick for PP

* Compute mask bounds when creating the mask

* Set mask bounds for all supported SWA models

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-08-17 10:30:27 +03:00
Kawrakow
2e2abddaa8 Quick hack to improve TG performance for SWA models (#692)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-08-15 16:43:04 +03:00
Kawrakow
633e0617b0 Enable CUDA graphs for MoE models + GPT-OSS support (#689)
* gmp-oss: common

* gpt-oss: attnetion sinks, swiglu_oai

* gpt-oss: WIP llama

Model loads and runs (CPU only), but PPL is much to high
(~1500 for 1st batch vs ~200 in mainline).
Is it because of SWA, because of vocab, or did I introduce a bug somewhere?

* gpt-oss: CPU seems to be working

It was the SWA thta was missing in the previous commit.

There are issues with EOG tokens, so this still needs to be added.

* CUDA: ADD_ID

Just a copy from mainline

* gpt-oss: Seems to be working on CUDA

* gpt-oss: add sinks to the attn-vec kernels

* CUDA: add head size of 64 to new mma

Haven't turned it on yet, but observe slightly better PP and slightly
worse TG performance with that.

* gpt-oss: add ability to use -fmoe (only CUDA for now)

* Move row sums to the write place

* Add sinks to iqk flash attention

* gpt_oss: Implement -fmoe on the CPU

* Simdify swiglu_oai

Turning it off for now as performance becomes more variable,
so perhaps I'm running into thermal trottling imore often
because of making the CPU work too hard.

* llama: factor out model loader

* Builds successfully

* It runs, but mmap does not work

* Fix llama_mmap so mmap works

* Minor

* Fix CUDA after latest changes

* Attempt to use CUDA graphs with MoE models - not working

* CUDA graphs WIP - still not working

* CUDA graphs - seems to be working

Likely not all MLA variants are working.
I no longer remember why I added the q8_0 cpy that
transposes the tensor, but if really needed, this is now
missing. Also missing is q6_0.

* Make q8_0 cache work for DeepSeek models with CUDA graphs

* cuda: cpy for q6_0

* Fix llama_mmap on non-Linux platforms

* Adding forgotten file

* Iterating on Windows build failures

* cuda: re-add q8_0 -> q8_0 transpose

so mla = 2 can be used with CUDA graphs and q8_0 cache.

* Disable graphs without -fmoe

* Minor

* Turn graphs on by default

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-08-15 09:18:07 +03:00
Kawrakow
e23b2a7cc9 MXFP4 (#682)
* mxfp4: basics

* mxfp4: Zen4 GEMM

* mxfp4: repacked GEMM (AVX2/Zen4)

* mxfp4: AVX2 GEMM

* mxfp4: NEON GEMM

* mxfp4: repacked GEMM (NEON)

* mxfp4: Metal

* Fix quantized K cache without FA (#680)

* Prevent assert with quantized K cache and no FA

* Fix MMQ when running with quantized K cache without FA

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

* Fix for Deepseek r1 parsing (#676)

* Implement function calling / tools for ik_llama.cpp for Kimi K2

* Implement basic tool choice

* Backport llama.cpp tool calls support

* Enhance function calls with improved chat parser and string utilities

- Add new chat.h/chat.cpp and chat-parser.h/chat-parser.cpp for better chat handling
- Improve function calls parsing with fallback to llama.cpp builder pattern
- Add string utility functions (starts_with, ends_with, find_partial_stop)
- Update README with function calls testing instructions
- Enhance Kimi K2 parser and function calls documentation
- Add comprehensive test suite for function calls
- Update CMakeLists.txt and Makefile for new components

* Enhance function calling with unified streaming and parser improvements

- Fix streaming content cleanup to prevent function syntax in output
- Unify content extraction patterns with llama.cpp approach
- Improve Kimi K2 parser robustness and partial content handling
- Add comprehensive test coverage for function call scenarios
- Optimize chat message parsing and diff computation

* Replace hardcoded values in kimi_k2_parser.hpp with named constants

- Add compile-time constants for all token format markers
- Add compile-time constants for XML format markers
- Add compile-time constants for simple format patterns
- Replace all hardcoded string literals with named constants
- Use compile-time length calculation to avoid manual counting
- Improve maintainability and reduce magic numbers throughout parser

* Fix duplicate common_chat_parse definition

- Remove duplicate implementation from chat-parser.cpp
- Keep single implementation in chat.cpp following llama.cpp patterns
- Resolves linker error: multiple definition of common_chat_parse

* Fix JSON assertion failure in function call parsing

- Add proper validation that 'function' field is an object before accessing nested keys
- Handle missing 'arguments' field gracefully with default "{}"
- Prevents crash when parsing malformed tool call JSON structures

* Add comprehensive Qwen3 XML tool calling support with unit tests

- Implement Qwen3 XML parser with <tool_call>{"name": "func", "arguments": {...}}</tool_call> format
- Add model detection and routing for Qwen3 vs Kimi-K2 formats
- Create 8 comprehensive unit tests covering parsing, streaming, error handling
- Fix token format cleaning bug in kimi_k2_parser.hpp processing order
- Remove progressive parsing code and related utilities
- Add tool injection support for Qwen3 format in server utils

* Add DeepSeek R1 function calling support with comprehensive unit tests

- Implement complete DeepSeek R1 tool call parsing in common_chat_parser.cpp
- Add DeepSeek R1 model detection and tool injection in deepseek_r1_tools.hpp
- Update function_calls.hpp with DeepSeek R1 integration and content extraction
- Update documentation to reflect support for Kimi-K2, Qwen3, and DeepSeek R1 models
- Add comprehensive unit tests for DeepSeek R1 reasoning, tool calls, and integration
- Port exact implementation patterns from original llama.cpp for compatibility

Key features:
- Native DeepSeek R1 format: <|tool▁calls▁begin|>function<|tool▁sep|>name```json{}```<|tool▁call▁end|><|tool▁calls▁end|>
- Reasoning content extraction from <think>...</think> tags
- Multiple tool calls support with separate call blocks
- Model detection for deepseek-r1, deepseek_r1 naming patterns
- Integration with incremental parsing and streaming support

* Add partial parsing support for JSON and regex

- json-partial.h/cpp: JSON partial parsing functionality
- regex-partial.h/cpp: Regex partial parsing functionality

* Add format_chat integration tests for Qwen3 tool injection

- Add test_qwen3_format_chat_integration() to validate tool injection pipeline
- Test tool injection conditions and system message enhancement
- Verify JSON formatting and anti-preamble instructions
- Add comprehensive test documentation

Tests confirm tool injection works correctly - conversational preamble
issue is not in ik_llama.cpp but likely in UI configuration.

* Fix Qwen3 tool call parsing - pass model name to parser

Server was not passing model name to parse_chat_message_incremental(),
causing Qwen3 to fall back to Kimi-K2 parser and return tool calls
as content instead of proper tool_calls array.

* Fix non-streaming path to use model-specific parsing

Non-streaming responses were hardcoded to use Kimi-K2 format,
causing Qwen3 XML tool calls to be returned as content instead
of proper tool_calls array. Now uses same model detection as
streaming path for consistency.

* Update Qwen3 function call handling in server and tests

- Enhanced server function call detection and response formatting
- Improved test coverage for Qwen3 tool call scenarios
- Refined XML parsing for better tool execution support

* Add DeepSeek-R1 function call parsing support

Implements comprehensive parsing for all 4 DeepSeek-R1 function call formats:
- Format 1: Standard function call syntax (already supported)
- Format 2: Alternative function call patterns (already supported)
- Format 3: Tools array format - function\n```json\n{"tools": [...]}
- Format 4: XML wrapped format - <tool_call>function</think>Name\n```json\n{...}```</tool_call>

Key changes:
- Added parse_deepseek_r1_tools_array() following original parse_prefixed_json_tool_call_array pattern
- Added parse_deepseek_r1_xml_wrapped() following Hermes-2-Pro XML wrapper patterns
- Integrated both parsers into exception handling chain for robust fallback
- Added comprehensive TDD test coverage for all formats
- Anonymized all confidential information while preserving functionality

Resolves tool_calls_count=0 issue where DeepSeek-R1 models generated valid tool calls
but server failed to parse them correctly.

* Update function_calls.md documentation for DeepSeek-R1 Format 4

- Added Format 4 (XML wrapped) documentation with examples
- Updated implementation notes with correct parser order (3→4→1→2)
- Marked all DeepSeek-R1 formats as working (July 2025 update)
- Updated test status for Format 3 and 4 as passing
- Added parse_deepseek_r1_xml_wrapped() function reference
- Corrected implementation file line numbers

* Fix merge conflict in test-function-calls.cpp

- Removed incomplete merge conflict marker from line 3027
- Ensured all tests compile and pass successfully
- All DeepSeek-R1 formats (1-4) working correctly
- All streaming and content cleaning tests passing

* Fix DeepSeek R1 parsing issue with responses wrapped in think tags

Restore missing consume_rest() call from working PR #648 implementation.
When responses don't contain tool calls, remaining content after reasoning
parsing must be preserved as displayable content.

Fixes issue where entire responses wrapped in <think> tags resulted in
empty content output.

* Implement proper reasoning handling following original llama.cpp patterns

- Add missing reasoning_format and reasoning_in_content fields to common_chat_syntax
- Update try_parse_reasoning to match original llama.cpp logic exactly
- Add TDD test case with reasoning_in_content=true for DeepSeek R1
- Following TDD: test should now pass with proper syntax configuration

Based on original llama.cpp implementation patterns.

* TDD SUCCESS: Fix DeepSeek R1 thinking tag termination issue

 Test passes with reasoning_in_content=true configuration
- Content properly preserved: '<think>content</think>' displays fully
- Reasoning field empty as expected
- Following TDD: test-first approach validates the fix

Next: Update server to automatically apply this configuration.

* Complete server integration fix for DeepSeek R1 thinking tag termination

- Server now automatically sets reasoning_in_content=true for DeepSeek R1 models
- Fixes issue where responses wrapped in <think> tags appear empty to users

* Add TDD test case for DeepSeek R1 thinking tag termination issue

- Test reproduces the exact failure scenario reported by user
- Validates that reasoning_in_content=true fixes the issue
- Demonstrates empty content problem and working solution

* Add remaining TDD test changes for DeepSeek R1 thinking tag fix

* Add debug output after upstream merge

* Remove temporary benchmark and debug files

- Remove tests/benchmark-progressive-parsing.cpp (development tool, not part of core functionality)
- Remove tests/reproduce_bug.sh (debugging script, not needed for PR)

* Port cpu moe options from mainline (#672)

* Port cpu moe options from mainline

* Use strdup and int32_t to follow coding guidelines

* maxfp4: CUDA dequantize

* mxfp4: CUDA GEMV

* mxfp4: CUDA MMQ

* mxfp4: minor CUDA tweaks

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Co-authored-by: Anton Sokolchenko <wsevendays@gmail.com>
Co-authored-by: Parsa <61601745+TheLegendOfKitty@users.noreply.github.com>
2025-08-09 08:40:18 +03:00
Kawrakow
d95ac93027 Fix quantized K cache without FA (#680)
* Prevent assert with quantized K cache and no FA

* Fix MMQ when running with quantized K cache without FA

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-08-08 13:51:14 +03:00
Kawrakow
ffd211849b Vulkan: add cmake options to build without coopmat(2) support (#674)
So I can test KHR coopmat and no coopmat.

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-08-07 17:26:21 +03:00
Kawrakow
dffa0a95b3 IQ4_KSS improvements (#642)
* iq4_kss: slightly better quantization

* iq4_kss: CUDA MMQ

* iq4_kss: repack/convert to q8_k_r8 (AVX2)

* iq4_kss: repack/convert to q8_k_r8 (NEON)

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-07-23 20:50:57 +02:00
Kawrakow
e1164e1fd8 Adding IQ1_KT - 1.75 bpw SOTA quants (#616)
* iq1_kt: basics

* iq1_kt: CUDA dequantize

Testing with LlaMA-3.1-8B-Instruct, we get almost the same PPL
as iq2_xxs, so about 0.2 bpw fewer bits for the same quality.

* iq1_kt: CUDA MMQ

* iq1_kt: CUDA MMVQ

* iq1_kt: AVX2 GEMM/GEMV

* iq1_kt: convert/repack to q8_0_r8 (AVX2)

* iq1_kt: slightly faster GEMV

18.6 t/s -> 19.4 t/s

* iq1_kt: NEON GEMM/GEMV

Pathetic as usual

* iq1_kt: slightly faster NEON - still pathetic

* iq1_kt: tiny bit better GEMV on NEON

* iq1_kt: convert/repack to q8_0_r8 (NEON)

* iq1_kt: very slightly faster convert/repack to q8_0_r8 on NEON

* Adding frgotten file

* iq1_kt: add to constants.py

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-07-20 10:05:23 +02:00
Kawrakow
d0bc1f8296 IQ1_M GEMM for ARM_NEON (#631)
* iq1_m GEMM on NEON

* Set repacking threshold

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-07-20 09:49:59 +02:00
Kawrakow
3da192ac33 Remove forgotten change 2025-07-18 20:11:57 +03:00