* Fixing Gigachat support
* Gigachat: CUDA FA (needs 192 x 192 for MLA = 3)
* Gigachat: CPU FA (needs 192 x 192 for MLA = 3)
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Fix q5_0_r4
The issue waqs in the tail part. As almost all models have tensor
rows that are multiple of 128, that part was never triggered in testing.
But ithe gpt-oss models have an embedding size of 2880, so we end
up there and trigger the bug.
* Fix q6_0_r4
Same fix as q5_0_r4
* Fix q4_0_r8
* Fix q5_0_r4 and q6_0_r4 also on Zen4
* Fix q4_0_r8 also on Zen4
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
This commit enables IQK quantization operations on ARM-based systems,
specifically tested on NVIDIA DGX Spark with GB10 Grace Blackwell.
Changes:
- Enable IQK_IMPLEMENT macro for ARM NEON operations
- Add arm_neon.h header include for ARM SIMD intrinsics
- Fix compilation errors related to missing NEON types and functions
Build requirements for ARM:
cmake .. -DGGML_CUDA=ON \
-DCMAKE_CXX_FLAGS="-march=armv8.2-a+dotprod+fp16" \
-DCMAKE_C_FLAGS="-march=armv8.2-a+dotprod+fp16"
Tested on:
- Platform: NVIDIA DGX Spark (aarch64)
- CPU: GB10 Grace Blackwell Superchip
- Memory: 128GB unified memory
Fixes build errors:
- 'float32x4_t' does not name a type
- 'vld1q_f32' was not declared in this scope
- 'v_expf' was not declared in this scope
- Missing FP16 NEON intrinsics
* Adding fused mul+multi_add + CPU implementation
* fused mul+multi_add: CUDA
* fused mul+multi_add: command line argument to disable it
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Fuse sigmoid+add+grouped_topk+get_rows (CPU)
* Fix CPU + CUDA
but CUDA is somehow not 100% correct as I get a slightly different
PPL (lower!)
* Minor
* Fuse sigmoid+add+topk+get_rows (CUDA)
* Fuse sigmoid+add+topk+get_rows (CPU)
* Fuse topk+view+get_rows+reshape+softmax (CPU)
* Fuse topk+view+get_rows+reshape+softmax (CUDA)
* cpu: turn off the openai topk fusing for now
Something is not right and I don't see the bug.
On the CPU one doesn't gain much if anything, so not a big loss.
* Also fuse sum_rows and div
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Better argsort (CPU)
* Attemt at grouped topk
* This seems to do the trick for grouped experts routing
* Cleanup
* Trying to merge, something is not right
* Working merged grouped top_k (CPU)
* Add command line option to enable grouped expert routing
* Add grouped expert routing option to llama-bench
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Avoid computing FA chunks where the mask is -infinity
* Avoid computing FA chunks where the mask is -infinity also for f16/bf16
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Fused up+gate+unary for regular (not MoE) FFN - CPU
* WIP CUDA
* Seems to be working on CUDA
For a dense model we get 2-3% speedup for PP and ~0.6% for TG.
* Add command line option
This time the option is ON by default, and one needs to turn it
off via -no-fug or --no-fused-up-gate
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Check for NaNs while loading the model.
* Also tell which experts have NaNs.
* Add command line option to validate quants
* Add checks for more quantization types
* Add checks for more quantizagtion types
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* This fixes confusion around Q8_0 on AVX2
* This does it for iq4_nl, including FA
* This does it for iq4_nl on Zen4, but FA does not work
* Slightly more clear
* Adding forgotten q8_0_r8 to num_rows()
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* q8_k_r16: basics
* q8_k_r16: iq4_xs now uses q8_k_r16 on Zen4+
PP performance is about the same as using q8_k_r8 on the Ryzen-7950X,
so we expect nice gains on Zen5, and we don't need to wory about
using 2 different q8_k_r8 implementations for fancy SIMD.
* q8_k_r16: iq2_xxs now uses q8_k_r16 on Zen4+
* q8_k_r16: iq2_xs now uses q8_k_r16 on Zen4+
* q8_k_r16: iq2_s now uses q8_k_r16 on Zen4+
* q8_k_r16: iq3_xxs now uses q8_k_r16 on Zen4+
* q8_k_r16: iq3_s now uses q8_k_r16 on Zen4+
* q8_k_r16: iq1_s and iq1_m now uses q8_k_r16 on Zen4+
* q8_k_r16: q2_K and q3_K now uses q8_k_r16 on Zen4+
* q8_k_r16: iq2_ks and iq2_k now uses q8_k_r16 on Zen4+
* q8_k_r16: iq2_kl now uses q8_k_r16 on Zen4+
* q8_k_r16: iq3_ks and iq3_k now uses q8_k_r16 on Zen4+
* q8_k_r16: iq4_kss, iq4_ks, and iq4_k now use q8_k_r16 on Zen4+
* q8_k_r16: iq5_ks, iq5_k, and iq6_k now use q8_k_r16 on Zen4+
* Fix AVX2
* Just always set num_rows to 16
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Q8_0 needs Q0_0_X4, but Q8_0_R8 needs Q8_2_X4.
So, if we decide to repack a Q8_0 MoE tensor to Q8_0_R8,
iqk_moe_fused_mul_unary fails because the activations were
prepared as Q0_0_X4, but we now need Q8_2_X4.
For now a simple fix: just take the slow path, do not repack.
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* This does the trick for PP
* Compute mask bounds when creating the mask
* Set mask bounds for all supported SWA models
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* gmp-oss: common
* gpt-oss: attnetion sinks, swiglu_oai
* gpt-oss: WIP llama
Model loads and runs (CPU only), but PPL is much to high
(~1500 for 1st batch vs ~200 in mainline).
Is it because of SWA, because of vocab, or did I introduce a bug somewhere?
* gpt-oss: CPU seems to be working
It was the SWA thta was missing in the previous commit.
There are issues with EOG tokens, so this still needs to be added.
* CUDA: ADD_ID
Just a copy from mainline
* gpt-oss: Seems to be working on CUDA
* gpt-oss: add sinks to the attn-vec kernels
* CUDA: add head size of 64 to new mma
Haven't turned it on yet, but observe slightly better PP and slightly
worse TG performance with that.
* gpt-oss: add ability to use -fmoe (only CUDA for now)
* Move row sums to the write place
* Add sinks to iqk flash attention
* gpt_oss: Implement -fmoe on the CPU
* Simdify swiglu_oai
Turning it off for now as performance becomes more variable,
so perhaps I'm running into thermal trottling imore often
because of making the CPU work too hard.
* llama: factor out model loader
* Builds successfully
* It runs, but mmap does not work
* Fix llama_mmap so mmap works
* Minor
* Fix CUDA after latest changes
* Attempt to use CUDA graphs with MoE models - not working
* CUDA graphs WIP - still not working
* CUDA graphs - seems to be working
Likely not all MLA variants are working.
I no longer remember why I added the q8_0 cpy that
transposes the tensor, but if really needed, this is now
missing. Also missing is q6_0.
* Make q8_0 cache work for DeepSeek models with CUDA graphs
* cuda: cpy for q6_0
* Fix llama_mmap on non-Linux platforms
* Adding forgotten file
* Iterating on Windows build failures
* cuda: re-add q8_0 -> q8_0 transpose
so mla = 2 can be used with CUDA graphs and q8_0 cache.
* Disable graphs without -fmoe
* Minor
* Turn graphs on by default
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* mxfp4: basics
* mxfp4: Zen4 GEMM
* mxfp4: repacked GEMM (AVX2/Zen4)
* mxfp4: AVX2 GEMM
* mxfp4: NEON GEMM
* mxfp4: repacked GEMM (NEON)
* mxfp4: Metal
* Fix quantized K cache without FA (#680)
* Prevent assert with quantized K cache and no FA
* Fix MMQ when running with quantized K cache without FA
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Fix for Deepseek r1 parsing (#676)
* Implement function calling / tools for ik_llama.cpp for Kimi K2
* Implement basic tool choice
* Backport llama.cpp tool calls support
* Enhance function calls with improved chat parser and string utilities
- Add new chat.h/chat.cpp and chat-parser.h/chat-parser.cpp for better chat handling
- Improve function calls parsing with fallback to llama.cpp builder pattern
- Add string utility functions (starts_with, ends_with, find_partial_stop)
- Update README with function calls testing instructions
- Enhance Kimi K2 parser and function calls documentation
- Add comprehensive test suite for function calls
- Update CMakeLists.txt and Makefile for new components
* Enhance function calling with unified streaming and parser improvements
- Fix streaming content cleanup to prevent function syntax in output
- Unify content extraction patterns with llama.cpp approach
- Improve Kimi K2 parser robustness and partial content handling
- Add comprehensive test coverage for function call scenarios
- Optimize chat message parsing and diff computation
* Replace hardcoded values in kimi_k2_parser.hpp with named constants
- Add compile-time constants for all token format markers
- Add compile-time constants for XML format markers
- Add compile-time constants for simple format patterns
- Replace all hardcoded string literals with named constants
- Use compile-time length calculation to avoid manual counting
- Improve maintainability and reduce magic numbers throughout parser
* Fix duplicate common_chat_parse definition
- Remove duplicate implementation from chat-parser.cpp
- Keep single implementation in chat.cpp following llama.cpp patterns
- Resolves linker error: multiple definition of common_chat_parse
* Fix JSON assertion failure in function call parsing
- Add proper validation that 'function' field is an object before accessing nested keys
- Handle missing 'arguments' field gracefully with default "{}"
- Prevents crash when parsing malformed tool call JSON structures
* Add comprehensive Qwen3 XML tool calling support with unit tests
- Implement Qwen3 XML parser with <tool_call>{"name": "func", "arguments": {...}}</tool_call> format
- Add model detection and routing for Qwen3 vs Kimi-K2 formats
- Create 8 comprehensive unit tests covering parsing, streaming, error handling
- Fix token format cleaning bug in kimi_k2_parser.hpp processing order
- Remove progressive parsing code and related utilities
- Add tool injection support for Qwen3 format in server utils
* Add DeepSeek R1 function calling support with comprehensive unit tests
- Implement complete DeepSeek R1 tool call parsing in common_chat_parser.cpp
- Add DeepSeek R1 model detection and tool injection in deepseek_r1_tools.hpp
- Update function_calls.hpp with DeepSeek R1 integration and content extraction
- Update documentation to reflect support for Kimi-K2, Qwen3, and DeepSeek R1 models
- Add comprehensive unit tests for DeepSeek R1 reasoning, tool calls, and integration
- Port exact implementation patterns from original llama.cpp for compatibility
Key features:
- Native DeepSeek R1 format: <|tool▁calls▁begin|>function<|tool▁sep|>name```json{}```<|tool▁call▁end|><|tool▁calls▁end|>
- Reasoning content extraction from <think>...</think> tags
- Multiple tool calls support with separate call blocks
- Model detection for deepseek-r1, deepseek_r1 naming patterns
- Integration with incremental parsing and streaming support
* Add partial parsing support for JSON and regex
- json-partial.h/cpp: JSON partial parsing functionality
- regex-partial.h/cpp: Regex partial parsing functionality
* Add format_chat integration tests for Qwen3 tool injection
- Add test_qwen3_format_chat_integration() to validate tool injection pipeline
- Test tool injection conditions and system message enhancement
- Verify JSON formatting and anti-preamble instructions
- Add comprehensive test documentation
Tests confirm tool injection works correctly - conversational preamble
issue is not in ik_llama.cpp but likely in UI configuration.
* Fix Qwen3 tool call parsing - pass model name to parser
Server was not passing model name to parse_chat_message_incremental(),
causing Qwen3 to fall back to Kimi-K2 parser and return tool calls
as content instead of proper tool_calls array.
* Fix non-streaming path to use model-specific parsing
Non-streaming responses were hardcoded to use Kimi-K2 format,
causing Qwen3 XML tool calls to be returned as content instead
of proper tool_calls array. Now uses same model detection as
streaming path for consistency.
* Update Qwen3 function call handling in server and tests
- Enhanced server function call detection and response formatting
- Improved test coverage for Qwen3 tool call scenarios
- Refined XML parsing for better tool execution support
* Add DeepSeek-R1 function call parsing support
Implements comprehensive parsing for all 4 DeepSeek-R1 function call formats:
- Format 1: Standard function call syntax (already supported)
- Format 2: Alternative function call patterns (already supported)
- Format 3: Tools array format - function\n```json\n{"tools": [...]}
- Format 4: XML wrapped format - <tool_call>function</think>Name\n```json\n{...}```</tool_call>
Key changes:
- Added parse_deepseek_r1_tools_array() following original parse_prefixed_json_tool_call_array pattern
- Added parse_deepseek_r1_xml_wrapped() following Hermes-2-Pro XML wrapper patterns
- Integrated both parsers into exception handling chain for robust fallback
- Added comprehensive TDD test coverage for all formats
- Anonymized all confidential information while preserving functionality
Resolves tool_calls_count=0 issue where DeepSeek-R1 models generated valid tool calls
but server failed to parse them correctly.
* Update function_calls.md documentation for DeepSeek-R1 Format 4
- Added Format 4 (XML wrapped) documentation with examples
- Updated implementation notes with correct parser order (3→4→1→2)
- Marked all DeepSeek-R1 formats as working (July 2025 update)
- Updated test status for Format 3 and 4 as passing
- Added parse_deepseek_r1_xml_wrapped() function reference
- Corrected implementation file line numbers
* Fix merge conflict in test-function-calls.cpp
- Removed incomplete merge conflict marker from line 3027
- Ensured all tests compile and pass successfully
- All DeepSeek-R1 formats (1-4) working correctly
- All streaming and content cleaning tests passing
* Fix DeepSeek R1 parsing issue with responses wrapped in think tags
Restore missing consume_rest() call from working PR #648 implementation.
When responses don't contain tool calls, remaining content after reasoning
parsing must be preserved as displayable content.
Fixes issue where entire responses wrapped in <think> tags resulted in
empty content output.
* Implement proper reasoning handling following original llama.cpp patterns
- Add missing reasoning_format and reasoning_in_content fields to common_chat_syntax
- Update try_parse_reasoning to match original llama.cpp logic exactly
- Add TDD test case with reasoning_in_content=true for DeepSeek R1
- Following TDD: test should now pass with proper syntax configuration
Based on original llama.cpp implementation patterns.
* TDD SUCCESS: Fix DeepSeek R1 thinking tag termination issue
✅ Test passes with reasoning_in_content=true configuration
- Content properly preserved: '<think>content</think>' displays fully
- Reasoning field empty as expected
- Following TDD: test-first approach validates the fix
Next: Update server to automatically apply this configuration.
* Complete server integration fix for DeepSeek R1 thinking tag termination
- Server now automatically sets reasoning_in_content=true for DeepSeek R1 models
- Fixes issue where responses wrapped in <think> tags appear empty to users
* Add TDD test case for DeepSeek R1 thinking tag termination issue
- Test reproduces the exact failure scenario reported by user
- Validates that reasoning_in_content=true fixes the issue
- Demonstrates empty content problem and working solution
* Add remaining TDD test changes for DeepSeek R1 thinking tag fix
* Add debug output after upstream merge
* Remove temporary benchmark and debug files
- Remove tests/benchmark-progressive-parsing.cpp (development tool, not part of core functionality)
- Remove tests/reproduce_bug.sh (debugging script, not needed for PR)
* Port cpu moe options from mainline (#672)
* Port cpu moe options from mainline
* Use strdup and int32_t to follow coding guidelines
* maxfp4: CUDA dequantize
* mxfp4: CUDA GEMV
* mxfp4: CUDA MMQ
* mxfp4: minor CUDA tweaks
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Co-authored-by: Anton Sokolchenko <wsevendays@gmail.com>
Co-authored-by: Parsa <61601745+TheLegendOfKitty@users.noreply.github.com>
* iq1_kt: basics
* iq1_kt: CUDA dequantize
Testing with LlaMA-3.1-8B-Instruct, we get almost the same PPL
as iq2_xxs, so about 0.2 bpw fewer bits for the same quality.
* iq1_kt: CUDA MMQ
* iq1_kt: CUDA MMVQ
* iq1_kt: AVX2 GEMM/GEMV
* iq1_kt: convert/repack to q8_0_r8 (AVX2)
* iq1_kt: slightly faster GEMV
18.6 t/s -> 19.4 t/s
* iq1_kt: NEON GEMM/GEMV
Pathetic as usual
* iq1_kt: slightly faster NEON - still pathetic
* iq1_kt: tiny bit better GEMV on NEON
* iq1_kt: convert/repack to q8_0_r8 (NEON)
* iq1_kt: very slightly faster convert/repack to q8_0_r8 on NEON
* Adding frgotten file
* iq1_kt: add to constants.py
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Experiments for 2.6875 bpw quants
At least according to rmse, this is significantly better than
q2_K, while using only 1/16 more bits per weight.
* iq2_kl: basics
* iq2_kl: CUDA dequantize
* iq2_kl: small improvement in PPL
Also check the two neighbouring values for the block scale
and use the one that minimizes RMSE.
* iq2_kl: MMQ
Quite good: PP-512(L3-8B) = 8472 t/s.
* iq2_kl: MMVQ
We get PP-128(L3-8B) = 162 t/s.
Which means that this is not quite as good as it should be as
(almost) same bpq q2_K is at 170 t/s.
* iq2_kl: Zen4 GEMM/GEMV
Not particularly fast. I may need to think about rearranging the bits.
* iq2_kl: better Zen4
* iq2_kl: convert/repack to q8_k_r8 (AVX2)
* iq2_kl: AVX2 GEMM/GEMV
* iq2_kl: WIP NEON
The compiler started crashing!!!
* iq2_kl: NEON
Had to work around a compiler crash when using vzip2q_u8 using
vqtbl2q_u8.
* iq2_kl: convert/repack to q8_k_r8 (NEON)
* iq2_kl: Metal dequantize
* iq2_kl: Metal GEMV - pretty slow
* iq2_kl: Metal GEMV - slightly better (40 t/s -> 44.5 t/s)
* iq2_kl: Metal GEMV - slightly better (44.5 t/s -> 46.5 t/s)
* iq2_kl: Metal GEMV - slightly better (46.5 t/s -> 47.2 t/s)
* iq2_kl: slightly better Metal dequantize
PP-512 goes to 476 t/s up from 466 t/s.
* iq2_kl: slightly better Metal dequantize
PP-512 goes to 492 t/s up from 476 t/s.
* Add iq2_kl to constants.py
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* iq2_xxs
55.8 -> 167.5 t/s. iq2_xxs is at 93.7 t/s
* iq2_xs
46.4 -> 166.6 t/s. iq2_xs_r4 is at 72.3 t/s.
* iq2_s
42.8 t/s -> 166.8 t/s. iq2_s_r4 is at 71.1 t/s.
* iq3_xxs
51.8 t/s -> 165.6 t/s. iq3_xxs_r4 is at 84.6 t/s.
* iq3_s
46.0 t/s -> 162.0 t/s. iq3_s_r4 is at 79.4 t/s
* q2_k
85.7 t/s -> 168.1 t/s. q2_k_r4 is at 111.2 t/s.
* q3_K
45.7 t/s -> 170.8 t/s. q3_k_r4 is at 110.3 t/s.
* q6_k
47.7 t/s -> 124 t/s. q6_k_r4 is at 112.7 t/s.
* q4_k
58.2 t/s -> 114.8 t/s. iq4_k_r4 is at 130.9 t/s.
As I had to add a new implementation for q8_1-quantized
activations, TG became slightly faster too
(25.1 -> 25.9 t/s).
* q5_k
54.9 -> 114.9 t/s. q5_k_r4 is at 116.2 t/s.
* iq4_xs
71.2 -> 167.8 t/s. iq4_xs_r4 is at 138.6 t/s.
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* iq2_kt and iq3_kt work with new int trellis
Much slower than the fp16 based trellis. I guess, Apple doesn't
have int8_t SIMD on the M2-Max GPU.
* q4_0
83.6 t/s -> 128.4 t/s. q4_0_r8 is at 123.5 t/s
* q5_0
74.2 t/s -> 128.5 t/s. q5_0_r4 is at 111.4 t/s.
* q6_0
74.2 t/s -> 128.8 t/s. q6_0_r4 is at 107.2 t/s.
* q8_0
84.5 -> 128.7 t/s. q8_0_r8 is at 131 t/s.
* iq4_nl
84.5 t/s -> 128.1 t/s. iq4_nl_r4 is at 120.4 t/s
* q4_1
74.4 -> 115.4 t/s. There is no repacked variant
* q5_1
64.2 t/s -> 114.9 t/s. There is no repacked variant.
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* This seems slightly faster for IQ2_KT, IQ3_KT TG
* This looks better for iq4_kt TG
* WIP
* Cleanup
* With fancy simd also set func16
* Enable next_128() also on AVX2
Despite having just 16 vector registers it is still faster.
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Removes errant ";" in front of 0xCBAC1FED in non-x86 code
```
error: expected primary-expression before ';' token
constexpr static uint32_t ka = ;0xCBAC1FED;
^
error: expected unqualified-id before numeric constant
constexpr static uint32_t ka = ;0xCBAC1FED;
^
```