Commit Graph

3832 Commits

Author SHA1 Message Date
Anton Sokolchenko
dc1746338c Fix for Deepseek r1 parsing (#676)
* Implement function calling / tools for ik_llama.cpp for Kimi K2

* Implement basic tool choice

* Backport llama.cpp tool calls support

* Enhance function calls with improved chat parser and string utilities

- Add new chat.h/chat.cpp and chat-parser.h/chat-parser.cpp for better chat handling
- Improve function calls parsing with fallback to llama.cpp builder pattern
- Add string utility functions (starts_with, ends_with, find_partial_stop)
- Update README with function calls testing instructions
- Enhance Kimi K2 parser and function calls documentation
- Add comprehensive test suite for function calls
- Update CMakeLists.txt and Makefile for new components

* Enhance function calling with unified streaming and parser improvements

- Fix streaming content cleanup to prevent function syntax in output
- Unify content extraction patterns with llama.cpp approach
- Improve Kimi K2 parser robustness and partial content handling
- Add comprehensive test coverage for function call scenarios
- Optimize chat message parsing and diff computation

* Replace hardcoded values in kimi_k2_parser.hpp with named constants

- Add compile-time constants for all token format markers
- Add compile-time constants for XML format markers
- Add compile-time constants for simple format patterns
- Replace all hardcoded string literals with named constants
- Use compile-time length calculation to avoid manual counting
- Improve maintainability and reduce magic numbers throughout parser

* Fix duplicate common_chat_parse definition

- Remove duplicate implementation from chat-parser.cpp
- Keep single implementation in chat.cpp following llama.cpp patterns
- Resolves linker error: multiple definition of common_chat_parse

* Fix JSON assertion failure in function call parsing

- Add proper validation that 'function' field is an object before accessing nested keys
- Handle missing 'arguments' field gracefully with default "{}"
- Prevents crash when parsing malformed tool call JSON structures

* Add comprehensive Qwen3 XML tool calling support with unit tests

- Implement Qwen3 XML parser with <tool_call>{"name": "func", "arguments": {...}}</tool_call> format
- Add model detection and routing for Qwen3 vs Kimi-K2 formats
- Create 8 comprehensive unit tests covering parsing, streaming, error handling
- Fix token format cleaning bug in kimi_k2_parser.hpp processing order
- Remove progressive parsing code and related utilities
- Add tool injection support for Qwen3 format in server utils

* Add DeepSeek R1 function calling support with comprehensive unit tests

- Implement complete DeepSeek R1 tool call parsing in common_chat_parser.cpp
- Add DeepSeek R1 model detection and tool injection in deepseek_r1_tools.hpp
- Update function_calls.hpp with DeepSeek R1 integration and content extraction
- Update documentation to reflect support for Kimi-K2, Qwen3, and DeepSeek R1 models
- Add comprehensive unit tests for DeepSeek R1 reasoning, tool calls, and integration
- Port exact implementation patterns from original llama.cpp for compatibility

Key features:
- Native DeepSeek R1 format: <|tool▁calls▁begin|>function<|tool▁sep|>name```json{}```<|tool▁call▁end|><|tool▁calls▁end|>
- Reasoning content extraction from <think>...</think> tags
- Multiple tool calls support with separate call blocks
- Model detection for deepseek-r1, deepseek_r1 naming patterns
- Integration with incremental parsing and streaming support

* Add partial parsing support for JSON and regex

- json-partial.h/cpp: JSON partial parsing functionality
- regex-partial.h/cpp: Regex partial parsing functionality

* Add format_chat integration tests for Qwen3 tool injection

- Add test_qwen3_format_chat_integration() to validate tool injection pipeline
- Test tool injection conditions and system message enhancement
- Verify JSON formatting and anti-preamble instructions
- Add comprehensive test documentation

Tests confirm tool injection works correctly - conversational preamble
issue is not in ik_llama.cpp but likely in UI configuration.

* Fix Qwen3 tool call parsing - pass model name to parser

Server was not passing model name to parse_chat_message_incremental(),
causing Qwen3 to fall back to Kimi-K2 parser and return tool calls
as content instead of proper tool_calls array.

* Fix non-streaming path to use model-specific parsing

Non-streaming responses were hardcoded to use Kimi-K2 format,
causing Qwen3 XML tool calls to be returned as content instead
of proper tool_calls array. Now uses same model detection as
streaming path for consistency.

* Update Qwen3 function call handling in server and tests

- Enhanced server function call detection and response formatting
- Improved test coverage for Qwen3 tool call scenarios
- Refined XML parsing for better tool execution support

* Add DeepSeek-R1 function call parsing support

Implements comprehensive parsing for all 4 DeepSeek-R1 function call formats:
- Format 1: Standard function call syntax (already supported)
- Format 2: Alternative function call patterns (already supported)
- Format 3: Tools array format - function\n```json\n{"tools": [...]}
- Format 4: XML wrapped format - <tool_call>function</think>Name\n```json\n{...}```</tool_call>

Key changes:
- Added parse_deepseek_r1_tools_array() following original parse_prefixed_json_tool_call_array pattern
- Added parse_deepseek_r1_xml_wrapped() following Hermes-2-Pro XML wrapper patterns
- Integrated both parsers into exception handling chain for robust fallback
- Added comprehensive TDD test coverage for all formats
- Anonymized all confidential information while preserving functionality

Resolves tool_calls_count=0 issue where DeepSeek-R1 models generated valid tool calls
but server failed to parse them correctly.

* Update function_calls.md documentation for DeepSeek-R1 Format 4

- Added Format 4 (XML wrapped) documentation with examples
- Updated implementation notes with correct parser order (3→4→1→2)
- Marked all DeepSeek-R1 formats as working (July 2025 update)
- Updated test status for Format 3 and 4 as passing
- Added parse_deepseek_r1_xml_wrapped() function reference
- Corrected implementation file line numbers

* Fix merge conflict in test-function-calls.cpp

- Removed incomplete merge conflict marker from line 3027
- Ensured all tests compile and pass successfully
- All DeepSeek-R1 formats (1-4) working correctly
- All streaming and content cleaning tests passing

* Fix DeepSeek R1 parsing issue with responses wrapped in think tags

Restore missing consume_rest() call from working PR #648 implementation.
When responses don't contain tool calls, remaining content after reasoning
parsing must be preserved as displayable content.

Fixes issue where entire responses wrapped in <think> tags resulted in
empty content output.

* Implement proper reasoning handling following original llama.cpp patterns

- Add missing reasoning_format and reasoning_in_content fields to common_chat_syntax
- Update try_parse_reasoning to match original llama.cpp logic exactly
- Add TDD test case with reasoning_in_content=true for DeepSeek R1
- Following TDD: test should now pass with proper syntax configuration

Based on original llama.cpp implementation patterns.

* TDD SUCCESS: Fix DeepSeek R1 thinking tag termination issue

 Test passes with reasoning_in_content=true configuration
- Content properly preserved: '<think>content</think>' displays fully
- Reasoning field empty as expected
- Following TDD: test-first approach validates the fix

Next: Update server to automatically apply this configuration.

* Complete server integration fix for DeepSeek R1 thinking tag termination

- Server now automatically sets reasoning_in_content=true for DeepSeek R1 models
- Fixes issue where responses wrapped in <think> tags appear empty to users

* Add TDD test case for DeepSeek R1 thinking tag termination issue

- Test reproduces the exact failure scenario reported by user
- Validates that reasoning_in_content=true fixes the issue
- Demonstrates empty content problem and working solution

* Add remaining TDD test changes for DeepSeek R1 thinking tag fix

* Add debug output after upstream merge

* Remove temporary benchmark and debug files

- Remove tests/benchmark-progressive-parsing.cpp (development tool, not part of core functionality)
- Remove tests/reproduce_bug.sh (debugging script, not needed for PR)
2025-08-08 13:56:44 +03:00
Kawrakow
d95ac93027 Fix quantized K cache without FA (#680)
* Prevent assert with quantized K cache and no FA

* Fix MMQ when running with quantized K cache without FA

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-08-08 13:51:14 +03:00
Kawrakow
ffd211849b Vulkan: add cmake options to build without coopmat(2) support (#674)
So I can test KHR coopmat and no coopmat.

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-08-07 17:26:21 +03:00
Anton Sokolchenko
05a61510b9 Fix Qwen3 content extraction breaking code formatting (#661)
Problem:
- qwen3::extract_content_during_parsing() used aggressive regex to collapse multiple newlines
- This broke proper code formatting (e.g., PEP 8's 2 empty lines between functions)
- Affected non-tool-call streaming output where formatting is critical

Solution:
- Replace aggressive std::regex_replace(R"(\n\s*\n)", "\n") with gentle string_strip()
- Follow original llama.cpp patterns: only trim leading/trailing whitespace
- Preserve internal formatting including multiple newlines
- Add proper include for common.h to access string_strip function

Changes:
- examples/server/parsers/qwen3_parser.hpp: Replace whitespace cleanup with string_strip()
- tests/test-function-calls.cpp: Add test_qwen3_whitespace_preservation() to prevent regression

Testing:
-  PEP 8 compliance: 2 empty lines between functions preserved
-  Tool call parsing: All Qwen3 tests continue to pass
-  No regressions: Existing functionality maintained
-  Follows original llama.cpp whitespace handling patterns
2025-08-07 08:22:01 +03:00
Anton Sokolchenko
f4051d9c3e Deepseek R1 function calls (more formats) (#652)
* Implement function calling / tools for ik_llama.cpp for Kimi K2

* Implement basic tool choice

* Backport llama.cpp tool calls support

* Enhance function calls with improved chat parser and string utilities

- Add new chat.h/chat.cpp and chat-parser.h/chat-parser.cpp for better chat handling
- Improve function calls parsing with fallback to llama.cpp builder pattern
- Add string utility functions (starts_with, ends_with, find_partial_stop)
- Update README with function calls testing instructions
- Enhance Kimi K2 parser and function calls documentation
- Add comprehensive test suite for function calls
- Update CMakeLists.txt and Makefile for new components

* Enhance function calling with unified streaming and parser improvements

- Fix streaming content cleanup to prevent function syntax in output
- Unify content extraction patterns with llama.cpp approach
- Improve Kimi K2 parser robustness and partial content handling
- Add comprehensive test coverage for function call scenarios
- Optimize chat message parsing and diff computation

* Replace hardcoded values in kimi_k2_parser.hpp with named constants

- Add compile-time constants for all token format markers
- Add compile-time constants for XML format markers
- Add compile-time constants for simple format patterns
- Replace all hardcoded string literals with named constants
- Use compile-time length calculation to avoid manual counting
- Improve maintainability and reduce magic numbers throughout parser

* Fix duplicate common_chat_parse definition

- Remove duplicate implementation from chat-parser.cpp
- Keep single implementation in chat.cpp following llama.cpp patterns
- Resolves linker error: multiple definition of common_chat_parse

* Fix JSON assertion failure in function call parsing

- Add proper validation that 'function' field is an object before accessing nested keys
- Handle missing 'arguments' field gracefully with default "{}"
- Prevents crash when parsing malformed tool call JSON structures

* Add comprehensive Qwen3 XML tool calling support with unit tests

- Implement Qwen3 XML parser with <tool_call>{"name": "func", "arguments": {...}}</tool_call> format
- Add model detection and routing for Qwen3 vs Kimi-K2 formats
- Create 8 comprehensive unit tests covering parsing, streaming, error handling
- Fix token format cleaning bug in kimi_k2_parser.hpp processing order
- Remove progressive parsing code and related utilities
- Add tool injection support for Qwen3 format in server utils

* Add DeepSeek R1 function calling support with comprehensive unit tests

- Implement complete DeepSeek R1 tool call parsing in common_chat_parser.cpp
- Add DeepSeek R1 model detection and tool injection in deepseek_r1_tools.hpp
- Update function_calls.hpp with DeepSeek R1 integration and content extraction
- Update documentation to reflect support for Kimi-K2, Qwen3, and DeepSeek R1 models
- Add comprehensive unit tests for DeepSeek R1 reasoning, tool calls, and integration
- Port exact implementation patterns from original llama.cpp for compatibility

Key features:
- Native DeepSeek R1 format: <|tool▁calls▁begin|>function<|tool▁sep|>name```json{}```<|tool▁call▁end|><|tool▁calls▁end|>
- Reasoning content extraction from <think>...</think> tags
- Multiple tool calls support with separate call blocks
- Model detection for deepseek-r1, deepseek_r1 naming patterns
- Integration with incremental parsing and streaming support

* Add partial parsing support for JSON and regex

- json-partial.h/cpp: JSON partial parsing functionality
- regex-partial.h/cpp: Regex partial parsing functionality

* Add format_chat integration tests for Qwen3 tool injection

- Add test_qwen3_format_chat_integration() to validate tool injection pipeline
- Test tool injection conditions and system message enhancement
- Verify JSON formatting and anti-preamble instructions
- Add comprehensive test documentation

Tests confirm tool injection works correctly - conversational preamble
issue is not in ik_llama.cpp but likely in UI configuration.

* Fix Qwen3 tool call parsing - pass model name to parser

Server was not passing model name to parse_chat_message_incremental(),
causing Qwen3 to fall back to Kimi-K2 parser and return tool calls
as content instead of proper tool_calls array.

* Fix non-streaming path to use model-specific parsing

Non-streaming responses were hardcoded to use Kimi-K2 format,
causing Qwen3 XML tool calls to be returned as content instead
of proper tool_calls array. Now uses same model detection as
streaming path for consistency.

* Update Qwen3 function call handling in server and tests

- Enhanced server function call detection and response formatting
- Improved test coverage for Qwen3 tool call scenarios
- Refined XML parsing for better tool execution support

* Add DeepSeek-R1 function call parsing support

Implements comprehensive parsing for all 4 DeepSeek-R1 function call formats:
- Format 1: Standard function call syntax (already supported)
- Format 2: Alternative function call patterns (already supported)
- Format 3: Tools array format - function\n```json\n{"tools": [...]}
- Format 4: XML wrapped format - <tool_call>function</think>Name\n```json\n{...}```</tool_call>

Key changes:
- Added parse_deepseek_r1_tools_array() following original parse_prefixed_json_tool_call_array pattern
- Added parse_deepseek_r1_xml_wrapped() following Hermes-2-Pro XML wrapper patterns
- Integrated both parsers into exception handling chain for robust fallback
- Added comprehensive TDD test coverage for all formats
- Anonymized all confidential information while preserving functionality

Resolves tool_calls_count=0 issue where DeepSeek-R1 models generated valid tool calls
but server failed to parse them correctly.

* Update function_calls.md documentation for DeepSeek-R1 Format 4

- Added Format 4 (XML wrapped) documentation with examples
- Updated implementation notes with correct parser order (3→4→1→2)
- Marked all DeepSeek-R1 formats as working (July 2025 update)
- Updated test status for Format 3 and 4 as passing
- Added parse_deepseek_r1_xml_wrapped() function reference
- Corrected implementation file line numbers

* Fix merge conflict in test-function-calls.cpp

- Removed incomplete merge conflict marker from line 3027
- Ensured all tests compile and pass successfully
- All DeepSeek-R1 formats (1-4) working correctly
- All streaming and content cleaning tests passing
2025-08-07 08:15:57 +03:00
Thireus ☠
d65d5fe29e Add support for GLM-4.5 models (#668)
* GLM-4.5

* GLM-4.5

* GLM-4.5

* convert_hf_to_gguf.py compatibility bugfix with GLM-4.5

From @ubergarm - https://github.com/ikawrakow/ik_llama.cpp/pull/668#issuecomment-3145913701

* Add ubergarm comments + my own

* Revert to llama.cpp script version that produced good BF16

See: https://github.com/ikawrakow/ik_llama.cpp/pull/668#issuecomment-3147374559

* Support for jinja chat templates

See https://github.com/ikawrakow/ik_llama.cpp/pull/668#issuecomment-3148109962

* GLM-4.5 llama.cpp final port

* Handle TENSOR_SKIP

Ported the hanges from:

f129567dc0
dcbbd2cb05

Except op info since ik_llama.cpp doesn't support this operation.

* Bugfix for TENSOR_SKIP

skip loading if a tensor has the TENSOR_SKIP flag - @ubergarm via https://github.com/ikawrakow/ik_llama.cpp/pull/668#issuecomment-3155297198

* Update llama.cpp

Restore original GGLM_ASSERT

* Fix chat template detection

Changes suggested by @ubergarm - https://github.com/ikawrakow/ik_llama.cpp/pull/668#issuecomment-3155927840

* Revert to original GGML_ASSERT
2025-08-07 07:55:00 +03:00
firecoperana
ddceb0a55d Merge pull request #648 from ikawrakow/fcp/missing_token_ps
Fix missing token per second for webui after function call update
2025-07-26 21:13:52 -05:00
Anton Sokolchenko
33daaf7310 Fix text generation endpoint (#654) 2025-07-26 19:36:48 -05:00
firecoperana
f443040d49 webui: move preset settings to top
webui:bug fix
2025-07-25 18:03:01 -05:00
firecoperana
981259fb8b bug fix no timings after tool update 2025-07-25 17:52:43 -05:00
Anton Sokolchenko
cfc8f5a61b Enable LLM function calls (#643) 2025-07-24 20:24:12 +02:00
Kawrakow
dffa0a95b3 IQ4_KSS improvements (#642)
* iq4_kss: slightly better quantization

* iq4_kss: CUDA MMQ

* iq4_kss: repack/convert to q8_k_r8 (AVX2)

* iq4_kss: repack/convert to q8_k_r8 (NEON)

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-07-23 20:50:57 +02:00
Kawrakow
0486b5ad93 Update README.md 2025-07-23 19:38:54 +02:00
Kawrakow
d78df741ce Update AUTHORS 2025-07-23 18:14:51 +02:00
Anton Sokolchenko
9ee72225dc Function calling support for Kimi-K2 (#628)
* Implement function calling / tools for ik_llama.cpp for Kimi K2

* Implement basic tool choice

* Backport llama.cpp tool calls support

* Enhance function calls with improved chat parser and string utilities

- Add new chat.h/chat.cpp and chat-parser.h/chat-parser.cpp for better chat handling
- Improve function calls parsing with fallback to llama.cpp builder pattern
- Add string utility functions (starts_with, ends_with, find_partial_stop)
- Update README with function calls testing instructions
- Enhance Kimi K2 parser and function calls documentation
- Add comprehensive test suite for function calls
- Update CMakeLists.txt and Makefile for new components

* Enhance function calling with unified streaming and parser improvements

- Fix streaming content cleanup to prevent function syntax in output
- Unify content extraction patterns with llama.cpp approach
- Improve Kimi K2 parser robustness and partial content handling
- Add comprehensive test coverage for function call scenarios
- Optimize chat message parsing and diff computation

* Replace hardcoded values in kimi_k2_parser.hpp with named constants

- Add compile-time constants for all token format markers
- Add compile-time constants for XML format markers
- Add compile-time constants for simple format patterns
- Replace all hardcoded string literals with named constants
- Use compile-time length calculation to avoid manual counting
- Improve maintainability and reduce magic numbers throughout parser

* Fix duplicate common_chat_parse definition

- Remove duplicate implementation from chat-parser.cpp
- Keep single implementation in chat.cpp following llama.cpp patterns
- Resolves linker error: multiple definition of common_chat_parse

* Fix JSON assertion failure in function call parsing

- Add proper validation that 'function' field is an object before accessing nested keys
- Handle missing 'arguments' field gracefully with default "{}"
- Prevents crash when parsing malformed tool call JSON structures

* Add comprehensive Qwen3 XML tool calling support with unit tests

- Implement Qwen3 XML parser with <tool_call>{"name": "func", "arguments": {...}}</tool_call> format
- Add model detection and routing for Qwen3 vs Kimi-K2 formats
- Create 8 comprehensive unit tests covering parsing, streaming, error handling
- Fix token format cleaning bug in kimi_k2_parser.hpp processing order
- Remove progressive parsing code and related utilities
- Add tool injection support for Qwen3 format in server utils

* Add DeepSeek R1 function calling support with comprehensive unit tests

- Implement complete DeepSeek R1 tool call parsing in common_chat_parser.cpp
- Add DeepSeek R1 model detection and tool injection in deepseek_r1_tools.hpp
- Update function_calls.hpp with DeepSeek R1 integration and content extraction
- Update documentation to reflect support for Kimi-K2, Qwen3, and DeepSeek R1 models
- Add comprehensive unit tests for DeepSeek R1 reasoning, tool calls, and integration
- Port exact implementation patterns from original llama.cpp for compatibility

Key features:
- Native DeepSeek R1 format: <|tool▁calls▁begin|>function<|tool▁sep|>name```json{}```<|tool▁call▁end|><|tool▁calls▁end|>
- Reasoning content extraction from <think>...</think> tags
- Multiple tool calls support with separate call blocks
- Model detection for deepseek-r1, deepseek_r1 naming patterns
- Integration with incremental parsing and streaming support

* Add partial parsing support for JSON and regex

- json-partial.h/cpp: JSON partial parsing functionality
- regex-partial.h/cpp: Regex partial parsing functionality

* Add format_chat integration tests for Qwen3 tool injection

- Add test_qwen3_format_chat_integration() to validate tool injection pipeline
- Test tool injection conditions and system message enhancement
- Verify JSON formatting and anti-preamble instructions
- Add comprehensive test documentation

Tests confirm tool injection works correctly - conversational preamble
issue is not in ik_llama.cpp but likely in UI configuration.

* Fix Qwen3 tool call parsing - pass model name to parser

Server was not passing model name to parse_chat_message_incremental(),
causing Qwen3 to fall back to Kimi-K2 parser and return tool calls
as content instead of proper tool_calls array.

* Fix non-streaming path to use model-specific parsing

Non-streaming responses were hardcoded to use Kimi-K2 format,
causing Qwen3 XML tool calls to be returned as content instead
of proper tool_calls array. Now uses same model detection as
streaming path for consistency.
2025-07-23 18:11:42 +02:00
Thomas
eaa2510a28 Add GitHub data: filename sanitization (#640) 2025-07-23 13:31:53 +02:00
Kawrakow
3600d82e98 Fix pauses after a comma (#639)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-07-23 11:45:58 +02:00
Thomas
94aa54df76 Add GitHub data (#637) 2025-07-22 18:18:40 +02:00
Kawrakow
9513222ba5 Revert "Update README.md"
This reverts commit b48d71fec8.
t0002
2025-07-22 15:22:46 +03:00
Kawrakow
4ea000892d Add .mailmap 2025-07-22 14:53:50 +03:00
Kawrakow
c3cd543d77 Update README.md 2025-07-22 09:01:59 +02:00
firecoperana
18eeb48941 Webui: New Features for Conversations, Settings, and Chat Messages (#618)
* Webui: add Rename/Upload conversation in header and sidebar

webui: don't change modified date when renaming conversation

* webui: add a preset feature to the settings #14649

* webui: Add editing assistant messages #13522

Webui: keep the following message while editing assistance response.

webui: change icon to edit message

* webui: DB import and export #14347

* webui: Wrap long numbers instead of infinite horizontal scroll (#14062)
fix sidebar being covered by main content #14082

---------

Co-authored-by: firecoperana <firecoperana>
2025-07-20 12:33:55 +02:00
Kawrakow
e1164e1fd8 Adding IQ1_KT - 1.75 bpw SOTA quants (#616)
* iq1_kt: basics

* iq1_kt: CUDA dequantize

Testing with LlaMA-3.1-8B-Instruct, we get almost the same PPL
as iq2_xxs, so about 0.2 bpw fewer bits for the same quality.

* iq1_kt: CUDA MMQ

* iq1_kt: CUDA MMVQ

* iq1_kt: AVX2 GEMM/GEMV

* iq1_kt: convert/repack to q8_0_r8 (AVX2)

* iq1_kt: slightly faster GEMV

18.6 t/s -> 19.4 t/s

* iq1_kt: NEON GEMM/GEMV

Pathetic as usual

* iq1_kt: slightly faster NEON - still pathetic

* iq1_kt: tiny bit better GEMV on NEON

* iq1_kt: convert/repack to q8_0_r8 (NEON)

* iq1_kt: very slightly faster convert/repack to q8_0_r8 on NEON

* Adding frgotten file

* iq1_kt: add to constants.py

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-07-20 10:05:23 +02:00
Kawrakow
d0bc1f8296 IQ1_M GEMM for ARM_NEON (#631)
* iq1_m GEMM on NEON

* Set repacking threshold

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-07-20 09:49:59 +02:00
Kawrakow
3da192ac33 Remove forgotten change 2025-07-18 20:11:57 +03:00
Kawrakow
712eb7b45c GEMM for iq1_m (#630)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-07-18 18:55:43 +02:00
Thireus ☠
cc51044e72 Add GGML_MAX_CONTEXTS definition in CMakeLists.txt (#622)
* Add GGML_MAX_CONTEXTS definition in CMakeLists.txt

If this entry is missing, GGML_MAX_CONTEXTS is ignored

* Update CMakeLists.txt

add_compile_definitions for GGML_MAX_CONTEXTS
2025-07-17 08:50:42 +02:00
Thireus ☠
eddeaac009 Bump Windows max open files from 512 to 2048 (#620)
* Bump windows max open files from 512 to 2048

https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/setmaxstdio?view=msvc-160

* Make _GGML_STDIO_TARGET dependent of GGML_MAX_CONTEXTS for Windows
2025-07-17 08:50:26 +02:00
ubergarm
5e357db589 Fixup kimi-k2 convert indentation (#617) 2025-07-16 15:24:20 +02:00
Thireus ☠
da38486de5 Bump GGML_MAX_CONTEXTS to allow loading more shards (#611)
* Bump GGML_MAX_CONTEXTS to allow loading more shards

This var prevents more than 64 shards from being loaded - Specifically relevant for large models such as DeepSeek R1.

* https://github.com/ikawrakow/ik_llama.cpp/pull/611#issuecomment-3072175559
2025-07-16 14:11:19 +02:00
ubergarm
d3ed217798 kimi-k2 convert script and chat template (#612)
* convert_hf_to_gguf for Kimi-K2-Instruct

Adapt mainline `PR14653` for tokenizer while maintaining proper MLA
tensors. Tested with this workflow using deepseek fp8_cast_bf16.py and
triton-cpu to upcast the fp8 safetensors to bf16 safetensors then used
this convert_hf_to_gguf.

* Add Kimi-K2 chat template

moonshotai/Kimi-K2-Instruct

https://github.com/ikawrakow/ik_llama.cpp/pull/609#issuecomment-3071259454

* kimi-k2 add ass to template to get response
2025-07-15 19:54:04 +02:00
Kawrakow
19c57dbe1d Vulkan: a fresh start (#608)
* It compiles

* Seems to be working with coopmat

* Vulkan needs f32 precision for flash attention

* Vulkan: fix u_batch > 4096/n_active_experts

for coopmat1. Without this fix we get an assert.
We get the same assert in mainline too.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-07-15 08:03:13 +02:00
Kawrakow
f375799f17 Adding IQ2_KL (#602)
* Experiments for 2.6875 bpw quants

At least according to rmse, this is significantly better than
q2_K, while using only 1/16 more bits per weight.

* iq2_kl: basics

* iq2_kl: CUDA dequantize

* iq2_kl: small improvement in PPL

Also check the two neighbouring values for the block scale
and use the one that minimizes RMSE.

* iq2_kl: MMQ

Quite good: PP-512(L3-8B) = 8472 t/s.

* iq2_kl: MMVQ

We get PP-128(L3-8B) = 162 t/s.
Which means that this is not quite as good as it should be as
(almost) same bpq q2_K is at 170 t/s.

* iq2_kl: Zen4 GEMM/GEMV

Not particularly fast. I may need to think about rearranging the bits.

* iq2_kl: better Zen4

* iq2_kl: convert/repack to q8_k_r8 (AVX2)

* iq2_kl: AVX2 GEMM/GEMV

* iq2_kl: WIP NEON

The compiler started crashing!!!

* iq2_kl: NEON

Had to work around a compiler crash when using vzip2q_u8 using
vqtbl2q_u8.

* iq2_kl: convert/repack to q8_k_r8 (NEON)

* iq2_kl: Metal dequantize

* iq2_kl: Metal GEMV - pretty slow

* iq2_kl: Metal GEMV - slightly better (40 t/s -> 44.5 t/s)

* iq2_kl: Metal GEMV - slightly better (44.5 t/s -> 46.5 t/s)

* iq2_kl: Metal GEMV - slightly better (46.5 t/s -> 47.2 t/s)

* iq2_kl: slightly better Metal dequantize

PP-512 goes to 476 t/s up from 466 t/s.

* iq2_kl: slightly better Metal dequantize

PP-512 goes to 492 t/s up from 476 t/s.

* Add iq2_kl to constants.py

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-07-14 18:55:08 +02:00
Aleksey Nikiforov
da8998c6c6 Ported kimi-k2 support from llama.cpp (#609)
Original patch by @gabriellarson:
https://github.com/ggml-org/llama.cpp/pull/14654

Co-authored-by: anikifoss <anikifoss>
2025-07-14 18:43:52 +02:00
Kawrakow
4f56069442 Add iq3_ks to constants.py (#606)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-07-13 19:14:26 +02:00
Nexes the Elder
e2cf466eaa Fix attn_v conditionality (#604)
To retain compatibility with : https://github.com/ikawrakow/ik_llama.cpp/pull/91
We need "else if" and not "if", otherwise the MOE and 70b condition takes precedence over the specified quant in the CLI.
2025-07-13 11:28:18 +02:00
Kawrakow
a6842ba601 Check if MMQ should be used before using it (#603)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-07-13 07:43:15 +02:00
saood06
02d675717e Support for dots.llm1 models (#573)
* Add llama.cpp changes for dots1 support

* Add python changes for dots1 support

* Fix to make it convert

* Remove V reshaping, remove BOS by default for dots1 and fix warmup to handle models without BOS

* Minor fix

* Remove commented lines
2025-07-10 02:37:36 -05:00
Kawrakow
4e2afbcd90 CUDA: Faster prompt processing for several quantization types (#595)
* cuda: slightly faster MMQ for iq3_k, iq3_k_r4

* cuda: slightly faster MMQ for iq4_k, iq4_k_r4

* cuda: slightly faster MMQ for iq4_ks_r4

* cuda: slightly faster MMQ for iq4_ks

* cuda: slightly faster MMQ for iq4_xs

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-07-10 09:27:28 +02:00
ubergarm
db49223e8c add hunyuan moe support for 561 (#565)
* add hunyuan moe

* Don't reshape Vcur

* Apply chat template fix from mainline PR14584
2025-07-09 10:29:40 +02:00
Kawrakow
6a56d5075d Faster prompt processing for IQ2_KS, IQ2_K, IQ2_K_R4 (#593)
* cuda: faster MMQ for iq2_ks, iq2_k, iq2_k_r4

* Lookup is still beter for MMQ if we get 4 values at once

* Minor

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-07-08 19:44:48 +02:00
Kawrakow
6970ef925f CUDA: small PP performance improvement for MoE models (#589)
* Trying to implement quantized fmoe - not working yet

* This works, but is slower than the non-working version

* quantize_mmq_q8_1_id

* Minor

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-07-07 07:23:12 +02:00
Fizz~
27ff5bf57e Special handling of Seed Coder FIM tokens (#585)
* Special handling of Seed Coder FIM tokens

* vocab: Add Seed Coder pretokenizer

* Formatting fix

* Update llama.h
2025-07-06 12:13:55 +02:00
firecoperana
49d4d2630a Fix server crash when there is no DRY sampler (#588)
Co-authored-by: firecoperana <firecoperana>
2025-07-06 07:51:36 +02:00
Kawrakow
2fddc45a02 Vulkan: flash attention for DeepSeek models (#584)
* vulkan: support mixed/deepseekR1 FA head sizes (#14509)

* vulkan: better parameterize FA by head sizes

* vulkan: support mixed/deepseekR1 FA head sizes

* Fix the FA cherry-pick

---------

Co-authored-by: Jeff Bolz <jbolz@nvidia.com>
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-07-05 15:14:12 +02:00
Kawrakow
b8784686e1 Adding forgotten file (#583)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-07-04 08:39:04 +02:00
Kawrakow
28e81fc761 Vulkan: adding GGML_OP_MULTI_ADD implementation (#582)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-07-04 08:33:43 +02:00
Kawrakow
93b7724bbb Vulkan: Disable multi-add for now (#581)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-07-03 18:31:48 +02:00
Kawrakow
8d4f0a61db Vulkan: add GGML_OP_FUSED_MUL_UNARY (#580)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-07-03 18:03:23 +02:00
Kawrakow
b445c83eb9 Vulkan: fused rms norm (#577)
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-07-03 15:36:52 +02:00