ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-02-19 04:40:09 +00:00

Author	SHA1	Message	Date
firecoperana	ddceb0a55d	Merge pull request #648 from ikawrakow/fcp/missing_token_ps Fix missing token per second for webui after function call update	2025-07-26 21:13:52 -05:00
Anton Sokolchenko	33daaf7310	Fix text generation endpoint (#654 )	2025-07-26 19:36:48 -05:00
firecoperana	f443040d49	webui: move preset settings to top webui:bug fix	2025-07-25 18:03:01 -05:00
firecoperana	981259fb8b	bug fix no timings after tool update	2025-07-25 17:52:43 -05:00
Anton Sokolchenko	cfc8f5a61b	Enable LLM function calls (#643 )	2025-07-24 20:24:12 +02:00
Kawrakow	dffa0a95b3	IQ4_KSS improvements (#642 ) * iq4_kss: slightly better quantization * iq4_kss: CUDA MMQ * iq4_kss: repack/convert to q8_k_r8 (AVX2) * iq4_kss: repack/convert to q8_k_r8 (NEON) --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-07-23 20:50:57 +02:00
Kawrakow	0486b5ad93	Update README.md	2025-07-23 19:38:54 +02:00
Kawrakow	d78df741ce	Update AUTHORS	2025-07-23 18:14:51 +02:00
Anton Sokolchenko	9ee72225dc	Function calling support for Kimi-K2 (#628 ) * Implement function calling / tools for ik_llama.cpp for Kimi K2 * Implement basic tool choice * Backport llama.cpp tool calls support * Enhance function calls with improved chat parser and string utilities - Add new chat.h/chat.cpp and chat-parser.h/chat-parser.cpp for better chat handling - Improve function calls parsing with fallback to llama.cpp builder pattern - Add string utility functions (starts_with, ends_with, find_partial_stop) - Update README with function calls testing instructions - Enhance Kimi K2 parser and function calls documentation - Add comprehensive test suite for function calls - Update CMakeLists.txt and Makefile for new components * Enhance function calling with unified streaming and parser improvements - Fix streaming content cleanup to prevent function syntax in output - Unify content extraction patterns with llama.cpp approach - Improve Kimi K2 parser robustness and partial content handling - Add comprehensive test coverage for function call scenarios - Optimize chat message parsing and diff computation * Replace hardcoded values in kimi_k2_parser.hpp with named constants - Add compile-time constants for all token format markers - Add compile-time constants for XML format markers - Add compile-time constants for simple format patterns - Replace all hardcoded string literals with named constants - Use compile-time length calculation to avoid manual counting - Improve maintainability and reduce magic numbers throughout parser * Fix duplicate common_chat_parse definition - Remove duplicate implementation from chat-parser.cpp - Keep single implementation in chat.cpp following llama.cpp patterns - Resolves linker error: multiple definition of common_chat_parse * Fix JSON assertion failure in function call parsing - Add proper validation that 'function' field is an object before accessing nested keys - Handle missing 'arguments' field gracefully with default "{}" - Prevents crash when parsing malformed tool call JSON structures * Add comprehensive Qwen3 XML tool calling support with unit tests - Implement Qwen3 XML parser with <tool_call>{"name": "func", "arguments": {...}}</tool_call> format - Add model detection and routing for Qwen3 vs Kimi-K2 formats - Create 8 comprehensive unit tests covering parsing, streaming, error handling - Fix token format cleaning bug in kimi_k2_parser.hpp processing order - Remove progressive parsing code and related utilities - Add tool injection support for Qwen3 format in server utils * Add DeepSeek R1 function calling support with comprehensive unit tests - Implement complete DeepSeek R1 tool call parsing in common_chat_parser.cpp - Add DeepSeek R1 model detection and tool injection in deepseek_r1_tools.hpp - Update function_calls.hpp with DeepSeek R1 integration and content extraction - Update documentation to reflect support for Kimi-K2, Qwen3, and DeepSeek R1 models - Add comprehensive unit tests for DeepSeek R1 reasoning, tool calls, and integration - Port exact implementation patterns from original llama.cpp for compatibility Key features: - Native DeepSeek R1 format: <｜tool▁calls▁begin｜>function<｜tool▁sep｜>name```json{}```<｜tool▁call▁end｜><｜tool▁calls▁end｜> - Reasoning content extraction from <think>...</think> tags - Multiple tool calls support with separate call blocks - Model detection for deepseek-r1, deepseek_r1 naming patterns - Integration with incremental parsing and streaming support * Add partial parsing support for JSON and regex - json-partial.h/cpp: JSON partial parsing functionality - regex-partial.h/cpp: Regex partial parsing functionality * Add format_chat integration tests for Qwen3 tool injection - Add test_qwen3_format_chat_integration() to validate tool injection pipeline - Test tool injection conditions and system message enhancement - Verify JSON formatting and anti-preamble instructions - Add comprehensive test documentation Tests confirm tool injection works correctly - conversational preamble issue is not in ik_llama.cpp but likely in UI configuration. * Fix Qwen3 tool call parsing - pass model name to parser Server was not passing model name to parse_chat_message_incremental(), causing Qwen3 to fall back to Kimi-K2 parser and return tool calls as content instead of proper tool_calls array. * Fix non-streaming path to use model-specific parsing Non-streaming responses were hardcoded to use Kimi-K2 format, causing Qwen3 XML tool calls to be returned as content instead of proper tool_calls array. Now uses same model detection as streaming path for consistency.	2025-07-23 18:11:42 +02:00
Thomas	eaa2510a28	Add GitHub data: filename sanitization (#640 )	2025-07-23 13:31:53 +02:00
Kawrakow	3600d82e98	Fix pauses after a comma (#639 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-07-23 11:45:58 +02:00
Thomas	94aa54df76	Add GitHub data (#637 )	2025-07-22 18:18:40 +02:00
Kawrakow	9513222ba5	Revert "Update README.md" This reverts commit `b48d71fec8`. t0002	2025-07-22 15:22:46 +03:00
Kawrakow	4ea000892d	Add .mailmap	2025-07-22 14:53:50 +03:00
Kawrakow	c3cd543d77	Update README.md	2025-07-22 09:01:59 +02:00
firecoperana	18eeb48941	Webui: New Features for Conversations, Settings, and Chat Messages (#618 ) * Webui: add Rename/Upload conversation in header and sidebar webui: don't change modified date when renaming conversation * webui: add a preset feature to the settings #14649 * webui: Add editing assistant messages #13522 Webui: keep the following message while editing assistance response. webui: change icon to edit message * webui: DB import and export #14347 * webui: Wrap long numbers instead of infinite horizontal scroll (#14062) fix sidebar being covered by main content #14082 --------- Co-authored-by: firecoperana <firecoperana>	2025-07-20 12:33:55 +02:00
Kawrakow	e1164e1fd8	Adding IQ1_KT - 1.75 bpw SOTA quants (#616 ) * iq1_kt: basics * iq1_kt: CUDA dequantize Testing with LlaMA-3.1-8B-Instruct, we get almost the same PPL as iq2_xxs, so about 0.2 bpw fewer bits for the same quality. * iq1_kt: CUDA MMQ * iq1_kt: CUDA MMVQ * iq1_kt: AVX2 GEMM/GEMV * iq1_kt: convert/repack to q8_0_r8 (AVX2) * iq1_kt: slightly faster GEMV 18.6 t/s -> 19.4 t/s * iq1_kt: NEON GEMM/GEMV Pathetic as usual * iq1_kt: slightly faster NEON - still pathetic * iq1_kt: tiny bit better GEMV on NEON * iq1_kt: convert/repack to q8_0_r8 (NEON) * iq1_kt: very slightly faster convert/repack to q8_0_r8 on NEON * Adding frgotten file * iq1_kt: add to constants.py --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-07-20 10:05:23 +02:00
Kawrakow	d0bc1f8296	IQ1_M GEMM for ARM_NEON (#631 ) * iq1_m GEMM on NEON * Set repacking threshold --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-07-20 09:49:59 +02:00
Kawrakow	3da192ac33	Remove forgotten change	2025-07-18 20:11:57 +03:00
Kawrakow	712eb7b45c	GEMM for iq1_m (#630 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-07-18 18:55:43 +02:00
Thireus ☠	cc51044e72	Add GGML_MAX_CONTEXTS definition in CMakeLists.txt (#622 ) * Add GGML_MAX_CONTEXTS definition in CMakeLists.txt If this entry is missing, GGML_MAX_CONTEXTS is ignored * Update CMakeLists.txt add_compile_definitions for GGML_MAX_CONTEXTS	2025-07-17 08:50:42 +02:00
Thireus ☠	eddeaac009	Bump Windows max open files from 512 to 2048 (#620 ) * Bump windows max open files from 512 to 2048 https://learn.microsoft.com/en-us/cpp/c-runtime-library/reference/setmaxstdio?view=msvc-160 * Make _GGML_STDIO_TARGET dependent of GGML_MAX_CONTEXTS for Windows	2025-07-17 08:50:26 +02:00
ubergarm	5e357db589	Fixup kimi-k2 convert indentation (#617 )	2025-07-16 15:24:20 +02:00
Thireus ☠	da38486de5	Bump GGML_MAX_CONTEXTS to allow loading more shards (#611 ) * Bump GGML_MAX_CONTEXTS to allow loading more shards This var prevents more than 64 shards from being loaded - Specifically relevant for large models such as DeepSeek R1. * https://github.com/ikawrakow/ik_llama.cpp/pull/611#issuecomment-3072175559	2025-07-16 14:11:19 +02:00
ubergarm	d3ed217798	kimi-k2 convert script and chat template (#612 ) * convert_hf_to_gguf for Kimi-K2-Instruct Adapt mainline `PR14653` for tokenizer while maintaining proper MLA tensors. Tested with this workflow using deepseek fp8_cast_bf16.py and triton-cpu to upcast the fp8 safetensors to bf16 safetensors then used this convert_hf_to_gguf. * Add Kimi-K2 chat template moonshotai/Kimi-K2-Instruct https://github.com/ikawrakow/ik_llama.cpp/pull/609#issuecomment-3071259454 * kimi-k2 add ass to template to get response	2025-07-15 19:54:04 +02:00
Kawrakow	19c57dbe1d	Vulkan: a fresh start (#608 ) * It compiles * Seems to be working with coopmat * Vulkan needs f32 precision for flash attention * Vulkan: fix u_batch > 4096/n_active_experts for coopmat1. Without this fix we get an assert. We get the same assert in mainline too. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-07-15 08:03:13 +02:00
Kawrakow	f375799f17	Adding IQ2_KL (#602 ) * Experiments for 2.6875 bpw quants At least according to rmse, this is significantly better than q2_K, while using only 1/16 more bits per weight. * iq2_kl: basics * iq2_kl: CUDA dequantize * iq2_kl: small improvement in PPL Also check the two neighbouring values for the block scale and use the one that minimizes RMSE. * iq2_kl: MMQ Quite good: PP-512(L3-8B) = 8472 t/s. * iq2_kl: MMVQ We get PP-128(L3-8B) = 162 t/s. Which means that this is not quite as good as it should be as (almost) same bpq q2_K is at 170 t/s. * iq2_kl: Zen4 GEMM/GEMV Not particularly fast. I may need to think about rearranging the bits. * iq2_kl: better Zen4 * iq2_kl: convert/repack to q8_k_r8 (AVX2) * iq2_kl: AVX2 GEMM/GEMV * iq2_kl: WIP NEON The compiler started crashing!!! * iq2_kl: NEON Had to work around a compiler crash when using vzip2q_u8 using vqtbl2q_u8. * iq2_kl: convert/repack to q8_k_r8 (NEON) * iq2_kl: Metal dequantize * iq2_kl: Metal GEMV - pretty slow * iq2_kl: Metal GEMV - slightly better (40 t/s -> 44.5 t/s) * iq2_kl: Metal GEMV - slightly better (44.5 t/s -> 46.5 t/s) * iq2_kl: Metal GEMV - slightly better (46.5 t/s -> 47.2 t/s) * iq2_kl: slightly better Metal dequantize PP-512 goes to 476 t/s up from 466 t/s. * iq2_kl: slightly better Metal dequantize PP-512 goes to 492 t/s up from 476 t/s. * Add iq2_kl to constants.py --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-07-14 18:55:08 +02:00
Aleksey Nikiforov	da8998c6c6	Ported kimi-k2 support from llama.cpp (#609 ) Original patch by @gabriellarson: https://github.com/ggml-org/llama.cpp/pull/14654 Co-authored-by: anikifoss <anikifoss>	2025-07-14 18:43:52 +02:00
Kawrakow	4f56069442	Add iq3_ks to constants.py (#606 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-07-13 19:14:26 +02:00
Nexes the Elder	e2cf466eaa	Fix attn_v conditionality (#604 ) To retain compatibility with : https://github.com/ikawrakow/ik_llama.cpp/pull/91 We need "else if" and not "if", otherwise the MOE and 70b condition takes precedence over the specified quant in the CLI.	2025-07-13 11:28:18 +02:00
Kawrakow	a6842ba601	Check if MMQ should be used before using it (#603 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-07-13 07:43:15 +02:00
saood06	02d675717e	Support for dots.llm1 models (#573 ) * Add llama.cpp changes for dots1 support * Add python changes for dots1 support * Fix to make it convert * Remove V reshaping, remove BOS by default for dots1 and fix warmup to handle models without BOS * Minor fix * Remove commented lines	2025-07-10 02:37:36 -05:00
Kawrakow	4e2afbcd90	CUDA: Faster prompt processing for several quantization types (#595 ) * cuda: slightly faster MMQ for iq3_k, iq3_k_r4 * cuda: slightly faster MMQ for iq4_k, iq4_k_r4 * cuda: slightly faster MMQ for iq4_ks_r4 * cuda: slightly faster MMQ for iq4_ks * cuda: slightly faster MMQ for iq4_xs --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-07-10 09:27:28 +02:00
ubergarm	db49223e8c	add hunyuan moe support for 561 (#565 ) * add hunyuan moe * Don't reshape Vcur * Apply chat template fix from mainline PR14584	2025-07-09 10:29:40 +02:00
Kawrakow	6a56d5075d	Faster prompt processing for IQ2_KS, IQ2_K, IQ2_K_R4 (#593 ) * cuda: faster MMQ for iq2_ks, iq2_k, iq2_k_r4 * Lookup is still beter for MMQ if we get 4 values at once * Minor --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-07-08 19:44:48 +02:00
Kawrakow	6970ef925f	CUDA: small PP performance improvement for MoE models (#589 ) * Trying to implement quantized fmoe - not working yet * This works, but is slower than the non-working version * quantize_mmq_q8_1_id * Minor --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-07-07 07:23:12 +02:00
Fizz~	27ff5bf57e	Special handling of Seed Coder FIM tokens (#585 ) * Special handling of Seed Coder FIM tokens * vocab: Add Seed Coder pretokenizer * Formatting fix * Update llama.h	2025-07-06 12:13:55 +02:00
firecoperana	49d4d2630a	Fix server crash when there is no DRY sampler (#588 ) Co-authored-by: firecoperana <firecoperana>	2025-07-06 07:51:36 +02:00
Kawrakow	2fddc45a02	Vulkan: flash attention for DeepSeek models (#584 ) * vulkan: support mixed/deepseekR1 FA head sizes (#14509) * vulkan: better parameterize FA by head sizes * vulkan: support mixed/deepseekR1 FA head sizes * Fix the FA cherry-pick --------- Co-authored-by: Jeff Bolz <jbolz@nvidia.com> Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-07-05 15:14:12 +02:00
Kawrakow	b8784686e1	Adding forgotten file (#583 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-07-04 08:39:04 +02:00
Kawrakow	28e81fc761	Vulkan: adding GGML_OP_MULTI_ADD implementation (#582 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-07-04 08:33:43 +02:00
Kawrakow	93b7724bbb	Vulkan: Disable multi-add for now (#581 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-07-03 18:31:48 +02:00
Kawrakow	8d4f0a61db	Vulkan: add GGML_OP_FUSED_MUL_UNARY (#580 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-07-03 18:03:23 +02:00
Kawrakow	b445c83eb9	Vulkan: fused rms norm (#577 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-07-03 15:36:52 +02:00
Kawrakow	1db6a073cb	Do not crash when there is no DRY sampler (#578 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-07-03 15:26:52 +02:00
Kawrakow	2fb0b26a8f	Fix debug build failure with RPC off (#579 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-07-03 15:26:28 +02:00
Kawrakow	c482d14b12	Chnage KQ mask padding to 64 (#574 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-07-03 10:43:27 +02:00
Kawrakow	59e3f4ffe7	Fix CMakeLists (#571 ) * Move Vulkan stuff inside if (GGML_VULKAN) * Minor --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-07-02 16:11:56 +02:00
Kawrakow	adc28f8852	Adding IQ3_KS quants (#566 ) * iq3_ks: basics * iq3_ks: CUDA dequantize * iq3_ks: CUDA mmvq * iq3_ks: mmq * iq3_ks: faster mmq * iq3_ks: Zen4 * iq3_ks: AVX2 convert to q8_k_r8 This gives usPP-512 = 360 t/s. * iq3_ks: AVX2 GEMM/GEMV * iq3_ks: NEON GEMM/GEMV * iq3_ks: NEON convert to q8_k_r8 This gives us PP-512 = 164 t/s. * iq3_ks: Metal dequantize * iq3_ks: Metal gemv - pathetic performance --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-07-02 09:27:47 +02:00
Kawrakow	6215d9315c	Minor CUDA PP speed improvement (#567 ) * Slightly better q8_0_q8_1 kerneel and iqk_ks tile loading * Minor --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-07-02 09:11:33 +02:00

1 2 3 4 5 ...

3826 Commits