ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-02-24 07:04:11 +00:00

Author	SHA1	Message	Date
Iwan Kawrakow	3bc7acf1bd	Add command line option This time the option is ON by default, and one needs to turn it off via -no-fug or --no-fused-up-gate	2025-08-30 12:09:55 +03:00
Iwan Kawrakow	df066ced5e	Seems to be working on CUDA For a dense model we get 2-3% speedup for PP and ~0.6% for TG.	2025-08-30 12:09:54 +03:00
Iwan Kawrakow	4360ff545e	WIP CUDA	2025-08-30 12:09:54 +03:00
Iwan Kawrakow	06aacc3167	Fused up+gate+unary for regular (not MoE) FFN - CPU	2025-08-30 12:09:54 +03:00
Kawrakow	f22a9ef95a	CUDA: prompt processing optimizations for MoE models (#739 ) * Skip the row id computation for the ffn_down op Sadly, almost negligible performance gain. * Also this doesn't do much * Also this barely moves the needle * This is slightly better --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-08-30 12:09:41 +03:00
Kawrakow	46968d4ab1	Sanitize imatrix (#735 ) * sanitize importance matrix: WIP * sanitize importance matrix: iq4_k * sanitize importance matrix: iq5_k, iq6_k * sanitize imatrix: iq4_ks * sanitize imatrix: iq4_kss * sanitize imatrix: iq2_ks and iq2_kl * sanitize imatrix: iq5_ks * sanitize imatrix: iq4_nl_r4 * sanitize imatrix: q4_0_r8 * sanitize imatrix: q6_0_r4 * sanitize imatrix: iq4_xs_r8 * sanitize imatrix: iq4_xs_r8 and q3_k_r4 with a template * sanitize imatrix: q2_k_r4, q4_k_r4, q5_k_r4, q6_k_r4 * sanitize imatrix: repacked i-quants * Minor * Add more checks for iq3_k, iq3_ks --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-08-29 09:08:15 +03:00
Kawrakow	872ac10b02	Make yarn_log_multiplier optional (#738 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-08-28 14:09:59 +03:00
Kawrakow	dac5b48398	Check for NaNs while loading the model. (#727 ) * Check for NaNs while loading the model. * Also tell which experts have NaNs. * Add command line option to validate quants * Add checks for more quantization types * Add checks for more quantizagtion types --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-08-27 19:00:17 +03:00
Iwan Kawrakow	03ce6fdb0d	Fix typo	2025-08-27 14:43:44 +03:00
Kawrakow	966a6ce93c	Heuristics for mmq_id -> original threshold (#734 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-08-27 08:17:41 +03:00
Kawrakow	931f04af53	Sanitize importances for KT quantization (#720 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-08-27 08:04:15 +03:00
Kawrakow	683d6f1fc8	Fix avx2 GEMM mess (v2) (#724 ) * This fixes confusion around Q8_0 on AVX2 * This does it for iq4_nl, including FA * This does it for iq4_nl on Zen4, but FA does not work * Slightly more clear * Adding forgotten q8_0_r8 to num_rows() --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-08-27 08:03:47 +03:00
Kawrakow	0cc32ff0b1	CUDA: muh faster prompt processing for MoE models and small u-batch sizes (#728 ) * WIP: adding mainline mmq_id implementation * This seems to work * Now also -fmoe works * WIP * WIP * WIP * This works for mainline supported quants * mmq_id: add iq2_k, iq2_k_r4 * mmiq_id: don't assume row size is multiple of type size (per row scales) * mmiq_id: don't assume row size is multiple of type size * mmq_id: add iq2_ks So we are sure it works with per row scales * mmq_id: add iq2_kl * mmq_id: add iq3_ks * mmq_id: adding iq3_k, iq3_k_r4 * mmq_id: add iq4_kss, iq4_ks, iq4_ks_r4 * mmq_id: adding iq4_k, iq4_k_r4 * mmq_id: adding iq5_ks, iq5_ks_r4 * mmq_id: adding iq5_k, iq5_k_r4, q6_0 * mmq_id: adding iq6_k * mmq_id: add iq1_s_r4 * mmq_id: adding iq1_kt, iq2_kt * mmq_id: add iq3_kt, iq4_kt * Add CUDA fp8 header --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-08-26 13:30:35 +03:00
Mohan Krishnan	50f7119dfd	Fix undefined template std::basic_string<char> (#726 ) Getting this error when compiling on Mac with clang 17 Simple fix, add the string header in src/llama-impl.h Co-authored-by: Mohan Krishnan <mohan.krishnan@grab.com>	2025-08-25 11:34:01 +03:00
saood06	af13c9a292	Add mikupad to ik_llama as an alternative WebUI (#558 ) * mikupad.html in ik_llama.cpp (functional but WIP) * Remove hardcoded extension and add error handling to extension loading * Update version number and add features array to version * Make version endpoint always accessible * Fix case with empty sql * Add useful error message when launched without sql file * Add sigma sampler * Update sigma step and max based on docs * Remove selectedSessionId and handle it with URL fragment * Export All (code only, no UI) * Add compression to server.cpp * Major UI work (and also add update backend endpoints to accomadate) * Finalize UI * Fix visual bug * fix merge conflict issue * Pull in full sqlite_modern_cpp repo for the license as it is not attached to source files * Make compression not show in sidebar if extension is not loaded * Finalize build, Put support behing LLAMA_SERVER_SQLITE3: command not found build option, and update error message to include the build option is not passed situation * Fix compile without flag on systems without it installed	2025-08-24 08:27:29 -05:00
Kawrakow	e008c0e192	Log for debugging #721 (#722 ) * Log for debugging #721 * Remove the 16 --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-08-23 15:24:34 +03:00
Kawrakow	e919e89d5a	Fix more Q8_0 repacking mess on AVX2 (#719 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-08-23 09:04:51 +03:00
Kawrakow	9351cc3416	Remove scary warning about incompatible model (#717 ) * Remove scary warning about incompatible model * Minor --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-08-22 18:42:01 +03:00
Kawrakow	dfa6e2b5fa	CUDA: faster IQ2_K, IQ2_KS, IQ2_K_R4 (#716 ) * Use bperm trick for iq2_ks gemm -> 7% gain * Use bperm trick for iq2_k gemm -> ~5% gain * Use bperm trick for iq2_k_r4 gemm -> ~3% gain * Use bperm trick for iq2_ks gemv -> ~7% gain * Use bperm trick for iq2_k gemv -> ~3% gain * Use bperm trick for iq2_k_r4 gemv -> ~7% gain --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-08-22 07:25:35 +03:00
Kawrakow	3b94f0a73e	AVX512+AVXVNNI GEMM implementation for quants using Q8_K for activations (#710 ) * q8_k_r16: basics * q8_k_r16: iq4_xs now uses q8_k_r16 on Zen4+ PP performance is about the same as using q8_k_r8 on the Ryzen-7950X, so we expect nice gains on Zen5, and we don't need to wory about using 2 different q8_k_r8 implementations for fancy SIMD. * q8_k_r16: iq2_xxs now uses q8_k_r16 on Zen4+ * q8_k_r16: iq2_xs now uses q8_k_r16 on Zen4+ * q8_k_r16: iq2_s now uses q8_k_r16 on Zen4+ * q8_k_r16: iq3_xxs now uses q8_k_r16 on Zen4+ * q8_k_r16: iq3_s now uses q8_k_r16 on Zen4+ * q8_k_r16: iq1_s and iq1_m now uses q8_k_r16 on Zen4+ * q8_k_r16: q2_K and q3_K now uses q8_k_r16 on Zen4+ * q8_k_r16: iq2_ks and iq2_k now uses q8_k_r16 on Zen4+ * q8_k_r16: iq2_kl now uses q8_k_r16 on Zen4+ * q8_k_r16: iq3_ks and iq3_k now uses q8_k_r16 on Zen4+ * q8_k_r16: iq4_kss, iq4_ks, and iq4_k now use q8_k_r16 on Zen4+ * q8_k_r16: iq5_ks, iq5_k, and iq6_k now use q8_k_r16 on Zen4+ * Fix AVX2 * Just always set num_rows to 16 --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-08-22 06:27:07 +03:00
Kawrakow	c8e4d6648c	Does this fix #690 ? (#711 ) * Does this fix #690? * Another attempt --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-08-21 19:17:33 +03:00
Kawrakow	05cd6994c8	CUDA: faster IQ3_K, IQ3_KS, IQ3_K_R4 (#714 ) * Use bperm trick for iq3_ks - 5% PP performance gain * Use bperm trick for iq3_k -> 5% PP performance gain * Use bperm trick for iq3_k -> 8% PP performance gain * Use bperm trick for iq3_k_r4 gemv -> ~5% faster * Use bperm trick for iq3_k gemv -> ~3% faster * Use bperm trick for iq3_k gemv -> 4.5% gain --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-08-21 19:08:57 +03:00
Kawrakow	78de7736e8	CUDA: faster prompt processing for 4-bit quants (#713 ) * Use __byte_perm in get_int_from_table_16 * Use get_int_from_table_16 everywhere for 4-bit quants --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-08-21 15:57:35 +03:00
Kawrakow	0cb6696943	Disable "...is not marked as EOG" messages (#712 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-08-20 16:47:14 +03:00
Kawrakow	2572d16399	Fix q8_0 repacking issues on AVX2 (#708 ) Q8_0 needs Q0_0_X4, but Q8_0_R8 needs Q8_2_X4. So, if we decide to repack a Q8_0 MoE tensor to Q8_0_R8, iqk_moe_fused_mul_unary fails because the activations were prepared as Q0_0_X4, but we now need Q8_2_X4. For now a simple fix: just take the slow path, do not repack. Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-08-19 19:49:58 +03:00
Kawrakow	ee85af6d26	Disable experimental code that causes issues with MSVC (#707 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-08-19 18:09:49 +03:00
usrlocalben	6aef3c9415	remove curious assertions (#705 ) This assertion can hit during prefill as MLA/KV tensors grow, e.g. Kimi K2 n_ctx >= 32768.	2025-08-19 14:41:29 +03:00
g2mt	23fe18ce83	Port universal assisted decoding to llama-server (#699 ) * port universal assisted decoding to server * fix calls * fix LOG_INFO * fix llama_detokenize call * use emplace_back	2025-08-18 09:22:23 +03:00
Kawrakow	a3a523009e	Revert "Better CPU prompt processing performance for SWA models (#696 )" (#701 ) This reverts commit `93a4f6089f`. Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-08-17 15:44:02 +03:00
Kawrakow	7d14f8ea79	Fix GLM-4.5 attention (#700 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-08-17 14:31:03 +03:00
Kawrakow	93a4f6089f	Better CPU prompt processing performance for SWA models (#696 ) * This does the trick for PP * Compute mask bounds when creating the mask * Set mask bounds for all supported SWA models --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-08-17 10:30:27 +03:00
Pavel Dudkouski	4bf5c8184b	Added "usage" to server response (#695 ) - Fixed "timings" calculation - Always send "timings" in last message	2025-08-17 02:25:14 +00:00
g2mt	b837773728	Port speculative decoding from upstream to llama-server (#645 ) * server : integrate speculative decoding * server: Fix field names * server: fix include, whitespace * fix compile errors in speculative.cpp * add llama_sampling_sample_and_accept_n to sampling * finish porting speculative decoding in server * port functions from common/speculative, common/sampling * remove arg * fix function names * init params_dft to none * correct value for n_ctx * prefix kv cache tensors with model name to avoid conflict * fix call arguments * fix spec decoding args * correct slot.id * use n_max * port the rest of sampling funcs * fix func arguments * slot.id starts at 1? * Revert "prefix kv cache tensors with model name to avoid conflict" This reverts commit `fbd5dfd866`. * disable draft logging * disable logging in speculative.cpp in mainline, these would be LOG_DEBUG, but since ik_llama doesnt support it, logging is disabled entirely * add more draft model parameters * fix * pass flash_attn * add speculative params for parity * set speculative params in launch_slot_with_task instead	2025-08-16 07:26:44 +03:00
Kawrakow	4239d259a6	Quick hack to improve TG performance for SWA models (#692 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-08-15 16:43:04 +03:00
Kawrakow	fc06bc9d27	Enable CUDA graphs for MoE models + GPT-OSS support (#689 ) * gmp-oss: common * gpt-oss: attnetion sinks, swiglu_oai * gpt-oss: WIP llama Model loads and runs (CPU only), but PPL is much to high (~1500 for 1st batch vs ~200 in mainline). Is it because of SWA, because of vocab, or did I introduce a bug somewhere? * gpt-oss: CPU seems to be working It was the SWA thta was missing in the previous commit. There are issues with EOG tokens, so this still needs to be added. * CUDA: ADD_ID Just a copy from mainline * gpt-oss: Seems to be working on CUDA * gpt-oss: add sinks to the attn-vec kernels * CUDA: add head size of 64 to new mma Haven't turned it on yet, but observe slightly better PP and slightly worse TG performance with that. * gpt-oss: add ability to use -fmoe (only CUDA for now) * Move row sums to the write place * Add sinks to iqk flash attention * gpt_oss: Implement -fmoe on the CPU * Simdify swiglu_oai Turning it off for now as performance becomes more variable, so perhaps I'm running into thermal trottling imore often because of making the CPU work too hard. * llama: factor out model loader * Builds successfully * It runs, but mmap does not work * Fix llama_mmap so mmap works * Minor * Fix CUDA after latest changes * Attempt to use CUDA graphs with MoE models - not working * CUDA graphs WIP - still not working * CUDA graphs - seems to be working Likely not all MLA variants are working. I no longer remember why I added the q8_0 cpy that transposes the tensor, but if really needed, this is now missing. Also missing is q6_0. * Make q8_0 cache work for DeepSeek models with CUDA graphs * cuda: cpy for q6_0 * Fix llama_mmap on non-Linux platforms * Adding forgotten file * Iterating on Windows build failures * cuda: re-add q8_0 -> q8_0 transpose so mla = 2 can be used with CUDA graphs and q8_0 cache. * Disable graphs without -fmoe * Minor * Turn graphs on by default --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-08-15 09:18:07 +03:00
saood06	e082df47f2	Gracefully fail the decode instead of crashing for kshift Deepseek error (#688 ) * Gracefuly fail the decode instead of crashing for kshift Deepseek error) * fix formatting * minor	2025-08-13 13:12:40 +03:00
firecoperana	d99cf7cb71	Fix completions endpoint (#684 ) Co-authored-by: firecoperana <firecoperana>	2025-08-11 09:43:20 +03:00
firecoperana	d60c8f4d3b	add jinja template support (#677 ) Co-authored-by: firecoperana <firecoperana>	2025-08-09 12:50:30 +00:00
Kawrakow	7117c23de4	MXFP4 (#682 ) * mxfp4: basics * mxfp4: Zen4 GEMM * mxfp4: repacked GEMM (AVX2/Zen4) * mxfp4: AVX2 GEMM * mxfp4: NEON GEMM * mxfp4: repacked GEMM (NEON) * mxfp4: Metal * Fix quantized K cache without FA (#680) * Prevent assert with quantized K cache and no FA * Fix MMQ when running with quantized K cache without FA --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com> * Fix for Deepseek r1 parsing (#676) * Implement function calling / tools for ik_llama.cpp for Kimi K2 * Implement basic tool choice * Backport llama.cpp tool calls support * Enhance function calls with improved chat parser and string utilities - Add new chat.h/chat.cpp and chat-parser.h/chat-parser.cpp for better chat handling - Improve function calls parsing with fallback to llama.cpp builder pattern - Add string utility functions (starts_with, ends_with, find_partial_stop) - Update README with function calls testing instructions - Enhance Kimi K2 parser and function calls documentation - Add comprehensive test suite for function calls - Update CMakeLists.txt and Makefile for new components * Enhance function calling with unified streaming and parser improvements - Fix streaming content cleanup to prevent function syntax in output - Unify content extraction patterns with llama.cpp approach - Improve Kimi K2 parser robustness and partial content handling - Add comprehensive test coverage for function call scenarios - Optimize chat message parsing and diff computation * Replace hardcoded values in kimi_k2_parser.hpp with named constants - Add compile-time constants for all token format markers - Add compile-time constants for XML format markers - Add compile-time constants for simple format patterns - Replace all hardcoded string literals with named constants - Use compile-time length calculation to avoid manual counting - Improve maintainability and reduce magic numbers throughout parser * Fix duplicate common_chat_parse definition - Remove duplicate implementation from chat-parser.cpp - Keep single implementation in chat.cpp following llama.cpp patterns - Resolves linker error: multiple definition of common_chat_parse * Fix JSON assertion failure in function call parsing - Add proper validation that 'function' field is an object before accessing nested keys - Handle missing 'arguments' field gracefully with default "{}" - Prevents crash when parsing malformed tool call JSON structures * Add comprehensive Qwen3 XML tool calling support with unit tests - Implement Qwen3 XML parser with <tool_call>{"name": "func", "arguments": {...}}</tool_call> format - Add model detection and routing for Qwen3 vs Kimi-K2 formats - Create 8 comprehensive unit tests covering parsing, streaming, error handling - Fix token format cleaning bug in kimi_k2_parser.hpp processing order - Remove progressive parsing code and related utilities - Add tool injection support for Qwen3 format in server utils * Add DeepSeek R1 function calling support with comprehensive unit tests - Implement complete DeepSeek R1 tool call parsing in common_chat_parser.cpp - Add DeepSeek R1 model detection and tool injection in deepseek_r1_tools.hpp - Update function_calls.hpp with DeepSeek R1 integration and content extraction - Update documentation to reflect support for Kimi-K2, Qwen3, and DeepSeek R1 models - Add comprehensive unit tests for DeepSeek R1 reasoning, tool calls, and integration - Port exact implementation patterns from original llama.cpp for compatibility Key features: - Native DeepSeek R1 format: <｜tool▁calls▁begin｜>function<｜tool▁sep｜>name```json{}```<｜tool▁call▁end｜><｜tool▁calls▁end｜> - Reasoning content extraction from <think>...</think> tags - Multiple tool calls support with separate call blocks - Model detection for deepseek-r1, deepseek_r1 naming patterns - Integration with incremental parsing and streaming support * Add partial parsing support for JSON and regex - json-partial.h/cpp: JSON partial parsing functionality - regex-partial.h/cpp: Regex partial parsing functionality * Add format_chat integration tests for Qwen3 tool injection - Add test_qwen3_format_chat_integration() to validate tool injection pipeline - Test tool injection conditions and system message enhancement - Verify JSON formatting and anti-preamble instructions - Add comprehensive test documentation Tests confirm tool injection works correctly - conversational preamble issue is not in ik_llama.cpp but likely in UI configuration. * Fix Qwen3 tool call parsing - pass model name to parser Server was not passing model name to parse_chat_message_incremental(), causing Qwen3 to fall back to Kimi-K2 parser and return tool calls as content instead of proper tool_calls array. * Fix non-streaming path to use model-specific parsing Non-streaming responses were hardcoded to use Kimi-K2 format, causing Qwen3 XML tool calls to be returned as content instead of proper tool_calls array. Now uses same model detection as streaming path for consistency. * Update Qwen3 function call handling in server and tests - Enhanced server function call detection and response formatting - Improved test coverage for Qwen3 tool call scenarios - Refined XML parsing for better tool execution support * Add DeepSeek-R1 function call parsing support Implements comprehensive parsing for all 4 DeepSeek-R1 function call formats: - Format 1: Standard function call syntax (already supported) - Format 2: Alternative function call patterns (already supported) - Format 3: Tools array format - function\n```json\n{"tools": [...]} - Format 4: XML wrapped format - <tool_call>function</think>Name\n```json\n{...}```</tool_call> Key changes: - Added parse_deepseek_r1_tools_array() following original parse_prefixed_json_tool_call_array pattern - Added parse_deepseek_r1_xml_wrapped() following Hermes-2-Pro XML wrapper patterns - Integrated both parsers into exception handling chain for robust fallback - Added comprehensive TDD test coverage for all formats - Anonymized all confidential information while preserving functionality Resolves tool_calls_count=0 issue where DeepSeek-R1 models generated valid tool calls but server failed to parse them correctly. * Update function_calls.md documentation for DeepSeek-R1 Format 4 - Added Format 4 (XML wrapped) documentation with examples - Updated implementation notes with correct parser order (3→4→1→2) - Marked all DeepSeek-R1 formats as working (July 2025 update) - Updated test status for Format 3 and 4 as passing - Added parse_deepseek_r1_xml_wrapped() function reference - Corrected implementation file line numbers * Fix merge conflict in test-function-calls.cpp - Removed incomplete merge conflict marker from line 3027 - Ensured all tests compile and pass successfully - All DeepSeek-R1 formats (1-4) working correctly - All streaming and content cleaning tests passing * Fix DeepSeek R1 parsing issue with responses wrapped in think tags Restore missing consume_rest() call from working PR #648 implementation. When responses don't contain tool calls, remaining content after reasoning parsing must be preserved as displayable content. Fixes issue where entire responses wrapped in <think> tags resulted in empty content output. * Implement proper reasoning handling following original llama.cpp patterns - Add missing reasoning_format and reasoning_in_content fields to common_chat_syntax - Update try_parse_reasoning to match original llama.cpp logic exactly - Add TDD test case with reasoning_in_content=true for DeepSeek R1 - Following TDD: test should now pass with proper syntax configuration Based on original llama.cpp implementation patterns. * TDD SUCCESS: Fix DeepSeek R1 thinking tag termination issue ✅ Test passes with reasoning_in_content=true configuration - Content properly preserved: '<think>content</think>' displays fully - Reasoning field empty as expected - Following TDD: test-first approach validates the fix Next: Update server to automatically apply this configuration. * Complete server integration fix for DeepSeek R1 thinking tag termination - Server now automatically sets reasoning_in_content=true for DeepSeek R1 models - Fixes issue where responses wrapped in <think> tags appear empty to users * Add TDD test case for DeepSeek R1 thinking tag termination issue - Test reproduces the exact failure scenario reported by user - Validates that reasoning_in_content=true fixes the issue - Demonstrates empty content problem and working solution * Add remaining TDD test changes for DeepSeek R1 thinking tag fix * Add debug output after upstream merge * Remove temporary benchmark and debug files - Remove tests/benchmark-progressive-parsing.cpp (development tool, not part of core functionality) - Remove tests/reproduce_bug.sh (debugging script, not needed for PR) * Port cpu moe options from mainline (#672) * Port cpu moe options from mainline * Use strdup and int32_t to follow coding guidelines * maxfp4: CUDA dequantize * mxfp4: CUDA GEMV * mxfp4: CUDA MMQ * mxfp4: minor CUDA tweaks --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com> Co-authored-by: Anton Sokolchenko <wsevendays@gmail.com> Co-authored-by: Parsa <61601745+TheLegendOfKitty@users.noreply.github.com>	2025-08-09 08:40:18 +03:00
Parsa	293f4aa433	Port cpu moe options from mainline (#672 ) * Port cpu moe options from mainline * Use strdup and int32_t to follow coding guidelines	2025-08-08 14:38:18 +03:00
Anton Sokolchenko	fa7a0f340e	Fix for Deepseek r1 parsing (#676 ) * Implement function calling / tools for ik_llama.cpp for Kimi K2 * Implement basic tool choice * Backport llama.cpp tool calls support * Enhance function calls with improved chat parser and string utilities - Add new chat.h/chat.cpp and chat-parser.h/chat-parser.cpp for better chat handling - Improve function calls parsing with fallback to llama.cpp builder pattern - Add string utility functions (starts_with, ends_with, find_partial_stop) - Update README with function calls testing instructions - Enhance Kimi K2 parser and function calls documentation - Add comprehensive test suite for function calls - Update CMakeLists.txt and Makefile for new components * Enhance function calling with unified streaming and parser improvements - Fix streaming content cleanup to prevent function syntax in output - Unify content extraction patterns with llama.cpp approach - Improve Kimi K2 parser robustness and partial content handling - Add comprehensive test coverage for function call scenarios - Optimize chat message parsing and diff computation * Replace hardcoded values in kimi_k2_parser.hpp with named constants - Add compile-time constants for all token format markers - Add compile-time constants for XML format markers - Add compile-time constants for simple format patterns - Replace all hardcoded string literals with named constants - Use compile-time length calculation to avoid manual counting - Improve maintainability and reduce magic numbers throughout parser * Fix duplicate common_chat_parse definition - Remove duplicate implementation from chat-parser.cpp - Keep single implementation in chat.cpp following llama.cpp patterns - Resolves linker error: multiple definition of common_chat_parse * Fix JSON assertion failure in function call parsing - Add proper validation that 'function' field is an object before accessing nested keys - Handle missing 'arguments' field gracefully with default "{}" - Prevents crash when parsing malformed tool call JSON structures * Add comprehensive Qwen3 XML tool calling support with unit tests - Implement Qwen3 XML parser with <tool_call>{"name": "func", "arguments": {...}}</tool_call> format - Add model detection and routing for Qwen3 vs Kimi-K2 formats - Create 8 comprehensive unit tests covering parsing, streaming, error handling - Fix token format cleaning bug in kimi_k2_parser.hpp processing order - Remove progressive parsing code and related utilities - Add tool injection support for Qwen3 format in server utils * Add DeepSeek R1 function calling support with comprehensive unit tests - Implement complete DeepSeek R1 tool call parsing in common_chat_parser.cpp - Add DeepSeek R1 model detection and tool injection in deepseek_r1_tools.hpp - Update function_calls.hpp with DeepSeek R1 integration and content extraction - Update documentation to reflect support for Kimi-K2, Qwen3, and DeepSeek R1 models - Add comprehensive unit tests for DeepSeek R1 reasoning, tool calls, and integration - Port exact implementation patterns from original llama.cpp for compatibility Key features: - Native DeepSeek R1 format: <｜tool▁calls▁begin｜>function<｜tool▁sep｜>name```json{}```<｜tool▁call▁end｜><｜tool▁calls▁end｜> - Reasoning content extraction from <think>...</think> tags - Multiple tool calls support with separate call blocks - Model detection for deepseek-r1, deepseek_r1 naming patterns - Integration with incremental parsing and streaming support * Add partial parsing support for JSON and regex - json-partial.h/cpp: JSON partial parsing functionality - regex-partial.h/cpp: Regex partial parsing functionality * Add format_chat integration tests for Qwen3 tool injection - Add test_qwen3_format_chat_integration() to validate tool injection pipeline - Test tool injection conditions and system message enhancement - Verify JSON formatting and anti-preamble instructions - Add comprehensive test documentation Tests confirm tool injection works correctly - conversational preamble issue is not in ik_llama.cpp but likely in UI configuration. * Fix Qwen3 tool call parsing - pass model name to parser Server was not passing model name to parse_chat_message_incremental(), causing Qwen3 to fall back to Kimi-K2 parser and return tool calls as content instead of proper tool_calls array. * Fix non-streaming path to use model-specific parsing Non-streaming responses were hardcoded to use Kimi-K2 format, causing Qwen3 XML tool calls to be returned as content instead of proper tool_calls array. Now uses same model detection as streaming path for consistency. * Update Qwen3 function call handling in server and tests - Enhanced server function call detection and response formatting - Improved test coverage for Qwen3 tool call scenarios - Refined XML parsing for better tool execution support * Add DeepSeek-R1 function call parsing support Implements comprehensive parsing for all 4 DeepSeek-R1 function call formats: - Format 1: Standard function call syntax (already supported) - Format 2: Alternative function call patterns (already supported) - Format 3: Tools array format - function\n```json\n{"tools": [...]} - Format 4: XML wrapped format - <tool_call>function</think>Name\n```json\n{...}```</tool_call> Key changes: - Added parse_deepseek_r1_tools_array() following original parse_prefixed_json_tool_call_array pattern - Added parse_deepseek_r1_xml_wrapped() following Hermes-2-Pro XML wrapper patterns - Integrated both parsers into exception handling chain for robust fallback - Added comprehensive TDD test coverage for all formats - Anonymized all confidential information while preserving functionality Resolves tool_calls_count=0 issue where DeepSeek-R1 models generated valid tool calls but server failed to parse them correctly. * Update function_calls.md documentation for DeepSeek-R1 Format 4 - Added Format 4 (XML wrapped) documentation with examples - Updated implementation notes with correct parser order (3→4→1→2) - Marked all DeepSeek-R1 formats as working (July 2025 update) - Updated test status for Format 3 and 4 as passing - Added parse_deepseek_r1_xml_wrapped() function reference - Corrected implementation file line numbers * Fix merge conflict in test-function-calls.cpp - Removed incomplete merge conflict marker from line 3027 - Ensured all tests compile and pass successfully - All DeepSeek-R1 formats (1-4) working correctly - All streaming and content cleaning tests passing * Fix DeepSeek R1 parsing issue with responses wrapped in think tags Restore missing consume_rest() call from working PR #648 implementation. When responses don't contain tool calls, remaining content after reasoning parsing must be preserved as displayable content. Fixes issue where entire responses wrapped in <think> tags resulted in empty content output. * Implement proper reasoning handling following original llama.cpp patterns - Add missing reasoning_format and reasoning_in_content fields to common_chat_syntax - Update try_parse_reasoning to match original llama.cpp logic exactly - Add TDD test case with reasoning_in_content=true for DeepSeek R1 - Following TDD: test should now pass with proper syntax configuration Based on original llama.cpp implementation patterns. * TDD SUCCESS: Fix DeepSeek R1 thinking tag termination issue ✅ Test passes with reasoning_in_content=true configuration - Content properly preserved: '<think>content</think>' displays fully - Reasoning field empty as expected - Following TDD: test-first approach validates the fix Next: Update server to automatically apply this configuration. * Complete server integration fix for DeepSeek R1 thinking tag termination - Server now automatically sets reasoning_in_content=true for DeepSeek R1 models - Fixes issue where responses wrapped in <think> tags appear empty to users * Add TDD test case for DeepSeek R1 thinking tag termination issue - Test reproduces the exact failure scenario reported by user - Validates that reasoning_in_content=true fixes the issue - Demonstrates empty content problem and working solution * Add remaining TDD test changes for DeepSeek R1 thinking tag fix * Add debug output after upstream merge * Remove temporary benchmark and debug files - Remove tests/benchmark-progressive-parsing.cpp (development tool, not part of core functionality) - Remove tests/reproduce_bug.sh (debugging script, not needed for PR)	2025-08-08 13:56:44 +03:00
Kawrakow	41f0d2e5de	Fix quantized K cache without FA (#680 ) * Prevent assert with quantized K cache and no FA * Fix MMQ when running with quantized K cache without FA --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-08-08 13:51:14 +03:00
Kawrakow	58f3bda0ae	Vulkan: add cmake options to build without coopmat(2) support (#674 ) So I can test KHR coopmat and no coopmat. Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-08-07 17:26:21 +03:00
Anton Sokolchenko	dee40cffb6	Fix Qwen3 content extraction breaking code formatting (#661 ) Problem: - qwen3::extract_content_during_parsing() used aggressive regex to collapse multiple newlines - This broke proper code formatting (e.g., PEP 8's 2 empty lines between functions) - Affected non-tool-call streaming output where formatting is critical Solution: - Replace aggressive std::regex_replace(R"(\n\s*\n)", "\n") with gentle string_strip() - Follow original llama.cpp patterns: only trim leading/trailing whitespace - Preserve internal formatting including multiple newlines - Add proper include for common.h to access string_strip function Changes: - examples/server/parsers/qwen3_parser.hpp: Replace whitespace cleanup with string_strip() - tests/test-function-calls.cpp: Add test_qwen3_whitespace_preservation() to prevent regression Testing: - ✅ PEP 8 compliance: 2 empty lines between functions preserved - ✅ Tool call parsing: All Qwen3 tests continue to pass - ✅ No regressions: Existing functionality maintained - ✅ Follows original llama.cpp whitespace handling patterns	2025-08-07 08:22:01 +03:00
Anton Sokolchenko	e484944bc0	Deepseek R1 function calls (more formats) (#652 ) * Implement function calling / tools for ik_llama.cpp for Kimi K2 * Implement basic tool choice * Backport llama.cpp tool calls support * Enhance function calls with improved chat parser and string utilities - Add new chat.h/chat.cpp and chat-parser.h/chat-parser.cpp for better chat handling - Improve function calls parsing with fallback to llama.cpp builder pattern - Add string utility functions (starts_with, ends_with, find_partial_stop) - Update README with function calls testing instructions - Enhance Kimi K2 parser and function calls documentation - Add comprehensive test suite for function calls - Update CMakeLists.txt and Makefile for new components * Enhance function calling with unified streaming and parser improvements - Fix streaming content cleanup to prevent function syntax in output - Unify content extraction patterns with llama.cpp approach - Improve Kimi K2 parser robustness and partial content handling - Add comprehensive test coverage for function call scenarios - Optimize chat message parsing and diff computation * Replace hardcoded values in kimi_k2_parser.hpp with named constants - Add compile-time constants for all token format markers - Add compile-time constants for XML format markers - Add compile-time constants for simple format patterns - Replace all hardcoded string literals with named constants - Use compile-time length calculation to avoid manual counting - Improve maintainability and reduce magic numbers throughout parser * Fix duplicate common_chat_parse definition - Remove duplicate implementation from chat-parser.cpp - Keep single implementation in chat.cpp following llama.cpp patterns - Resolves linker error: multiple definition of common_chat_parse * Fix JSON assertion failure in function call parsing - Add proper validation that 'function' field is an object before accessing nested keys - Handle missing 'arguments' field gracefully with default "{}" - Prevents crash when parsing malformed tool call JSON structures * Add comprehensive Qwen3 XML tool calling support with unit tests - Implement Qwen3 XML parser with <tool_call>{"name": "func", "arguments": {...}}</tool_call> format - Add model detection and routing for Qwen3 vs Kimi-K2 formats - Create 8 comprehensive unit tests covering parsing, streaming, error handling - Fix token format cleaning bug in kimi_k2_parser.hpp processing order - Remove progressive parsing code and related utilities - Add tool injection support for Qwen3 format in server utils * Add DeepSeek R1 function calling support with comprehensive unit tests - Implement complete DeepSeek R1 tool call parsing in common_chat_parser.cpp - Add DeepSeek R1 model detection and tool injection in deepseek_r1_tools.hpp - Update function_calls.hpp with DeepSeek R1 integration and content extraction - Update documentation to reflect support for Kimi-K2, Qwen3, and DeepSeek R1 models - Add comprehensive unit tests for DeepSeek R1 reasoning, tool calls, and integration - Port exact implementation patterns from original llama.cpp for compatibility Key features: - Native DeepSeek R1 format: <｜tool▁calls▁begin｜>function<｜tool▁sep｜>name```json{}```<｜tool▁call▁end｜><｜tool▁calls▁end｜> - Reasoning content extraction from <think>...</think> tags - Multiple tool calls support with separate call blocks - Model detection for deepseek-r1, deepseek_r1 naming patterns - Integration with incremental parsing and streaming support * Add partial parsing support for JSON and regex - json-partial.h/cpp: JSON partial parsing functionality - regex-partial.h/cpp: Regex partial parsing functionality * Add format_chat integration tests for Qwen3 tool injection - Add test_qwen3_format_chat_integration() to validate tool injection pipeline - Test tool injection conditions and system message enhancement - Verify JSON formatting and anti-preamble instructions - Add comprehensive test documentation Tests confirm tool injection works correctly - conversational preamble issue is not in ik_llama.cpp but likely in UI configuration. * Fix Qwen3 tool call parsing - pass model name to parser Server was not passing model name to parse_chat_message_incremental(), causing Qwen3 to fall back to Kimi-K2 parser and return tool calls as content instead of proper tool_calls array. * Fix non-streaming path to use model-specific parsing Non-streaming responses were hardcoded to use Kimi-K2 format, causing Qwen3 XML tool calls to be returned as content instead of proper tool_calls array. Now uses same model detection as streaming path for consistency. * Update Qwen3 function call handling in server and tests - Enhanced server function call detection and response formatting - Improved test coverage for Qwen3 tool call scenarios - Refined XML parsing for better tool execution support * Add DeepSeek-R1 function call parsing support Implements comprehensive parsing for all 4 DeepSeek-R1 function call formats: - Format 1: Standard function call syntax (already supported) - Format 2: Alternative function call patterns (already supported) - Format 3: Tools array format - function\n```json\n{"tools": [...]} - Format 4: XML wrapped format - <tool_call>function</think>Name\n```json\n{...}```</tool_call> Key changes: - Added parse_deepseek_r1_tools_array() following original parse_prefixed_json_tool_call_array pattern - Added parse_deepseek_r1_xml_wrapped() following Hermes-2-Pro XML wrapper patterns - Integrated both parsers into exception handling chain for robust fallback - Added comprehensive TDD test coverage for all formats - Anonymized all confidential information while preserving functionality Resolves tool_calls_count=0 issue where DeepSeek-R1 models generated valid tool calls but server failed to parse them correctly. * Update function_calls.md documentation for DeepSeek-R1 Format 4 - Added Format 4 (XML wrapped) documentation with examples - Updated implementation notes with correct parser order (3→4→1→2) - Marked all DeepSeek-R1 formats as working (July 2025 update) - Updated test status for Format 3 and 4 as passing - Added parse_deepseek_r1_xml_wrapped() function reference - Corrected implementation file line numbers * Fix merge conflict in test-function-calls.cpp - Removed incomplete merge conflict marker from line 3027 - Ensured all tests compile and pass successfully - All DeepSeek-R1 formats (1-4) working correctly - All streaming and content cleaning tests passing	2025-08-07 08:15:57 +03:00
Thireus ☠	47c3dc798c	Add support for GLM-4.5 models (#668 ) * GLM-4.5 * GLM-4.5 * GLM-4.5 * convert_hf_to_gguf.py compatibility bugfix with GLM-4.5 From @ubergarm - https://github.com/ikawrakow/ik_llama.cpp/pull/668#issuecomment-3145913701 * Add ubergarm comments + my own * Revert to llama.cpp script version that produced good BF16 See: https://github.com/ikawrakow/ik_llama.cpp/pull/668#issuecomment-3147374559 * Support for jinja chat templates See https://github.com/ikawrakow/ik_llama.cpp/pull/668#issuecomment-3148109962 * GLM-4.5 llama.cpp final port * Handle TENSOR_SKIP Ported the hanges from: `f129567dc0` `dcbbd2cb05` Except op info since ik_llama.cpp doesn't support this operation. * Bugfix for TENSOR_SKIP skip loading if a tensor has the TENSOR_SKIP flag - @ubergarm via https://github.com/ikawrakow/ik_llama.cpp/pull/668#issuecomment-3155297198 * Update llama.cpp Restore original GGLM_ASSERT * Fix chat template detection Changes suggested by @ubergarm - https://github.com/ikawrakow/ik_llama.cpp/pull/668#issuecomment-3155927840 * Revert to original GGML_ASSERT	2025-08-07 07:55:00 +03:00
firecoperana	ae0ba31fd0	Merge pull request #648 from ikawrakow/fcp/missing_token_ps Fix missing token per second for webui after function call update	2025-07-26 21:13:52 -05:00
Anton Sokolchenko	d65c8cea5a	Fix text generation endpoint (#654 )	2025-07-26 19:36:48 -05:00
firecoperana	608ff76170	webui: move preset settings to top webui:bug fix	2025-07-25 18:03:01 -05:00
firecoperana	82d9bc03a9	bug fix no timings after tool update	2025-07-25 17:52:43 -05:00

1 2 3 4 5 ...

3872 Commits