mirror of
https://github.com/ikawrakow/ik_llama.cpp.git
synced 2026-04-28 18:32:04 +00:00
MXFP4 (#682)
* mxfp4: basics * mxfp4: Zen4 GEMM * mxfp4: repacked GEMM (AVX2/Zen4) * mxfp4: AVX2 GEMM * mxfp4: NEON GEMM * mxfp4: repacked GEMM (NEON) * mxfp4: Metal * Fix quantized K cache without FA (#680) * Prevent assert with quantized K cache and no FA * Fix MMQ when running with quantized K cache without FA --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com> * Fix for Deepseek r1 parsing (#676) * Implement function calling / tools for ik_llama.cpp for Kimi K2 * Implement basic tool choice * Backport llama.cpp tool calls support * Enhance function calls with improved chat parser and string utilities - Add new chat.h/chat.cpp and chat-parser.h/chat-parser.cpp for better chat handling - Improve function calls parsing with fallback to llama.cpp builder pattern - Add string utility functions (starts_with, ends_with, find_partial_stop) - Update README with function calls testing instructions - Enhance Kimi K2 parser and function calls documentation - Add comprehensive test suite for function calls - Update CMakeLists.txt and Makefile for new components * Enhance function calling with unified streaming and parser improvements - Fix streaming content cleanup to prevent function syntax in output - Unify content extraction patterns with llama.cpp approach - Improve Kimi K2 parser robustness and partial content handling - Add comprehensive test coverage for function call scenarios - Optimize chat message parsing and diff computation * Replace hardcoded values in kimi_k2_parser.hpp with named constants - Add compile-time constants for all token format markers - Add compile-time constants for XML format markers - Add compile-time constants for simple format patterns - Replace all hardcoded string literals with named constants - Use compile-time length calculation to avoid manual counting - Improve maintainability and reduce magic numbers throughout parser * Fix duplicate common_chat_parse definition - Remove duplicate implementation from chat-parser.cpp - Keep single implementation in chat.cpp following llama.cpp patterns - Resolves linker error: multiple definition of common_chat_parse * Fix JSON assertion failure in function call parsing - Add proper validation that 'function' field is an object before accessing nested keys - Handle missing 'arguments' field gracefully with default "{}" - Prevents crash when parsing malformed tool call JSON structures * Add comprehensive Qwen3 XML tool calling support with unit tests - Implement Qwen3 XML parser with <tool_call>{"name": "func", "arguments": {...}}</tool_call> format - Add model detection and routing for Qwen3 vs Kimi-K2 formats - Create 8 comprehensive unit tests covering parsing, streaming, error handling - Fix token format cleaning bug in kimi_k2_parser.hpp processing order - Remove progressive parsing code and related utilities - Add tool injection support for Qwen3 format in server utils * Add DeepSeek R1 function calling support with comprehensive unit tests - Implement complete DeepSeek R1 tool call parsing in common_chat_parser.cpp - Add DeepSeek R1 model detection and tool injection in deepseek_r1_tools.hpp - Update function_calls.hpp with DeepSeek R1 integration and content extraction - Update documentation to reflect support for Kimi-K2, Qwen3, and DeepSeek R1 models - Add comprehensive unit tests for DeepSeek R1 reasoning, tool calls, and integration - Port exact implementation patterns from original llama.cpp for compatibility Key features: - Native DeepSeek R1 format: <|tool▁calls▁begin|>function<|tool▁sep|>name```json{}```<|tool▁call▁end|><|tool▁calls▁end|> - Reasoning content extraction from <think>...</think> tags - Multiple tool calls support with separate call blocks - Model detection for deepseek-r1, deepseek_r1 naming patterns - Integration with incremental parsing and streaming support * Add partial parsing support for JSON and regex - json-partial.h/cpp: JSON partial parsing functionality - regex-partial.h/cpp: Regex partial parsing functionality * Add format_chat integration tests for Qwen3 tool injection - Add test_qwen3_format_chat_integration() to validate tool injection pipeline - Test tool injection conditions and system message enhancement - Verify JSON formatting and anti-preamble instructions - Add comprehensive test documentation Tests confirm tool injection works correctly - conversational preamble issue is not in ik_llama.cpp but likely in UI configuration. * Fix Qwen3 tool call parsing - pass model name to parser Server was not passing model name to parse_chat_message_incremental(), causing Qwen3 to fall back to Kimi-K2 parser and return tool calls as content instead of proper tool_calls array. * Fix non-streaming path to use model-specific parsing Non-streaming responses were hardcoded to use Kimi-K2 format, causing Qwen3 XML tool calls to be returned as content instead of proper tool_calls array. Now uses same model detection as streaming path for consistency. * Update Qwen3 function call handling in server and tests - Enhanced server function call detection and response formatting - Improved test coverage for Qwen3 tool call scenarios - Refined XML parsing for better tool execution support * Add DeepSeek-R1 function call parsing support Implements comprehensive parsing for all 4 DeepSeek-R1 function call formats: - Format 1: Standard function call syntax (already supported) - Format 2: Alternative function call patterns (already supported) - Format 3: Tools array format - function\n```json\n{"tools": [...]} - Format 4: XML wrapped format - <tool_call>function</think>Name\n```json\n{...}```</tool_call> Key changes: - Added parse_deepseek_r1_tools_array() following original parse_prefixed_json_tool_call_array pattern - Added parse_deepseek_r1_xml_wrapped() following Hermes-2-Pro XML wrapper patterns - Integrated both parsers into exception handling chain for robust fallback - Added comprehensive TDD test coverage for all formats - Anonymized all confidential information while preserving functionality Resolves tool_calls_count=0 issue where DeepSeek-R1 models generated valid tool calls but server failed to parse them correctly. * Update function_calls.md documentation for DeepSeek-R1 Format 4 - Added Format 4 (XML wrapped) documentation with examples - Updated implementation notes with correct parser order (3→4→1→2) - Marked all DeepSeek-R1 formats as working (July 2025 update) - Updated test status for Format 3 and 4 as passing - Added parse_deepseek_r1_xml_wrapped() function reference - Corrected implementation file line numbers * Fix merge conflict in test-function-calls.cpp - Removed incomplete merge conflict marker from line 3027 - Ensured all tests compile and pass successfully - All DeepSeek-R1 formats (1-4) working correctly - All streaming and content cleaning tests passing * Fix DeepSeek R1 parsing issue with responses wrapped in think tags Restore missing consume_rest() call from working PR #648 implementation. When responses don't contain tool calls, remaining content after reasoning parsing must be preserved as displayable content. Fixes issue where entire responses wrapped in <think> tags resulted in empty content output. * Implement proper reasoning handling following original llama.cpp patterns - Add missing reasoning_format and reasoning_in_content fields to common_chat_syntax - Update try_parse_reasoning to match original llama.cpp logic exactly - Add TDD test case with reasoning_in_content=true for DeepSeek R1 - Following TDD: test should now pass with proper syntax configuration Based on original llama.cpp implementation patterns. * TDD SUCCESS: Fix DeepSeek R1 thinking tag termination issue ✅ Test passes with reasoning_in_content=true configuration - Content properly preserved: '<think>content</think>' displays fully - Reasoning field empty as expected - Following TDD: test-first approach validates the fix Next: Update server to automatically apply this configuration. * Complete server integration fix for DeepSeek R1 thinking tag termination - Server now automatically sets reasoning_in_content=true for DeepSeek R1 models - Fixes issue where responses wrapped in <think> tags appear empty to users * Add TDD test case for DeepSeek R1 thinking tag termination issue - Test reproduces the exact failure scenario reported by user - Validates that reasoning_in_content=true fixes the issue - Demonstrates empty content problem and working solution * Add remaining TDD test changes for DeepSeek R1 thinking tag fix * Add debug output after upstream merge * Remove temporary benchmark and debug files - Remove tests/benchmark-progressive-parsing.cpp (development tool, not part of core functionality) - Remove tests/reproduce_bug.sh (debugging script, not needed for PR) * Port cpu moe options from mainline (#672) * Port cpu moe options from mainline * Use strdup and int32_t to follow coding guidelines * maxfp4: CUDA dequantize * mxfp4: CUDA GEMV * mxfp4: CUDA MMQ * mxfp4: minor CUDA tweaks --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com> Co-authored-by: Anton Sokolchenko <wsevendays@gmail.com> Co-authored-by: Parsa <61601745+TheLegendOfKitty@users.noreply.github.com>
This commit is contained in:
@@ -3697,6 +3697,147 @@ void quantize_row_q8_K128(const float * x, void * vy, int64_t k) {
|
||||
iqk_quantize_row_q8_K128(x, vy, k);
|
||||
}
|
||||
|
||||
// ============================== MXFP4
|
||||
|
||||
namespace {
|
||||
inline int best_index_mxfp4(float d, const int8_t * values, float x) {
|
||||
float best = std::abs(x - d*values[0]);
|
||||
int index = 0;
|
||||
for (int j = 1; j < 16; ++j) {
|
||||
float diff = std::abs(x - d*values[j]);
|
||||
if (diff < best) { best = diff; index = j; }
|
||||
}
|
||||
return index;
|
||||
}
|
||||
static void quantize_row_mxfp4_impl(int n_per_row, const float * x, char * cy,
|
||||
[[maybe_unused]] float * weight,
|
||||
const int8_t * values,
|
||||
[[maybe_unused]] const float * quant_weights,
|
||||
[[maybe_unused]] const int ntry) {
|
||||
|
||||
GGML_ASSERT(n_per_row % QK_MXFP4 == 0);
|
||||
GGML_UNUSED(quant_weights);
|
||||
|
||||
block_mxfp4 * y = (block_mxfp4 *)cy;
|
||||
|
||||
//int last_ibl = -1;
|
||||
//float sigma2 = 0;
|
||||
|
||||
//const uint8_t e = (uint8_t) (floorf(log2f(amax)) - 2 + 127);
|
||||
// -> log2f(amax) ~ e - 125 -> amax = 2^(e - 125)
|
||||
//const float d = GGML_E8M0_TO_FP32_HALF(e);
|
||||
|
||||
for (int ib = 0; ib < n_per_row/QK_MXFP4; ++ib) {
|
||||
memset(&y[ib], 0, sizeof(block_mxfp4));
|
||||
const float * xb = x + ib*QK_MXFP4;
|
||||
//if (int ibl = ib/(QK_K/QK_MXFP4); ibl != last_ibl) {
|
||||
// int n = std::min(QK_K, n_per_row - ib*QK_MXFP4);
|
||||
// float sumx2 = 0;
|
||||
// for (int j = 0; j < n; ++j) sumx2 += xb[j]*xb[j];
|
||||
// sigma2 = 2.0f*sumx2/n;
|
||||
// last_ibl = ibl;
|
||||
//}
|
||||
//if (quant_weights) {
|
||||
// const float * qw = quant_weights + ib*QK_MXFP4;
|
||||
// for (int j = 0; j < QK_MXFP4; ++j) weight[j] = qw[j] * sqrtf(sigma2 + xb[j]*xb[j]);
|
||||
//} else {
|
||||
// for (int j = 0; j < QK_MXFP4; ++j) weight[j] = xb[j]*xb[j];
|
||||
//}
|
||||
float amax = 0;
|
||||
for (int j = 0; j < QK_MXFP4; ++j) {
|
||||
float ax = fabsf(xb[j]);
|
||||
amax = std::max(amax, ax);
|
||||
}
|
||||
if (!amax) {
|
||||
continue;
|
||||
}
|
||||
const uint8_t e = (uint8_t) (floorf(log2f(amax)) - 2 + 127);
|
||||
const float d = GGML_E8M0_TO_FP32_HALF(e);
|
||||
y[ib].e = e;
|
||||
for (int j = 0; j < QK_MXFP4/2; ++j) {
|
||||
uint8_t v0 = best_index_mxfp4(d, values, xb[j]);
|
||||
uint8_t v1 = best_index_mxfp4(d, values, xb[j+QK_MXFP4/2]);
|
||||
y[ib].qs[j] = v0 | (v1 << 4);
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
|
||||
void quantize_row_mxfp4_ref(const float * x, block_mxfp4 * y, int64_t k) {
|
||||
quantize_mxfp4(x, (void *)y, 1, k, nullptr);
|
||||
}
|
||||
|
||||
void quantize_row_mxfp4(const float * x, void * y, int64_t k) {
|
||||
quantize_mxfp4(x, (void *)y, 1, k, nullptr);
|
||||
}
|
||||
|
||||
size_t quantize_mxfp4(const float * src, void * dst, int64_t nrows, int64_t n_per_row, const float * imatrix) {
|
||||
constexpr int kBlockSize = QK_MXFP4;
|
||||
GGML_ASSERT(n_per_row%kBlockSize == 0);
|
||||
auto row_size = ggml_row_size(GGML_TYPE_MXFP4, n_per_row);
|
||||
char * qrow = (char *)dst;
|
||||
float weight[kBlockSize];
|
||||
for (int64_t row = 0; row < nrows; ++row) {
|
||||
quantize_row_mxfp4_impl(n_per_row, src, qrow, weight, kvalues_mxfp4, imatrix, 7);
|
||||
src += n_per_row;
|
||||
qrow += row_size;
|
||||
}
|
||||
return nrows * row_size;
|
||||
}
|
||||
|
||||
void dequantize_row_mxfp4(const block_mxfp4 * x, float * y, int64_t k) {
|
||||
constexpr int kBlockSize = QK_MXFP4;
|
||||
GGML_ASSERT(k%kBlockSize == 0);
|
||||
int nblock = k/kBlockSize;
|
||||
for (int ib = 0; ib < nblock; ++ib) {
|
||||
float d = GGML_E8M0_TO_FP32_HALF(x[ib].e);
|
||||
for (int j = 0; j < kBlockSize/2; ++j) {
|
||||
y[j ] = d * kvalues_mxfp4[x[ib].qs[j] & 0xf];
|
||||
y[j+kBlockSize/2] = d * kvalues_mxfp4[x[ib].qs[j] >> 4];
|
||||
}
|
||||
y += kBlockSize;
|
||||
}
|
||||
}
|
||||
|
||||
void vec_dot_mxfp4_q8_0_x4(int n, float * s, size_t bs, const void * vx, size_t bx, const void * vy, size_t by, int nrc) {
|
||||
#if GGML_USE_IQK_MULMAT
|
||||
if (iqk_mul_mat(1, 1, n, GGML_TYPE_MXFP4, vx, 0, GGML_TYPE_Q8_K, vy, 0, s, 0, 0, 1)) {
|
||||
return;
|
||||
}
|
||||
#endif
|
||||
GGML_ASSERT(n%QK_MXFP4 == 0);
|
||||
GGML_ASSERT(nrc == 1);
|
||||
GGML_UNUSED(bs);
|
||||
GGML_UNUSED(bx);
|
||||
GGML_UNUSED(by);
|
||||
//const block_mxfp4 * x = (const block_mxfp4 *)vx;
|
||||
//const block_q8_K * y = (const block_q8_K *)vy;
|
||||
//int nblock = n/QK_MXFP4;
|
||||
//float sumf = 0;
|
||||
//for (int ibl = 0; ibl < nblock; ++ibl) {
|
||||
// //int sumi = 0;
|
||||
// auto qy = y[ibl].qs;
|
||||
// auto qx = x[ibl].qs;
|
||||
// float db = d * y[ibl].d;
|
||||
// for (int ib = 0; ib < QK_K/kBlockSize; ++ib) {
|
||||
// float dl = db * ((x[ibl].scales[ib] & 254) - 127);
|
||||
// //int ls = (x[ibl].scales[ib] & 254) - 127;
|
||||
// const int8_t * values = iq4k_values + ((x[ibl].scales[ib] & 1) << 4);
|
||||
// int suml = 0;
|
||||
// for (int j = 0; j < kBlockSize/2; ++j) {
|
||||
// suml += qy[j ] * values[qx[j] & 0xf]
|
||||
// + qy[j + kBlockSize/2] * values[qx[j] >> 4];
|
||||
// }
|
||||
// sumf += dl * suml;
|
||||
// //sumi += ls * suml;
|
||||
// qy += kBlockSize;
|
||||
// qx += kBlockSize/2;
|
||||
// }
|
||||
// //sumf += d * y[ibl].d * sumi;
|
||||
//}
|
||||
//*s = sumf;
|
||||
}
|
||||
|
||||
namespace {
|
||||
static void quantize_row_iq4_k_impl_bs128(const int super_block_size, const int block_size,
|
||||
int n_per_row, const float * x, char * cy,
|
||||
|
||||
Reference in New Issue
Block a user