ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-04-29 10:51:51 +00:00

Author	SHA1	Message	Date
Djip007	808ce4907c	Unroll for loop for repacked BF16 MATMUL (#1047 ) see https://github.com/ikawrakow/ik_llama.cpp/discussions/1028 for detail	2025-12-08 06:09:45 +01:00
Kawrakow	658ced0abd	Hadamard transforms for K-cache - CPU only (#1033 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-12-04 06:51:11 +01:00
Kawrakow	f1191036b2	Support GigaChat3 (#995 ) * Fixing Gigachat support * Gigachat: CUDA FA (needs 192 x 192 for MLA = 3) * Gigachat: CPU FA (needs 192 x 192 for MLA = 3) --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-24 06:55:14 +01:00
Kawrakow	459bf5812d	Add missing AVX512 operators for MSVC (#948 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-14 06:58:51 +02:00
Kawrakow	f4202c812e	Fix repacked legacy quants (#951 ) * Fix q5_0_r4 The issue waqs in the tail part. As almost all models have tensor rows that are multiple of 128, that part was never triggered in testing. But ithe gpt-oss models have an embedding size of 2880, so we end up there and trigger the bug. * Fix q6_0_r4 Same fix as q5_0_r4 * Fix q4_0_r8 * Fix q5_0_r4 and q6_0_r4 also on Zen4 * Fix q4_0_r8 also on Zen4 --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-13 15:35:37 +02:00
Lennart Lopin	1da9c218b0	Add ARM Grace Blackwell (NVIDIA DGX Spark) support (#922 ) This commit enables IQK quantization operations on ARM-based systems, specifically tested on NVIDIA DGX Spark with GB10 Grace Blackwell. Changes: - Enable IQK_IMPLEMENT macro for ARM NEON operations - Add arm_neon.h header include for ARM SIMD intrinsics - Fix compilation errors related to missing NEON types and functions Build requirements for ARM: cmake .. -DGGML_CUDA=ON \ -DCMAKE_CXX_FLAGS="-march=armv8.2-a+dotprod+fp16" \ -DCMAKE_C_FLAGS="-march=armv8.2-a+dotprod+fp16" Tested on: - Platform: NVIDIA DGX Spark (aarch64) - CPU: GB10 Grace Blackwell Superchip - Memory: 128GB unified memory Fixes build errors: - 'float32x4_t' does not name a type - 'vld1q_f32' was not declared in this scope - 'v_expf' was not declared in this scope - Missing FP16 NEON intrinsics	2025-11-09 14:22:40 +02:00
Kawrakow	665434e5ec	Fix iqk_mul_mat when number of rows is not multiple of repack rows (#911 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-06 19:07:46 +02:00
Kawrakow	5b38d431ac	Much better CPU TG performance at long context for GLM-4.5 (#899 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-11-05 10:20:26 +02:00
Kawrakow	db3ba4999f	Fused mul + multi_add op (#858 ) * Adding fused mul+multi_add + CPU implementation * fused mul+multi_add: CUDA * fused mul+multi_add: command line argument to disable it --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-10-24 07:40:35 +03:00
Kawrakow	28d3e63805	Various fused ops around expert selection (#840 ) * Fuse sigmoid+add+grouped_topk+get_rows (CPU) * Fix CPU + CUDA but CUDA is somehow not 100% correct as I get a slightly different PPL (lower!) * Minor * Fuse sigmoid+add+topk+get_rows (CUDA) * Fuse sigmoid+add+topk+get_rows (CPU) * Fuse topk+view+get_rows+reshape+softmax (CPU) * Fuse topk+view+get_rows+reshape+softmax (CUDA) * cpu: turn off the openai topk fusing for now Something is not right and I don't see the bug. On the CPU one doesn't gain much if anything, so not a big loss. * Also fuse sum_rows and div --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-10-19 19:02:46 +03:00
Kawrakow	dbfd151594	Grouped expert routing (CPU only) (#836 ) * Better argsort (CPU) * Attemt at grouped topk * This seems to do the trick for grouped experts routing * Cleanup * Trying to merge, something is not right * Working merged grouped top_k (CPU) * Add command line option to enable grouped expert routing * Add grouped expert routing option to llama-bench --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-10-16 14:57:02 +03:00
Kawrakow	ecf8f931ea	Better argsort (CPU) (#835 ) * Better argsort (CPU) * Minor --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-10-16 11:31:03 +03:00
Kawrakow	e94d1a92a5	Attempt to fix AVX2 FA (#807 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-09-30 08:06:53 +02:00
Kawrakow	c108e4b7c9	CPU: faster FA (#797 ) * Avoid computing FA chunks where the mask is -infinity * Avoid computing FA chunks where the mask is -infinity also for f16/bf16 --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-09-26 09:00:25 +02:00
Kawrakow	cde2eb5e95	cpu: fused softmax+topk (#794 ) * cpu: fused softmax+topk * Cleanup --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-09-24 09:02:21 +02:00
Kawrakow	06cc7c6894	Better CPU SWA (#757 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-09-04 11:58:16 +02:00
Kawrakow	b66cecca45	Fused FFN_UP+FFN_GATE op (#741 ) * Fused up+gate+unary for regular (not MoE) FFN - CPU * WIP CUDA * Seems to be working on CUDA For a dense model we get 2-3% speedup for PP and ~0.6% for TG. * Add command line option This time the option is ON by default, and one needs to turn it off via -no-fug or --no-fused-up-gate --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-08-31 18:16:36 +03:00
Kawrakow	46968d4ab1	Sanitize imatrix (#735 ) * sanitize importance matrix: WIP * sanitize importance matrix: iq4_k * sanitize importance matrix: iq5_k, iq6_k * sanitize imatrix: iq4_ks * sanitize imatrix: iq4_kss * sanitize imatrix: iq2_ks and iq2_kl * sanitize imatrix: iq5_ks * sanitize imatrix: iq4_nl_r4 * sanitize imatrix: q4_0_r8 * sanitize imatrix: q6_0_r4 * sanitize imatrix: iq4_xs_r8 * sanitize imatrix: iq4_xs_r8 and q3_k_r4 with a template * sanitize imatrix: q2_k_r4, q4_k_r4, q5_k_r4, q6_k_r4 * sanitize imatrix: repacked i-quants * Minor * Add more checks for iq3_k, iq3_ks --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-08-29 09:08:15 +03:00
Kawrakow	dac5b48398	Check for NaNs while loading the model. (#727 ) * Check for NaNs while loading the model. * Also tell which experts have NaNs. * Add command line option to validate quants * Add checks for more quantization types * Add checks for more quantizagtion types --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-08-27 19:00:17 +03:00
Iwan Kawrakow	03ce6fdb0d	Fix typo	2025-08-27 14:43:44 +03:00
Kawrakow	931f04af53	Sanitize importances for KT quantization (#720 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-08-27 08:04:15 +03:00
Kawrakow	683d6f1fc8	Fix avx2 GEMM mess (v2) (#724 ) * This fixes confusion around Q8_0 on AVX2 * This does it for iq4_nl, including FA * This does it for iq4_nl on Zen4, but FA does not work * Slightly more clear * Adding forgotten q8_0_r8 to num_rows() --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-08-27 08:03:47 +03:00
Kawrakow	e008c0e192	Log for debugging #721 (#722 ) * Log for debugging #721 * Remove the 16 --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-08-23 15:24:34 +03:00
Kawrakow	e919e89d5a	Fix more Q8_0 repacking mess on AVX2 (#719 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-08-23 09:04:51 +03:00
Kawrakow	3b94f0a73e	AVX512+AVXVNNI GEMM implementation for quants using Q8_K for activations (#710 ) * q8_k_r16: basics * q8_k_r16: iq4_xs now uses q8_k_r16 on Zen4+ PP performance is about the same as using q8_k_r8 on the Ryzen-7950X, so we expect nice gains on Zen5, and we don't need to wory about using 2 different q8_k_r8 implementations for fancy SIMD. * q8_k_r16: iq2_xxs now uses q8_k_r16 on Zen4+ * q8_k_r16: iq2_xs now uses q8_k_r16 on Zen4+ * q8_k_r16: iq2_s now uses q8_k_r16 on Zen4+ * q8_k_r16: iq3_xxs now uses q8_k_r16 on Zen4+ * q8_k_r16: iq3_s now uses q8_k_r16 on Zen4+ * q8_k_r16: iq1_s and iq1_m now uses q8_k_r16 on Zen4+ * q8_k_r16: q2_K and q3_K now uses q8_k_r16 on Zen4+ * q8_k_r16: iq2_ks and iq2_k now uses q8_k_r16 on Zen4+ * q8_k_r16: iq2_kl now uses q8_k_r16 on Zen4+ * q8_k_r16: iq3_ks and iq3_k now uses q8_k_r16 on Zen4+ * q8_k_r16: iq4_kss, iq4_ks, and iq4_k now use q8_k_r16 on Zen4+ * q8_k_r16: iq5_ks, iq5_k, and iq6_k now use q8_k_r16 on Zen4+ * Fix AVX2 * Just always set num_rows to 16 --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-08-22 06:27:07 +03:00
Kawrakow	2572d16399	Fix q8_0 repacking issues on AVX2 (#708 ) Q8_0 needs Q0_0_X4, but Q8_0_R8 needs Q8_2_X4. So, if we decide to repack a Q8_0 MoE tensor to Q8_0_R8, iqk_moe_fused_mul_unary fails because the activations were prepared as Q0_0_X4, but we now need Q8_2_X4. For now a simple fix: just take the slow path, do not repack. Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-08-19 19:49:58 +03:00
Kawrakow	a3a523009e	Revert "Better CPU prompt processing performance for SWA models (#696 )" (#701 ) This reverts commit `93a4f6089f`. Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-08-17 15:44:02 +03:00
Kawrakow	93a4f6089f	Better CPU prompt processing performance for SWA models (#696 ) * This does the trick for PP * Compute mask bounds when creating the mask * Set mask bounds for all supported SWA models --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-08-17 10:30:27 +03:00
Kawrakow	4239d259a6	Quick hack to improve TG performance for SWA models (#692 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-08-15 16:43:04 +03:00
Kawrakow	fc06bc9d27	Enable CUDA graphs for MoE models + GPT-OSS support (#689 ) * gmp-oss: common * gpt-oss: attnetion sinks, swiglu_oai * gpt-oss: WIP llama Model loads and runs (CPU only), but PPL is much to high (~1500 for 1st batch vs ~200 in mainline). Is it because of SWA, because of vocab, or did I introduce a bug somewhere? * gpt-oss: CPU seems to be working It was the SWA thta was missing in the previous commit. There are issues with EOG tokens, so this still needs to be added. * CUDA: ADD_ID Just a copy from mainline * gpt-oss: Seems to be working on CUDA * gpt-oss: add sinks to the attn-vec kernels * CUDA: add head size of 64 to new mma Haven't turned it on yet, but observe slightly better PP and slightly worse TG performance with that. * gpt-oss: add ability to use -fmoe (only CUDA for now) * Move row sums to the write place * Add sinks to iqk flash attention * gpt_oss: Implement -fmoe on the CPU * Simdify swiglu_oai Turning it off for now as performance becomes more variable, so perhaps I'm running into thermal trottling imore often because of making the CPU work too hard. * llama: factor out model loader * Builds successfully * It runs, but mmap does not work * Fix llama_mmap so mmap works * Minor * Fix CUDA after latest changes * Attempt to use CUDA graphs with MoE models - not working * CUDA graphs WIP - still not working * CUDA graphs - seems to be working Likely not all MLA variants are working. I no longer remember why I added the q8_0 cpy that transposes the tensor, but if really needed, this is now missing. Also missing is q6_0. * Make q8_0 cache work for DeepSeek models with CUDA graphs * cuda: cpy for q6_0 * Fix llama_mmap on non-Linux platforms * Adding forgotten file * Iterating on Windows build failures * cuda: re-add q8_0 -> q8_0 transpose so mla = 2 can be used with CUDA graphs and q8_0 cache. * Disable graphs without -fmoe * Minor * Turn graphs on by default --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-08-15 09:18:07 +03:00
Kawrakow	7117c23de4	MXFP4 (#682 ) * mxfp4: basics * mxfp4: Zen4 GEMM * mxfp4: repacked GEMM (AVX2/Zen4) * mxfp4: AVX2 GEMM * mxfp4: NEON GEMM * mxfp4: repacked GEMM (NEON) * mxfp4: Metal * Fix quantized K cache without FA (#680) * Prevent assert with quantized K cache and no FA * Fix MMQ when running with quantized K cache without FA --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com> * Fix for Deepseek r1 parsing (#676) * Implement function calling / tools for ik_llama.cpp for Kimi K2 * Implement basic tool choice * Backport llama.cpp tool calls support * Enhance function calls with improved chat parser and string utilities - Add new chat.h/chat.cpp and chat-parser.h/chat-parser.cpp for better chat handling - Improve function calls parsing with fallback to llama.cpp builder pattern - Add string utility functions (starts_with, ends_with, find_partial_stop) - Update README with function calls testing instructions - Enhance Kimi K2 parser and function calls documentation - Add comprehensive test suite for function calls - Update CMakeLists.txt and Makefile for new components * Enhance function calling with unified streaming and parser improvements - Fix streaming content cleanup to prevent function syntax in output - Unify content extraction patterns with llama.cpp approach - Improve Kimi K2 parser robustness and partial content handling - Add comprehensive test coverage for function call scenarios - Optimize chat message parsing and diff computation * Replace hardcoded values in kimi_k2_parser.hpp with named constants - Add compile-time constants for all token format markers - Add compile-time constants for XML format markers - Add compile-time constants for simple format patterns - Replace all hardcoded string literals with named constants - Use compile-time length calculation to avoid manual counting - Improve maintainability and reduce magic numbers throughout parser * Fix duplicate common_chat_parse definition - Remove duplicate implementation from chat-parser.cpp - Keep single implementation in chat.cpp following llama.cpp patterns - Resolves linker error: multiple definition of common_chat_parse * Fix JSON assertion failure in function call parsing - Add proper validation that 'function' field is an object before accessing nested keys - Handle missing 'arguments' field gracefully with default "{}" - Prevents crash when parsing malformed tool call JSON structures * Add comprehensive Qwen3 XML tool calling support with unit tests - Implement Qwen3 XML parser with <tool_call>{"name": "func", "arguments": {...}}</tool_call> format - Add model detection and routing for Qwen3 vs Kimi-K2 formats - Create 8 comprehensive unit tests covering parsing, streaming, error handling - Fix token format cleaning bug in kimi_k2_parser.hpp processing order - Remove progressive parsing code and related utilities - Add tool injection support for Qwen3 format in server utils * Add DeepSeek R1 function calling support with comprehensive unit tests - Implement complete DeepSeek R1 tool call parsing in common_chat_parser.cpp - Add DeepSeek R1 model detection and tool injection in deepseek_r1_tools.hpp - Update function_calls.hpp with DeepSeek R1 integration and content extraction - Update documentation to reflect support for Kimi-K2, Qwen3, and DeepSeek R1 models - Add comprehensive unit tests for DeepSeek R1 reasoning, tool calls, and integration - Port exact implementation patterns from original llama.cpp for compatibility Key features: - Native DeepSeek R1 format: <｜tool▁calls▁begin｜>function<｜tool▁sep｜>name```json{}```<｜tool▁call▁end｜><｜tool▁calls▁end｜> - Reasoning content extraction from <think>...</think> tags - Multiple tool calls support with separate call blocks - Model detection for deepseek-r1, deepseek_r1 naming patterns - Integration with incremental parsing and streaming support * Add partial parsing support for JSON and regex - json-partial.h/cpp: JSON partial parsing functionality - regex-partial.h/cpp: Regex partial parsing functionality * Add format_chat integration tests for Qwen3 tool injection - Add test_qwen3_format_chat_integration() to validate tool injection pipeline - Test tool injection conditions and system message enhancement - Verify JSON formatting and anti-preamble instructions - Add comprehensive test documentation Tests confirm tool injection works correctly - conversational preamble issue is not in ik_llama.cpp but likely in UI configuration. * Fix Qwen3 tool call parsing - pass model name to parser Server was not passing model name to parse_chat_message_incremental(), causing Qwen3 to fall back to Kimi-K2 parser and return tool calls as content instead of proper tool_calls array. * Fix non-streaming path to use model-specific parsing Non-streaming responses were hardcoded to use Kimi-K2 format, causing Qwen3 XML tool calls to be returned as content instead of proper tool_calls array. Now uses same model detection as streaming path for consistency. * Update Qwen3 function call handling in server and tests - Enhanced server function call detection and response formatting - Improved test coverage for Qwen3 tool call scenarios - Refined XML parsing for better tool execution support * Add DeepSeek-R1 function call parsing support Implements comprehensive parsing for all 4 DeepSeek-R1 function call formats: - Format 1: Standard function call syntax (already supported) - Format 2: Alternative function call patterns (already supported) - Format 3: Tools array format - function\n```json\n{"tools": [...]} - Format 4: XML wrapped format - <tool_call>function</think>Name\n```json\n{...}```</tool_call> Key changes: - Added parse_deepseek_r1_tools_array() following original parse_prefixed_json_tool_call_array pattern - Added parse_deepseek_r1_xml_wrapped() following Hermes-2-Pro XML wrapper patterns - Integrated both parsers into exception handling chain for robust fallback - Added comprehensive TDD test coverage for all formats - Anonymized all confidential information while preserving functionality Resolves tool_calls_count=0 issue where DeepSeek-R1 models generated valid tool calls but server failed to parse them correctly. * Update function_calls.md documentation for DeepSeek-R1 Format 4 - Added Format 4 (XML wrapped) documentation with examples - Updated implementation notes with correct parser order (3→4→1→2) - Marked all DeepSeek-R1 formats as working (July 2025 update) - Updated test status for Format 3 and 4 as passing - Added parse_deepseek_r1_xml_wrapped() function reference - Corrected implementation file line numbers * Fix merge conflict in test-function-calls.cpp - Removed incomplete merge conflict marker from line 3027 - Ensured all tests compile and pass successfully - All DeepSeek-R1 formats (1-4) working correctly - All streaming and content cleaning tests passing * Fix DeepSeek R1 parsing issue with responses wrapped in think tags Restore missing consume_rest() call from working PR #648 implementation. When responses don't contain tool calls, remaining content after reasoning parsing must be preserved as displayable content. Fixes issue where entire responses wrapped in <think> tags resulted in empty content output. * Implement proper reasoning handling following original llama.cpp patterns - Add missing reasoning_format and reasoning_in_content fields to common_chat_syntax - Update try_parse_reasoning to match original llama.cpp logic exactly - Add TDD test case with reasoning_in_content=true for DeepSeek R1 - Following TDD: test should now pass with proper syntax configuration Based on original llama.cpp implementation patterns. * TDD SUCCESS: Fix DeepSeek R1 thinking tag termination issue ✅ Test passes with reasoning_in_content=true configuration - Content properly preserved: '<think>content</think>' displays fully - Reasoning field empty as expected - Following TDD: test-first approach validates the fix Next: Update server to automatically apply this configuration. * Complete server integration fix for DeepSeek R1 thinking tag termination - Server now automatically sets reasoning_in_content=true for DeepSeek R1 models - Fixes issue where responses wrapped in <think> tags appear empty to users * Add TDD test case for DeepSeek R1 thinking tag termination issue - Test reproduces the exact failure scenario reported by user - Validates that reasoning_in_content=true fixes the issue - Demonstrates empty content problem and working solution * Add remaining TDD test changes for DeepSeek R1 thinking tag fix * Add debug output after upstream merge * Remove temporary benchmark and debug files - Remove tests/benchmark-progressive-parsing.cpp (development tool, not part of core functionality) - Remove tests/reproduce_bug.sh (debugging script, not needed for PR) * Port cpu moe options from mainline (#672) * Port cpu moe options from mainline * Use strdup and int32_t to follow coding guidelines * maxfp4: CUDA dequantize * mxfp4: CUDA GEMV * mxfp4: CUDA MMQ * mxfp4: minor CUDA tweaks --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com> Co-authored-by: Anton Sokolchenko <wsevendays@gmail.com> Co-authored-by: Parsa <61601745+TheLegendOfKitty@users.noreply.github.com>	2025-08-09 08:40:18 +03:00
Kawrakow	1b05210904	IQ4_KSS improvements (#642 ) * iq4_kss: slightly better quantization * iq4_kss: CUDA MMQ * iq4_kss: repack/convert to q8_k_r8 (AVX2) * iq4_kss: repack/convert to q8_k_r8 (NEON) --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-07-23 20:50:57 +02:00
Kawrakow	f989fb03bd	Adding IQ1_KT - 1.75 bpw SOTA quants (#616 ) * iq1_kt: basics * iq1_kt: CUDA dequantize Testing with LlaMA-3.1-8B-Instruct, we get almost the same PPL as iq2_xxs, so about 0.2 bpw fewer bits for the same quality. * iq1_kt: CUDA MMQ * iq1_kt: CUDA MMVQ * iq1_kt: AVX2 GEMM/GEMV * iq1_kt: convert/repack to q8_0_r8 (AVX2) * iq1_kt: slightly faster GEMV 18.6 t/s -> 19.4 t/s * iq1_kt: NEON GEMM/GEMV Pathetic as usual * iq1_kt: slightly faster NEON - still pathetic * iq1_kt: tiny bit better GEMV on NEON * iq1_kt: convert/repack to q8_0_r8 (NEON) * iq1_kt: very slightly faster convert/repack to q8_0_r8 on NEON * Adding frgotten file * iq1_kt: add to constants.py --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-07-20 10:05:23 +02:00
Kawrakow	07673c6c33	IQ1_M GEMM for ARM_NEON (#631 ) * iq1_m GEMM on NEON * Set repacking threshold --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-07-20 09:49:59 +02:00
Iwan Kawrakow	38012f7290	Remove forgotten change	2025-07-18 20:11:57 +03:00
Kawrakow	cc82006f51	GEMM for iq1_m (#630 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-07-18 18:55:43 +02:00
Kawrakow	45fae1a144	Adding IQ2_KL (#602 ) * Experiments for 2.6875 bpw quants At least according to rmse, this is significantly better than q2_K, while using only 1/16 more bits per weight. * iq2_kl: basics * iq2_kl: CUDA dequantize * iq2_kl: small improvement in PPL Also check the two neighbouring values for the block scale and use the one that minimizes RMSE. * iq2_kl: MMQ Quite good: PP-512(L3-8B) = 8472 t/s. * iq2_kl: MMVQ We get PP-128(L3-8B) = 162 t/s. Which means that this is not quite as good as it should be as (almost) same bpq q2_K is at 170 t/s. * iq2_kl: Zen4 GEMM/GEMV Not particularly fast. I may need to think about rearranging the bits. * iq2_kl: better Zen4 * iq2_kl: convert/repack to q8_k_r8 (AVX2) * iq2_kl: AVX2 GEMM/GEMV * iq2_kl: WIP NEON The compiler started crashing!!! * iq2_kl: NEON Had to work around a compiler crash when using vzip2q_u8 using vqtbl2q_u8. * iq2_kl: convert/repack to q8_k_r8 (NEON) * iq2_kl: Metal dequantize * iq2_kl: Metal GEMV - pretty slow * iq2_kl: Metal GEMV - slightly better (40 t/s -> 44.5 t/s) * iq2_kl: Metal GEMV - slightly better (44.5 t/s -> 46.5 t/s) * iq2_kl: Metal GEMV - slightly better (46.5 t/s -> 47.2 t/s) * iq2_kl: slightly better Metal dequantize PP-512 goes to 476 t/s up from 466 t/s. * iq2_kl: slightly better Metal dequantize PP-512 goes to 492 t/s up from 476 t/s. * Add iq2_kl to constants.py --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-07-14 18:55:08 +02:00
Kawrakow	3248a35992	Adding IQ3_KS quants (#566 ) * iq3_ks: basics * iq3_ks: CUDA dequantize * iq3_ks: CUDA mmvq * iq3_ks: mmq * iq3_ks: faster mmq * iq3_ks: Zen4 * iq3_ks: AVX2 convert to q8_k_r8 This gives usPP-512 = 360 t/s. * iq3_ks: AVX2 GEMM/GEMV * iq3_ks: NEON GEMM/GEMV * iq3_ks: NEON convert to q8_k_r8 This gives us PP-512 = 164 t/s. * iq3_ks: Metal dequantize * iq3_ks: Metal gemv - pathetic performance --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-07-02 09:27:47 +02:00
Kawrakow	b5f2f00106	Much faster prompt processing for IQ1_S and IQ1_M on ARM_NEON (#553 ) * iq1_s 66.3 t/s -> 168.8 t/s. * iq1_m 19 t/s -> 163 t/s. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-06-24 14:21:37 +02:00
Kawrakow	64f6c2dead	Much faster prompt processing for k-quants (ARM_NEON) (#552 ) * iq2_xxs 55.8 -> 167.5 t/s. iq2_xxs is at 93.7 t/s * iq2_xs 46.4 -> 166.6 t/s. iq2_xs_r4 is at 72.3 t/s. * iq2_s 42.8 t/s -> 166.8 t/s. iq2_s_r4 is at 71.1 t/s. * iq3_xxs 51.8 t/s -> 165.6 t/s. iq3_xxs_r4 is at 84.6 t/s. * iq3_s 46.0 t/s -> 162.0 t/s. iq3_s_r4 is at 79.4 t/s * q2_k 85.7 t/s -> 168.1 t/s. q2_k_r4 is at 111.2 t/s. * q3_K 45.7 t/s -> 170.8 t/s. q3_k_r4 is at 110.3 t/s. * q6_k 47.7 t/s -> 124 t/s. q6_k_r4 is at 112.7 t/s. * q4_k 58.2 t/s -> 114.8 t/s. iq4_k_r4 is at 130.9 t/s. As I had to add a new implementation for q8_1-quantized activations, TG became slightly faster too (25.1 -> 25.9 t/s). * q5_k 54.9 -> 114.9 t/s. q5_k_r4 is at 116.2 t/s. * iq4_xs 71.2 -> 167.8 t/s. iq4_xs_r4 is at 138.6 t/s. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-06-24 13:05:01 +02:00
Kawrakow	ddda4d9e64	Much faster prompt processing for I-quants (ARM_NEON) (#550 ) * iq2_xxs 55.8 -> 167.5 t/s. iq2_xxs is at 93.7 t/s * iq2_xs 46.4 -> 166.6 t/s. iq2_xs_r4 is at 72.3 t/s. * iq2_s 42.8 t/s -> 166.8 t/s. iq2_s_r4 is at 71.1 t/s. * iq3_xxs 51.8 t/s -> 165.6 t/s. iq3_xxs_r4 is at 84.6 t/s. * iq3_s 46.0 t/s -> 162.0 t/s. iq3_s_r4 is at 79.4 t/s --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-06-23 15:50:24 +02:00
Kawrakow	4776dd2809	Much faster prompt processing for IQK quants (ARM_NEON) (#549 ) * Faster GEMM fir iq2_ks, iq4_ks * iq5_ks 63.8 t/s -> 166 t/s. iq5_ks_r4 is at 107.4 t/s. But: iw5_ks_r4 TG performance is quite a bit better: 21.7 t/s vs 17.7 t/s for iq5_ks. * iq6_k 44 t/s -> 164.3 t/s. There is no iq6_k_r4 * iq5_k 46 t/s -> 167 t/s. iq5_k_r4 is at 99.5 t/s. * iq4_k 46.4 -> 167.2 t/s. iq4_k_r4 is at 115 t/s. * iq3_k 47.3 t/s -> 166.5 t/s. iq3_k_r4 is at 96.5 t/s. * iq2_k 47.4 t/s -> 167 t/s. iq2_k_r4 is at 113.3 t/s. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-06-23 11:55:50 +02:00
Iwan Kawrakow	cac763fc20	To use GGML_ABORT we need to include ggml-impl.h.	2025-06-22 17:49:32 +03:00
Iwan Kawrakow	22d6817d1e	Abort if IQK_IMPLEMENT is not defined	2025-06-22 16:49:38 +03:00
Kawrakow	4f97409b80	Faster ARM_NEON GEMM implementation for legacy quants (#546 ) * iq2_kt and iq3_kt work with new int trellis Much slower than the fp16 based trellis. I guess, Apple doesn't have int8_t SIMD on the M2-Max GPU. * q4_0 83.6 t/s -> 128.4 t/s. q4_0_r8 is at 123.5 t/s * q5_0 74.2 t/s -> 128.5 t/s. q5_0_r4 is at 111.4 t/s. * q6_0 74.2 t/s -> 128.8 t/s. q6_0_r4 is at 107.2 t/s. * q8_0 84.5 -> 128.7 t/s. q8_0_r8 is at 131 t/s. * iq4_nl 84.5 t/s -> 128.1 t/s. iq4_nl_r4 is at 120.4 t/s * q4_1 74.4 -> 115.4 t/s. There is no repacked variant * q5_1 64.2 t/s -> 114.9 t/s. There is no repacked variant. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-06-21 16:35:08 +02:00
Kawrakow	a98b7678a3	Perhaps slightly faster trellis quants (#541 ) * This seems slightly faster for IQ2_KT, IQ3_KT TG * This looks better for iq4_kt TG * WIP * Cleanup * With fancy simd also set func16 * Enable next_128() also on AVX2 Despite having just 16 vector registers it is still faster. --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-06-21 16:32:16 +02:00
Kawrakow	1843ed22c5	New integer trellis on ARM_NEON (#544 ) * Adapt iq3_kt to new trellis on NEON * iq3_kt is now working on NEON --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-06-20 09:26:36 +03:00
Kawrakow	144ee1c4c6	Fix NEON build (#542 ) Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-06-19 18:37:22 +03:00
Kawrakow	c6166b4020	Fix missed block_q8_x2 bf16 -> i16 change (#540 ) Closes #538 Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2025-06-19 09:35:36 +03:00
Louie Helm	0ade534305	Fix KT Neon / ARM typo (#536 ) Removes errant ";" in front of 0xCBAC1FED in non-x86 code ``` error: expected primary-expression before ';' token constexpr static uint32_t ka = ;0xCBAC1FED; ^ error: expected unqualified-id before numeric constant constexpr static uint32_t ka = ;0xCBAC1FED; ^ ```	2025-06-18 19:55:02 +03:00

1 2 3 4 5

240 Commits