Q8_0 needs Q0_0_X4, but Q8_0_R8 needs Q8_2_X4.
So, if we decide to repack a Q8_0 MoE tensor to Q8_0_R8,
iqk_moe_fused_mul_unary fails because the activations were
prepared as Q0_0_X4, but we now need Q8_2_X4.
For now a simple fix: just take the slow path, do not repack.
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* This does the trick for PP
* Compute mask bounds when creating the mask
* Set mask bounds for all supported SWA models
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* gmp-oss: common
* gpt-oss: attnetion sinks, swiglu_oai
* gpt-oss: WIP llama
Model loads and runs (CPU only), but PPL is much to high
(~1500 for 1st batch vs ~200 in mainline).
Is it because of SWA, because of vocab, or did I introduce a bug somewhere?
* gpt-oss: CPU seems to be working
It was the SWA thta was missing in the previous commit.
There are issues with EOG tokens, so this still needs to be added.
* CUDA: ADD_ID
Just a copy from mainline
* gpt-oss: Seems to be working on CUDA
* gpt-oss: add sinks to the attn-vec kernels
* CUDA: add head size of 64 to new mma
Haven't turned it on yet, but observe slightly better PP and slightly
worse TG performance with that.
* gpt-oss: add ability to use -fmoe (only CUDA for now)
* Move row sums to the write place
* Add sinks to iqk flash attention
* gpt_oss: Implement -fmoe on the CPU
* Simdify swiglu_oai
Turning it off for now as performance becomes more variable,
so perhaps I'm running into thermal trottling imore often
because of making the CPU work too hard.
* llama: factor out model loader
* Builds successfully
* It runs, but mmap does not work
* Fix llama_mmap so mmap works
* Minor
* Fix CUDA after latest changes
* Attempt to use CUDA graphs with MoE models - not working
* CUDA graphs WIP - still not working
* CUDA graphs - seems to be working
Likely not all MLA variants are working.
I no longer remember why I added the q8_0 cpy that
transposes the tensor, but if really needed, this is now
missing. Also missing is q6_0.
* Make q8_0 cache work for DeepSeek models with CUDA graphs
* cuda: cpy for q6_0
* Fix llama_mmap on non-Linux platforms
* Adding forgotten file
* Iterating on Windows build failures
* cuda: re-add q8_0 -> q8_0 transpose
so mla = 2 can be used with CUDA graphs and q8_0 cache.
* Disable graphs without -fmoe
* Minor
* Turn graphs on by default
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* mxfp4: basics
* mxfp4: Zen4 GEMM
* mxfp4: repacked GEMM (AVX2/Zen4)
* mxfp4: AVX2 GEMM
* mxfp4: NEON GEMM
* mxfp4: repacked GEMM (NEON)
* mxfp4: Metal
* Fix quantized K cache without FA (#680)
* Prevent assert with quantized K cache and no FA
* Fix MMQ when running with quantized K cache without FA
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Fix for Deepseek r1 parsing (#676)
* Implement function calling / tools for ik_llama.cpp for Kimi K2
* Implement basic tool choice
* Backport llama.cpp tool calls support
* Enhance function calls with improved chat parser and string utilities
- Add new chat.h/chat.cpp and chat-parser.h/chat-parser.cpp for better chat handling
- Improve function calls parsing with fallback to llama.cpp builder pattern
- Add string utility functions (starts_with, ends_with, find_partial_stop)
- Update README with function calls testing instructions
- Enhance Kimi K2 parser and function calls documentation
- Add comprehensive test suite for function calls
- Update CMakeLists.txt and Makefile for new components
* Enhance function calling with unified streaming and parser improvements
- Fix streaming content cleanup to prevent function syntax in output
- Unify content extraction patterns with llama.cpp approach
- Improve Kimi K2 parser robustness and partial content handling
- Add comprehensive test coverage for function call scenarios
- Optimize chat message parsing and diff computation
* Replace hardcoded values in kimi_k2_parser.hpp with named constants
- Add compile-time constants for all token format markers
- Add compile-time constants for XML format markers
- Add compile-time constants for simple format patterns
- Replace all hardcoded string literals with named constants
- Use compile-time length calculation to avoid manual counting
- Improve maintainability and reduce magic numbers throughout parser
* Fix duplicate common_chat_parse definition
- Remove duplicate implementation from chat-parser.cpp
- Keep single implementation in chat.cpp following llama.cpp patterns
- Resolves linker error: multiple definition of common_chat_parse
* Fix JSON assertion failure in function call parsing
- Add proper validation that 'function' field is an object before accessing nested keys
- Handle missing 'arguments' field gracefully with default "{}"
- Prevents crash when parsing malformed tool call JSON structures
* Add comprehensive Qwen3 XML tool calling support with unit tests
- Implement Qwen3 XML parser with <tool_call>{"name": "func", "arguments": {...}}</tool_call> format
- Add model detection and routing for Qwen3 vs Kimi-K2 formats
- Create 8 comprehensive unit tests covering parsing, streaming, error handling
- Fix token format cleaning bug in kimi_k2_parser.hpp processing order
- Remove progressive parsing code and related utilities
- Add tool injection support for Qwen3 format in server utils
* Add DeepSeek R1 function calling support with comprehensive unit tests
- Implement complete DeepSeek R1 tool call parsing in common_chat_parser.cpp
- Add DeepSeek R1 model detection and tool injection in deepseek_r1_tools.hpp
- Update function_calls.hpp with DeepSeek R1 integration and content extraction
- Update documentation to reflect support for Kimi-K2, Qwen3, and DeepSeek R1 models
- Add comprehensive unit tests for DeepSeek R1 reasoning, tool calls, and integration
- Port exact implementation patterns from original llama.cpp for compatibility
Key features:
- Native DeepSeek R1 format: <|tool▁calls▁begin|>function<|tool▁sep|>name```json{}```<|tool▁call▁end|><|tool▁calls▁end|>
- Reasoning content extraction from <think>...</think> tags
- Multiple tool calls support with separate call blocks
- Model detection for deepseek-r1, deepseek_r1 naming patterns
- Integration with incremental parsing and streaming support
* Add partial parsing support for JSON and regex
- json-partial.h/cpp: JSON partial parsing functionality
- regex-partial.h/cpp: Regex partial parsing functionality
* Add format_chat integration tests for Qwen3 tool injection
- Add test_qwen3_format_chat_integration() to validate tool injection pipeline
- Test tool injection conditions and system message enhancement
- Verify JSON formatting and anti-preamble instructions
- Add comprehensive test documentation
Tests confirm tool injection works correctly - conversational preamble
issue is not in ik_llama.cpp but likely in UI configuration.
* Fix Qwen3 tool call parsing - pass model name to parser
Server was not passing model name to parse_chat_message_incremental(),
causing Qwen3 to fall back to Kimi-K2 parser and return tool calls
as content instead of proper tool_calls array.
* Fix non-streaming path to use model-specific parsing
Non-streaming responses were hardcoded to use Kimi-K2 format,
causing Qwen3 XML tool calls to be returned as content instead
of proper tool_calls array. Now uses same model detection as
streaming path for consistency.
* Update Qwen3 function call handling in server and tests
- Enhanced server function call detection and response formatting
- Improved test coverage for Qwen3 tool call scenarios
- Refined XML parsing for better tool execution support
* Add DeepSeek-R1 function call parsing support
Implements comprehensive parsing for all 4 DeepSeek-R1 function call formats:
- Format 1: Standard function call syntax (already supported)
- Format 2: Alternative function call patterns (already supported)
- Format 3: Tools array format - function\n```json\n{"tools": [...]}
- Format 4: XML wrapped format - <tool_call>function</think>Name\n```json\n{...}```</tool_call>
Key changes:
- Added parse_deepseek_r1_tools_array() following original parse_prefixed_json_tool_call_array pattern
- Added parse_deepseek_r1_xml_wrapped() following Hermes-2-Pro XML wrapper patterns
- Integrated both parsers into exception handling chain for robust fallback
- Added comprehensive TDD test coverage for all formats
- Anonymized all confidential information while preserving functionality
Resolves tool_calls_count=0 issue where DeepSeek-R1 models generated valid tool calls
but server failed to parse them correctly.
* Update function_calls.md documentation for DeepSeek-R1 Format 4
- Added Format 4 (XML wrapped) documentation with examples
- Updated implementation notes with correct parser order (3→4→1→2)
- Marked all DeepSeek-R1 formats as working (July 2025 update)
- Updated test status for Format 3 and 4 as passing
- Added parse_deepseek_r1_xml_wrapped() function reference
- Corrected implementation file line numbers
* Fix merge conflict in test-function-calls.cpp
- Removed incomplete merge conflict marker from line 3027
- Ensured all tests compile and pass successfully
- All DeepSeek-R1 formats (1-4) working correctly
- All streaming and content cleaning tests passing
* Fix DeepSeek R1 parsing issue with responses wrapped in think tags
Restore missing consume_rest() call from working PR #648 implementation.
When responses don't contain tool calls, remaining content after reasoning
parsing must be preserved as displayable content.
Fixes issue where entire responses wrapped in <think> tags resulted in
empty content output.
* Implement proper reasoning handling following original llama.cpp patterns
- Add missing reasoning_format and reasoning_in_content fields to common_chat_syntax
- Update try_parse_reasoning to match original llama.cpp logic exactly
- Add TDD test case with reasoning_in_content=true for DeepSeek R1
- Following TDD: test should now pass with proper syntax configuration
Based on original llama.cpp implementation patterns.
* TDD SUCCESS: Fix DeepSeek R1 thinking tag termination issue
✅ Test passes with reasoning_in_content=true configuration
- Content properly preserved: '<think>content</think>' displays fully
- Reasoning field empty as expected
- Following TDD: test-first approach validates the fix
Next: Update server to automatically apply this configuration.
* Complete server integration fix for DeepSeek R1 thinking tag termination
- Server now automatically sets reasoning_in_content=true for DeepSeek R1 models
- Fixes issue where responses wrapped in <think> tags appear empty to users
* Add TDD test case for DeepSeek R1 thinking tag termination issue
- Test reproduces the exact failure scenario reported by user
- Validates that reasoning_in_content=true fixes the issue
- Demonstrates empty content problem and working solution
* Add remaining TDD test changes for DeepSeek R1 thinking tag fix
* Add debug output after upstream merge
* Remove temporary benchmark and debug files
- Remove tests/benchmark-progressive-parsing.cpp (development tool, not part of core functionality)
- Remove tests/reproduce_bug.sh (debugging script, not needed for PR)
* Port cpu moe options from mainline (#672)
* Port cpu moe options from mainline
* Use strdup and int32_t to follow coding guidelines
* maxfp4: CUDA dequantize
* mxfp4: CUDA GEMV
* mxfp4: CUDA MMQ
* mxfp4: minor CUDA tweaks
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Co-authored-by: Anton Sokolchenko <wsevendays@gmail.com>
Co-authored-by: Parsa <61601745+TheLegendOfKitty@users.noreply.github.com>
* iq1_kt: basics
* iq1_kt: CUDA dequantize
Testing with LlaMA-3.1-8B-Instruct, we get almost the same PPL
as iq2_xxs, so about 0.2 bpw fewer bits for the same quality.
* iq1_kt: CUDA MMQ
* iq1_kt: CUDA MMVQ
* iq1_kt: AVX2 GEMM/GEMV
* iq1_kt: convert/repack to q8_0_r8 (AVX2)
* iq1_kt: slightly faster GEMV
18.6 t/s -> 19.4 t/s
* iq1_kt: NEON GEMM/GEMV
Pathetic as usual
* iq1_kt: slightly faster NEON - still pathetic
* iq1_kt: tiny bit better GEMV on NEON
* iq1_kt: convert/repack to q8_0_r8 (NEON)
* iq1_kt: very slightly faster convert/repack to q8_0_r8 on NEON
* Adding frgotten file
* iq1_kt: add to constants.py
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Experiments for 2.6875 bpw quants
At least according to rmse, this is significantly better than
q2_K, while using only 1/16 more bits per weight.
* iq2_kl: basics
* iq2_kl: CUDA dequantize
* iq2_kl: small improvement in PPL
Also check the two neighbouring values for the block scale
and use the one that minimizes RMSE.
* iq2_kl: MMQ
Quite good: PP-512(L3-8B) = 8472 t/s.
* iq2_kl: MMVQ
We get PP-128(L3-8B) = 162 t/s.
Which means that this is not quite as good as it should be as
(almost) same bpq q2_K is at 170 t/s.
* iq2_kl: Zen4 GEMM/GEMV
Not particularly fast. I may need to think about rearranging the bits.
* iq2_kl: better Zen4
* iq2_kl: convert/repack to q8_k_r8 (AVX2)
* iq2_kl: AVX2 GEMM/GEMV
* iq2_kl: WIP NEON
The compiler started crashing!!!
* iq2_kl: NEON
Had to work around a compiler crash when using vzip2q_u8 using
vqtbl2q_u8.
* iq2_kl: convert/repack to q8_k_r8 (NEON)
* iq2_kl: Metal dequantize
* iq2_kl: Metal GEMV - pretty slow
* iq2_kl: Metal GEMV - slightly better (40 t/s -> 44.5 t/s)
* iq2_kl: Metal GEMV - slightly better (44.5 t/s -> 46.5 t/s)
* iq2_kl: Metal GEMV - slightly better (46.5 t/s -> 47.2 t/s)
* iq2_kl: slightly better Metal dequantize
PP-512 goes to 476 t/s up from 466 t/s.
* iq2_kl: slightly better Metal dequantize
PP-512 goes to 492 t/s up from 476 t/s.
* Add iq2_kl to constants.py
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* iq2_xxs
55.8 -> 167.5 t/s. iq2_xxs is at 93.7 t/s
* iq2_xs
46.4 -> 166.6 t/s. iq2_xs_r4 is at 72.3 t/s.
* iq2_s
42.8 t/s -> 166.8 t/s. iq2_s_r4 is at 71.1 t/s.
* iq3_xxs
51.8 t/s -> 165.6 t/s. iq3_xxs_r4 is at 84.6 t/s.
* iq3_s
46.0 t/s -> 162.0 t/s. iq3_s_r4 is at 79.4 t/s
* q2_k
85.7 t/s -> 168.1 t/s. q2_k_r4 is at 111.2 t/s.
* q3_K
45.7 t/s -> 170.8 t/s. q3_k_r4 is at 110.3 t/s.
* q6_k
47.7 t/s -> 124 t/s. q6_k_r4 is at 112.7 t/s.
* q4_k
58.2 t/s -> 114.8 t/s. iq4_k_r4 is at 130.9 t/s.
As I had to add a new implementation for q8_1-quantized
activations, TG became slightly faster too
(25.1 -> 25.9 t/s).
* q5_k
54.9 -> 114.9 t/s. q5_k_r4 is at 116.2 t/s.
* iq4_xs
71.2 -> 167.8 t/s. iq4_xs_r4 is at 138.6 t/s.
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* iq2_kt and iq3_kt work with new int trellis
Much slower than the fp16 based trellis. I guess, Apple doesn't
have int8_t SIMD on the M2-Max GPU.
* q4_0
83.6 t/s -> 128.4 t/s. q4_0_r8 is at 123.5 t/s
* q5_0
74.2 t/s -> 128.5 t/s. q5_0_r4 is at 111.4 t/s.
* q6_0
74.2 t/s -> 128.8 t/s. q6_0_r4 is at 107.2 t/s.
* q8_0
84.5 -> 128.7 t/s. q8_0_r8 is at 131 t/s.
* iq4_nl
84.5 t/s -> 128.1 t/s. iq4_nl_r4 is at 120.4 t/s
* q4_1
74.4 -> 115.4 t/s. There is no repacked variant
* q5_1
64.2 t/s -> 114.9 t/s. There is no repacked variant.
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* This seems slightly faster for IQ2_KT, IQ3_KT TG
* This looks better for iq4_kt TG
* WIP
* Cleanup
* With fancy simd also set func16
* Enable next_128() also on AVX2
Despite having just 16 vector registers it is still faster.
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Removes errant ";" in front of 0xCBAC1FED in non-x86 code
```
error: expected primary-expression before ';' token
constexpr static uint32_t ka = ;0xCBAC1FED;
^
error: expected unqualified-id before numeric constant
constexpr static uint32_t ka = ;0xCBAC1FED;
^
```
* New iq4_kt trellis
The new trellis generates int8_t values via
sum_as_uint8_t[(ka * idx + kb) & 0x3f33f3f3f] - 126.
CUDA dequantize works.
AVX2 case Ny > 32 works, and we get 273 t/s for L3-8B.
PPL is on par or even slightly lower than original QTIP trellis.
* Something is not working with the AVX2 dot product
* New iq4_kt: CUDA MMVQ
* New iq4_kt: CUDA MMQ
* For now have only iq4_kt use the new trellis
* Fix iq2_kt that got broken along the way
* New iq4_kt: AVX2 dot product finally works
We get 13.6 t/s vs 8.4 t/s with the f16 trellis and f32 arithmetic.
Still somewhat slower than other quants, but no longer pathetic.
* New iq4_kt: fix vanilla AVX2
* New iq4_kt: NEON implementation
We get very respectable PP-512 = 120 t/s.
TG-128 is pathetic at 5.3 t/s, so 20+% slower than the f16 variant.
* New iq4_kt: slightly faster NEON
* New iq4_kt: slightly faster NEON
* New iq4_kt: faster NEON
We are now at 9.4 t/s, up from 6.6 t/s for the f16 trellis.
* Minor
* New iq4_kt trellis: not working Metal implementation
* Remove the extra 4 bytes of row meta data that is no longer used
* Cleanup
* Adding forgottent file
* Switching iq2_kt to new trellis - CUDA MMQ
* New iq2_kt: CUDA GEMV
* New iq2_kt: AVX2 dequantize
* New iq2_kt: AVX2 GEMM/GEMV
* Adding forgotten file
* New iq2_kt: NEON GEMM/GEMV
* New iq2_kt: slightly faster NEON GEMM
* New iq2_kt: Metal - very slow.
It seems Apple Silicon cannot quickly add 4 8-bit ints.
Or I don't know how to do it - but I didn't find anything
in the Metal Shading Language Specification.
So, performance is quite a bit worse than the original trellis.
* Add missing break
* Trying @louiehelm's multiplier
* CPU
* iq3_kt: use integer trellis + CUDA dequantize and MMVQ
* iq3_kt: MMQ
* iq3_kt: AVX2 GEMM
* iq3_kt: AVX2 GEMV
* The trellis quants now need super-blocks of 256, so we need a check
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Repack q4_0 and q8_0 to q8_0_R8
q8_0 is fine, but I observe a very significant PPL increase
for q4_0. Best guess: precision loss with the 32 bit <-> 16 bit
scale conversions.
* Change q8_2_x4 to store in16_t sums
With that q4_0 now works.
I need to check all quants that use q8_2_x4!
* q5_0 and use a dequntizing template
* q6_0
129 t/s -> 296 t/s. q6_0_r4 is at 244 t/s.
* iq4_nl
137 t/s -> 293 t/s. iq4_nl is at 251 t/s.
* q4_1: 135 t/s -> 262 t/s
* q5_1: 125 t/s -> 253 t/s
* iq3_xs
178 t/s -> 363 t/s. iq4_xs_r4 is at 275 t/s.
* q2_K
202 t/s -> 364 t/s. q2_k_r4 is at 247 t/s.
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* iq4_ks
203 t/s -> 357 t/s. iq4_ks_r4 is 242 t/s.
* iq4_k
175 t/s -> 353 t/s. iq4_k_r4 is 208 t/s.
PPL is actually lower!
* iq5_ks
180 t/s -> 359 t/s. iq5_ks_r4 is 210 t/s.
PPL is actually lower - 7.4160 vs 7.4494 for LlaMA-3.1-8B-Instruct
* iq5_k - accuracy loss is too big
* iq5_k - there was a bug with the shifts
...and that's why PPL was so high. It is also high on main.
This fixes it.
* iq6_k
148 t/s -> 350 t/s. There is no iq6_k_r4
PPL is actually lower because we have a bug in the existing
implementation!
* iq3_k
169 t/s -> 363 t/s. iq3_k_r4 is at 200 t/s.
* iq2_k
190 t/s -> 364 t/s. iq2_k_r4 is at 232 t/s.
* iq2_ks
200 t/s -> 367 t/s. There is no iq2_ks_r4.
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* q6_K dequantizing GEMM
* Much easier: just use different vec_dot types!
* WIP
* Finally q6_K x q8_2_x4 dot product works
* Very slightly better
* We don't need the changes in ggml.c
* Fix AVX2
* iq2_xs
* Fix AVX2
* iq2_s
* q3_K
* Fix q8_k_r8 on Zen4
* q3_K: repack to q8_k_r8 instead of q8_0_r8
With that we hit 360 t/s for LlaMA-3.1-8B on a Ryzen-7950X.
q8_k_r8 is 386 t/s, so for a batch size of 512 repacking costs
~7% of the time taken by the actual GEMM.
* q3_K: don't scale when all quants in a block are <= 127 when repacking
* iq2_s: repack to q8_k_r8 instead of q8_0_r8
* iq2_xs: rapck to q8_k_r8
* WIP
* iq2_xs: repack to q8_k_r8
* iq3_xxs: repack to q8_k_r8
* iq3_s: use q8_k_r8
* iq1_s: repack to q8_k_r8
* iq1_m: repack to q8_k_r8
* iq1_m: slightly faster
* Slightly faster
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* q4_K: dequantize to q8_1_r8 for batch >= 32
We get 268 t/s, up from 186 t/s.
* q4_K: GEMM with q8_2_X4
* q5_K: GEMM with q8_2_X4 and repack to q8_1_r8
* Remove the scales, they are not needed
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Much faster iq2_xxs GEMM
PP-512 = 290 t/s vs ~110 t/s (iq2_xxs) or 148 t/s (iq2_xxs_r4) on main.
* iq2_xxs: q8_2_x4 GEMM
* iq2_xxs: use template for q8_2_x4 GEMM
* Fix AVX2
* Cleanup
* NEON is not working yet, so still use Q8_K GEMM
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Also do the dequantize approach for mul_mat_id
* Also do the dequantize approach for iqk_moe_fused_up_gate
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Experimenting with dequant + f32 GEMM
For iq4_kt this results in a massive PP improvement
from PP512 = ~42 t/s to PP512 = 128 t/s.
* Experimenting with dequant + f32 GEMM
iq2_kt: from PP512 = 57.3 t/s to PP512 = 135.0 t/s
iq3_kt: from PP512 = 43.8 t/s to PP512 = 131.4 t/s
* Experimenting with dequant + f16 GEMM on NEON
iq2_kt: PP512 = 79 t/s from 42 t/s
iq3_kt: PP512 = 81 t/s from 35 t/s
Also, found the reason why the f16 implementation for iq4_kt was
not working: it overflows. It works after mltiplying with the row scale
before doing the multiply-adds.
* Experimenting with dequant + f16 GEMM on NEON
iq4_kt: PP512 = 86 t/s from 29 t/s
* Minor
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* WIP
* WIP
* WIP
* Testing Trellis quantization
Using 12 bits per 8 weights I get a better rmse than
iq2_xxs. I still need to see how quantizing the group-of-8
scales will affect accuracy. By AVX2 SIMDifying the search
for the best code, LLaMA-3.1-8B gets quantized in 130 seconds
on the Ryzen-7950X CPU - sluggish but still acceptable.
* Testing Trellis quantization: 4-bit quantized block scales
rmse increases by just 3%, so this is beating iq2_xss in terms
of rmse at the same 2.0625 bpw.
* Testing Trellis quantization: playing with scales and generators
* iq2_kt: quantize / dequantize
I now see that I was comparing apples to oranges:
iq2_xxs was using a weight of sigma^2/4 + x^2, while
the Trellis approach wasn't (weight = 1). Once I use the same weight,
iq2_kt is actually slightly worse than iq2_xxs in terms
of rmse, so does not look promising at this point.
Also, once each group of 8 Trellis values no longer has a
constant sum(q^2) that we can precompute, quantization
becomes significantly slower (476 seconds for LLaMA-3.1-8B).
* iq2_kt: CUDA dequantize
so we can run perplexity calcs.
As already indicated by rmse, the 2-bit trellis approach is
quite a bit worse than iq2_xxs.
* WIP
* WIP
* WIP - try larger blocks
With blocks of 32 and 16 bits per groups of 8 the brute force
seach becomes prohibitive in terms of CPU time (30+ minutes
for 8B LLaMA after SIMDifying with AVX2). The trick is to
group the points in clusters, find the nearest cluster,
and only search within the cluster.
* iq2_kt - this is better
Using blocks of 32 and 16 bits per group of 8 weights
it beats iq2_xxs in terms of PPL by a significant margin.
It is 0.0625 bpw larger, but even if we go to 15 bits per
group od 8 (so 0.0625 bpw less than iq2_xxs), PPL is still
lower.
* iq2_kt - even better
Re-quantize after determining block scales
(at the epxense of much longer quantization time).
* iq2_kt: CUDA dot product
Implemented as DMMV.
Very slow - just 81 t/s for LLaMA-3.1-8B.
Then again, Q2_K_S with forced to use DMMV only
gets 112 t/s vs 145 t/s via MMVQ. My memory is that
when the DMMV kernels were properly maintained/used,
DMMV was about on par with MMVQ for k-quants on my GPU.
* iq2_kt: very slightly faster CUDA dot product
* iq2_kt: f16 CUDA dot product
We arrive at 112 t/s.
* iq2_kt: faster f16 CUDA dot product
We arrive at 139 t/s (no FA), and 149 t/s (FA).
My RTX-4080 is ~20% slower than the RTX-6000 quoted in the
QTIP repository, so with FA (which I'm sure they also used)
we are at around ~180 t/s on their GPU, so almost matching
their performance.
* iq2_kt: faster f16 CUDA dot product
We arrive at 146 t/s (no FA), and 158 t/s (FA).
This is measured for LLaMA-3.1-8B with output.weight
left as f16.
* Minor
* Adding iq3_kt
3.125 bpw. So far does not look good on the PPL vs bpw plot.
* Forgotten change
* WIP
* WIP
* iq3_kt WIP: slowly improving
PPL(LLaMA-3.1-8B-Instruct, 8192) is now 6.8322, which is
starting to be competitive/slightly better than other quants.
* WIP
* iq3_kt WIP: slowly improving
PPL(LLaMA-3.1-8B-Instruct, 8192) is now 6.7892
* iq3_kt WIP: slowly improving
PPL(LLaMA-3.1-8B-Instruct, 8192) is now 6.7689 after shrinking
by 0.015 bpw by using iq4_k instead of q5_k for attn_v.
* iq3_kt WIP: speed up quantization
Nearly 60% improvement of quantization speed by having the
points nelonging to a cluster copied to contiguous memory
during initialization, and then accessed sequantially while
searching for the closest point. LLaMA-3.1-8B now gets
quantized in ~150 seconds on the Ryzen-5975WX.
* iq3_kt speed up quantization
Same trick as last commit applied to iq2_kt. Here we get
an even larger speedup: quantization time on the Ryzen-5975WX
for LLaMA-3.1-8B drops to 195 seconds from 375 seconds!
* iq3_kt: CUDA dot product
* iq2_kt: SOTA
We arrive at
PPL(LLaMA-3.1-8B-Instruct, 8192) = 9.2406
PPL(LLaMA-2-7B, 4096) = 6.4179
* iq2_kt: SOTA
We arrive at
PPL(LLaMA-3.1-8B-Instruct, 8192) = 9.1642
PPL(LLaMA-2-7B, 4096) = 6.3920
* Adding iq4_kt - not competitive at this point
* WIP
* WIP
* iq4_kt: CUDA dot product
* iq4_kt: minor tweaks
* iq2_kt: SOTA
We arrive at
PPL(LLaMA-3.1-8B-Instruct, 8192) = 9.1642
PPL(LLaMA-2-7B, 4096) = 6.3920
* iq2_kt: SOTA
We arrive at
PPL(LLaMA-3.1-8B-Instruct, 8192) = 9.0297
PPL(LLaMA-2-7B, 4096) = 6.3913
Ah, quantization is faster too. About 20% faster.
* iq3_kt: small improvements and faster quantization
* iq2_kt: SOTA
We arrive at
PPL(LLaMA-3.1-8B-Instruct, 8192) = 8.9627
PPL(LLaMA-2-7B, 4096) = 6.3825
Quantization is faster too: ~200 seconds for LLaMA-3.1-8B
on Ryzen-5975WX.
* iq3_kt: small progress
* WIP
* iq4_kt: go to 4.0 bpw
15 bits per group of 4, plus 8 bit scales ifor blocks of 32.
This gives a slightly better PPL than iq4_kss.
* iq4_kt: very slightly better
at the expense of much longer quantization time.
* iq4_kt: failed attemt to adjust CUDA dot product
It was working for 4.125 bpw. But after changing to 4.0 bpw
there is something wrong and I don't see the bug.
* DRY
* DRY
* iq4_kt: CUDA dot product works
* DRY
* Report actual bpw
* Minor tweaks
* Checkpoint
Go to groups of 8 for iq3_kt. 2 x 8 = 16 bits for the magnitude
plus 1 bpw for the sign. It goves a visible improvement in the
PPL vs bpw plot, but that comes at the expense of much longer
quantization time (7.5 minutes for LLaMA-3.1-8B on the Ryzen-5975WX).
I also notices that the 3INST generator is not actually generating a
Gaussian distribution. But going to a better generator means
readjusting all the hyper-parameters, so leaving it for later.
* WIP for IQ2_KT
* WIP - working basic iq2_kt
* still super slow (0.17t/s eval)
* flatten 3inst iters + avx2 (0.3t/s eval)
* iq3_kt (0.3t/s eval) and renames
* wip buggy iq4_KT
* fix (0.22t/s eval)
* naming and remove unused fn
* cleanup
* more cleanup
* delete unused and noncompiling mmvq functions
* Some performance tweaks
* Slighty faster iq2_kt
* port Trellis struct to iq3_kt, iq4_kt
* oops untracked files
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
* Refactor iqk: WIP
* Refactor iqk: Factor out float GEMM (AVX2/AVX512)
* Refactor iqk: Factor out GEMM for legacy quants (AVX2/AVX512)
* Refactor iqk: Factor out GEMM for k-quants (AVX2/AVX512)
* Refactor iqk: fix AVX2
* Refactor iqk: Factor out GEMM for i-quants (AVX2/AVX512)
* Refactor iqk: fix AVX2
* Refactor iqk: Factor out GEMM for iqk-quants (AVX2/AVX512)
* Refactor iqk: fix AVX2
* Refactor iqk: Factor out GEMM for 1-bit quants (ABX2/AVX512)
* Refactor iqk: fix AVX2
* Refactor iqk: Factor out GEMM for iq1_bn, iq2_bn, iq2_bn_r4
* Refactor iqk: Factor out GEMM for repacked legacy quants
* Refactor iqk: Factor out GEMM for q8_K_R8, q8_KV
* Refactor iqk: Factor out GEMM for repacked i-quants
* Refactor iqk: GEMM kernels are refactored on AVX2/AVX512
* Refactor iqk: factor out 1-bit quants (NEON)
* Refactor iqk: factor out k-quants (NEON)
* Refactor iqk: factor out floats (NEON)
* Also iq4_xs belongs to k-quants
* Refactor iqk: factor out iqk quants (NEON)
* Refactor iqk: factor out legacy quants (NEON)
* Refactor iqk: factor out repacked legacy quants (NEON)
* Refactor iqk: factor out repacked k-quants (NEON)
* Refactor iqk: factor out repacked iqk quants (NEON)
* Refactor iqk: GEMM kernels are refactored on NEON
* Refactor iqk: FA compiles
If it works is a different story.
Current compile time: 107.3 sesonds on the Ryzen-7950X
* Refactor iqk: FA refactored (Zen4)
Compile time for the FA files is now ~21 seconds on my
Ryzen-7950X, so still slightly too long for my taste
but much better than the 142 seconds we had before.
* Adding forgotten file
* Most helpers don't need to be templates
Also hide Q4_0 and Q8_KV behind IQK_FA_ALL_QUANTS.
Compilation time drops to 14 second on the Ryzen-5975WX
* Fix bf16
* Refactor iqk: FA refactored (NEON)
* Forgotten MMQ ref and typo (#431)
* Adding forgotten iq5_k_r4
* Fix iq4_k_r4 on NEON
* Fix iq4_ks on NEON
It was broken before the refactoring (the shifts were not correctly
applied).
* Fix q8_0 on NEON
* Fix q6_0 K cache
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Co-authored-by: Nexes the Elder <124105151+Nexesenex@users.noreply.github.com>
* iq5_ks: basics
* iq5_ks: quantize
* iq5_ks: CUDA dequantize works
* iq5_ks: dot product works on CUDA
* iq5_ks: MMQ works
* iq5_ks: Zen4
* iq5_ks: AVX2
But is is not quite right, just like iq4_k, iq5_k, iq6_k, iq4_ks.
All these need fixing on AVX2.
* iq5_ks: NEON
* iq5_ks: Metal dequantize
* iq5_ks: Metal dot product
---------
Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>