Commit Graph

27 Commits

Author SHA1 Message Date
Kawrakow
fc06bc9d27 Enable CUDA graphs for MoE models + GPT-OSS support (#689)
* gmp-oss: common

* gpt-oss: attnetion sinks, swiglu_oai

* gpt-oss: WIP llama

Model loads and runs (CPU only), but PPL is much to high
(~1500 for 1st batch vs ~200 in mainline).
Is it because of SWA, because of vocab, or did I introduce a bug somewhere?

* gpt-oss: CPU seems to be working

It was the SWA thta was missing in the previous commit.

There are issues with EOG tokens, so this still needs to be added.

* CUDA: ADD_ID

Just a copy from mainline

* gpt-oss: Seems to be working on CUDA

* gpt-oss: add sinks to the attn-vec kernels

* CUDA: add head size of 64 to new mma

Haven't turned it on yet, but observe slightly better PP and slightly
worse TG performance with that.

* gpt-oss: add ability to use -fmoe (only CUDA for now)

* Move row sums to the write place

* Add sinks to iqk flash attention

* gpt_oss: Implement -fmoe on the CPU

* Simdify swiglu_oai

Turning it off for now as performance becomes more variable,
so perhaps I'm running into thermal trottling imore often
because of making the CPU work too hard.

* llama: factor out model loader

* Builds successfully

* It runs, but mmap does not work

* Fix llama_mmap so mmap works

* Minor

* Fix CUDA after latest changes

* Attempt to use CUDA graphs with MoE models - not working

* CUDA graphs WIP - still not working

* CUDA graphs - seems to be working

Likely not all MLA variants are working.
I no longer remember why I added the q8_0 cpy that
transposes the tensor, but if really needed, this is now
missing. Also missing is q6_0.

* Make q8_0 cache work for DeepSeek models with CUDA graphs

* cuda: cpy for q6_0

* Fix llama_mmap on non-Linux platforms

* Adding forgotten file

* Iterating on Windows build failures

* cuda: re-add q8_0 -> q8_0 transpose

so mla = 2 can be used with CUDA graphs and q8_0 cache.

* Disable graphs without -fmoe

* Minor

* Turn graphs on by default

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-08-15 09:18:07 +03:00
Kawrakow
7117c23de4 MXFP4 (#682)
* mxfp4: basics

* mxfp4: Zen4 GEMM

* mxfp4: repacked GEMM (AVX2/Zen4)

* mxfp4: AVX2 GEMM

* mxfp4: NEON GEMM

* mxfp4: repacked GEMM (NEON)

* mxfp4: Metal

* Fix quantized K cache without FA (#680)

* Prevent assert with quantized K cache and no FA

* Fix MMQ when running with quantized K cache without FA

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

* Fix for Deepseek r1 parsing (#676)

* Implement function calling / tools for ik_llama.cpp for Kimi K2

* Implement basic tool choice

* Backport llama.cpp tool calls support

* Enhance function calls with improved chat parser and string utilities

- Add new chat.h/chat.cpp and chat-parser.h/chat-parser.cpp for better chat handling
- Improve function calls parsing with fallback to llama.cpp builder pattern
- Add string utility functions (starts_with, ends_with, find_partial_stop)
- Update README with function calls testing instructions
- Enhance Kimi K2 parser and function calls documentation
- Add comprehensive test suite for function calls
- Update CMakeLists.txt and Makefile for new components

* Enhance function calling with unified streaming and parser improvements

- Fix streaming content cleanup to prevent function syntax in output
- Unify content extraction patterns with llama.cpp approach
- Improve Kimi K2 parser robustness and partial content handling
- Add comprehensive test coverage for function call scenarios
- Optimize chat message parsing and diff computation

* Replace hardcoded values in kimi_k2_parser.hpp with named constants

- Add compile-time constants for all token format markers
- Add compile-time constants for XML format markers
- Add compile-time constants for simple format patterns
- Replace all hardcoded string literals with named constants
- Use compile-time length calculation to avoid manual counting
- Improve maintainability and reduce magic numbers throughout parser

* Fix duplicate common_chat_parse definition

- Remove duplicate implementation from chat-parser.cpp
- Keep single implementation in chat.cpp following llama.cpp patterns
- Resolves linker error: multiple definition of common_chat_parse

* Fix JSON assertion failure in function call parsing

- Add proper validation that 'function' field is an object before accessing nested keys
- Handle missing 'arguments' field gracefully with default "{}"
- Prevents crash when parsing malformed tool call JSON structures

* Add comprehensive Qwen3 XML tool calling support with unit tests

- Implement Qwen3 XML parser with <tool_call>{"name": "func", "arguments": {...}}</tool_call> format
- Add model detection and routing for Qwen3 vs Kimi-K2 formats
- Create 8 comprehensive unit tests covering parsing, streaming, error handling
- Fix token format cleaning bug in kimi_k2_parser.hpp processing order
- Remove progressive parsing code and related utilities
- Add tool injection support for Qwen3 format in server utils

* Add DeepSeek R1 function calling support with comprehensive unit tests

- Implement complete DeepSeek R1 tool call parsing in common_chat_parser.cpp
- Add DeepSeek R1 model detection and tool injection in deepseek_r1_tools.hpp
- Update function_calls.hpp with DeepSeek R1 integration and content extraction
- Update documentation to reflect support for Kimi-K2, Qwen3, and DeepSeek R1 models
- Add comprehensive unit tests for DeepSeek R1 reasoning, tool calls, and integration
- Port exact implementation patterns from original llama.cpp for compatibility

Key features:
- Native DeepSeek R1 format: <|tool▁calls▁begin|>function<|tool▁sep|>name```json{}```<|tool▁call▁end|><|tool▁calls▁end|>
- Reasoning content extraction from <think>...</think> tags
- Multiple tool calls support with separate call blocks
- Model detection for deepseek-r1, deepseek_r1 naming patterns
- Integration with incremental parsing and streaming support

* Add partial parsing support for JSON and regex

- json-partial.h/cpp: JSON partial parsing functionality
- regex-partial.h/cpp: Regex partial parsing functionality

* Add format_chat integration tests for Qwen3 tool injection

- Add test_qwen3_format_chat_integration() to validate tool injection pipeline
- Test tool injection conditions and system message enhancement
- Verify JSON formatting and anti-preamble instructions
- Add comprehensive test documentation

Tests confirm tool injection works correctly - conversational preamble
issue is not in ik_llama.cpp but likely in UI configuration.

* Fix Qwen3 tool call parsing - pass model name to parser

Server was not passing model name to parse_chat_message_incremental(),
causing Qwen3 to fall back to Kimi-K2 parser and return tool calls
as content instead of proper tool_calls array.

* Fix non-streaming path to use model-specific parsing

Non-streaming responses were hardcoded to use Kimi-K2 format,
causing Qwen3 XML tool calls to be returned as content instead
of proper tool_calls array. Now uses same model detection as
streaming path for consistency.

* Update Qwen3 function call handling in server and tests

- Enhanced server function call detection and response formatting
- Improved test coverage for Qwen3 tool call scenarios
- Refined XML parsing for better tool execution support

* Add DeepSeek-R1 function call parsing support

Implements comprehensive parsing for all 4 DeepSeek-R1 function call formats:
- Format 1: Standard function call syntax (already supported)
- Format 2: Alternative function call patterns (already supported)
- Format 3: Tools array format - function\n```json\n{"tools": [...]}
- Format 4: XML wrapped format - <tool_call>function</think>Name\n```json\n{...}```</tool_call>

Key changes:
- Added parse_deepseek_r1_tools_array() following original parse_prefixed_json_tool_call_array pattern
- Added parse_deepseek_r1_xml_wrapped() following Hermes-2-Pro XML wrapper patterns
- Integrated both parsers into exception handling chain for robust fallback
- Added comprehensive TDD test coverage for all formats
- Anonymized all confidential information while preserving functionality

Resolves tool_calls_count=0 issue where DeepSeek-R1 models generated valid tool calls
but server failed to parse them correctly.

* Update function_calls.md documentation for DeepSeek-R1 Format 4

- Added Format 4 (XML wrapped) documentation with examples
- Updated implementation notes with correct parser order (3→4→1→2)
- Marked all DeepSeek-R1 formats as working (July 2025 update)
- Updated test status for Format 3 and 4 as passing
- Added parse_deepseek_r1_xml_wrapped() function reference
- Corrected implementation file line numbers

* Fix merge conflict in test-function-calls.cpp

- Removed incomplete merge conflict marker from line 3027
- Ensured all tests compile and pass successfully
- All DeepSeek-R1 formats (1-4) working correctly
- All streaming and content cleaning tests passing

* Fix DeepSeek R1 parsing issue with responses wrapped in think tags

Restore missing consume_rest() call from working PR #648 implementation.
When responses don't contain tool calls, remaining content after reasoning
parsing must be preserved as displayable content.

Fixes issue where entire responses wrapped in <think> tags resulted in
empty content output.

* Implement proper reasoning handling following original llama.cpp patterns

- Add missing reasoning_format and reasoning_in_content fields to common_chat_syntax
- Update try_parse_reasoning to match original llama.cpp logic exactly
- Add TDD test case with reasoning_in_content=true for DeepSeek R1
- Following TDD: test should now pass with proper syntax configuration

Based on original llama.cpp implementation patterns.

* TDD SUCCESS: Fix DeepSeek R1 thinking tag termination issue

 Test passes with reasoning_in_content=true configuration
- Content properly preserved: '<think>content</think>' displays fully
- Reasoning field empty as expected
- Following TDD: test-first approach validates the fix

Next: Update server to automatically apply this configuration.

* Complete server integration fix for DeepSeek R1 thinking tag termination

- Server now automatically sets reasoning_in_content=true for DeepSeek R1 models
- Fixes issue where responses wrapped in <think> tags appear empty to users

* Add TDD test case for DeepSeek R1 thinking tag termination issue

- Test reproduces the exact failure scenario reported by user
- Validates that reasoning_in_content=true fixes the issue
- Demonstrates empty content problem and working solution

* Add remaining TDD test changes for DeepSeek R1 thinking tag fix

* Add debug output after upstream merge

* Remove temporary benchmark and debug files

- Remove tests/benchmark-progressive-parsing.cpp (development tool, not part of core functionality)
- Remove tests/reproduce_bug.sh (debugging script, not needed for PR)

* Port cpu moe options from mainline (#672)

* Port cpu moe options from mainline

* Use strdup and int32_t to follow coding guidelines

* maxfp4: CUDA dequantize

* mxfp4: CUDA GEMV

* mxfp4: CUDA MMQ

* mxfp4: minor CUDA tweaks

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
Co-authored-by: Anton Sokolchenko <wsevendays@gmail.com>
Co-authored-by: Parsa <61601745+TheLegendOfKitty@users.noreply.github.com>
2025-08-09 08:40:18 +03:00
Kawrakow
f989fb03bd Adding IQ1_KT - 1.75 bpw SOTA quants (#616)
* iq1_kt: basics

* iq1_kt: CUDA dequantize

Testing with LlaMA-3.1-8B-Instruct, we get almost the same PPL
as iq2_xxs, so about 0.2 bpw fewer bits for the same quality.

* iq1_kt: CUDA MMQ

* iq1_kt: CUDA MMVQ

* iq1_kt: AVX2 GEMM/GEMV

* iq1_kt: convert/repack to q8_0_r8 (AVX2)

* iq1_kt: slightly faster GEMV

18.6 t/s -> 19.4 t/s

* iq1_kt: NEON GEMM/GEMV

Pathetic as usual

* iq1_kt: slightly faster NEON - still pathetic

* iq1_kt: tiny bit better GEMV on NEON

* iq1_kt: convert/repack to q8_0_r8 (NEON)

* iq1_kt: very slightly faster convert/repack to q8_0_r8 on NEON

* Adding frgotten file

* iq1_kt: add to constants.py

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-07-20 10:05:23 +02:00
Kawrakow
45fae1a144 Adding IQ2_KL (#602)
* Experiments for 2.6875 bpw quants

At least according to rmse, this is significantly better than
q2_K, while using only 1/16 more bits per weight.

* iq2_kl: basics

* iq2_kl: CUDA dequantize

* iq2_kl: small improvement in PPL

Also check the two neighbouring values for the block scale
and use the one that minimizes RMSE.

* iq2_kl: MMQ

Quite good: PP-512(L3-8B) = 8472 t/s.

* iq2_kl: MMVQ

We get PP-128(L3-8B) = 162 t/s.
Which means that this is not quite as good as it should be as
(almost) same bpq q2_K is at 170 t/s.

* iq2_kl: Zen4 GEMM/GEMV

Not particularly fast. I may need to think about rearranging the bits.

* iq2_kl: better Zen4

* iq2_kl: convert/repack to q8_k_r8 (AVX2)

* iq2_kl: AVX2 GEMM/GEMV

* iq2_kl: WIP NEON

The compiler started crashing!!!

* iq2_kl: NEON

Had to work around a compiler crash when using vzip2q_u8 using
vqtbl2q_u8.

* iq2_kl: convert/repack to q8_k_r8 (NEON)

* iq2_kl: Metal dequantize

* iq2_kl: Metal GEMV - pretty slow

* iq2_kl: Metal GEMV - slightly better (40 t/s -> 44.5 t/s)

* iq2_kl: Metal GEMV - slightly better (44.5 t/s -> 46.5 t/s)

* iq2_kl: Metal GEMV - slightly better (46.5 t/s -> 47.2 t/s)

* iq2_kl: slightly better Metal dequantize

PP-512 goes to 476 t/s up from 466 t/s.

* iq2_kl: slightly better Metal dequantize

PP-512 goes to 492 t/s up from 476 t/s.

* Add iq2_kl to constants.py

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-07-14 18:55:08 +02:00
Kawrakow
3248a35992 Adding IQ3_KS quants (#566)
* iq3_ks: basics

* iq3_ks: CUDA dequantize

* iq3_ks: CUDA mmvq

* iq3_ks: mmq

* iq3_ks: faster mmq

* iq3_ks: Zen4

* iq3_ks: AVX2 convert to q8_k_r8

This gives usPP-512 = 360 t/s.

* iq3_ks: AVX2 GEMM/GEMV

* iq3_ks: NEON GEMM/GEMV

* iq3_ks: NEON convert to q8_k_r8

This gives us PP-512 = 164 t/s.

* iq3_ks: Metal dequantize

* iq3_ks: Metal gemv - pathetic performance

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-07-02 09:27:47 +02:00
Kawrakow
5236c98b41 CUDA: MMQ for iqX_r4 quants (#557)
* cuda: MMQ for iq2_k_r4

* cuda: MMQ for iq3_k_r4

* cuda: MMQ for iq4_k_r4

* cuda: MMQ for iq5_k_r4

* iqk_r4 quants: use MMQ only for batches < 1024 tokens

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-06-26 08:50:49 +02:00
Kawrakow
d85c64428e New IQ2_KT, IQ3_KT and IQ4_KT, V2 (#529)
* New iq4_kt trellis

The new trellis generates int8_t values via
sum_as_uint8_t[(ka * idx + kb) & 0x3f33f3f3f] - 126.
CUDA dequantize works.
AVX2 case Ny > 32 works, and we get 273 t/s for L3-8B.
PPL is on par or even slightly lower than original QTIP trellis.

* Something is not working with the AVX2 dot product

* New iq4_kt: CUDA MMVQ

* New iq4_kt: CUDA MMQ

* For now have only iq4_kt use the new trellis

* Fix iq2_kt that got broken along the way

* New iq4_kt: AVX2 dot product finally works

We get 13.6 t/s vs 8.4 t/s with the f16 trellis and f32 arithmetic.
Still somewhat slower than other quants, but no longer pathetic.

* New iq4_kt: fix vanilla AVX2

* New iq4_kt: NEON implementation

We get very respectable PP-512 = 120 t/s.
TG-128 is pathetic at 5.3 t/s, so 20+% slower than the f16 variant.

* New iq4_kt: slightly faster NEON

* New iq4_kt: slightly faster NEON

* New iq4_kt: faster NEON

We are now at 9.4 t/s, up from 6.6 t/s for the f16 trellis.

* Minor

* New iq4_kt trellis: not working Metal implementation

* Remove the extra 4 bytes of row meta data that is no longer used

* Cleanup

* Adding forgottent file

* Switching iq2_kt to new trellis - CUDA MMQ

* New iq2_kt: CUDA GEMV

* New iq2_kt: AVX2 dequantize

* New iq2_kt: AVX2 GEMM/GEMV

* Adding forgotten file

* New iq2_kt: NEON GEMM/GEMV

* New iq2_kt: slightly faster NEON GEMM

* New iq2_kt: Metal - very slow.

It seems Apple Silicon cannot quickly add 4 8-bit ints.
Or I don't know how to do it - but I didn't find anything
in the Metal Shading Language Specification.
So, performance is quite a bit worse than the original trellis.

* Add missing break

* Trying @louiehelm's multiplier

* CPU

* iq3_kt: use integer trellis + CUDA dequantize and MMVQ

* iq3_kt: MMQ

* iq3_kt: AVX2 GEMM

* iq3_kt: AVX2 GEMV

* The trellis quants now need super-blocks of 256, so we need a check

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-06-18 16:20:54 +03:00
Kawrakow
8ffad187ab MMQ implementation for IQ4_KS_R4 and IQ5_KS_R4 (#493)
* MMQ for iq4_ks_r4

* MMQ for iq5_ks_r4

* Add forgotten file

* Another forgotten file

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-06-05 08:31:20 +03:00
Kawrakow
7e79665a31 CUDA implementation for IQ1_S_R4 (#492)
* iq1_s_r4: CUDA dequantize

* iq1_s_r4: CUDA GEMV

* iq1_s_r4: MMQ on CUDA

Requires Turing or better (will fall back to dequantize+cuBLAS on older cards).

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-06-05 07:24:31 +03:00
Andrew Chan
a1c931c30c Trellis quants with CPU inference (#441)
* WIP

* WIP

* WIP

* Testing Trellis quantization

Using 12 bits per 8 weights I get a better rmse than
iq2_xxs. I still need to see how quantizing the group-of-8
scales will affect accuracy. By AVX2 SIMDifying the search
for the best code, LLaMA-3.1-8B gets quantized in 130 seconds
on the Ryzen-7950X CPU - sluggish but still acceptable.

* Testing Trellis quantization: 4-bit quantized block scales

rmse increases by just 3%, so this is beating iq2_xss in terms
of rmse at the same 2.0625 bpw.

* Testing Trellis quantization: playing with scales and generators

* iq2_kt: quantize / dequantize

I now see that I was comparing apples to oranges:
iq2_xxs was using a weight of sigma^2/4 + x^2, while
the Trellis approach wasn't (weight = 1). Once I use the same weight,
iq2_kt is actually slightly worse than iq2_xxs in terms
of rmse, so does not look promising at this point.
Also, once each group of 8 Trellis values no longer has a
constant sum(q^2) that we can precompute, quantization
becomes significantly slower (476 seconds for LLaMA-3.1-8B).

* iq2_kt: CUDA dequantize

so we can run perplexity calcs.
As already indicated by rmse, the 2-bit trellis approach is
quite a bit worse than iq2_xxs.

* WIP

* WIP

* WIP - try larger blocks

With blocks of 32 and 16 bits per groups of 8 the brute force
seach becomes prohibitive in terms of CPU time (30+ minutes
for 8B LLaMA after SIMDifying with AVX2). The trick is to
group the points in clusters, find the nearest cluster,
and only search within the cluster.

* iq2_kt - this is better

Using blocks of 32 and 16 bits per group of 8 weights
it beats iq2_xxs in terms of PPL by a significant margin.
It is 0.0625 bpw larger, but even if we go to 15 bits per
group od 8 (so 0.0625 bpw less than iq2_xxs), PPL is still
lower.

* iq2_kt - even better

Re-quantize after determining block scales
(at the epxense of much longer quantization time).

* iq2_kt: CUDA dot product

Implemented as DMMV.
Very slow - just 81 t/s for LLaMA-3.1-8B.
Then again, Q2_K_S with forced to use DMMV only
gets 112 t/s vs 145 t/s via MMVQ. My memory is that
when the DMMV kernels were properly maintained/used,
DMMV was about on par with MMVQ for k-quants on my GPU.

* iq2_kt: very slightly faster CUDA dot product

* iq2_kt: f16 CUDA dot product

We arrive at 112 t/s.

* iq2_kt: faster f16 CUDA dot product

We arrive at 139 t/s (no FA), and 149 t/s (FA).

My RTX-4080 is ~20% slower than the RTX-6000 quoted in the
QTIP repository, so with FA (which I'm sure they also used)
we are at around ~180 t/s on their GPU, so almost matching
their performance.

* iq2_kt: faster f16 CUDA dot product

We arrive at 146 t/s (no FA), and 158 t/s (FA).
This is measured for LLaMA-3.1-8B with output.weight
left as f16.

* Minor

* Adding iq3_kt

3.125 bpw. So far does not look good on the PPL vs bpw plot.

* Forgotten change

* WIP

* WIP

* iq3_kt WIP: slowly improving

PPL(LLaMA-3.1-8B-Instruct, 8192) is now 6.8322, which is
starting to be competitive/slightly better than other quants.

* WIP

* iq3_kt WIP: slowly improving

PPL(LLaMA-3.1-8B-Instruct, 8192) is now 6.7892

* iq3_kt WIP: slowly improving

PPL(LLaMA-3.1-8B-Instruct, 8192) is now 6.7689 after shrinking
by 0.015 bpw by using iq4_k instead of q5_k for attn_v.

* iq3_kt WIP: speed up quantization

Nearly 60% improvement of quantization speed by having the
points nelonging to a cluster copied to contiguous memory
during initialization, and then accessed sequantially while
searching for the closest point. LLaMA-3.1-8B now gets
quantized in ~150 seconds on the Ryzen-5975WX.

* iq3_kt speed up quantization

Same trick as last commit applied to iq2_kt. Here we get
an even larger speedup: quantization time on the Ryzen-5975WX
for LLaMA-3.1-8B drops to 195 seconds from 375 seconds!

* iq3_kt: CUDA dot product

* iq2_kt: SOTA

We arrive at
PPL(LLaMA-3.1-8B-Instruct, 8192) = 9.2406
PPL(LLaMA-2-7B,            4096) = 6.4179

* iq2_kt: SOTA

We arrive at
PPL(LLaMA-3.1-8B-Instruct, 8192) = 9.1642
PPL(LLaMA-2-7B,            4096) = 6.3920

* Adding iq4_kt - not competitive at this point

* WIP

* WIP

* iq4_kt: CUDA dot product

* iq4_kt: minor tweaks

* iq2_kt: SOTA

We arrive at
PPL(LLaMA-3.1-8B-Instruct, 8192) = 9.1642
PPL(LLaMA-2-7B,            4096) = 6.3920

* iq2_kt: SOTA

We arrive at
PPL(LLaMA-3.1-8B-Instruct, 8192) = 9.0297
PPL(LLaMA-2-7B,            4096) = 6.3913

Ah, quantization is faster too. About 20% faster.

* iq3_kt: small improvements and faster quantization

* iq2_kt: SOTA

We arrive at
PPL(LLaMA-3.1-8B-Instruct, 8192) = 8.9627
PPL(LLaMA-2-7B,            4096) = 6.3825

Quantization is faster too: ~200 seconds for LLaMA-3.1-8B
on Ryzen-5975WX.

* iq3_kt: small progress

* WIP

* iq4_kt: go to 4.0 bpw

15 bits per group of 4, plus 8 bit scales ifor blocks of 32.
This gives a slightly better PPL than iq4_kss.

* iq4_kt: very slightly better

at the expense of much longer quantization time.

* iq4_kt: failed attemt to adjust CUDA dot product

It was working for 4.125 bpw. But after changing to 4.0 bpw
there is something wrong and I don't see the bug.

* DRY

* DRY

* iq4_kt: CUDA dot product works

* DRY

* Report actual bpw

* Minor tweaks

* Checkpoint

Go to groups of 8 for iq3_kt. 2 x 8 = 16 bits for the magnitude
plus 1 bpw for the sign. It goves a visible improvement in the
PPL vs bpw plot, but that comes at the expense of much longer
quantization time (7.5 minutes for LLaMA-3.1-8B on the Ryzen-5975WX).

I also notices that the 3INST generator is not actually generating a
Gaussian distribution. But going to a better generator means
readjusting all the hyper-parameters, so leaving it for later.

* WIP for IQ2_KT

* WIP - working basic iq2_kt

* still super slow (0.17t/s eval)

* flatten 3inst iters + avx2 (0.3t/s eval)

* iq3_kt (0.3t/s eval) and renames

* wip buggy iq4_KT

* fix (0.22t/s eval)

* naming and remove unused fn

* cleanup

* more cleanup

* delete unused and noncompiling mmvq functions

* Some performance tweaks

* Slighty faster iq2_kt

* port Trellis struct to iq3_kt, iq4_kt

* oops untracked files

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-23 09:17:52 +03:00
Kawrakow
3d92d7f802 Adding IQ5_KS - 5.25 bpw quants (#422)
* iq5_ks: basics

* iq5_ks: quantize

* iq5_ks: CUDA dequantize works

* iq5_ks: dot product works on CUDA

* iq5_ks: MMQ works

* iq5_ks: Zen4

* iq5_ks: AVX2

But is is not quite right, just like iq4_k, iq5_k, iq6_k, iq4_ks.
All these need fixing on AVX2.

* iq5_ks: NEON

* iq5_ks: Metal dequantize

* iq5_ks: Metal dot product

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-15 16:02:39 +03:00
Kawrakow
ce2b0292e1 CUDA: faster FA TG for GQA models (#370)
* cuda: WIP MMA FA

* Use MMA for TG also when quantized

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2025-05-04 09:17:44 +03:00
Kawrakow
6b968f3894 Bitnet changes (#106)
* Adapting iq2_bn to work without separate scale tensors

Why? It is becoming burdensome to maintain the special Bitnet
conversion in convert_hf_to_gguf.py, so I thnk it is better
to make iq1_bn and iq2_bn just work with the mainline
conversion script (which does not generate scales).

* Adapting iq1_bn to work without separate scale tensors

* Adapting iq2_bn: CUDA dequantize

* Adapting iq2_bn: CUDA works

* Adapting iq1_bn: CUDA works

* Adapting iq1_bn, iq2_bn: NEON

* Adapting iq1_bn, iq2_bn: Metal

Dequantize works, but there is still something wrong
with the dot products.

* WIP

Absoolutely don't see what is wrong with the iq1_bn and iq2_bn
vector dot product kernels.

* Remove iq1_tn and iq2_tn - Part 1

Now that iq1_bn and iq2_bn have per row scales, there is no
reason to also have iq1_tn and iq2_tn.

* Remove iq1_tn and iq2_tn - Part 2

* Bitnet: use the standard llm_build_kv to build self attention

My main motivation was to enable FA. But FA does not work anyway
because head size is 100 for the Botnet ternary models
(and I had forgotten this little detail).

* Revert "Avoid rebuild of GGML graph for each token (#98)"

This reverts commit f2d315b46f.
As far as I can tell, the commit breaks Metal TG.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-10-25 13:08:43 +02:00
Kawrakow
76b97c8064 Adding IQ4_KSS: 4.0 bpw quants (#89)
* iq4_kss: WIP

* iq4_kss: CUDA dequantize works

So we can run perplexity. Sadly, the result does not look good
on the bpw vs quantization error plot.

* iq4_kss: slightly better quantization

* iq4_kss: another small quantization improvement

* iq4_kss: CUDA works

TG-128 performance is very decent with 131 t/s for LLaMA-3.1-8B.
In comparison, we have 123 t/s for q4_0 and 128 t/s for iq4_ks.
I.e., the reduced model size more than offsets the additional
bit fiddling required for iq4_kss.

* iq4_kss: new bit arrangement - CUDA and Zen4 work

Did not lose performance on CUDA. Zen4 is decent, but not great:
PP-512(LLaMA-3.1-8B) = 163 t/s.
TG-128 is of course better than other 4-bit quants due to smaller model size.
We get 14.5 t/s @ 8 threads.

* iq4_kss: ARM_NEON. Predictably very slow

* iq4_kss: Metal

PP is not too bad - just 10% slower than q4_0.
But TG is 30% slower, i.e., predictably bad.

* iq4_kss: somewhat faster Metal dot product

45.75 t/s -> 48.75 t/s.
Still 22% slower than q4_0

* iq4_kss: AVX2

Bad, but better than I expected.
PP-512(LLaMA-3.1-8B) = 167 t/s on the Ryzen-5950X.
I.e., with 32 AVX2 threads we get the performance of
16 Zen4 threads.

* iq4_kss: very slightly faster Metal dot product

48.7 t/s -> 49.3 t/s

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-10-16 15:18:26 +03:00
Kawrakow
910a134094 IQ2_KS: 2.1875 bpw non-linear quantization (#85)
* Experimenting

* iq2k: Try make_qx_quants for the scale

Slightly better for LLaMA-3.1, Gemma-2, slightly worse for
Qwen2.5

* iq2k with make_qx_quants: adjust scale

* iq2ks: basics

* iq2_ks: CUDA works

* iq2_ks: WIP

* iq2_ks: WIP

* iq2_ks: Zen4

* iq2_ks: AVX2

* iq2_ks: scalar dot product

* iq2_ks: ARM_NEON

* iq2_ks: Metal

* iq2_ks: faster Metal

LLaMA-3.1-8B:
PP-512 = 475.22 ± 0.37 t/s
TG-128 =  45.32 ± 0.03 t/s

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-10-13 13:34:30 +03:00
Kawrakow
b30c9e10d8 New SOTA quantization: 4.25 bpw IQ4_KS (#83)
* iq4_k_xxs: basics

* WIP + adding iq3_kl quantization mix

* iq4_xxs: this looks very viable compared to iq4_xs

At the same 4.25 bpw PPL is always better, for some models
significantly better. I'll rename to iq4_ks and keep it.

* iq4_xxs: CUDA dot product

We get TG-128 = 126 t/s for LLaMA-3.1-8B, compared to 123 t/s for q4_0.

* iq4_xxs: scalar CPU dot product

Also fix the breakage I caused with the dedicated work buffer
quantization portion when the multiplication is not done
via iqk_mul_mat.

* iq4_xxs: Zen4

I noticed that iq4_xs is wrong on Zen4 (and possibly AVX2).
Again the same mistake of packing int32_t back to int16_t,
which overflows occasionally (just occasionally, that's why the
result doesn't look completely wrong, so I didn't notice).

* Fix iq4_xs (Zen4)

* iq4_xxs: AVX2

* iq4_xxs: ARM_NEON

* iq4_xxs: Metal

* iq4_xxs: slightly faster TG on Metal

* iq4_xxs: rename to iq4_ks

After all, tt is a smaller variant of iq4_k.

* iq3_kl: use iq4_ks instead of iq4_k/iq4_xs

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-10-09 12:54:40 +03:00
Kawrakow
cce49832c1 Adding Q6_0 (#77)
* Adding q6_0 - basics + AVX2/Zen4 working

* Adding q6_0: CUDA dequantize works, but not mmvq

* Adding q6_0: CUDA mmvq works

* Adding q6_0: CUDA cpy, so Q6_0 can be used for KV-cache

* Add q6_0 to CPU flash attention

Disappointing result: for LlaMA-3.2-1B, q6_0 K- and V-cache
gives about the same PPL as q8_0 K-cache and q4_0 V-cache,
while needing the exact same RAM.
I.e., what was the point?

* q6_0: slightly better kv-cache result

Better than q8_0+q4_0, but not as good as q8_0+iq4_nl

* q6_0: works on ARM_NEON

* q6_0: dequantize works on Metal, but not vector dot product

* q6_0: it now works on Metal

Outperforms q5_0 by a significant margin. E.g.
| model                          |       size |     params | backend    | ngl | threads |          test |              t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | ------: | ------------: | ---------------: |
| llama 8B Q6_0                  |   6.08 GiB |     8.03 B | Metal      | 100 |       4 |         tg128 |     44.02 ± 0.08 |
| llama 8B Q5_0                  |   5.21 GiB |     8.03 B | Metal      | 100 |       4 |         tg128 |     40.13 ± 0.12 |
| llama 8B Q6_0                  |   6.08 GiB |     8.03 B | Metal      | 100 |       4 |         pp512 |    500.55 ± 0.32 |
| llama 8B Q5_0                  |   5.21 GiB |     8.03 B | Metal      | 100 |       4 |         pp512 |    448.02 ± 0.27 |

* q6_0: can now be used for kv-cache on Metal

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-10-02 15:22:13 +03:00
Kawrakow
918ada20fa Add CUDA support for IQ1_TN (#45)
* iq1_tn: adding CUDA dequantize

* iq1_tn: adding CUDA dot product

* Delete commented out stuff

* Delete forgotten TODO

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-09-09 21:17:17 +03:00
Kawrakow
8f43e55103 Merge mainline - Aug 12 2024 (#17)
* Merge mainline

* Fix after merge

* Remove CI check

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-08-12 15:14:32 +02:00
Iwan Kawrakow
cfb0410067 iq6_k: WIP (nothing works) 2024-08-09 16:00:31 +02:00
Kawrakow
a9f302ebe2 Adding IQ2_TN for use with ternary models (#13)
* iq2_tn: TriLM specific 2.0625 bpw quantization

Quantize/dequantize/scale dot product.

I get 46 t/s for the TriLM-3.9B with any SIMD!
Finally a compiler doing a decent job auto-vectorizing the
scalar implementation.

* iq2_tn: AVX512

Just reusing the k-quants template gets us to PP-512 = 376 t/s,
TG-128 = 47.6 t/s for TriLM-3.9B.

* iq2_tn: AVX512

With this tweak we get to PP-512 = 431 t/s.

* iq2_tn: AVX512

With this tweak we get TG-128 = 19.58 / 35.18 t/s for 1 / 2 threads.
At 4 threads we saturate at 48.41 t/s, and then performance slowly
degrades with increasing number of threads.

* iq2_tn: AVX2

PP512 = 440 t/s on the Ryzen-5975WX.
We should be able to do better.

* iq2_tn: initial NEON version

* iq2_tn: NEON

For TriLM-3.9B running on the M2-Max we get PP-512 = 193.5 t/s,
TG-128 = 75.5 t/s. This is in line with what we have for
iq2_bn ant 3.3B Bitnet.

* iq2_tn: Metal

For TriLM-3.9B on a 30-core M2-Max we get PP-512 = 890 t/s,
TG-128 = 98.5 t/s.

* iq2_tn: CUDA

For TriLM-3.9B running on RTX-4080 we get PP-512 = 9936 t/s,
TG-128 = 299.2 t/s.

* iq2_tn: AVX2 PP improvement

We now get PP-512 = 490.73 t/s for TriLM-3.9B on the Ryzen-5975WX.
We have PP-512 = 636.61 t/s for Bintnet-3B quantized with iq2_bn.
Bintnet-3B is actually 3.4B, TriLM-3.9B is 3.99B, so we would
expect 3.43/3.99 * 636 = 546 t/s, so it seems we still have something
that is not quite optimal in iq2_tn.

* iq2_tn: small NEON improvement

For TriLM-3.9B we now get PP-512 = 206.6 t/s and TG-128 = 76.4 t/s.

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-08-07 07:56:09 +02:00
Iwan Kawrakow
4f237d44f6 iq3_k: Basics
Quantize/dequantize, CUDA dequantize.
PPL of LLaMA-3.1-8B is better than iq3_s and iq3_m.
2024-08-01 09:38:06 +02:00
Iwan Kawrakow
22d1568c1c iq5_k: CUDA dot product still not working 2024-08-01 09:38:06 +02:00
Iwan Kawrakow
5d341757bc iq5_k: Basics
Quantize/dequantize, CUDA dequantize
2024-08-01 09:38:06 +02:00
Iwan Kawrakow
c85e139c68 iq2_k: Basics
Quantize/dequantize, CUDA deqantize, AVX512 iqk_mul_mat.
2024-08-01 09:38:06 +02:00
Kawrakow
291066e6df IQ4_K: SOTA 4-bit quantization (#6)
* iq4_k: basics

* quantize/dequantize works
* CUDA dequantize works and one can run PPL calcs. I get
  PPL = 6.5258 for LlaMA-3.1-8B, which is 1.77% above fp16.
  In comparison, q4_K_S (same size) is 2.88% above fp16.
* TG on CUDA does not work. Johannes has changed the way i-quant dot
  products are done, so need to sort out what he had in mind
* iqk_mul_mat is not implemented.

* iq4_k: TG now works on CUDA

* iq4_k: AVX512 implementation

For LLaMA-3.1-8B we get PP-512 = 182.6 t/s, TG-128 = 13.6 t/s,
so almost the same as q4_K_S.

* iq4_k: AVX2 implementation

For LLaMA-3.1-8B we get PP-512 = 203.1 t/s, TG-128 = 12.9 t/s
on the Ryzen-5975X.

* iq4_k: NEON implementation

For LLaMA-3.1-8B we get PP-512 = 60.7 t/s, TG-128 = 25.0 t/s
on the M2-Max. TG is on par with q4_K_S, PP is ~10% slower.

* iq4_k: Metal implementation

For LLaMA-3.1-8B we get PP-512 = 445 t/s, TG-128 = 46.3 t/s
on a 30-core M2-Max GPU. This is to be compared with (currently)
PP-512 = 460 t/s, TG-128 = 51 t/s for q4_K_S.

* iq4_k: scalar dot product

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-07-28 12:11:59 +02:00
Kawrakow
154e0d75fc Merge mainline llama.cpp (#3)
* Merging mainline - WIP

* Merging mainline - WIP

AVX2 and CUDA appear to work.
CUDA performance seems slightly (~1-2%) lower as it is so often
the case with llama.cpp/ggml after some "improvements" have been made.

* Merging mainline - fix Metal

* Remove check

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
2024-07-27 07:55:01 +02:00