ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-03-07 12:30:08 +00:00

Files

Kawrakow b7768e203f Faster CPU prompt processing for Q4_K and Q5_K (#525 )

* q4_K: dequantize to q8_1_r8 for batch >= 32

We get 268 t/s, up from 186 t/s.

* q4_K: GEMM with q8_2_X4

* q5_K: GEMM with q8_2_X4 and repack to q8_1_r8

* Remove the scales, they are not needed

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

2025-06-13 07:58:15 +03:00

cmake

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

include

Fix non rpc build error (#506 )

2025-06-08 17:27:00 +03:00

src

Faster CPU prompt processing for Q4_K and Q5_K (#525 )

2025-06-13 07:58:15 +03:00

.gitignore

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

CMakeLists.txt

Better strategy for GPU offload (#520 )

2025-06-12 19:25:11 +03:00