ik_llama.cpp/src at b7768e203f3283d601c9d3a0cd34dfc76b40aa87 - ik_llama.cpp - Public git mirror

ikawrakow/ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-02-20 21:24:08 +00:00

Files

History

Kawrakow b7768e203f Faster CPU prompt processing for Q4_K and Q5_K (#525 )

* q4_K: dequantize to q8_1_r8 for batch >= 32

We get 268 t/s, up from 186 t/s.

* q4_K: GEMM with q8_2_X4

* q5_K: GEMM with q8_2_X4 and repack to q8_1_r8

* Remove the scales, they are not needed

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

2025-06-13 07:58:15 +03:00

..

Merge mainline - Aug 12 2024 (#17 )

2024-08-12 15:14:32 +02:00

IQ1_M_R4 CUDA implementation (#494 )

2025-06-05 19:13:51 +03:00

Merge mainline - Aug 12 2024 (#17 )

2024-08-12 15:14:32 +02:00

Faster CPU prompt processing for Q4_K and Q5_K (#525 )

2025-06-13 07:58:15 +03:00

kompute @ 4565194ed7

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

kompute-shaders

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

Merge mainline - Aug 12 2024 (#17 )

2024-08-12 15:14:32 +02:00

CMakeLists.txt

Better strategy for GPU offload (#520 )

2025-06-12 19:25:11 +03:00

ggml-aarch64.c

Merge mainline - Aug 12 2024 (#17 )

2024-08-12 15:14:32 +02:00

ggml-aarch64.h

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

ggml-alloc.c

Fix ARM_NEON build failure due to q8_2 (#303 )

2025-04-01 13:48:20 +02:00

ggml-backend-impl.h

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

ggml-backend.c

Fix non rpc build error (#506 )

2025-06-08 17:27:00 +03:00

ggml-blas.cpp

Merge mainline - Aug 12 2024 (#17 )

2024-08-12 15:14:32 +02:00

ggml-cann.cpp

Merge mainline - Aug 12 2024 (#17 )

2024-08-12 15:14:32 +02:00

ggml-common.h

Trellis quants with CPU inference (#441 )

2025-05-23 09:17:52 +03:00

ggml-cuda.cu

Better strategy for GPU offload (#520 )

2025-06-12 19:25:11 +03:00

ggml-impl.h

Merge mainline - Aug 12 2024 (#17 )

2024-08-12 15:14:32 +02:00

ggml-kompute.cpp

Merge mainline - Aug 12 2024 (#17 )

2024-08-12 15:14:32 +02:00

ggml-metal.m

Metal implementatio for the trellis quants. (#475 )

2025-06-01 15:23:44 +03:00

ggml-metal.metal

Metal implementatio for the trellis quants. (#475 )

2025-06-01 15:23:44 +03:00

ggml-quants.c

Trellis quants with CPU inference (#441 )

2025-05-23 09:17:52 +03:00

ggml-quants.h

IQ1_M_R4: better 1.75 bpw quants (#187 )

2025-02-06 14:08:52 +02:00

ggml-rpc.cpp

Fix non rpc build error (#506 )

2025-06-08 17:27:00 +03:00

ggml-sycl.cpp

Merge mainline - Aug 12 2024 (#17 )

2024-08-12 15:14:32 +02:00

ggml-vulkan.cpp

Merge mainline - Aug 12 2024 (#17 )

2024-08-12 15:14:32 +02:00

ggml.c

Faster CPU prompt processing for Q4_K and Q5_K (#525 )

2025-06-13 07:58:15 +03:00