ik_llama.cpp/ggml/src at 957a6e79119ccb4c41ea4fb346db17ade733c2d8 - ik_llama.cpp - Public git mirror

ikawrakow/ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-03-13 23:40:09 +00:00

Files

History

Kawrakow 92ceda1d06 FlashMLA-3 for DeepSeek models on CUDA (#386 )

* CUDA WIP: support for FlashMLA-3

* Much better

The issue was that I did not change the number of warps
used for 3D matrix multiplications (wk_b * kv_cache, MoE),
so we ended up using 4 warps for TG. By going to 1 warp
in these cases, we get a significant boost in TG performance
(tested with DeepSeek-Lite)

* Sadly, the previous commit was wrong

* Finalizing

* Also add these

* Minor

* Minor tweak

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

2025-05-07 17:38:22 +03:00

..

Merge mainline - Aug 12 2024 (#17 )

2024-08-12 15:14:32 +02:00

FlashMLA-3 for DeepSeek models on CUDA (#386 )

2025-05-07 17:38:22 +03:00

Merge mainline - Aug 12 2024 (#17 )

2024-08-12 15:14:32 +02:00

Fix DeepSeek q8_0 cache (#391 )

2025-05-07 12:06:49 +03:00

kompute @ 4565194ed7

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

kompute-shaders

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

Merge mainline - Aug 12 2024 (#17 )

2024-08-12 15:14:32 +02:00

CMakeLists.txt

CUDA: faster FA TG for GQA models (#370 )

2025-05-04 09:17:44 +03:00

ggml-aarch64.c

Merge mainline - Aug 12 2024 (#17 )

2024-08-12 15:14:32 +02:00

ggml-aarch64.h

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

ggml-alloc.c

Fix ARM_NEON build failure due to q8_2 (#303 )

2025-04-01 13:48:20 +02:00

ggml-backend-impl.h

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

ggml-backend.c

FlashMLA-2 (CPU): faster and smaller compute buffer size (#253 )

2025-03-13 12:07:43 +02:00

ggml-blas.cpp

Merge mainline - Aug 12 2024 (#17 )

2024-08-12 15:14:32 +02:00

ggml-cann.cpp

Merge mainline - Aug 12 2024 (#17 )

2024-08-12 15:14:32 +02:00

ggml-common.h

Add copyright notices (#317 )

2025-04-07 10:43:26 +02:00

ggml-cuda.cu

FlashMLA-3 for DeepSeek models on CUDA (#386 )

2025-05-07 17:38:22 +03:00

ggml-impl.h

Merge mainline - Aug 12 2024 (#17 )

2024-08-12 15:14:32 +02:00

ggml-kompute.cpp

Merge mainline - Aug 12 2024 (#17 )

2024-08-12 15:14:32 +02:00

ggml-metal.m

Metal: FA and FlashMLA (#310 )

2025-04-03 17:54:25 +02:00

ggml-metal.metal

Metal: FA and FlashMLA (#310 )

2025-04-03 17:54:25 +02:00

ggml-quants.c

Improved IQ1_M quantization (#327 )

2025-04-13 10:37:55 +02:00

ggml-quants.h

IQ1_M_R4: better 1.75 bpw quants (#187 )

2025-02-06 14:08:52 +02:00

ggml-rpc.cpp

Merge mainline - Aug 12 2024 (#17 )

2024-08-12 15:14:32 +02:00

ggml-sycl.cpp

Merge mainline - Aug 12 2024 (#17 )

2024-08-12 15:14:32 +02:00

ggml-vulkan.cpp

Merge mainline - Aug 12 2024 (#17 )

2024-08-12 15:14:32 +02:00

ggml.c

CPU FA improvements (#351 )

2025-04-29 07:19:43 +02:00