ik_llama.cpp/src at 1cb7e1bf39dceb8bea8152e589f92b55da25ad21 - ik_llama.cpp - Public git mirror

ikawrakow/ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-02-20 13:14:09 +00:00

Files

History

Kawrakow 82c4f27332 Fuse the attention gate in Step-3.5-Flash (#1244 )

* WIP

* This works but is slow

* Turn off the up / gate clamps for now

* OK we need the clamping

* Fuse the clamp (CUDA)

* Fuse the clamp (CPU)

* WIP

* Be able to use merged q, k, v

* Be able to use merged up/gate experts

* Fuse the clamp (CUDA mmvq)

* WIP: graph parallel for Step-3.5

* WIP

* This should be it

* Cleanup

* Fix merge

* Not working attempt to extend fused_mul_unary to the Step-3.5 case

* It works now, but performance gain is very minor

2026-02-07 07:56:58 +02:00

..

Merge vulkan code from mainline up to commit of 6/28/2025 (#563 )

2025-07-02 08:49:42 +02:00

Merge mainline - Aug 12 2024 (#17 )

2024-08-12 15:14:32 +02:00

Fuse the attention gate in Step-3.5-Flash (#1244 )

2026-02-07 07:56:58 +02:00

Merge mainline - Aug 12 2024 (#17 )

2024-08-12 15:14:32 +02:00

Step-3.5-Flash support (#1231 )

2026-02-05 08:13:22 +02:00

kompute @ 4565194ed7

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

kompute-shaders

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

Port of Qwen3-VL support from mainline (#883 )

2025-11-04 19:20:54 +02:00

CMakeLists.txt

Remove llamafile remnants (#1179 )

2026-01-22 13:20:23 +02:00

ggml-aarch64.c

Merge mainline - Aug 12 2024 (#17 )

2024-08-12 15:14:32 +02:00

ggml-aarch64.h

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

ggml-alloc.c

Enable CUDA graphs for MoE models + GPT-OSS support (#689 )

2025-08-15 09:18:07 +03:00

ggml-backend-impl.h

Merge vulkan code from mainline up to commit of 6/28/2025 (#563 )

2025-07-02 08:49:42 +02:00

ggml-backend.cpp

Fix build failure when OpenMP is not available (#1171 )

2026-01-22 12:26:23 +02:00

ggml-blas.cpp

Merge mainline - Aug 12 2024 (#17 )

2024-08-12 15:14:32 +02:00

ggml-cann.cpp

Merge vulkan code from mainline up to commit of 6/28/2025 (#563 )

2025-07-02 08:49:42 +02:00

ggml-common.h

AVX512+AVXVNNI GEMM implementation for quants using Q8_K for activations (#710 )

2025-08-22 06:27:07 +03:00

ggml-cuda.cu

Change default FA offset to ln(2) (#1235 )

2026-02-05 13:42:53 +02:00

ggml-impl.h

MXFP4 (#682 )

2025-08-09 08:40:18 +03:00

ggml-kompute.cpp

Merge vulkan code from mainline up to commit of 6/28/2025 (#563 )

2025-07-02 08:49:42 +02:00

ggml-metal.m

MXFP4 (#682 )

2025-08-09 08:40:18 +03:00

ggml-metal.metal

MXFP4 (#682 )

2025-08-09 08:40:18 +03:00

ggml-quants.c

Fix avx2 GEMM mess (v2) (#724 )

2025-08-27 08:03:47 +03:00

ggml-quants.h

IQ1_M_R4: better 1.75 bpw quants (#187 )

2025-02-06 14:08:52 +02:00

ggml-rpc.cpp

server: improve speed of speculative decoding (#1119 )

2026-01-10 08:01:22 +02:00

ggml-sycl.cpp

Merge vulkan code from mainline up to commit of 6/28/2025 (#563 )

2025-07-02 08:49:42 +02:00

ggml-vulkan.cpp

Port of Qwen3-VL support from mainline (#883 )

2025-11-04 19:20:54 +02:00

ggml.c

Fuse the attention gate in Step-3.5-Flash (#1244 )

2026-02-07 07:56:58 +02:00