ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-02-27 00:24:11 +00:00

Files

Kawrakow 808f08787a AVX2 Flash Attention (#48 )

* First version of AVX2 Flash attention

I simply took the Zen4 implementation and converted
platform specific stuff to methods of a struct providing
data loading/storing, conversions, multiply, add, etc.

Most likely not optimal as the Zen4 strategy has been
designed based on having 32 512-bit registers, so basically
we can have 4X more data stored in vector registers compared
to AVX2 with 16 x 256-bit.

It still gives a small speedup (~4% at 2048 tokens) for Gemma-2b.

* Fix Zenn4 parts broken via the AVX2 change

* Try smaller q_step - no improvement

* Fix ARM_NEON

I had forgotten to guard the AVX2/Zen4 implementation against __aarch64__

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

2024-09-10 19:17:04 +03:00

cmake

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

include

Adding IQ1_TN - 1.6875 bpw for TriLM ternary models (#44 )

2024-09-09 14:56:34 +03:00

src

AVX2 Flash Attention (#48 )

2024-09-10 19:17:04 +03:00

.gitignore

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

CMakeLists.txt

Merge mainline - Aug 12 2024 (#17 )

2024-08-12 15:14:32 +02:00