ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-03-09 05:20:01 +00:00

Files

Iwan Kawrakow 1b834ac6e4 Flash attention: templated implementation

Needed to model different head sizes for different
LLMs, batch sizes that are not a multiple of 8, stc.

I see 2-3% performance degradation.

It is one of those things
that I don't understand, but really would like to:

I have an implementation of a function that depends in a compile time
constant. I get performance X.
I then turn the implementation into a template, where the former
compile time constant is a template parameter, and I instantiate the template
for a bunch of different values, one of which is the former compile
time constants. I observe performance c*X, where c almost always is
less than 1, and depending on how unlucky we get, it can be as low
as 0.5 or somesuch. But in my simple-minded understanding, I expect
the template instantiation with the former compile time constant
to turn into the exact same function as the former non-templated
implementation, and so I expect the exact same performance.

i.e., if I have some function
void some_function(...) {
    constexpr int N = 128;
    ... // code that depends on N
}

and I now write
template <int N>
void some_function_T(...) {
    ... // same code as in some_function() that depends on N
}

and I say
void wrapper_function(int N) {
    switch (N) {
        case  64: some_function_T< 64>(); break;
        case 128: some_function_T<128>(); break;
        ...
    }
}
I expect wrapper_function(128) to have the exact same performance as
some_function() (run time of some_function() is long enough to have the
additional function call overhead be completely negligible).
This is the reason I'm using a template in the first place instead
of just having void some_function(int N).

But no. Tough luck.

2024-08-31 13:10:36 +03:00

cmake

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

include

Fused soft cap and SIMD-ified GeLU (#9 )

2024-08-20 17:15:47 +03:00

src

Flash attention: templated implementation

2024-08-31 13:10:36 +03:00

.gitignore

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

CMakeLists.txt

Merge mainline - Aug 12 2024 (#17 )

2024-08-12 15:14:32 +02:00