ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-05-11 08:30:19 +00:00

Files

Kawrakow cbe2bca1e6 Faster MLA prompt processing (#205 )

* Do not allocate / report caches that are not used

It is either the standard KV cache or MLA cache, not both.

* Rename X_pe to X_rope

Much easier to follow, at least for my brain, when we have
  X_rope : rotational position encoding
  X_nope :         no position encoding
instead of X_pe and X_nope, where I was wondering wtf is 'pe'
and 'nope'.

* WIP

* WIP

* WIP

* WIP

* Warn user when disabling MLA

* MLA: compile time option to not use transposed KV cache

Cuts KV cache size in nearly half at the expense of slower
TG performance for long contexts (it becomes similar to
no-MLA).

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

2025-02-13 11:50:20 +02:00

CMakeLists.txt

Be able to repack tensors at run time (#147 )

2024-12-17 14:16:34 +01:00

llama-grammar.cpp

Merge mainline - Aug 12 2024 (#17 )

2024-08-12 15:14:32 +02:00

llama-grammar.h

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

llama-impl.h

Time to fix replace_all (#68 )