Reduce size of compute buffers (#237)

* This reduces compute buffer size for MLA

* This should accomplish it for standard attention

* Much better

* Better concat for contiguous tensors

If all the op does is to concatenate the second tensor
to the first, why would we want to have a loop?

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>
This commit is contained in:
Kawrakow
2025-03-01 08:25:27 +02:00
committed by GitHub
parent 472b4c37c1
commit e787c00141
7 changed files with 236 additions and 79 deletions

View File

@@ -175,7 +175,8 @@ struct gpt_params {
bool simple_io = false; // improves compatibility with subprocesses and limited consoles
bool cont_batching = true; // insert new sequences for decoding on-the-fly
bool flash_attn = false; // flash attention
int mla_attn = false; // MLA 0: standard attention, 1: MLA with K and transposed V cache, 2: MLA with just K cache
int mla_attn = 0; // MLA 0: standard attention, 1: MLA with K and transposed V cache, 2: MLA with just K cache
int attn_max_batch = 0; // Max batch size to use when computing attention (only applicable if flash_attn = false)
bool fused_moe_up_gate = false; // fused up*unary(gate) op for MoE models
bool input_prefix_bos = false; // prefix BOS to user inputs, preceding input_prefix