ik_llama.cpp/tests at 46862d725bf5ef4abed676b171cd53178ef72cdc - ik_llama.cpp - Public git mirror

ikawrakow/ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-02-26 08:04:09 +00:00

Files

History

Iwan Kawrakow 46862d725b Add softcap to flash attention

Just CPU and CUDA for now (but, as we know, flash attention
on the CPU is useless in llama.cpp).

On CUDA this improves PP performance quite a bit, especially for
long contexts. E.g., for PP-16384, I now get 3777 t/s.
Without this change, one cannot use FA, and one gets 2300 t/s
(after fusing softcap and softmax), or 2000 t/s without the
fused softcap+softmax.

In comparison, mainline llama.cpp has PP-16384 = 1549 t/s before
PR-8542 (where Johannes Gaessler has also added softcap to FA),
and PP-16384 = 3097 t/s after this PR.

2024-08-26 18:22:29 +03:00

..

.gitignore

tests : gitignore ggml-common.h

2024-03-09 14:17:11 +02:00

CMakeLists.txt

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

get-model.cpp

ci : add model tests + script wrapper (#4586 )

2024-01-26 14:18:00 +02:00

get-model.h

ci : add model tests + script wrapper (#4586 )

2024-01-26 14:18:00 +02:00

run-json-schema-to-grammar.mjs

json-schema-to-grammar improvements (+ added to server) (#5978 )

2024-03-21 11:50:43 +00:00

test-autorelease.cpp

ggml : add numa options (#5377 )

2024-02-16 11:31:07 +02:00

test-backend-ops.cpp

Add softcap to flash attention

2024-08-26 18:22:29 +03:00

test-c.c

Nomic Vulkan backend (#4456 )

2024-01-29 15:50:50 -05:00

test-chat-template.cpp

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

test-double-float.cpp

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

test-grad0.cpp

ggml : refactor rope norm/neox (#7634 )

2024-06-05 11:29:20 +03:00

test-grammar-integration.cpp

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

test-grammar-parser.cpp

grammars: x{min,max} repetition operator (#6640 )

2024-06-06 10:07:06 +01:00

test-json-schema-to-grammar.cpp

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

test-llama-grammar.cpp

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

test-model-load-cancel.cpp

ggml : add numa options (#5377 )

2024-02-16 11:31:07 +02:00

test-opt.cpp

code : normalize enum names (#5697 )

2024-02-25 12:09:09 +02:00

test-quantize-fns.cpp

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

test-quantize-perf.cpp

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

test-rope.cpp

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

test-sampling.cpp

Merge mainline - Aug 12 2024 (#17 )

2024-08-12 15:14:32 +02:00

test-tokenizer-0.cpp

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

test-tokenizer-0.py

py : logging and flake8 suppression refactoring (#7081 )

2024-05-05 08:07:48 +03:00

test-tokenizer-0.sh

tests : fix test-tokenizer-0.sh

2024-05-28 15:04:09 +03:00

test-tokenizer-1-bpe.cpp

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

test-tokenizer-1-spm.cpp

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

test-tokenizer-random.py

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00