ik_llama.cpp/tests at ik/skip_unnecessary_quantize - ik_llama.cpp - Public git mirror

ikawrakow/ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-03-12 23:10:01 +00:00

Files

History

Kawrakow c7e99c88a2 Faster Gemma2 (#27 )

* soft_cap_max: initial CPU version of fused softcap + soft_max

With this vanilla CPU implementation I'm already getting a ~3% speedup
for Gemma-2-9b and a prompt of 8192 tokens.

* soft_cap_max: WIP - something is wrong with CUDA

* soft_cap_max: looks good on CPU and CUDA

* Add softcap to flash attention

Just CPU and CUDA for now (but, as we know, flash attention
on the CPU is useless in llama.cpp).

On CUDA this improves PP performance quite a bit, especially for
long contexts. E.g., for PP-16384, I now get 3777 t/s.
Without this change, one cannot use FA, and one gets 2300 t/s
(after fusing softcap and softmax), or 2000 t/s without the
fused softcap+softmax.

In comparison, mainline llama.cpp has PP-16384 = 1549 t/s before
PR-8542 (where Johannes Gaessler has also added softcap to FA),
and PP-16384 = 3097 t/s after this PR.

* soft_cap_max: Metal

* Flash attention with softcap: Metal

---------

Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>

2024-08-27 17:40:59 +03:00

..

.gitignore

tests : gitignore ggml-common.h

2024-03-09 14:17:11 +02:00

CMakeLists.txt

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

get-model.cpp

ci : add model tests + script wrapper (#4586 )

2024-01-26 14:18:00 +02:00

get-model.h

ci : add model tests + script wrapper (#4586 )

2024-01-26 14:18:00 +02:00

run-json-schema-to-grammar.mjs

json-schema-to-grammar improvements (+ added to server) (#5978 )

2024-03-21 11:50:43 +00:00

test-autorelease.cpp

ggml : add numa options (#5377 )

2024-02-16 11:31:07 +02:00

test-backend-ops.cpp

Faster Gemma2 (#27 )

2024-08-27 17:40:59 +03:00

test-c.c

Nomic Vulkan backend (#4456 )

2024-01-29 15:50:50 -05:00

test-chat-template.cpp

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

test-double-float.cpp

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

test-grad0.cpp

ggml : refactor rope norm/neox (#7634 )

2024-06-05 11:29:20 +03:00

test-grammar-integration.cpp

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

test-grammar-parser.cpp

grammars: x{min,max} repetition operator (#6640 )

2024-06-06 10:07:06 +01:00

test-json-schema-to-grammar.cpp

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

test-llama-grammar.cpp

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

test-model-load-cancel.cpp

ggml : add numa options (#5377 )

2024-02-16 11:31:07 +02:00

test-opt.cpp

code : normalize enum names (#5697 )

2024-02-25 12:09:09 +02:00

test-quantize-fns.cpp

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

test-quantize-perf.cpp

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

test-rope.cpp

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

test-sampling.cpp

Merge mainline - Aug 12 2024 (#17 )

2024-08-12 15:14:32 +02:00

test-tokenizer-0.cpp

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

test-tokenizer-0.py

py : logging and flake8 suppression refactoring (#7081 )

2024-05-05 08:07:48 +03:00

test-tokenizer-0.sh

tests : fix test-tokenizer-0.sh

2024-05-28 15:04:09 +03:00

test-tokenizer-1-bpe.cpp

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

test-tokenizer-1-spm.cpp

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00

test-tokenizer-random.py

Merge mainline llama.cpp (#3 )

2024-07-27 07:55:01 +02:00