ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-01-27 01:29:51 +00:00

Author	SHA1	Message	Date
Iwan Kawrakow	ad53eabf87	iqk_mul_mat: be independent of llamafile_sgemm (WIP) * Remove iqk_mul_mat from llamafile_sgemm * Pass tensor types and strides to iqk_mul_mat It is marked WIP because only tested on __aarch64__	2024-06-22 12:02:50 +03:00
Iwan Kawrakow	9593e163db	iqk_mul_mat: add ability to disable it	2024-06-22 12:02:50 +03:00
Iwan Kawrakow	81cf6990f5	iqk_mul_mat: be able to handle any f16/f32 combination on AVX2 But only turning on f16 x f32 and f32 x f16 for now.	2024-06-22 12:02:50 +03:00
Iwan Kawrakow	9386b49918	iqk_mul_mat: fp16 for Arm ~2% slower than tinyBLAS - not sure why.	2024-06-22 12:02:50 +03:00
Iwan Kawrakow	bc659e7de1	iqk_mul_mat: fp16 implementation for AVX2 This simple implementation beats jart's tiniBLAS by a small margin (143 t/s vs 137 t/s for PP-512, TG is 4.75 t/s, so exactly the same as ggml).	2024-06-22 12:02:50 +03:00
Iwan Kawrakow	667bd4759c	iqk_mul_mat: make it independent of sgemm	2024-06-22 12:02:50 +03:00
Iwan Kawrakow	19c578b413	iqk_mul_mat for llama.cpp	2024-06-22 12:02:49 +03:00
jojorne	84f6de17f6	Fix no gcc pragma on Windows (#7751 )	2024-06-18 22:18:32 +10:00
Eve	465263d0cf	sgemm : AVX Q4_0 and Q8_0 (#6891 ) * basic avx implementation * style * combine denibble with load * reduce 256 to 128 (and back!) conversions * sse load * Update sgemm.cpp * oops oops	2024-05-08 17:29:23 +03:00
Justine Tunney	4b1c3c98b4	llamafile : use 64-bit integers in sgemm (#6928 )	2024-04-26 17:05:33 +03:00
Justine Tunney	192090bae4	llamafile : improve sgemm.cpp (#6796 ) * llamafile : improve sgemm.cpp - Re-enable by default - Fix issue described in #6716 - Make code more abstract, elegant, and maintainable - Faster handling of weirdly shaped `m` an `n` edge cases * Address review comments * Help clang produce fma instructions * Address review comments	2024-04-22 22:00:36 +03:00
Justine Tunney	8cc91dc63c	ggml : add llamafile sgemm (#6414 ) This change upstreams llamafile's cpu matrix multiplication kernels which improve image and prompt evaluation speed. For starters, Q4_0 and Q8_0 weights should go ~40% faster on CPU. The biggest benefits are with data types like f16 / f32, which process prompts 2x faster thus making them faster than quantized data types for prompt evals. This change also introduces bona fide AVX512 support since tinyBLAS is able to exploit the larger register file. For example, on my CPU llama.cpp llava-cli processes an image prompt at 305 tokens/second, using the Q4_K and Q4_0 types, which has always been faster than if we used f16 LLaVA weights, which at HEAD go 188 tokens/second. With this change, f16 LLaVA performance leap frogs to 464 tokens/second. On Intel Core i9-14900K this change improves F16 prompt perf by 5x. For example, using llama.cpp at HEAD with Mistral 7b f16 to process a 215 token prompt will go 13 tok/sec. This change has fixes making it go 52 tok/sec. It's mostly thanks to my vectorized outer product kernels but also because I added support for correctly counting the number of cores on Alderlake, so the default thread count discounts Intel's new efficiency cores. Only Linux right now can count cores. This work was sponsored by Mozilla who's given permission to change the license of this code from Apache 2.0 to MIT. To read more about what's improved, and how it works, see: https://justine.lol/matmul/	2024-04-16 21:55:30 +03:00

12 Commits