ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-02-02 12:39:54 +00:00

Author	SHA1	Message	Date
Steven Roussey	9aec13530b	ggml : fix fprintf warnings (#1720 )	2023-06-08 10:12:28 +03:00
Georgi Gerganov	62fe685f4f	clang-tidy : restore dot file from accidental deletion	2023-06-08 10:09:08 +03:00
Kawrakow	a68dd950d7	metal : add Q4_K implementation (#1733 ) * Metal implementation for Q4_K Very slow for now: 42 ms / token, Q4_0 runs in 28 ms/token on my 30-core M2 Max GPU. * Optimizing Q4_K on metal The first token always takes longer, I guess because the metal kernel is being jit-compiled. So, using n = 128 to measure time. At this point Q4_K takes 29.5 ms / token compared to 27.2 ms / token for Q4_0. Quite a bit better than the initial attempt, but still not good enough. * Optimizing q4_K metal dot some more For n = 256 it is now 28.1 ms/token compared to 27 ms/token for q4_0. * Fix after merge with master --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com>	2023-06-08 10:08:23 +03:00
johnson442	2d1609eaaf	k-quants : add missing compile definition to CMakeLists (#1748 )	2023-06-08 10:02:48 +03:00
Georgi Gerganov	afb1e0cf32	k-quants : allow to optionally disable at compile time (#1734 ) * k-quants : put behind optional compile flag LLAMA_K_QUANTS * build : enable k-quants by default	2023-06-07 10:59:52 +03:00
jacobi petrucciani	4e1aea2999	flake : update to support metal on m1/m2 (#1724 )	2023-06-07 07:15:31 +03:00
Georgi Gerganov	e115880e31	readme : add June roadmap	2023-06-07 07:15:08 +03:00
Willy Tarreau	1ad7e68899	main: add the possibility to open the prompt cache read-only (#1640 ) The prompt cache constitutes a nice speed up when using the same prompt prefix across multiple evaluations, but when using it, it will also be updated, which is not always desirable. One use case is to have a large prompt containing some context and usage rules, and a second part containing variable data of the problem being studied. In this case it's desirable to be able to save the first part once, and to always reuse it as-is without updating it with the second part. The new argument --prompt-cache-ro enables this read-only mode on the prompt cache. The prompt's contents that match the cache are loaded from the cache but the rest is not modified. This allowed to reduce a total analysis time from 112s to 49.7s here, without having to backup and restore a copy of the prompt, which takes significant time at 500 MB. Signed-off-by: Willy Tarreau <w@1wt.eu>	2023-06-06 22:10:17 -04:00
Georgi Gerganov	fb21508b1b	llama : fix vram_scratch var	2023-06-06 22:54:39 +03:00
Georgi Gerganov	11757de5e1	llama : fix compile warnings	2023-06-06 22:41:53 +03:00
Johannes Gäßler	e957101084	Multi GPU support, CUDA refactor, CUDA scratch buffer (#1703 ) * CUDA multi GPU + scratch ggml_cuda_compute_forward Tensor parallelism ggml_cuda_add ggml_cuda_rms_norm ggml_cuda_silu CUDA scratch buffer --main-gpu CLI option	2023-06-06 21:33:23 +02:00
Georgi Gerganov	32ef74369c	metal : add f16 support	2023-06-06 20:21:56 +03:00
LostRuins	698d0096d6	Clblast fixes + enhancements to save VRAM and offload more layers (#1675 ) * Use events instead of clFinish, where possible * OpenCL: Don't load gpu layers into RAM, add mul_f32 kernel * Reduce queueing overhead for contiguous tensors by using single mul kernel call * Adapt to #1612 cl_mem malloc changes * Reduce code duplication between cuda and opencl branches * Improve implementation * Clblast fixes + enhancements to save VRAM: 1. Change all Clblast buffers to CL_MEM_READ_WRITE, as the pool malloc currently doesn't properly handle them. 2. When recycling buffers in pool malloc, always assign the SMALLEST available buffer that fits, instead of the FIRST available buffer 3. When failing to recycle a buffer in pool malloc (all too small), instead recycle the largest available free buffer by resizing it. * change max value size_t to use limits * removed flags from the CL pool malloc, apply code tidying suggestions.	2023-06-06 19:00:01 +02:00
Georgi Gerganov	0679702945	ggml : fix builds, add ggml-quants-k.o (close #1712 , close #1710 )	2023-06-06 10:18:03 +03:00
Georgi Gerganov	0aa222be28	gitignore : add .clang-tidy	2023-06-06 09:55:25 +03:00
Georgi Gerganov	391d82f905	llama : temporary disable Q6_K output quantization (#1711 )	2023-06-06 09:39:38 +03:00
Spencer Sutton	c54f0af06b	metal : add checks for buffer size (#1706 ) Co-authored-by: Spencer Sutton <Spencer.Sutton@precisely.com>	2023-06-06 06:28:17 +03:00
Yuval Peled	739a917df4	docs : add performance troubleshoot + example benchmark documentation (#1674 ) * test anchor link * test table * add benchmarks * Add performance troubleshoot & benchmark * add benchmarks * remove unneeded line --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-06-05 23:32:36 +03:00
Foul-Tarnished	ae6585dec5	readme : fix typo (#1700 ) Fix a typo in a command in README.md	2023-06-05 23:28:37 +03:00
mgroeber9110	38afdf54fd	llama : consistently catch and throw only exceptions deriving from std::exception (#1599 ) Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-06-05 23:24:29 +03:00
kiltyj	6600bfe1f7	metal : use shared buffers between CPU and GPU (#1696 ) * Use MTLDevice.newBufferWithBytesNoCopy to share buffers between CPU and GPU * Page-align buffers used by Metal * Remove trailing whitespace * Only import unistd.h for Metal builds * metal : remove unnecessary copies --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-06-05 23:24:04 +03:00
grahameth	3cdac3973b	ggml : fix internal overflow in ggml_time_us on Windows (#1702 ) Co-authored-by: grahameth <->	2023-06-05 23:11:49 +03:00
Georgi Gerganov	1b56d3c0cd	ci : disable auto tidy (#1705 )	2023-06-05 23:05:05 +03:00
Kawrakow	083663ff7d	ggml : add SOTA 2,3,4,5,6 bit k-quantizations (#1684 ) * Starting to add k-quantization to ggml I think it is better to have quantization separate from ggml. For now just adding the k-quants there, but it would be better to also factor out the existing ggml quantizations. * Adding Q3_K and Q8_K (de)-quantization * Q3_K now working on CUDA and AVX2/scalar CUDA is not ideal - ~50% slower than Q4_0 for single token prediction, about the same in batch mode (perplexity). CPU single token is ~55 ms (on Ryzen 7950X). * Some improvement for Q3_K on CUDA It is now ~22.5 ms/token on my GPU, so ~30% slower than Q4_0. * Some more CUDA optimizations for Q3_K Single token is now 20.5 ms/token (~20% slower than Q4_0). Perplexity is on par with Q4_0. * Adding Q4_K - scalar, AVX2, CUDA Performance is the same or perhaps very slightly better than Q4_0 on the CPU. On the GPU, single token prediction is ~10% better than Q4_0, batch mode (perplexity is about the same). * Adding Q6_K - scalar, AVX2, CUDA Performance is ~40% lower compared to Q4_K on the CPU. This is to be expected, considering that we are memory bound on the CPU and the 6-bit model is ~44% larger than the 4-bit. On the GPU, single token prediction is ~6% lower than Q4_0, batch mode (perplexity) is even closer (but still slower). * Adding Q5_K - scalar, AVX2, CUDA Performance is ~20% lower compared to Q4_K on the CPU. This is to be expected, considering that we are memory bound on the CPU and the 5-bit model is ~22% larger than the 4-bit. On the GPU, single token prediction is about the same as Q4_0 for both, single token and batch prediction. * Per convention, all QX_K quantizations use Q5_K for output.weight * Adding quantization mixes * Quantization mixes: didn't quite get what I wanted in the last commit * Q4_K dot product for ARM_NEON * Q6_K dot product for ARM_NEON * Q5_K dot product for ARM_NEON * Adding Q3_K dot for ARM_NEON It is 22% slower than Q4_K, despite the smaller model size. On x86_64, where we are memory bound, the Q3_K model is quite a bit faster than Q4_K. * A very slightly faster ARM_NEON Q3_K dot * Adding Q2_K - just CUDA for now Token prediction is pretty good - about 15.5 ms on a RTX 4080. Perplexity is about the same as Q4_K. * Adding scalar and AVX2 Q2_K dot * Adding ARM_NEON Q2_K dot About the same performance as Q4_K. * A slightly faster ARM_NEON Q2_K dot Single token prediction is now ~36 ms on M2 Max. The code is much simpler too. * Fixed bug in Q2_K CUDA dot product kernel Stranegly enough, for the few prompts I tried with the 7B model the responses looked perfectly reasonable. Only realized something is not quite right when I tried the larger models and started getting nonse back. In any case, Q2_K single token evaluation time on an RTX 4080 in a Ryzen7950X box iusing CUDA and model fully loaded on the GPU are ~15.5 ms for 7B, ~25.4 ms for 13B, and ~55.8 ms for 30B. The max number of layers that fit in VRAM for The 65B is 32. With that, we get ~330 ms per token, which is not that much faster than just running on the CPU (~470 ms per token). * Don't print zeros/NaNs when no count histogram has been collected * A 10% faster CUDA vector dot kernel for Q3_K Q3_K is now running at ~18.5 ms / token on CUDA, so the gap to Q4_0 is only 10%. It seems memory acccess pattern is more important for performance than the amount of computation the kernel does. * A slightly daster Q4_K AVX2 dot product For perplexity, where we are less memory bound, time per pass drops by ~5%. Barely measurable difference for single token prediction. * A slightly faster ARM_NEON A4_K dot product * Minor * Fix quantization error test We cannot possibly be expecting rmse < 0.002 for 2- and 3-bit quantization variants. * Fix docker build I have been sloppy with vector reinterpret casts on ARM_NEON. It seems clang is very forgiving in that regard. * Added forgotten ggml.o dependence on k_quants.h to the Makefile * Had unintentionally committed the Makefile with -Ofast enabled * ggml : rename k_quants -> ggml-quants-k, use lowercase in code --------- Co-authored-by: Iwan Kawrakow <iwan.kawrakow@gmail.com> Co-authored-by: Georgi Gerganov <ggerganov@gmail.com>	2023-06-05 22:56:18 +03:00
Henri Vasserman	c0bd595a01	Increase 3B scratch buffers. (#1698 ) The 128 MB was too optimistic. Too bad it is not dynamically computed.	2023-06-05 13:43:08 +03:00
Georgi Gerganov	dbfd086f3f	llama : fix Metal KV cache sync (close #1695 )	2023-06-05 10:19:03 +03:00
Georgi Gerganov	595b95c725	readme : update hot topics	2023-06-04 23:38:19 +03:00
Georgi Gerganov	442db16478	llama : Metal inference (#1642 ) * mtl : export the LLaMA computation graph * ci : disable temporary * mtl : adapt the MNIST example as starter * mtl : no need for mtl-export tool, add cli arg for main instead * mtl : export just a small part of the graph for now to make it easier * mtl : move MSL code into separate file for easy editing * mtl : initial get_rows_q4_0 kernel * mtl : confirmed get_rows_q4_0 is working correctly * mtl : add rms_norm kernel + confirm working * mtl : add mul kernel + confirm working * mtl : initial mul_mat Q4 kernel (wrong results) * mtl : mul_mat fixes (still wrong) * mtl : another mul_mat Q4 (still does not work) * mtl : working mul_mat q4 * ggml : fix handling of "view" ops in ggml_graph_import() * mtl : add rope kernel * mtl : add reshape and transpose handling * ggml : store offset as opt arg for ggml_view_xd() operators * mtl : add cpy kernel + handle view ops * mtl : confirm f16 x f32 attention mul mat * mtl : add scale kernel * mtl : add diag_mask_inf kernel * mtl : fix soft_max kernel * ggml : update ggml_nbytes() to handle non-contiguous tensors * mtl : verify V tensor contents * mtl : add f32 -> f32 cpy kernel * mtl : add silu kernel * mtl : add non-broadcast mul kernel * mtl : full GPU inference of the computation graph * mtl : optimize rms_norm and soft_max kernels * mtl : add f16 mat x f32 vec multiplication kernel * mtl : fix bug in f16 x f32 mul mat + speed-up computation * mtl : faster mul_mat_q4_0_f32 kernel * mtl : fix kernel signature + roll inner loop * mtl : more threads for rms_norm + better timing * mtl : remove printfs from inner loop * mtl : simplify implementation * mtl : add save/load vocab to ggml file * mtl : plug Metal inference into llama.cpp (very quick-n-dirty) * mtl : make it work with main example Lots of hacks but at least now it generates text * mtl : preparing for merge * mtl : clean-up ggml mtl interface + suport scratch / inplace * mtl : remove temp / debug code * metal : final refactoring and simplification * Revert "ci : disable temporary" This reverts commit 98c267fc77fe811082f672538fc91bcfc9072d63. * metal : add comments * metal : clean-up stuff, fix typos * readme : add Metal instructions * readme : add example for main	2023-06-04 23:34:30 +03:00
0cc4m	9223b3ab53	OpenCL: Fix duplication of layers in VRAM and RAM, add GPU mul kernel (#1653 ) * Use events instead of clFinish, where possible * OpenCL: Don't load gpu layers into RAM, add mul_f32 kernel * Reduce queueing overhead for contiguous tensors by using single mul kernel call * Adapt to #1612 cl_mem malloc changes * Reduce code duplication between cuda and opencl branches * Improve implementation	2023-06-04 08:12:05 +02:00
Henri Vasserman	a65a9c322b	Add info about CUDA_VISIBLE_DEVICES (#1682 )	2023-06-03 16:35:20 +03:00
Jiří Podivín	d4a5347616	Docker: change to calling convert.py (#1641 ) Deprecation disclaimer was added to convert-pth-to-ggml.py	2023-06-03 15:11:53 +03:00
Evan Jones	1faeb8fd7c	Fix prompt cache saving and chat-persistent rollover (#1678 ) * Fix prompt cache saving and chat-persistent rollover (fixes #1670) * clang-tidy Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> --------- Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>	2023-06-03 07:28:45 -04:00
Henri Vasserman	ee6461be54	OpenLLaMA 3B support (#1588 ) This adds support to llama.cpp to load the model. Currently missing are changes that are required from convert.py to convert the model correctly. It needs some changes to start reading the JSON configuration for HF models instead of deriving the values by guessing. Co-authored-by: FNsi <125447286+FNsi@users.noreply.github.com>	2023-05-30 21:24:22 +03:00
Georgi Gerganov	9bd2bdfba4	ggml : sync cgraph import / export API	2023-05-29 19:31:44 +03:00
Georgi Gerganov	6ae16e9c4f	ggml : fix bug in ggml_alibi	2023-05-29 19:30:49 +03:00
DannyDaemonic	5d769f35b3	Work around for recalculating logits in cached prompts (Fixes #1585 ) (#1609 ) * Work around for recalculating logits in cached prompts	2023-05-29 05:13:40 -07:00
Jiří Podivín	4541c5d448	Adding git in container package dependencies (#1621 ) Git added to build packages for version information in docker image Signed-off-by: Jiri Podivin <jpodivin@gmail.com>	2023-05-28 21:45:50 -07:00
Johannes Gäßler	e7b0672c25	LLAMA_DEBUG adds debug symbols (#1617 )	2023-05-28 21:01:02 +02:00
Kerfuffle	0907ab79e4	Only show -ngl option when relevant + other doc/arg handling updates (#1625 ) 1. Add a `LLAMA_SUPPORTS_GPU_OFFLOAD` define to `llama.h` (defined when compiled with CLBlast or cuBLAS) 2. Update the argument handling in the common example code to only show the `-ngl`, `--n-gpu-layers` option when GPU offload is possible. 3. Add an entry for the `-ngl`, `--n-gpu-layers` option to the `main` and `server` examples documentation 4. Update `main` and `server` examples documentation to use the new style dash separator argument format 5. Update the `server` example to use dash separators for its arguments and adds `-ngl` to `--help` (only shown when compiled with appropriate support). It will still support `--memory_f32` and `--ctx_size` for compatibility. 6. Add a warning discouraging use of `--memory-f32` for the `main` and `server` examples `--help` text as well as documentation. Rationale: https://github.com/ggerganov/llama.cpp/discussions/1593#discussioncomment-6004356	2023-05-28 11:48:57 -06:00
Vladimir Zorin	2de97bcba2	examples : add --alias option to gpt_params to set use friendly model name (#1614 )	2023-05-28 20:14:24 +03:00
Howard Su	139240d596	opencl : no need to allocate cl_mem on heap (#1612 )	2023-05-28 20:13:36 +03:00
Howard Su	b4e11a1e94	opencl : use strstr to check if fp16 supported (#1611 ) * Use strstr to check if fp16 supported * Ensure ext_buffer is null terminated	2023-05-28 20:09:56 +03:00
apcameron	9ef1c5c76e	ggml : add support for the RISCV architecture (#1616 )	2023-05-27 23:03:25 +03:00
Kerfuffle	f989fcc1ac	Include server in releases + other build system cleanups (#1610 ) Set `LLAMA_BUILD_SERVER` in workflow so the `server` example gets build. This currently only applies to Windows builds because it seems like only Windows binary artifacts are included in releases. Add `server` example target to `Makefile` (still uses `LLAMA_BUILD_SERVER` define and does not build by default) Fix issue where `vdot` binary wasn't removed when running `make clean`. Fix compile warnings in `server` example. Add `.hpp` files to trigger workflow (the server example has one).	2023-05-27 11:04:14 -06:00
Henri Vasserman	c31487b41a	Add documentation about CLBlast (#1604 ) Installing, compiling and using.	2023-05-27 18:47:55 +03:00
Henri Vasserman	6a7be3e13b	[CI] Fix openblas (#1613 ) * Fix OpenBLAS build * Fix `LLAMA_BLAS_VENDOR` CMake variable that should be a string and not a boolean.	2023-05-27 17:24:06 +03:00
Georgi Gerganov	621bc1797b	ggml : add ggml_tensor_overhead()	2023-05-27 16:19:56 +03:00
Henri Vasserman	2097ef1048	[CI] CLBlast: Fix directory name (#1606 )	2023-05-27 14:18:25 +02:00
Georgi Gerganov	97a29d7fab	ggml : sync ggml core (minor additions, e.g. ggml_get_tensor_by_name())	2023-05-27 12:23:16 +03:00
Kerfuffle	b9242e02d9	Some improvements to loading the session with --prompt-cache (#1550 ) Improvements to loading the session with `--prompt-cache` in the `main` example. 1. Fix an issue where the `--seed` parameter was ignored when loading a cached prompt. 2. When loading a cached prompt, you previously had to specify the saved prompt (or a prefix of it) again. This pull changes that behavior to default to the prompt that was cached if a prompt wasn't specified by the user.	2023-05-25 20:18:01 -06:00

... 39 40 41 42 43 ...

2639 Commits