ik_llama.cpp

mirror of https://github.com/ikawrakow/ik_llama.cpp.git synced 2026-01-30 19:19:57 +00:00

Author	SHA1	Message	Date
slaren	882a16a127	llama : ggml-backend integration (#4766 ) * llama : ggml-backend integration * ggml-backend : add names to buffers * fix unmap after loading * batched-bench : add tensor_split param * llama : check for null tensor_split * ggml-backend : increase GGML_MAX_BACKENDS * improve graph splitting, partial fix for --no-kv-offload * cuda : add ggml-backend split buffer support * cuda : do not create buffer types for devices that don't exist (fixes usage without CUDA devices available) * ggml : fix null backend dereference (#4807) * ggml : fix null backend dereference * ggml : also check ggml_backend_is_cpu * test-backend-ops : check buffer allocation failures * llama : add cparam (split_mode) and command line argument (--split-mode, -sm) to configure the split mode (none, layer or row) * ggml : fix mul_mat_id work size * llama : rewrite session kv load/set without graphs * minor * llama : only initialize used backends, free backends on context free * llama : abort ctx if cuda backend init fails * llama : rewrite lora with ggml-backend and compute on CPU ggml-ci * llama : only map to a backend buffer the region of the file mapping containing the tensors used in the buffer * opencl : add ggml-backend buffer type * cuda : only use batched_cublas with batched mat muls (fixes fp16 tg perf) * llama : on Metal, by default offload the full model ggml-ci * metal : page align the data ptr (#4854) * Apply suggestions from code review Co-authored-by: Johannes Gäßler <johannesg@5d6.de> * cuda : fix split buffer free * address review comments * llama-bench : add split-mode parameter * fix whitespace * opencl : fix double initialization * server : add --split-mode parameter * use async copy and compute to improve multi-gpu performance ggml-ci * use async memcpys to copy the graph outputs to the CPU * fix opencl * use a host buffer for the cpu compute buffer for faster copies to the gpu --------- Co-authored-by: Georgi Gerganov <ggerganov@gmail.com> Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2024-01-12 20:07:38 +01:00
Jared Van Bortel	3490c2eb19	build : fix build info generation and cleanup Makefile (#3920 ) * cmake : fix joining of REAL_GIT_DIR * fix includes with help from include-what-you-use * make : remove unneeded deps and add test-rope target * fix C includes in C++ source files * Revert "fix includes with help from include-what-you-use" This reverts commit 635e9fadfd516d4604a0fecf4a854bfb25ad17ae.	2023-12-01 00:23:08 +02:00
shibe2	1c980e81e3	CLBlast: Add outer loops over src0 for broadcasting in mulmat Reduce repeated dequantization of the same data.	2023-10-20 22:30:52 +04:00
shibe2	3537868cbd	opencl : fix element-wise multiplication (#3656 )	2023-10-18 15:09:22 +03:00
shibe2	99629835fc	CLBlast: Fix temporary buffer size for f16 conversion (wsize) Fix buffer overflow. Reduce the size to fit just one 2D slice. Assert sufficient size.	2023-10-17 21:02:30 +04:00
shibe2	8f78e4d46e	CLBlast: Fix matrix-vector multiplication (#3544 )	2023-10-12 21:59:47 +02:00
shibe2	0cb7a718e6	CLBlast: Fix handling of on-device tensor data Fix uploading tensor data to device, including 3D, 4D, and non-contiguous tensors. Use correct offsets into data that is already in VRAM. Correct handling of OpenCL events when multiple commands are queued.	2023-10-05 18:25:23 +04:00
shibe2	4103006018	CLBlast: Add broadcast support for matrix multiplication (#3402 ) Broadcast src0 into src1 across dimensions 2 and 3 when needed. This is required for models that use GQA.	2023-10-02 21:26:15 +02:00
shibe2	1f4f0754e3	ggml-opencl.cpp: Make private functions static (#3300 )	2023-09-21 14:10:26 -04:00
slaren	8d85c7d12c	ggml-opencl : store GPU buffer in ggml_tensor::extra (#2994 )	2023-09-04 14:59:52 +02:00
Wentai Zhang	b3912e82f1	opencl : fix a bug in ggml_cl_pool_malloc() for ggml_cl_mul_mat_f32() (#2955 ) Co-authored-by: Wentai Zhang <wentaizhang@tencent.com>	2023-09-03 11:46:44 +03:00
Howard Su	b29cd6ba6d	Fix opencl by wrap #if-else-endif with \n (#2086 )	2023-07-07 05:34:18 +02:00
Govlzkoy	1d97415a70	[ggml] fix index for ne03 value in ggml_cl_mul_f32 (#2088 )	2023-07-04 07:50:00 +08:00
LostRuins	f3aefd46e3	Porting the improved K-Quant CUDA kernels to OpenCL (#1966 ) * Added broken new q4k quant * xx + ib0 * Fix q2_k fast kernel * Use preprocessor for QK_K * Add q6_k fast matmul kernel * ported q3k speedup successfully * ported q2k and q5k speedups * remove old dot kernels and template * fixed global const struct types * fixing address spaces * fixed string too long CI issue --------- Co-authored-by: 0cc4m <picard12@live.de>	2023-06-29 05:56:43 +02:00
Howard Su	49fdf2d578	ggml : fix warnings under MSVC (#1908 )	2023-06-17 18:46:15 +03:00
0cc4m	d7df87f386	opencl : support k-quants (#1836 ) * Porting q2_k kernel to OpenCL * Set global and local sizes for kernel calls for dequantizing k-quants * Added q6_k kernel * Fix q4_k opencl struct order * Replace uchar with uint8_t * Finish dequant kernels * Added OpenCL DMMV kernels * Fix q2_k, improve code * Fix q3_k * Shorten switch statements * Improve code formatting --------- Co-authored-by: Concedo <39025047+LostRuins@users.noreply.github.com>	2023-06-16 21:59:49 +03:00
Howard Su	df687e822c	Leverage mmap for offloading tensors to GPU (#1597 ) * Rebase to latest * Show progress * Add assert to make sure we only allocate temp buffer for non-CPU backend tensor Co-authored-by: Johannes Gäßler <johannesg@5d6.de> --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de>	2023-06-12 14:44:16 +02:00
Robert Sung-wook Shin	6a4f97263d	OpenCL: Add release memory (#1741 ) * Add opencl release memory * Rename function name	2023-06-09 18:24:40 +02:00
Johannes Gäßler	e957101084	Multi GPU support, CUDA refactor, CUDA scratch buffer (#1703 ) * CUDA multi GPU + scratch ggml_cuda_compute_forward Tensor parallelism ggml_cuda_add ggml_cuda_rms_norm ggml_cuda_silu CUDA scratch buffer --main-gpu CLI option	2023-06-06 21:33:23 +02:00
LostRuins	698d0096d6	Clblast fixes + enhancements to save VRAM and offload more layers (#1675 ) * Use events instead of clFinish, where possible * OpenCL: Don't load gpu layers into RAM, add mul_f32 kernel * Reduce queueing overhead for contiguous tensors by using single mul kernel call * Adapt to #1612 cl_mem malloc changes * Reduce code duplication between cuda and opencl branches * Improve implementation * Clblast fixes + enhancements to save VRAM: 1. Change all Clblast buffers to CL_MEM_READ_WRITE, as the pool malloc currently doesn't properly handle them. 2. When recycling buffers in pool malloc, always assign the SMALLEST available buffer that fits, instead of the FIRST available buffer 3. When failing to recycle a buffer in pool malloc (all too small), instead recycle the largest available free buffer by resizing it. * change max value size_t to use limits * removed flags from the CL pool malloc, apply code tidying suggestions.	2023-06-06 19:00:01 +02:00
0cc4m	9223b3ab53	OpenCL: Fix duplication of layers in VRAM and RAM, add GPU mul kernel (#1653 ) * Use events instead of clFinish, where possible * OpenCL: Don't load gpu layers into RAM, add mul_f32 kernel * Reduce queueing overhead for contiguous tensors by using single mul kernel call * Adapt to #1612 cl_mem malloc changes * Reduce code duplication between cuda and opencl branches * Improve implementation	2023-06-04 08:12:05 +02:00
Howard Su	139240d596	opencl : no need to allocate cl_mem on heap (#1612 )	2023-05-28 20:13:36 +03:00
Howard Su	b4e11a1e94	opencl : use strstr to check if fp16 supported (#1611 ) * Use strstr to check if fp16 supported * Ensure ext_buffer is null terminated	2023-05-28 20:09:56 +03:00
Maarten ter Huurne	ce89052a49	Fix handling of "invalid property" when creating OpenCL command queue (#1565 ) The `clCreateCommandQueue()` function will return the code `CL_INVALID_QUEUE_PROPERTIES` when passed unsupported properties, not `CL_INVALID_PROPERTY` as the original code was checking for.	2023-05-23 19:01:15 +03:00
0cc4m	3955ecde57	OpenCL Token Generation Acceleration (#1459 ) * Move back to C++ for OpenCL * Refactor OpenCL code to work more like the CUDA code, add missing functions * Deduplicate dequant kernels * Add OpenCL compile options * Use compile args for preprocessing constants * Restore default platform + device selection by id behavior --------- Co-authored-by: Johannes Gäßler <johannesg@5d6.de> Co-authored-by: Henri Vasserman <henv@hot.ee>	2023-05-23 00:33:24 +03:00

25 Commits