composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-05-14 10:09:41 +00:00

Author	SHA1	Message	Date
zjing14	be753a8db1	Add Gemm instances for performance improvement (#1018 ) * improve kpad * more tuning parameters * f16_f8_fp16 * cut test time * add f16_f8_fp16 * add f16_f8_f16 * testing instances for skinny cases * format * clean * add fp16_f8_fp16 * clang-format * add grouped gemm instalces * fixed profile grouped_gemm * clean * clean * clean * clean * clean * add missing instance func * fixed inferface --------- Co-authored-by: Jing Zhang <jizha@amd.com> Co-authored-by: root <root@sh5-1e707-rc06-38.mkm.dcgpu> [ROCm/composable_kernel commit: `98fd41f597`]	2023-11-07 09:09:58 -06:00
zjing14	91e1cf6750	Revert "Grouped Gemm with looping over the tiles. (#788 )" (#982 ) This reverts commit 43fe5037d4ff9d07365e5d3b8f5b31676a8ff9da. [ROCm/composable_kernel commit: `c99323be6e`]	2023-10-11 14:27:29 -05:00
Adam Osewski	34b77070f3	Grouped Gemm with looping over the tiles. (#788 ) * Introduce LocalBlockToCTileMap. * Change the signature of CalculateBottomIndex() function which now does not accept any argument. The B2C map which is already passed as an argument to the kernel Run function is calculating block's local id already outside at kernel entry point __global__ function. The LocalB2C map stores as members local block ID. * Use LocalBlockToCTile map in device ops. * First draft of tile loop work distribution. * Fix typo. * Simplify kernel arguments. Calculate descriptors & B2C maps on the device. * Use looping kernel. * Fix B2C constructor. * Fix Navi21 errors. * Calculate tile start/end in device kernel. * Change Run API to accept user provided workspace buffer. * Add new line at EOF. * Move Gemm KernelArguments to device op interface. * Remove unused code. * Update API. * Launch grid size which is min of occupancy vs tile count * Get back to use constant memory for gemm descriptors. * Remove unused code. * Add default virtual method implementation. * Update comments to conform with doxygen style. * Fix doc style and unused parameters. * Add thread cluster lengths to kernel name. * Remove old splitk impl and replace it with tile looping one. * Modify instances. * set KPerBlock to 64 * maximize wherever possible vector load size. * Fix instances cluster lengths. * Change comment style. * Use 128b store where possible in instances. * Update test cases, since KPerBlock has doubled. * Update output stream operator for Sequence. * Add pipeline version to GroupedGEMM device op type string. * Fix pipeline version type logging. * Fix input tensors type after merge. * Fix compiler error. * Fix output stream operator for Pipeline version. * Store using 128b. * Set of instances with kpb 32/64 * Limit number of instances * Remove commented out instances. * Fix function name. * Limit the number of instances. Add pipline version to the regular instances * Change thr cluster layout for reading B tensor. * disabled failed instances --------- Co-authored-by: Adam Osewski <aosewski@amd.com> Co-authored-by: zjing14 <zhangjing14@gmail.com> Co-authored-by: Jing Zhang <jizha@amd.com> [ROCm/composable_kernel commit: `a4f72a314a`]	2023-10-10 22:21:15 -05:00
zjing14	c79ecbccfb	Grouped Gemm with Fixed K and N with SplitK (#818 ) * move all arguments into device * add b2c_tile_map * add examples * add SetDeviceKernelArgs * dedicated fixed_nk solution * init client api * add grouped_gemm_bias example * add a instance * add instances * formatting * fixed cmake * Update EnableCompilerWarnings.cmake * Update cmake-ck-dev.sh * clean; fixed comments * fixed comment * add instances for fp32 output * add instances for fp32 output * add fp32 out client example * fixed CI * init commit for kbatch * add splitk gridwise * format * fixed * clean deviceop * clean code * finish splitk * fixed instances * change m_loops to tile_loops * add setkbatch * clean code * add splitK+bias * add instances * opt mk_nk instances * clean examples * fixed CI * remove zero * finished non-zero * clean * clean code * optimized global_barrier * fixed ci * fixed CI * removed AddBias * format * fixed CI * fixed CI * move 20_grouped_gemm to 21_grouped_gemm --------- Co-authored-by: Jing Zhang <jizha@amd.com> [ROCm/composable_kernel commit: `f5ec04f091`]	2023-08-31 09:22:12 -05:00
Illia Silin	65eccfd426	do not build gfx941/942 targets during daily QA runs (#758 ) [ROCm/composable_kernel commit: `d140bdc9fa`]	2023-06-16 12:13:16 -07:00
Illia Silin	48347d8653	Enable gfx941 and gfx942 architectures. (#752 ) * enable gfx941/942 targets * fix clang format * fix the cmake logic for multiple targets * fix cmake syntax for looping over targets * add gfx941/942 support for gemm_xdl instances [ROCm/composable_kernel commit: `027e46ee82`]	2023-06-15 08:20:59 -07:00
Illia Silin	dda83a196e	Syncing up from internal repo to enable MI300. (#690 ) * enable gfx940 * switch between intrinsic mfma routines on mi100/200 and mi300 * fix mfma_int8 on MI300 * disable 2 int8 examples on MI300 * Update cmake-ck-dev.sh * restore gitignore file * modify Jenkinsfile to the internal repo --------- Co-authored-by: Jing Zhang <jizha@amd.com> Co-authored-by: zjing14 <zhangjing14@gmail.com> [ROCm/composable_kernel commit: `4feebedd41`]	2023-04-28 18:22:59 -05:00
Haocong WANG	ec634a3d32	Add CMake Option "USE_OPT_NAVI3X" (#647 ) * Add CMake Option "USE_OPT_NAVI3X" * remove navi3x opt compile option from cmake script [ROCm/composable_kernel commit: `4e097ad283`]	2023-03-29 14:07:33 -05:00
Rostyslav Geyyer	81187d3553	Update cmake-ck-dev.sh script (#641 ) Co-authored-by: Rosty Geyyer <rosty.geyyer@amd.com> [ROCm/composable_kernel commit: `fa998675fc`]	2023-03-15 18:38:11 -05:00
zjing14	84a4731c15	disable tensor contraction f64 on MI100 (#602 ) [ROCm/composable_kernel commit: `209baee299`]	2023-02-23 16:59:37 -08:00
zjing14	af49d3cc89	Add contraction_fp64 example (#570 ) * add contraction_bilinear * add contraction_scale_xdl_fp64 * reduce tile size to avoid register spill --------- Co-authored-by: root <root@ctr-ubbsmc16.amd.com> [ROCm/composable_kernel commit: `24c9ee1d22`]	2023-02-15 12:00:58 -06:00
rocking5566	9052e8501c	Conv perlayer int8 quantization (#471 ) * Add conv2d requant example * Fix bash error * Rename example * 1. Rename gemm quantization 2. shares the requantization lambda function with conv * Refine declare type * Add conv bias relu quantization exmaple * clang format * Fix compile error due to merge develop * Fix CI error * Extract quantization post operation into another file * Support quantization for non piecewise linear function * Add instance for conv quantization * Add convolution quantization factory * Add convolution quantization client example * Add more instances with different template parameters * clang format * Sync the naming with the develop [ROCm/composable_kernel commit: `226bc02b73`]	2022-11-02 13:56:26 -06:00
Chao Liu	34f18d8e24	update document: Readme, contributors, citation, (#463 ) * update cmake script * update readme * Update README.md * add citation * add images * Update README.md * update * Update README.md * Update CONTRIBUTORS.md * Update README.md * Update CITATION.cff * Update README.md * Update CITATION.cff [ROCm/composable_kernel commit: `473ba5bc4a`]	2022-10-03 00:48:24 -05:00

13 Commits