composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-05-15 10:37:44 +00:00

Author	SHA1	Message	Date
Andriy Roshchenko	ece7edc492	Adding more instances of grouped convolution 3d forward for FP8 with ConvScale element-wise operation and ReLU activation. (#1386 ) * Add CMakePresets configurations. * Add ConvScale+ReLU Functor and an Example * Account for ReLU FLOPs. * Add instances of 3D convolutions with ConvscaleRelu operation. * Implement Client Example * Cleanup [ROCm/composable_kernel commit: `802a8a1df1`]	2024-07-16 08:51:49 -07:00
Bartłomiej Kocot	c885afdaae	Support access per groups and filter3x3 in grouped conv fwd (#1382 ) * Support access per groups and filter3x3 in grouped conv fwd * Fixes for large cases * Fixes for large tensors [ROCm/composable_kernel commit: `82e8a78a3f`]	2024-07-12 11:08:42 -07:00
Harisankar Sadasivan	45802765e0	Universal streamk with atomics (#1360 ) * universal streamk with atomics with ckprofiler support. grid_size and streamk strategy are tunable. grid_size of -1 leads to #WGs = maximum occupancy X num_CUs. implementation supports many different streamk policies: 1-tile, 2-tile, 3-tile and 4-tile. streamk strategy of -1 leads to default streamk policy (4-tile). * Update README.md * fixing clang-format issues * removed conflicts in struct members between streamk and universal streamk * corrected arg parsing for streamk and universal streamk * added stream-k policies for 3 tile and 4 tile * fixed argument type issue with parsing cmd args * changes suggested in PR review are made- removing comments and correcting copyright * file permissions updated * added default value support for grid_size and streamk-policy selection set to -1 * print messages for arguments * print messages for arguments * print messages for arguments1 [ROCm/composable_kernel commit: `75e622f02f`]	2024-07-05 21:40:30 -07:00
jakpiase	605ac804c4	Add structural sparsity xdlops (#1363 ) * Implemented smfmac xdlops * add reviewer comments [ROCm/composable_kernel commit: `eaa870a1ab`]	2024-07-04 12:00:14 +02:00
Jun Liu	fa73739812	Fix issue with multiple targets and remove smfmac tests from unsupported test targets (#1372 ) [ROCm/composable_kernel commit: `959073842c`]	2024-07-03 23:34:38 -07:00
jakpiase	b9523bfc3f	Add structural sparsity gemm instruction tests (#1309 ) * first version of smfmac test * add reviewer comments * add reviewer suggestions [ROCm/composable_kernel commit: `ed21948bcd`]	2024-06-27 11:30:32 +02:00
Illia Silin	57ae3ae99f	Merging the gfx12 code into public repo. (#1362 ) [ROCm/composable_kernel commit: `941d1f7ce0`]	2024-06-27 00:33:34 -07:00
arai713	3f6437ede4	CK Instance Gen (#1145 ) * Format * Format * Format * Remove const * Use the right template * Format * Format * add row/col instances * Add missing file * fixed * fixing block to etile error * Format * Updates * Format * fixed rrr layout * generating a sample JSON file: currently contains includes, prologue/epilogue and instances * version where the json is passed into the instances to generate a key * updated run function to just launch kernel * updated run function: only contains kernel object, json file is updated but still needs to be cleaned up, added front-end API to parse JSON into character buffer * adding in testing files * cleaned up comments, still need to work on including header files * removed unneeded files * removed/commented out JSON implementation * added fusion(prologue/epilogue) into instance generation * working on instance selection * added instance selection, need to fix instance validation * removed block2etile map validity check for testing purposes * test running: failing due to incorrect files/input * all grid descs/ptrs completed, but device file not found * Update test and embed modules * Restore older version * added convolution operation, written test, debugging generated code for compilation * attempting to include CK in host directory: _Float16 error * CK header file issues * slight fix * don't crash when hip can't report total memory * dump generated code to a file * changing sizes * creating tensor descriptors using CK methods: set up grid desc manually, also trying to set up an argument pointer - this needs to be fixed * some fixes to call the device code * separating test files for conv and gemm * completed arg ptr, now have linking errors * clang format fix * resolved linker issues in conv test * remove dependency on libutility from ck * resolved num dim error * properly passing arg ptr, errors with passing typenames: redefinition/redeclaration * undo the commenting of device function * hand created kernel code to find rtc issues * dump the full src to file * resolved redeclaration errors, cleaned up errors for Amber's kernel code * debugging purposes: redeclaration error * config files * resolved errors for NumTensor and redeclaration, formatted version.h * resolved most errors in manually added kernel and my own. error with calling kernel object: overloaded function type * WIP: close to getting kernel compiled * WIP: fixing rtc errors * fixed sequence errors, formatting, still one error with run fcn * yay: kernel compiles and runs * updated templated/generated version to run and compile * minor fixes * working generated example, resolved memory access error due to padding * adding in reference kernel, validation failing against reference * debugging: printing kernel argsz * reduced error in results * debugged reference kernel and output errors, added to generated version, currently debugging prologue function issues * working validation (using reference convolution) with prologue function for both hard-coded and generated version * WIP: create an alt version that creates Argument on the device * wip: added new duplicate files, fixed fusion templating errors from working example, setting up kernel arguments * wip: making necessary methods device code * added grid descs, working on grid pointers, errors with stl numerics * wip: updating kernel args - issue, replacing some std functions * replaced std::accumulate call with temp hardcoded version * wip: args causing memory issue * Construct Argument object inside the kernel and use it to call convolution device function. Code runs and verification passes * adding object file dump * temporary hardcoding of grid size, can remove device op inst + arg ptr * minor fix for grid size * added modified example where arg ptr is created on the device for generated version as well * removed device op instance and arg ptr from modified examples * moving device op file for testing purposes and to properly build CK * commenting out print-outs * adjust compiler args to produce a valid ELF file * temporary removal of validation * reverting compiler args back for working example * retrieve necessary arguments from generated template parameters in correct format * calculating grid size on host-side, still need to clean up process, pass parameters to host functions properly * scaled up factory functions/wrapper structs to implement host-side launch parameter calculations using CK host side functions - in hard-coded example * temporary change to generate ELF format binary object file * removed unecessary code, added comments * formatting fix * cleaned up code, added new tests, restructured library: move helper into CK * refactored launch parameter calculation to be more concise * renamed files and variables for more clarity/uniformity * more code cleaning, removed debug statements * moved majority of my files into codegen directory, running properly * updated Embed.cmake(string_view) in codegen directory * updated host directory to match Embed.cmake as well * added old tests in * updated instance generation methods to be more concise * removed layout from launch parameter calculation * working test * fixed issue with verification, all instances working * updated verification in other tests * removed duplicate matrix padder file, removed code dumps * removed old hard-coded tests * removed old host directory, all files in codegen directory now * fixed copyright in files * commenting out validation * renamed files * made changes for review: fixed copyright, renamed files for clarity, removed comments, refactored code * updated headers * removing duplicate file for fwd conv to gemm, merging with original file * fix building codegen with clang++ directly * resolving build error from conv_fwd_to_gemm * fix for previous error * renaming tests * created common test file * cleaned up code, added comments * renamed device op * fixed typos in comments * removed extra space * code cleanup: resolving Amber's comments * removed wrapper struct for matrix padder, fixed template * cleaned up if statements for better readability --------- Co-authored-by: Paul <pfultz2@yahoo.com> Co-authored-by: Jing Zhang <jizha@amd.com> Co-authored-by: M. Amber Hassaan <amber_474@yahoo.com> Co-authored-by: illsilin <Illia.Silin@amd.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> [ROCm/composable_kernel commit: `3e9711f0cb`]	2024-06-25 16:37:35 -05:00
carlushuang	723dd9813e	WA for rocm-6.2+ s constrait for buffer resource (#1346 ) * WA for rocm-6.2+ s constrait for buffer resource * add missing memory clobber [ROCm/composable_kernel commit: `fa129c1a5d`]	2024-06-21 11:00:13 -05:00
Bartłomiej Kocot	cc0dd8a45e	Fix cmake warnings (#1342 ) * Cmake add -Wno-nvcc-compt * Remove template without initialization list * dpp remove template without init list * Fixes [ROCm/composable_kernel commit: `510325a468`]	2024-06-21 09:47:58 +02:00
ThruptiRajLakshmanaGowda	428cefd1b5	Adding Missed Activation Functions for Grouped 2D/3D Convolutions (#1348 ) * Initial Push * First Push * Fixed Clang format * Resolve merge conflict * Addressed review comments * Addressed review comments * Addressed review comments [ROCm/composable_kernel commit: `0162a5f6ba`]	2024-06-20 09:24:54 -05:00
Bartłomiej Kocot	e1c3bf298d	Add read_first_lane function for int64 (#1347 ) [ROCm/composable_kernel commit: `8faec23cb4`]	2024-06-18 15:05:30 -05:00
jakpiase	163a866a5b	Switch to universal gemm in grouped gemm tile loop (#1335 ) * switch to universal gemm in grouped gemm tile loop * minor fixes * add reviewers comments --------- Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com> [ROCm/composable_kernel commit: `e2d139201b`]	2024-06-18 09:01:49 -05:00
Bartłomiej Kocot	856b54e58b	Fix continous dim selection in contraction (#1336 ) * Fix continous dim selection in contraction * Fixes [ROCm/composable_kernel commit: `933951ed48`]	2024-06-18 10:26:49 +02:00
zjing14	4847f3beb4	disabled lds direct load inline asm (#1331 ) [ROCm/composable_kernel commit: `e02103168a`]	2024-06-16 20:33:47 -05:00
Bartłomiej Kocot	d413c30ff4	Support large tensors in grouped conv fwd (#1332 ) * Support large tensors in grouped conv fwd * Multi ABD fixes * Fix calculate element space size [ROCm/composable_kernel commit: `dc1e9c5df9`]	2024-06-14 09:53:03 -05:00
Rostyslav Geyyer	25ae51c6f0	Add a convinvscale op, related instances and examples (#1307 ) * Update the element op * Add an example * Add instances * Add a client example * make sure new instances only build on gfx9 * Update element op and its handling * Format * Update instances to take element op as an argument * Update examples to use random scale values * Format * Update client example with random scales * Format --------- Co-authored-by: illsilin <Illia.Silin@amd.com> [ROCm/composable_kernel commit: `ce66277a76`]	2024-06-10 14:48:49 -05:00
Bartłomiej Kocot	41c68496e6	Integrate universal gemm with conv forward (#1320 ) * Integrate universal gemm with conv fwd * Fix conv fwd wmma test * Fix instances * Remove direct load check [ROCm/composable_kernel commit: `ac58cc5d1d`]	2024-06-05 13:01:29 -05:00
Rostyslav Geyyer	fec15d8c40	Add a scale op, related instances and examples (#1242 ) * Add a scale op * Update the element op * Add instances * Add an example * Add a client example * Add a flag check * Revert flag check addition * Fix flag check * Update d strides in example * Update d strides in client example * Apply suggestions from code review Update copyright header Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Move the example * Move the client example * Update element op * Update example with the new element op * Add scalar layout * Update example * Update kernel for scalar Ds * Revert kernel changes * Update element op * Update example to use scales' pointers * Format * Update instances * Update client example * Move element op to unary elements * Update element op to work with values instead of pointers * Update instances to take element op as an argument * Update examples to use random scale values --------- Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> [ROCm/composable_kernel commit: `cb0645bedc`]	2024-06-04 19:28:15 -05:00
zjing14	551be3cb67	Post-merge fix of PR 1300 (#1313 ) * add f8 gemm with multiD for both row/col wise * change compute_type to fp8 * changed tuning parameters in the example * add rcr example * post-merge fix * fix * reduce init range [ROCm/composable_kernel commit: `6fb1f4e03f`]	2024-05-31 22:46:41 -07:00
zjing14	fe0f89d95d	add f8 gemm multiD with both row/col wise scale (#1300 ) * add f8 gemm with multiD for both row/col wise * change compute_type to fp8 * changed tuning parameters in the example * add rcr example [ROCm/composable_kernel commit: `80db62f08d`]	2024-05-28 12:04:22 -05:00
Bartłomiej Kocot	b4b436d29a	Optimize grouped conv bwd weight for small M and N (#1303 ) * Optimize grouped conv bwd weight for small M and N * Fixes [ROCm/composable_kernel commit: `fd72380aeb`]	2024-05-22 21:01:01 +02:00
Illia Silin	ca0015bf39	aggregate device macros in ck_tile config header (#1297 ) [ROCm/composable_kernel commit: `06b891c5c2`]	2024-05-20 08:34:45 -07:00
Illia Silin	0003dce849	replace the ENV macro with CK_ENV (#1296 ) [ROCm/composable_kernel commit: `1274861a9d`]	2024-05-17 10:42:51 -07:00
Illia Silin	ca31c8515e	remove wrong use of nonexistent class members (#1290 ) [ROCm/composable_kernel commit: `c44137838e`]	2024-05-15 08:08:17 -07:00
jakpiase	290ac20e62	Add unit tests for grouped gemm two stage (#1256 ) * add unit tests for grouped gemm two stage * add reviewers suggestions --------- Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com> [ROCm/composable_kernel commit: `3e3471d5d2`]	2024-05-15 10:03:39 +02:00
Illia Silin	254758813f	Code clean-up (#1285 ) * code clean-up * remove the profiling output samples [ROCm/composable_kernel commit: `566b6480a2`]	2024-05-10 09:41:39 -07:00
Bartłomiej Kocot	70f51bb03f	Change output gemm type to AccDataType in two stage conv bwd wei (#1283 ) [ROCm/composable_kernel commit: `8346af9c68`]	2024-05-10 10:57:42 +02:00
Adam Osewski	675a16e3b8	Fix MakeArgument (#1284 ) [ROCm/composable_kernel commit: `a0ae1c6133`]	2024-05-09 09:42:41 -07:00
Adam Osewski	3ede1f58e6	Add vector instruction coherency bits for gfx94 targets. (#1268 ) [ROCm/composable_kernel commit: `3c043cd10b`]	2024-05-09 07:30:17 -07:00
Illia Silin	ffe52d2d30	fix the output formatting (#1282 ) [ROCm/composable_kernel commit: `fdbf8ccbd7`]	2024-05-08 16:11:54 -07:00
Bartłomiej Kocot	b6a17bc3e2	Add two stage grouped conv bwd weight kernel (#1280 ) [ROCm/composable_kernel commit: `0b6b5d1785`]	2024-05-08 09:53:24 +02:00
Illia Silin	e88d576926	Enable logging in CK with environment variable. (#1278 ) * enable logging using environment variable * update ck.hpp header * fix typo * fix clang format * Update include/ck/utility/env.hpp Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> --------- Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> [ROCm/composable_kernel commit: `bf42097646`]	2024-05-07 16:26:43 -07:00
Illia Silin	ba9ffb86c7	add missing vector header (#1275 ) [ROCm/composable_kernel commit: `08d51d9bc4`]	2024-05-02 11:27:59 -07:00
Rostyslav Geyyer	e0669ecf6c	Mark unneeded instances as "getting deprecated" (#1265 ) * Add a flag * Add flag check and messages --------- Co-authored-by: root <root@aus-g7-rogeyyer.amd.com> [ROCm/composable_kernel commit: `6ced3c12ff`]	2024-04-29 12:00:55 -07:00
Haocong WANG	1a34c500a6	[GEMM] UniversalGemm update (#1262 ) * Add bf16 instances * Add bf16 gemm universal example * tempsave * Add guard to navi compilation * workground on a specific mixed gemm instance ( bring back it when compiler fix upload) * fix formatting condition statement issue * solve conflict --------- Co-authored-by: Jun Liu <Liu.Jun@amd.com> [ROCm/composable_kernel commit: `764164b488`]	2024-04-26 12:56:07 -05:00
Rostyslav Geyyer	2d642d2737	Add element op (#1259 ) [ROCm/composable_kernel commit: `f044ff71fb`]	2024-04-26 12:55:45 -05:00
zjing14	ce67c185b4	bf16A_Int8B with fastgelu/bias (#1264 ) * changed the copy function to v7r2 * adding multi_abd * in-progress * add post-load oob check * debugging * adjust instances * add run_lds * add elemntwise_op * replace multi_abd_device with v3 * clean up * clean * clean * Added LDSType * profiling * adjust oobcheck * add missing file * refactor * clean * add examples [ROCm/composable_kernel commit: `0d0150db20`]	2024-04-26 07:26:30 -05:00
Adam Osewski	2b452ad135	Grouped GEMM Multiple D tile loop. (#1247 ) * Overload output stream operator for LoopScheduler and PiplineVersion * Add Run overload accepting grid descriptors MK. * Add __device__ keyword for CalculateGridSize * Create device op GroupedGemmMultipleD * Add GroupedGemm MultipleD Tile Loop implementation. * Add an example for GroupedGemm MultipleD tile loop. * Device Op GroupedGEMMTileLoop. * Bunch of small changes in exmaple. * CkProfiler * Remove unused tparam. * Fix include statement. * Fix output stream overloads. * Do not make descriptors and check validity untill we find group. * Fix gemm desc initialization. * Revert device op * Fix compilation for DTYPES=FP16 * Validate tensor transfers paramters. * Validate on host only NK dims if M is not known. * Fix bug. * A convenient debug func for selecting threads. * Fix has main k block loop bug. * Make sure that b2c has up to date tile offset. * Output stream operator for Sequence type. * Cmake file formatting. [ROCm/composable_kernel commit: `b4032629e5`]	2024-04-25 15:12:53 -05:00
ltqin	b4f3b8e693	Universal gemm flush cache (#1251 ) * add flush cache to device op * add flush cache parameter to ckProfiler * change calculate size a and b method * chang evaluation time method foro AVERAGE to MEDIAN * format code * adjust some code * fix core dumped * remove loop call flush icache in kernel * remove loop(outer) call flush icache --------- Co-authored-by: letaoqin <letaoqin@amd.com> [ROCm/composable_kernel commit: `f448d179b7`]	2024-04-25 15:07:14 -05:00
Bartłomiej Kocot	ac08f8a3a1	Fix contraction IsSupported checks (#1257 ) [ROCm/composable_kernel commit: `b1f8ae379b`]	2024-04-23 22:59:39 +02:00
Bartłomiej Kocot	6578635cb3	Refactor elementwise kernels (#1222 ) * Refactor elementwise kernels * Instances fixes * Fix cmake * Fix max pool bwd test * Update two stage gemm split k * Restore elementwise scale for hiptensor backward compatiblity * Fix Acc data type check in conv fwd multiple abd * Disable conv fp64 fwd example * Update grouped conv weight multi d [ROCm/composable_kernel commit: `ad1597c499`]	2024-04-19 13:31:17 +02:00
Bartłomiej Kocot	d001bea12f	Add grouped conv bwd weight multi d kernel (#1237 ) * Add grouped conv bwd weight multi d kernel * Reference fix * Fix cmake files * bwd weight scale only xdl * Fixes * Fix client conv fwd example [ROCm/composable_kernel commit: `fd923b6d86`]	2024-04-18 23:35:04 +02:00
zjing14	4ddb546fe5	Added Multi_ABD support into Gemm and GroupedGemmFixedNK (#978 ) * added an example grouped_gemm_multi_abd * fixed ci * add setElementwiseOp * changed API * clean code: add multiA into example * fixed v7r2 copy * add transpose * clean * fixed vector_load check * Update example/15_grouped_gemm/grouped_gemm_multi_abd_xdl_fixed_nk_bias_fp16.cpp Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Update example/15_grouped_gemm/grouped_gemm_multi_abd_xdl_fixed_nk_bias_fp16.cpp Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Update example/15_grouped_gemm/grouped_gemm_multi_abd_xdl_fixed_nk_bias_fp16.cpp Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Update include/ck/tensor_operation/gpu/grid/gridwise_gemm_multiple_abd_xdl_cshuffle.hpp Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Update include/ck/tensor_operation/gpu/grid/gridwise_gemm_multiple_abd_xdl_cshuffle.hpp Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Update include/ck/tensor_operation/gpu/device/device_grouped_gemm_multi_abd.hpp Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Update include/ck/tensor_operation/gpu/device/device_grouped_gemm_multi_abd.hpp Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Update include/ck/tensor_operation/gpu/device/device_grouped_gemm_multi_abd.hpp Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Update include/ck/tensor_operation/gpu/device/device_grouped_gemm_multi_abd.hpp Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Update include/ck/tensor_operation/gpu/device/device_grouped_gemm_multi_abd.hpp Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Update include/ck/tensor_operation/gpu/device/device_grouped_gemm_multi_abd.hpp Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Update include/ck/tensor_operation/gpu/device/device_grouped_gemm_multi_abd_fixed_nk.hpp Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Update include/ck/tensor_operation/gpu/device/device_grouped_gemm_multi_abd_fixed_nk.hpp Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * add reduce * testing * add example_b16_i8 * refactor example * clean * add mpading * disable reduce for kbatch = 1 * seperate reduce device op * add reduce op * add guard for workspace_size * add instances * format * fixed * add client example * add a colmajor * add instances * Update cmake-ck-dev.sh * Update profile_gemm_splitk.cpp * Update gridwise_gemm_xdlops_v2r4r2.hpp * format * Update profile_gemm_splitk.cpp * fixed * fixed * adjust test * adjust precision loss * adjust test * fixed * add bf16_i8 scale bias * fixed scale * fixed scale elementwise_op * revert contraction deviceop changes * fixed * Add AddFastGelu * Revert "Merge branch 'jizhan/gemm_splitk_reduce' into grouped_gemm_multi_abd_fixed_nk_example" This reverts commit `3b5d001efd`, reversing changes made to `943199a991`. * add Scales into elementwise * add gemm_multi_abd client example * add client examples * add rcr and crr * add grouped gemm client example * add grouped gemm client example * add instance for rcr crr * format * fixed * fixed cmake * fixed * fixed client_example * format * fixed contraction isSupport * Update include/ck/tensor_operation/gpu/device/device_grouped_gemm_multi_abd_fixed_nk.hpp Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> * Update device_reduce_threadwise.hpp * clean * Fixes * Fix example --------- Co-authored-by: Jing Zhang <jizha@amd.com> Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> [ROCm/composable_kernel commit: `12865fbf28`]	2024-04-15 21:09:45 -05:00
Haocong WANG	ceaecc86ca	[GEMM] Gemm universal device operation (#1154 ) * Optimize GEMM on MI200/300: 1. Add new blockwise gemm pipeline 2. Add irregular splitk intances * clang format + typo fix * Fix a bug * initial commit * Add more instances to irregular splitk * blkgemm pipeline v1~4 prototype * Sanity Checked. Known issue: 1. Poor performance of splitk 2. Register spill on blkgemmpipeline v3 * Sanity and Performance fix: 1. fix a bug related to sanity in grouped b2c mapping 2. fix a bug related to sanity and performance in splitk offset * Sanity and API update: 1. Remove prefetch stage 2. Fix valid check bug 3, Add first gemm_universal instance into ckProfiler * Add NN instances for gemm universal * 1. Add NT instances for gemm_universal 2. Fix a bug about Kpadding in gemm_universal * Fix a bug regarding padding Odd K number * remove kernel print * Fix KPadding bug... * Update safety check * another try to fix kpadding.. * Sanity checked * new instances.. * clang format+typo fix * remove clang format script's change * Add non-hotloop compile option * 1. Add fp16xfp8 example 2. pull packed convert f8 from pr1150 * Some miscs.. opt and fix * Add pipeline description docs * Split universal gemm instance library to cut profiler compiling time * uncomment cmakefile * Fix a bug caused by blockwise_gemm_pipe_v2 * reduce default splitk to 1 * Add 224x256x64 tile size * update, including: 1. Experiment pipeline 5~7 2. Optimization for pipeline 4 3. Organized instance library * temp save * temp save * Permuted lds layout, sanity and function checked * clang format * Move OOB check from RunRead to RunWrite, for better software pipeline. TODO: agpr spill when NN layout * clangformat * A/B splitpipe scheduler for v3 * Fix two bugs * bug fix * fix a bug in oob check * Example for mixed fp16_fp8 gemm * Clean experimental code blocks * Add mixed precision gemm into profiler * tempsave * optimize m/n major lds layout * Add RRR GEMM mixed precision instances * Optimize f8 matrix transpose * Add test_gemm_universal * A/B spilt schedule for blkpip v5 * Take ds_read2 into iglp scheduling scheme * format * fixed cmake * Add llvm-option into CI cmake flag --------- Co-authored-by: Jing Zhang <jizhan@amd.com> [ROCm/composable_kernel commit: `f83e9701e9`]	2024-04-13 21:03:18 -05:00
Illia Silin	7b8026faa9	[HotFix] pass XDL and WMMA macros to libs that use CK (#1234 ) [ROCm/composable_kernel commit: `d7f05fb996`]	2024-04-11 16:40:45 -07:00
jakpiase	1438bdd38c	Add Grouped Gemm Multiple D SplitK TwoStage (#1212 ) * Support A/B/C elementwise ops. * First part of GGEMM multiD splitk two stage. * WIP - changes for debuggin. * tmp save * working version * added bf16@int8 version * fixes * add reviewers sugestions * pre-commited missing files * switched to ifs from elseifs --------- Co-authored-by: Adam Osewski <Adam.Osewski@amd.com> [ROCm/composable_kernel commit: `c701071666`]	2024-04-04 11:01:33 +02:00
Rostyslav Geyyer	0b8e766e55	Add instances for conv_scale with fp8@bf8->fp8 (#1220 ) * Update device op api to support BComputeType * Add example * Add instances * Add profiler mode * Add client example * Update copyright year * Add BComputeType check * Fix compute types [ROCm/composable_kernel commit: `a61e73bc56`]	2024-04-03 09:08:08 -05:00
Bartłomiej Kocot	0adb068ce8	Introduce combined elementwise ops (#1217 ) * Introduce combined elementwise ops * Introduce refrence elementwise [ROCm/composable_kernel commit: `9a194837af`]	2024-04-02 17:23:49 -05:00
Illia Silin	1f4d13b2b5	Split the instances by architecture. (#1223 ) * parse examples inside the add_example_executable function * fix the example 64 cmake file * add xdl flag to the gemm_bias_softmax_gemm_permute example * add filtering of tests based on architecture type * enable test_grouped_gemm for gfx9 only * enable test_transpose only for gfx9 * only linnk test_transpose if it gets built * split the gemm instances by architectures * split gemm_bilinear,grouped_conv_bwd_weight instances by targets * split instances by architecture * split grouped_conv instances by architecture * fix clang format * fix the if-else logic in group_conv headers * small fix for grouped convolution instances * fix the grouped conv bwd weight dl instances * fix client examples * only enable client examples 3 and 4 on gfx9 * set the gfx9 macro * make sure the architecture macros are set by cmake * use separate set of xdl/wmma flags for host code * sinmplify the main cmake file * add conv_fwd_bf8 instance declaration [ROCm/composable_kernel commit: `ae57e5938e`]	2024-04-02 09:42:17 -07:00

1 2 3 4 5 ...

446 Commits