composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-07-13 02:27:33 +00:00

Author	SHA1	Message	Date
Aviral Goel	004784ef98	chore(copyright) update library wide CMakeLists.txt copyright header template (#3313 ) * chore(copyright) update library wide CMakeLists.txt files copyright header template * Fix build --------- Co-authored-by: Sami Remes <samremes@amd.com>	2025-11-28 13:49:54 -08:00
AviralGoelAMD	4e49e0228b	chore(copyright): update copyright header for test directory	2025-11-19 17:43:28 -07:00
linqunAMD	321627aec5	Extend XDL kernel to Support RDNA3/4 - Part 4 (#2724 ) * Fix example * fix build error * update pk_i4 & moe test case * fix all instance build (examples) * fix batched_gemm_gemm (example) * disable example_gemm_bias_softmax_gemm_permute on gfx11 * remove unnecessary disable gfx11 * update tests * update tests2	2025-09-12 08:17:07 -07:00
Kiefer van Teutem	7330ec37ee	Implement batched gemm gemm for RDNA (3 and 4) (#2612 ) * Create new copies of existing device struct and gridwise struct for batched_gemm_softmax_gemm and disable the softmax part. Still based on old wmma pipelines. Also copy the example and remove the softmax part from the reference calculation. Works and results match reference except for tiny float errors in problem 2. * Turn DeviceBatchedGemmGemm_Wmma_CShuffleV3 into a proper DeviceBatchedGemmGemm derived class, with the right argument and invoker functions. Update example to use new definitions. * Remove unused cross-attention and self-attention kernels, arguments, and invokers. Also remove other unused Argument types. * Remove masking related code, test unusual sizes in example. * Remove remaining softmax related code from GridwiseBatchedGemmGemm_wmma_cshuffle_v3 and example. * Remove code related to numDims, bias, and TensorSpec from Device struct and example. * Add layout template parameters to device struct * Move (NPerBlock, LTilePerBlock) device struct template arguments up by two places to match XDL template argument ordering. * Merge accumulation data types into one type to match XDL device struct. * Remove NPerWmma template parameter from device struct and just set it equal to LPerWmma. Now device struct template params exactly match those for XDL batched gemm gemm. * Add support for RCCR layout and test this in example * Add batched_gemm_gemm_wmma to instance library + profiler, and add gtest just like for xdl. * Add RCCR instance and additional RCRR instance to library. * Remove unused permute and alpha related code. Time all tests. Fix B1 strides in argument verification. * Remove references to G0, G1 in favor of batch, reduce dimensionality of length and stride arrays. * Managed to replace old wmma gridwise pipeline and blockwise struct with new wmma blockwise pipeline. Some cleanup required but all tests pass. * Make TransposeC a proper template parameter that gets passed all the way from BlockGemmPipeline_Selector to WmmaGemm so we can use the correct settings for bacthed gemm gemm as well as regular gemm. Gemm universal tests now pass again. * Replace old LoopSched and PipelineVer params with BlockwiseGemm pipeline equivalents, and use these in instance factory. The v3 pipeline does not work yet, but v1 works for intrawave and interwave. * Adapt the A wave descriptor to deal with RDNA4 wmma. This fixes batched gemm gemm functionality on RDNA4. * Fixed two aspects of the v3 pipeline that were incorrect: First of all the blockwise copy operator was invoked once too many in all cases (RunRead and move window), which broke batched gemm gemm when the blockwise pipeline was used multiple times. Furthermore we should be using the mainloop (hotloop) for num_k_loop >=2 instead of num_k_loop >=3. Now we can use support any K dimension. * Remove num prefetch parameter from gridwise struct since we don't use it and it doesn't do anything, * Remove unused non-lds paths. * Test and update the IsSupportedArgument() and CheckValidity() functions for all layouts + padding modes and various problem sizes. * Add a lot of instances to the profiler with various blocksizes and pipelines, all verified. * Add support for BF16: instance library, tests, and examples. * Add examples for int8 and fp8, had to add type_convert_sp template specializations for the latter. * Template the library instance lists and add default padding instances. * Move memory calculations from the kernel to the Argument contructor. Also actually parse and use the user-provided batch strides. * Actually parse and use user-provided regular strides. * More refactor: remove references to multiple dims per dims, and g0 / g1. Also move xdl specific test utils out of generic test util header. * Small post-rebase-on-develop fix due to bscale-related pipeline changes. All tests rerun + tested bscale and regular gemm. * Introduce the correct GetCThreadDescriptor function in the blockwise gemm pipelines for the TransposeC=true case. It turns out to be identical for our batched gemm gemm (gemm0) usecases, but could theoretically be different for wmma_gemm instances with smaller-than-4-byte output data size. * Remove unused NumPrefetch template parameter, we don't need to match the XDL template params one-to-one. * Implement proper TailNum and HasMainLoop template parameters for the v3 pipeline. Now the Run() function knows at compile time whether there are 1, 2, or more loops in total, and adds or removes sections accordingly. It still uses the blockwise copy operators the correct amount of times. * Add print lambda with env check and file and func to device and gridwise level compatibility error messages. Also respect compatibility in example script. * RDNA3 does not support fp8	2025-09-04 14:10:24 -07:00
Illia Silin	ae57e5938e	Split the instances by architecture. (#1223 ) * parse examples inside the add_example_executable function * fix the example 64 cmake file * add xdl flag to the gemm_bias_softmax_gemm_permute example * add filtering of tests based on architecture type * enable test_grouped_gemm for gfx9 only * enable test_transpose only for gfx9 * only linnk test_transpose if it gets built * split the gemm instances by architectures * split gemm_bilinear,grouped_conv_bwd_weight instances by targets * split instances by architecture * split grouped_conv instances by architecture * fix clang format * fix the if-else logic in group_conv headers * small fix for grouped convolution instances * fix the grouped conv bwd weight dl instances * fix client examples * only enable client examples 3 and 4 on gfx9 * set the gfx9 macro * make sure the architecture macros are set by cmake * use separate set of xdl/wmma flags for host code * sinmplify the main cmake file * add conv_fwd_bf8 instance declaration	2024-04-02 09:42:17 -07:00
Illia Silin	bba085d2b5	Refactoring cmake files to build data types separately. (#932 ) * refactor cmake files for the tests * refactor cmake files for examples * fix cmake for gemm example * fix the cmake file for all examples * add splitting by data types in gemm_splitk instance header * rename test to reflect only dl instances are used * clean up CI workspace, update cmake for instances * change the jenkinsfile syntax * build all instances except DL on gfx11 * move workspace cleanup after stages * clean up workspace after every stage * isolate data types in grouped_conv_fwd header * isolate dl instances for grouped_conv2d_fwd * fix syntax * fix cmake and batchnorm instances * fix typo * fix reduction instances * fix grouped_conv headers * fix syntax * replace parsing logic for instances, replace bfp16 with bf16 * fix the client examples build * clean up DTYPES from instances cmake files * update the parsing logic in cmake files * make an exception for reduction kernels * update few remaining cmake files to handle DTYPES * fix syntax * fix cmake conflicts * replace f8 with fp8 test name * resolve conflicts for dpp instances	2023-09-20 22:15:56 -07:00
Illia Silin	08eb176929	Allow building CK for specific data types and split off last remaining DL instances. (#830 ) * properly split conv_nd_bwd_data instances * split conv2d_fwd instance data types * split the gemm, conv2d_fwd and batched_gemm_softamx_gemm * split the tests by data types where possible * filter examples by DTYPES * split few remaining examples by DTYPES * filter most instances by DTYPES * add new lines at end of headers, fix grouped_gemm profiler * fix syntax * split the ckprofiler instances by DTYPES * split the conv2d and quantization DL and XDL instances * fix the splitting of conv2d DL instances * split softmax and pool_fwd tests for fp16 and fp32 types * fix syntax * fix the dl_int8 quantization instances isolation	2023-08-07 14:56:10 -07:00
Illia Silin	027e46ee82	Enable gfx941 and gfx942 architectures. (#752 ) * enable gfx941/942 targets * fix clang format * fix the cmake logic for multiple targets * fix cmake syntax for looping over targets * add gfx941/942 support for gemm_xdl instances	2023-06-15 08:20:59 -07:00
Illia Silin	b94fd0b227	update copyright headers (#726 )	2023-05-31 18:46:57 -05:00
Illia Silin	d821d1e54f	Enable gemm_dl and other kernels on Navi3x. (#714 ) * enable dl kernels on navi3 * do not build xdl tests and examples on Navi * run tests before building everything on jenkins * disable gemm_bilinear on gfx1030 * add gpu targets to installer on Navi * put tests in the same order as before * reduce the number of navi targets in CI * build CI installed for gfx940 as well * only build for MI300 during QA runs	2023-05-23 11:23:16 -05:00
Po Yen Chen	8784a72e23	Modularize ckProfiler operations (#514 ) * Re-structure ckProfiler source files * Rename profiler.cpp to main.cpp * Modularize ckProfiler operations * Add description for profiler operations * Use longer name to avoid name collision * Use macro to delay expansion * Use std::move() to avoid object copying * Prohibit users from calling dtor * Use macro to eliminate redundant code * Make friend function hidden * Add missing include directive <iostream> * Fix wrong include directives * Remove int8 from batchnorm-forward instances since it is not needed for forward training and could fail test Co-authored-by: Qianfeng Zhang <Qianfeng.Zhang@amd.com>	2022-12-01 15:15:02 -06:00
Adam Osewski	3048028897	Refactor device op implementations into `impl` subdirectory. (#420 ) * Move kernel implementation files under impl directory. * Update examples paths. * Update device kernel impl include paths. * Update tensor operation instances include paths. * Update profiler and tests include paths. * Clang-format * Update include paths for batched gemm reduce * Refactor UnitTest ConvNDBwdWeight. * Refactor fwd and bwd data convND UT. * Fix used test macro. * Fix include path. * Fix include paths. * Fix include paths in profiler and tests. * Fix include paths. Co-authored-by: Adam Osewski <aosewski@amd.com>	2022-10-13 09:05:08 -05:00
Anthony Chang	868e5c555b	Fused attention instances & padding tests (#395 ) * modify comment * trim unnecessary check * add gemm spec in kernel name * add TNTT gemm_gemm + atten kernel instances * refactor attention padding to better fit in unit tests This streamlines usage where "ResetNaNToMinusInf" is now hidden from user facing device op. Also added compile-time conditionals that load OOB value as NaN only after padding is enabled * add adhoc padding test for atten * shrink input value range for attention kernel validation to avoid occasional error by 1e-3 Still unsure whether this kind of deterministic floating point accurary issue is expected or not. May want to try exact same approach as the GPU kernel in the host reference GEMM+Softmax+GEMM function to see if the accuracy discrepancy goes away. Until then, shrink the input value range as it is less likely to produce errors of around ~1e-3. * attention kernel proper granular padding for all 4 dims * IsSupportedArgument checks * test more padded cases * block PadK specialization in attention kernels * workaround clang crash for gfx908 (gfx908 only) workaround for compiler crash in fused kernels on mainline #9110; #10738 seems ok error message was "fatal error: error in backend: Error while trying to spill VGPR0 from class VGPR_32: Cannot scavenge register without an emergency spill slot!" this fall back to less ideal way of handle NPadding in fused attention kernel * comment out kernels giving wrong results on MI100; MI200 doesn't seem affected	2022-09-06 14:38:56 -05:00
Anthony Chang	fe52c94c98	GemmGemm TNNT instances (#399 ) * add gemm_gemm TNNT instance * sanitize Gemm1KPack * disable instances that failed validation on mi100	2022-09-06 13:38:01 -05:00
Anthony Chang	f4047c9418	Implement padding and sanity checks for fused GEMM+GEMM (#376 ) * GemmPadder and GemmGemmPadder * proper padding using GemmGemmPadder * test gemm_gemm padding * properly check size K in IsSupportedArgument() * properly check size requirement given SrcScalarPerVector in IsSupportedArgument() * comment * format	2022-08-23 10:01:02 -05:00
Anthony Chang	c20a75b07d	Fused GEMM+GEMM (#351 ) * initial stub for gemm_gemm_xdl_cshuffle * set up example code * compiles * prevent integer overflow * harmonize interface between ref_gemm and ref_batched_gemm * batched_gemm_gemm * fix example * host tensor gen: diagonal pattern in lowest two-dimensions only * make c descriptors containing only integral constants * clean up * add BlockwiseGemmXdlops_v2 while exploring an unified approach * implement proper interface * tidy up example * fix compilation warnings * coarsely controlled 2nd gemm padding * remove rocm-cmake's hard requirement for certain revision * clang-format * resolve merge conflict * fix compilation error on gfx10 * adds acc0 elementwise op to interface * add gemm_gemm instances and tests * avoid LDS data hazard * fix build Co-authored-by: Chao Liu <chao.liu2@amd.com>	2022-08-13 09:18:58 -05:00

16 Commits