* Create new copies of existing device struct and gridwise struct for batched_gemm_softmax_gemm and disable the softmax part. Still based on old wmma pipelines. Also copy the example and remove the softmax part from the reference calculation. Works and results match reference except for tiny float errors in problem 2.
* Turn DeviceBatchedGemmGemm_Wmma_CShuffleV3 into a proper DeviceBatchedGemmGemm derived class, with the right argument and invoker functions. Update example to use new definitions.
* Remove unused cross-attention and self-attention kernels, arguments, and invokers. Also remove other unused Argument types.
* Remove masking related code, test unusual sizes in example.
* Remove remaining softmax related code from GridwiseBatchedGemmGemm_wmma_cshuffle_v3 and example.
* Remove code related to numDims, bias, and TensorSpec from Device struct and example.
* Add layout template parameters to device struct
* Move (NPerBlock, LTilePerBlock) device struct template arguments up by two places to match XDL template argument ordering.
* Merge accumulation data types into one type to match XDL device struct.
* Remove NPerWmma template parameter from device struct and just set it equal to LPerWmma. Now device struct template params exactly match those for XDL batched gemm gemm.
* Add support for RCCR layout and test this in example
* Add batched_gemm_gemm_wmma to instance library + profiler, and add gtest just like for xdl.
* Add RCCR instance and additional RCRR instance to library.
* Remove unused permute and alpha related code. Time all tests. Fix B1 strides in argument verification.
* Remove references to G0, G1 in favor of batch, reduce dimensionality of length and stride arrays.
* Managed to replace old wmma gridwise pipeline and blockwise struct with new wmma blockwise pipeline. Some cleanup required but all tests pass.
* Make TransposeC a proper template parameter that gets passed all the way from BlockGemmPipeline_Selector to WmmaGemm so we can use the correct settings for bacthed gemm gemm as well as regular gemm. Gemm universal tests now pass again.
* Replace old LoopSched and PipelineVer params with BlockwiseGemm pipeline equivalents, and use these in instance factory. The v3 pipeline does not work yet, but v1 works for intrawave and interwave.
* Adapt the A wave descriptor to deal with RDNA4 wmma. This fixes batched gemm gemm functionality on RDNA4.
* Fixed two aspects of the v3 pipeline that were incorrect: First of all the blockwise copy operator was invoked once too many in all cases (RunRead and move window), which broke batched gemm gemm when the blockwise pipeline was used multiple times. Furthermore we should be using the mainloop (hotloop) for num_k_loop >=2 instead of num_k_loop >=3. Now we can use support any K dimension.
* Remove num prefetch parameter from gridwise struct since we don't use it and it doesn't do anything,
* Remove unused non-lds paths.
* Test and update the IsSupportedArgument() and CheckValidity() functions for all layouts + padding modes and various problem sizes.
* Add a lot of instances to the profiler with various blocksizes and pipelines, all verified.
* Add support for BF16: instance library, tests, and examples.
* Add examples for int8 and fp8, had to add type_convert_sp template specializations for the latter.
* Template the library instance lists and add default padding instances.
* Move memory calculations from the kernel to the Argument contructor. Also actually parse and use the user-provided batch strides.
* Actually parse and use user-provided regular strides.
* More refactor: remove references to multiple dims per dims, and g0 / g1. Also move xdl specific test utils out of generic test util header.
* Small post-rebase-on-develop fix due to bscale-related pipeline changes. All tests rerun + tested bscale and regular gemm.
* Introduce the correct GetCThreadDescriptor function in the blockwise gemm pipelines for the TransposeC=true case. It turns out to be identical for our batched gemm gemm (gemm0) usecases, but could theoretically be different for wmma_gemm instances with smaller-than-4-byte output data size.
* Remove unused NumPrefetch template parameter, we don't need to match the XDL template params one-to-one.
* Implement proper TailNum and HasMainLoop template parameters for the v3 pipeline. Now the Run() function knows at compile time whether there are 1, 2, or more loops in total, and adds or removes sections accordingly. It still uses the blockwise copy operators the correct amount of times.
* Add print lambda with env check and file and func to device and gridwise level compatibility error messages. Also respect compatibility in example script.
* RDNA3 does not support fp8
* parse examples inside the add_example_executable function
* fix the example 64 cmake file
* add xdl flag to the gemm_bias_softmax_gemm_permute example
* add filtering of tests based on architecture type
* enable test_grouped_gemm for gfx9 only
* enable test_transpose only for gfx9
* only linnk test_transpose if it gets built
* split the gemm instances by architectures
* split gemm_bilinear,grouped_conv_bwd_weight instances by targets
* split instances by architecture
* split grouped_conv instances by architecture
* fix clang format
* fix the if-else logic in group_conv headers
* small fix for grouped convolution instances
* fix the grouped conv bwd weight dl instances
* fix client examples
* only enable client examples 3 and 4 on gfx9
* set the gfx9 macro
* make sure the architecture macros are set by cmake
* use separate set of xdl/wmma flags for host code
* sinmplify the main cmake file
* add conv_fwd_bf8 instance declaration
* refactor cmake files for the tests
* refactor cmake files for examples
* fix cmake for gemm example
* fix the cmake file for all examples
* add splitting by data types in gemm_splitk instance header
* rename test to reflect only dl instances are used
* clean up CI workspace, update cmake for instances
* change the jenkinsfile syntax
* build all instances except DL on gfx11
* move workspace cleanup after stages
* clean up workspace after every stage
* isolate data types in grouped_conv_fwd header
* isolate dl instances for grouped_conv2d_fwd
* fix syntax
* fix cmake and batchnorm instances
* fix typo
* fix reduction instances
* fix grouped_conv headers
* fix syntax
* replace parsing logic for instances, replace bfp16 with bf16
* fix the client examples build
* clean up DTYPES from instances cmake files
* update the parsing logic in cmake files
* make an exception for reduction kernels
* update few remaining cmake files to handle DTYPES
* fix syntax
* fix cmake conflicts
* replace f8 with fp8 test name
* resolve conflicts for dpp instances
* properly split conv_nd_bwd_data instances
* split conv2d_fwd instance data types
* split the gemm, conv2d_fwd and batched_gemm_softamx_gemm
* split the tests by data types where possible
* filter examples by DTYPES
* split few remaining examples by DTYPES
* filter most instances by DTYPES
* add new lines at end of headers, fix grouped_gemm profiler
* fix syntax
* split the ckprofiler instances by DTYPES
* split the conv2d and quantization DL and XDL instances
* fix the splitting of conv2d DL instances
* split softmax and pool_fwd tests for fp16 and fp32 types
* fix syntax
* fix the dl_int8 quantization instances isolation
* enable gfx941/942 targets
* fix clang format
* fix the cmake logic for multiple targets
* fix cmake syntax for looping over targets
* add gfx941/942 support for gemm_xdl instances
* enable dl kernels on navi3
* do not build xdl tests and examples on Navi
* run tests before building everything on jenkins
* disable gemm_bilinear on gfx1030
* add gpu targets to installer on Navi
* put tests in the same order as before
* reduce the number of navi targets in CI
* build CI installed for gfx940 as well
* only build for MI300 during QA runs
* Re-structure ckProfiler source files
* Rename profiler.cpp to main.cpp
* Modularize ckProfiler operations
* Add description for profiler operations
* Use longer name to avoid name collision
* Use macro to delay expansion
* Use std::move() to avoid object copying
* Prohibit users from calling dtor
* Use macro to eliminate redundant code
* Make friend function hidden
* Add missing include directive <iostream>
* Fix wrong include directives
* Remove int8 from batchnorm-forward instances since it is not needed for forward training and could fail test
Co-authored-by: Qianfeng Zhang <Qianfeng.Zhang@amd.com>
* Move kernel implementation files under impl directory.
* Update examples paths.
* Update device kernel impl include paths.
* Update tensor operation instances include paths.
* Update profiler and tests include paths.
* Clang-format
* Update include paths for batched gemm reduce
* Refactor UnitTest ConvNDBwdWeight.
* Refactor fwd and bwd data convND UT.
* Fix used test macro.
* Fix include path.
* Fix include paths.
* Fix include paths in profiler and tests.
* Fix include paths.
Co-authored-by: Adam Osewski <aosewski@amd.com>
* modify comment
* trim unnecessary check
* add gemm spec in kernel name
* add TNTT gemm_gemm + atten kernel instances
* refactor attention padding to better fit in unit tests
This streamlines usage where "ResetNaNToMinusInf" is now hidden from user facing device op.
Also added compile-time conditionals that load OOB value as NaN only after padding is enabled
* add adhoc padding test for atten
* shrink input value range for attention kernel validation to avoid occasional error by 1e-3
Still unsure whether this kind of deterministic floating point accurary issue is expected
or not. May want to try exact same approach as the GPU kernel in the host reference
GEMM+Softmax+GEMM function to see if the accuracy discrepancy goes away. Until then,
shrink the input value range as it is less likely to produce errors of around ~1e-3.
* attention kernel proper granular padding for all 4 dims
* IsSupportedArgument checks
* test more padded cases
* block PadK specialization in attention kernels
* workaround clang crash for gfx908
(gfx908 only) workaround for compiler crash in fused kernels on mainline #9110; #10738 seems ok
error message was "fatal error: error in backend: Error while trying to spill VGPR0 from class
VGPR_32: Cannot scavenge register without an emergency spill slot!"
this fall back to less ideal way of handle NPadding in fused attention kernel
* comment out kernels giving wrong results on MI100; MI200 doesn't seem affected
* GemmPadder and GemmGemmPadder
* proper padding using GemmGemmPadder
* test gemm_gemm padding
* properly check size K in IsSupportedArgument()
* properly check size requirement given SrcScalarPerVector in IsSupportedArgument()
* comment
* format
* initial stub for gemm_gemm_xdl_cshuffle
* set up example code
* compiles
* prevent integer overflow
* harmonize interface between ref_gemm and ref_batched_gemm
* batched_gemm_gemm
* fix example
* host tensor gen: diagonal pattern in lowest two-dimensions only
* make c descriptors containing only integral constants
* clean up
* add BlockwiseGemmXdlops_v2 while exploring an unified approach
* implement proper interface
* tidy up example
* fix compilation warnings
* coarsely controlled 2nd gemm padding
* remove rocm-cmake's hard requirement for certain revision
* clang-format
* resolve merge conflict
* fix compilation error on gfx10
* adds acc0 elementwise op to interface
* add gemm_gemm instances and tests
* avoid LDS data hazard
* fix build
Co-authored-by: Chao Liu <chao.liu2@amd.com>