* [Compiler] Addressing new compiler warnings
Clang enables new lifetime warnings in production and we see build
errors due to this with the staging compiler.
The attributes added in this PR are suggested by the compiler. However,
I'm not very familiar with the code base, so the changes may be
incorrect.
* Update some more instances
* Adds file-level ignores via clang diagnostic pragma
The number of instances was large, so I decided to use file-level scope
to disable the warning via pragma clang diagnostic ignored.
It also showed this warning coming from the gtest dependency. For that,
I did add the respective command line flag to the CMake variables. I
don't know if this is acceptable or not.
* This adds the remaining instances
For a build on gfx90a.
* fix clang format
* Adding couple more instances from gfx1200 build
* Fixed another few instances
---------
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>
1. Add base class GridwiseGemm_xdl_cshuffle_base for all gridwise_gemm_xdl classes.
- to select correct LDS layout and epilogue behavior , three additional parameters is added.
- ForceNaiveLdsLayout: disable XOR based LDS layout when it is true
- DirectLoad: pipeline only use directload, we need force naive layout and ignore any padding on gfx9
- IsMxGemm: epilogue has two addtional dimensions
2. Move all LDS descriptor layout related fucntion to base class, including
- GetABlockDescriptor_AK0PerBlock_MPerBlock_AK1
- GetBBlockDescriptor_BK0PerBlock_NPerBlock_BK1
- GetCShuffleBlockDescriptor_MBlock_MPerBlock_NBlock_NPerBlock
3. Move several LDS related helper funtions to base class, including
- GetSharedMemoryNumberOfByte
- GetABlockDescriptor_AKB_AK0PerBlock_MPerBlock_AK1
- GetBBlockDescriptor_BKB_BK0PerBlock_NPerBlock_BK1
- GetCBlockDescriptor_MBlock_NXdlPerWave_MWaveMPerXdl_NBlock_NXdlPerWave_NWaveNPerXdl
4. Move all c epilogue related code to base class, and 4 kind of implementation are provided
- RunEpilogueNoShuffle
- RunEpilogue
- RunMultiDEpilogue
- RunMoeEpilogue
This change significantly improves compile-time performance by reducing template
instantiation depth for sequence generation and merging operations:
Optimizations:
- sequence_gen: Reduce instantiation depth from O(log N) to O(1) by using
__make_integer_seq to generate indices in a single step, then applying the
functor via pack expansion
- uniform_sequence_gen: Similarly optimized to O(1) depth using __make_integer_seq
with a helper that applies a constant value via pack expansion
- sequence_merge: Reduce depth from O(N) to O(log N) using binary tree reduction
strategy. Added direct concatenation specializations for 1-4 sequences to
avoid recursion in common cases, falling back to binary tree merging for 5+
sequences
Documentation:
- Added extensive inline comments explaining why sequence_merge cannot achieve
O(1) depth like sequence_gen (requires computing cumulative sequence lengths
from heterogeneous inputs, inherently requiring recursion)
- Documented the binary tree reduction approach and why it's superior to fold
expressions for this use case
Testing:
- Added comprehensive unit tests for uniform_sequence_gen with different values,
sizes, and edge cases
- Added tests for sequence_gen with custom functors (double, square, identity,
constant) to verify the new implementation works with arbitrary functors
- Added tests for sequence_merge with 4, 5, and many sequences to verify both
the direct concatenation path and binary tree reduction path
- Added tests for empty sequence edge cases
* Additional flavors for WMMA conv fwd large tensor
- added F16/BF16 clamp operation
- added F16/BF16 bias_clamp operation
- small modification to the device code to accomodate extra tensors
* changed strategy to handle GemmArgs array
* Adding generic instance
* Added generic instance to clamp and bias_clamp ops
* wip: test suite for batched gemm multiple d gemm multiple d, working on gridwise implenentation
* wip: many fixes in implementation of batched gemm gemm multiple d
* wip: batched gemm gemm multiple d gridwise op compiling, not working yet
* fix: incorrect d0 grid indexing in batched gemm gemm multipled
* feat: add instances for batched gemm add relu gemm add
* chore: configure instance with low vector transfer size for odd sizes
* chore: add some more validation to device batched gemm gemm multiple d, and removed template parameter that didn't really make sense
* fix: upate device_batched_gemm_gemm_wmma to work with new gridwise changes
* fix: disable odd size tests on XDL archs
* chore: removed temporary logging
* chore: update some references to C tensor to E tensor
* Tentative fix for example template params
* Tentative fix for non-multi-D batched gemm gemm device impl.
* Tentative fix for xdl example template params
* Tentative fix for profiler build on gfx90a
* chore: improve device batched gemm gemm multi D comment to include all ops and dimensions
* chore: explicitly call ck::make_tuple to prevent issues when std::make_tuple would apply
* fix: make the gemm1 data types match what happens in the device op
* feat: add d0s/d1s datatypes and layouts to the device op type string
* chore: change element-wise op so addition happens in fp32
* chore: add static asserts for gemm0/gemm1 calculated wave sizes
* chore: also updated other element-wise ops to use fp32 calculations
* chore: log number of supported instances
* chore: update instance comment
* chore: disable kernel timing in example by default
* fix: gemm1 wave size calculation
* fix: make sure batched gemm multiple d gemm multiple d profiler performs correct type conversions
* chore: remove increased tolerance in batched gemm gemm multiple d example
* chore: add comment explaining that verification fails for certain input values
* chore: clarify instance comment
---------
Co-authored-by: kiefer <kiefer.van.teutem@streamhpc.com>
* feat: grouped gemm tile loop support for RDNA4
* fix: removed extra parameter from grouped gemm example instance
* fix: FP8 check incorrectly enabling FP8 on RDNA3
* Take split_k into account when checking 2GB tensor limit.
* Revert "Take split_k into account when checking 2GB tensor limit."
This reverts commit adf35c91be.
* Optimize grouped conv bwd wei split_k off calc
(cherry picked from commit 6f61dd56c5)
* Update gridwise_gemm_xdl_cshuffle_conv_v3.hpp
(cherry picked from commit b33877c10f)
* Fix tensor descriptors and stride calculations
* Don't miss half of the elements
* Fix buffer size calculations
* Disable hack if stride not divisible by k_batch
* Clean up comments
* Disallow hack in non-contiguous edge cases
* Index -> Dim
* Fix broken test
* Refactor applicability checks into separate function
* fix missed variable name
* Fix variable name in info print
* update V3 2GB check
* No more regression, use templates instead
* Code deduplication
* Regression fix for cshuffle
* arch-guarded atomic_add implementations for gfx11
* Similar for half(4|8)_t as well
* Only use both offset hacks at the same time
* Revert "arch-guarded atomic_add implementations for gfx11"
This reverts commit 3883fe6935.
This reverts commit 5311ec608d.
* Reapply "arch-guarded atomic_add implementations for gfx11"
This reverts commit 1972adeddc.
* Only remove float4 atomic_add
* Refactor to single flag
* Consolidate template parameters
* Consolidate flag in transformers
---------
Co-authored-by: Bartlomiej Kocot <barkocot@amd.com>
* Convolution bwd weight device implementation
* Merge branch 'grouped_conv_bwd_weight_device_impl_wmma' into 'feature/conv_bwd_weight_wmma'
Convolution bwd weight device implementation
See merge request amd/ai/composable_kernel!38
* Fix bug and disable splitK=-1 tests for wmma
* Add generic instances for bf16 f32 bf16
* check gridwise level validity in device impl for 1 stage D0
* Fix bugs in device implementation:
- rdna3 compilation error
- gridwise layouts (need to be correct to ensure that CheckValidaity()
works correctly)
* Add padding in conv to gemm transformers for 1x1Stride1Pad0 specialization
* Remove workaround for 1x1Stride1Pad0 conv specialization
* Add instances for xdl parity (for pipeline v1)
* Add two stage instances (xdl parity)
* Add multiple Ds instances
* Add examples
* Uncomment scale instances
* Fix copyright
* Fix examples compilation
* Add atomic add float4
* Fix compilation error
* Fix instances
* Compute tolerances in examples instead of using default ones
* Compute tolerances instead of using default ones in bilinear and scale tests
* Merge branch 'grouped_conv_bwd_weight_instances_examples' into 'feature/conv_bwd_weight_wmma'
Grouped conv: Instances and example bwd weight
See merge request amd/ai/composable_kernel!47
* Device implementation of explicit gemm for grouped conv bwd weight
Based on batched gemm multiple D
* Add instances for pipeline v1 and v3
* Add support for occupancy-based splitk
* Fix ckProfiler dependencies
* Review fixes
* Merge branch 'explicit_bwd_weight' into 'feature/conv_bwd_weight_wmma'
Device implementation of explicit gemm for grouped conv bwd weight
See merge request amd/ai/composable_kernel!52
* Fix cmake file for tests
* fix clang format
* fix instance factory error
* Adapt all grouped conv bwd weight vanilla Xdl instances to 16x16. MRepeat doubled for all but 12 of them (some static assert failure). Also added custom reduced profiler target for building grouped conv bwd weight vanilla only profiler. Verified with gtest test.
* Revert "Adapt all grouped conv bwd weight vanilla Xdl instances to 16x16. MRepeat doubled for all but 12 of them (some static assert failure). Also added custom reduced profiler target for building grouped conv bwd weight vanilla only profiler. Verified with gtest test."
This reverts commit d20c869d3d.
* Disable splitk for 2stage xdl on rdna (bug to be fixed)
* Fix add_test_executable
* Always ForceThreadTileTransfer for now, WaveTileTransfer does not work for convolution yet.
* Grab device and gridwise files from bkp branch, this should enable splitK support for convolution and also we no longer ForceThreadTileTransfer for explicit gemm. Also grab some updates from 7e7243783008b11e904f127ecf1df55ef95e9af2 to fix building on clang20.
* Fix bug in various bwd wei device implementations / profiler where the occupancy based split_k value could not be found because the Argument did not derive from ArgumentSplitK, leading to incorrect error tolerances.
* Actually print the reason when a device implementation is not supported.
* Print number of valid instances in profiler and tests.
* Fix clang format for Two Stage implementation
* Fix copyright
* Address review comments
* Fix explicit conv bwd weight struct
* Fix gridwise common
* Fix gridwise ab scale
* Remove autodeduce 1 stage
* Restore example tolerance calculation
* Fix compilation error
* Fix gridwise common
* Fix gridwise gemm
* Fix typo
* Fix splitk
* Fix splitk ab scale
* Adapt all grouped conv bwd weight vanilla Xdl instances to 16x16. MRepeat doubled for all but 12 of them (some static assert failure). Also added custom reduced profiler target for building grouped conv bwd weight vanilla only profiler. Verified with gtest test.
* Reduce instances to only the tuned wmma V3 ones for implicit v1 intra and explicit v1 intra pad/nopad.
* Add explicit oddMN support with custom tuned instances
* Add two stage instances based on the parameters from the tuned cshuffle V3 instances. CShuffleBlockTranserScalarPerVector adapted to 4, and mergegroups fixed to 1 for now. No more special instance lists.
* Replace cshuffle non-v3 lists with v3 lists, making sure to not have duplications. Also removing stride1pad0 support for NHWGC since we can use explicit for those cases.
* Remove some instances that give incorrect results (f16 NHWGC)
* Add bf16 f32 bf16 instances based on tuned b16 NHWGC GKYXC instances.
* Add back some generic instances to make sure we have the same shape / layout / datatype support as before the instance selection process.
* Add instances for scale and bilinear based on the bf16 NHWGC GKYXC tuning. Keep generic instances for support.
* Disable two stage f16 instances which produce incorrect results.
* Remove more instances which fail verification, for bf16_f32_bf16 and for f16 scale / bilinear.
* Disable all non-generic two-stage instances in the instance lists for NHWGC. They are never faster and support is already carried by CShuffleV3 and Explicit.
* Remove unused instance lists and related add_x_instance() functions, fwd declarations, cmakelists entries. Also merge the "wmma" and "wmma v3" instance list files, which are both v3.
* Re-enable all xdl instances (un-16x16-adapted) and dl instances. Remove custom ckProfiler target.
* Remove straggler comments
* Remove [[maybe_unused]]
* Fix clang format
* Remove unwanted instances. This includes all instances which are not NHWGCxGKYXC and F16 or BF16 (no mixed in-out types).
* Add comment
---------
Co-authored-by: kiefer <kiefer.van.teutem@streamhpc.com>
Co-authored-by: Kiefer van Teutem <50830967+krithalith@users.noreply.github.com>
Old sequence sort code was showing up on build profiles. Convert it to constexpr functions for much more efficient build-time execution. The sorting is still O(N^2), but our sequences are small enough it executes quickly. This reduced compilation time of a small convolution by more than 10% and time overall time spent in the compiler on a narrow build by %6.
We only want to import enums and types into the builder reflection code. But, some of the enums are included in much larger files or even big trees of include files. This leads to unintended mixing of code and very confusing interactions and symbol conflicts. We organize the includes and extract two new enum-only headers to help with decoupling in CK. This refactoring is critical if we want to include reflection in a device-operator "describe" method.
* Remove a few unnecessary includes from headers in builder/reflect/.
* Extract enums scheduler and pipeline to their own headers so they can be used without importing other code.
* Order includes alphabetically for better organization.
The immediate goal is to unblock reflection integration, and this type of cleanup helps the flexibility and robustness of the CK header library.
* Initial implementation:
- add new thread group transfer supporting transpose instruction
- refactor AB transfer to switch between thread and wave tiles methods
* Add some comments and remove explicit wave and lane calculations
* Remove compiler option for performance
* fp16 example: use tuned instance
* Missing cleanup
* Integrate wave transfer in existing gemm and batched gemm instances
* Add fast instances
* extend implementation for 8 bit datatypes
packed types not supported
* Address review comments
* Optimize pipeline v1 and re-introduce compiler option
* Disable wave tile approach for b scale gemm
* Fix for clang20
* Avoid code duplication of amd_global_load_transpose_to_vgpr function
* add instances of device_grouped_conv_fwd_xdl_f32_comp_instances
* add instances of device_grouped_conv_fwd_xdl_f32_tf32_mem_instances
* add instances of device_grouped_conv_fwd_xdl_large_tensor_f32_tf32_instances
* tf32:conv:add instances for base class DeviceConvFwd
* tf32:conv:add instances for base class DeviceGroupedConvBwdDataMultipleD
* tf32:conv:add instances for base class DeviceGroupedConvBwdWeight
* add tf32 in profiler
* remove gnhwc/ngchw/ngcdhw instances
* remove non-ndhwgc/nhwgc/nhwc instances
* add check in IsSupportedArgument()
* Remove C++20 code
C++20 features should not be used in CK. Remove all C++20 code.
* fix c++17 build
* format
* fix merge issue
---------
Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>
* upgrade default docker to rocm7.0.1
* turn on build and test on gfx950 by default
* use rocm-dev instead of rocm
* link libhiprtc for codegen targets
* resolving codegen compilation errors: removed calls to other std functions, resolved issues with int32_t: needed the correct header, put use of e8m0 into header guards
---------
Co-authored-by: Astha Rai <astha.rai713@gmail.com>
* Revert "Revert "feature:tf32:add initial conv3d fwd kernel support (#2763)" (#2848)"
This reverts commit 03b59f8c76.
* fix compile error on gf12x
* only run tf32 example on gfx942
* only build tf32 instance on gfx942
* ckProfiler:only support tf32 in gfx942
* delete unuseful messages
* Create new copies of existing device struct and gridwise struct for batched_gemm_softmax_gemm and disable the softmax part. Still based on old wmma pipelines. Also copy the example and remove the softmax part from the reference calculation. Works and results match reference except for tiny float errors in problem 2.
* Turn DeviceBatchedGemmGemm_Wmma_CShuffleV3 into a proper DeviceBatchedGemmGemm derived class, with the right argument and invoker functions. Update example to use new definitions.
* Remove unused cross-attention and self-attention kernels, arguments, and invokers. Also remove other unused Argument types.
* Remove masking related code, test unusual sizes in example.
* Remove remaining softmax related code from GridwiseBatchedGemmGemm_wmma_cshuffle_v3 and example.
* Remove code related to numDims, bias, and TensorSpec from Device struct and example.
* Add layout template parameters to device struct
* Move (NPerBlock, LTilePerBlock) device struct template arguments up by two places to match XDL template argument ordering.
* Merge accumulation data types into one type to match XDL device struct.
* Remove NPerWmma template parameter from device struct and just set it equal to LPerWmma. Now device struct template params exactly match those for XDL batched gemm gemm.
* Add support for RCCR layout and test this in example
* Add batched_gemm_gemm_wmma to instance library + profiler, and add gtest just like for xdl.
* Add RCCR instance and additional RCRR instance to library.
* Remove unused permute and alpha related code. Time all tests. Fix B1 strides in argument verification.
* Remove references to G0, G1 in favor of batch, reduce dimensionality of length and stride arrays.
* Managed to replace old wmma gridwise pipeline and blockwise struct with new wmma blockwise pipeline. Some cleanup required but all tests pass.
* Make TransposeC a proper template parameter that gets passed all the way from BlockGemmPipeline_Selector to WmmaGemm so we can use the correct settings for bacthed gemm gemm as well as regular gemm. Gemm universal tests now pass again.
* Replace old LoopSched and PipelineVer params with BlockwiseGemm pipeline equivalents, and use these in instance factory. The v3 pipeline does not work yet, but v1 works for intrawave and interwave.
* Adapt the A wave descriptor to deal with RDNA4 wmma. This fixes batched gemm gemm functionality on RDNA4.
* Fixed two aspects of the v3 pipeline that were incorrect: First of all the blockwise copy operator was invoked once too many in all cases (RunRead and move window), which broke batched gemm gemm when the blockwise pipeline was used multiple times. Furthermore we should be using the mainloop (hotloop) for num_k_loop >=2 instead of num_k_loop >=3. Now we can use support any K dimension.
* Remove num prefetch parameter from gridwise struct since we don't use it and it doesn't do anything,
* Remove unused non-lds paths.
* Test and update the IsSupportedArgument() and CheckValidity() functions for all layouts + padding modes and various problem sizes.
* Add a lot of instances to the profiler with various blocksizes and pipelines, all verified.
* Add support for BF16: instance library, tests, and examples.
* Add examples for int8 and fp8, had to add type_convert_sp template specializations for the latter.
* Template the library instance lists and add default padding instances.
* Move memory calculations from the kernel to the Argument contructor. Also actually parse and use the user-provided batch strides.
* Actually parse and use user-provided regular strides.
* More refactor: remove references to multiple dims per dims, and g0 / g1. Also move xdl specific test utils out of generic test util header.
* Small post-rebase-on-develop fix due to bscale-related pipeline changes. All tests rerun + tested bscale and regular gemm.
* Introduce the correct GetCThreadDescriptor function in the blockwise gemm pipelines for the TransposeC=true case. It turns out to be identical for our batched gemm gemm (gemm0) usecases, but could theoretically be different for wmma_gemm instances with smaller-than-4-byte output data size.
* Remove unused NumPrefetch template parameter, we don't need to match the XDL template params one-to-one.
* Implement proper TailNum and HasMainLoop template parameters for the v3 pipeline. Now the Run() function knows at compile time whether there are 1, 2, or more loops in total, and adds or removes sections accordingly. It still uses the blockwise copy operators the correct amount of times.
* Add print lambda with env check and file and func to device and gridwise level compatibility error messages. Also respect compatibility in example script.
* RDNA3 does not support fp8
Update Blockwise and Gridwise files to support both wave32 & wave64.
1. Calculate WaveSize from template parameter, instead of hard code it to 64, some "64" is also replace with WaveSize
2. Move BN0Shuffled and BK0Shuffled to device side. we can't get correct mfma inst info in host side.
3. Update b_thread_offset_n and b_thread_offset_k in gridwise_gemm_xdl_cshuffle_v3_b_scale.hpp for gfx11. in gfx11, input data is duplicated for each 16 threads, it is different with all of others.
4. Modify a1_threadwise_copy in gridwise_batched_*gemm*gemm for gfx11. for gfx11, we need duplicate input and swizzle A if transposeC isn't enabled.
* add template for fp16 atomic add
* add template for unsigned short atomic add
* use atomicCAS in atomic add for fp16 and unsigned short
* revrt back to atomic add using casting