* create a filter to build only libs required by hiptensor
* allow building libs for miopen and hiptensor at the same time
* tweak the lib filtering logic one more time
* CMakeLists.txt hack for Windows.
* Add Windows build instructions.
* Fix type issue with variadic min function.
* Use std::common_type to fix the variadic min/max functions.
* Enable CPU guard compilation on Windows.
* Suppress warnings related to std::getenv on Windows platform.
* Git ignore the output directory on Windows platform.
* Powershell script for running tests and generating reports.
* Improve test logging.
* Disable non-conv tests.
* Fix Debug build on Windows.
* More debug build changes.
* Update Windows build instructions.
* Enable all tests.
* Test fixes.
* Suppress not found linker options warning.
* Update unsigned long literals and format specifiers to work correctly in Windows
* Fix conv 3D bwd weight bilinear tests on Windows.
* Revert changes on .gitignore.
* Clean-up CMake project file for Windows builds.
* clang-format
* Fix definition of CMAKE_PREFIX_PATH on both Linux and Windows platforms.
* Fix building examples on Windows.
* Update Readme.
* Remove the suppression of the deprecated warnings.
* Remove Windows specific min/max implementations from CK Tile math core.
* Remove unnecessary no-op on Windows.
---------
Co-authored-by: User <user@example.com>
Co-authored-by: Ville Pietilä <none>
Co-authored-by: John Afaganis <john.afaganis@amd.com>
Co-authored-by: Ville Pietilä <>
* Added install of CK_Tile headers when using CK_EXPERIMENTAL_BUILDER. MIOpen needs this since the builder uses features from CK Tile and the CK Tile install is excluded when doing a narrow build for MIOpen
* Changed algorithm concept type checks to be concepts instead of constexpr bool functions. This improves compiler error messages when using these concepts in static_asserts
---------
Co-authored-by: Daryl Hawkins <DarylHawkins@amd.com>
* wip: grouped_gemm implementation based on wmma kernel + example for fp16
* chore: clean up grouped_gem_wmma_splitk_fp16 example
* chore: add cmake options to fully disable XDL or WMMA kernels
* feat: add tests for grouped gemma wmma instances for f16 and bf16 (all layouts)
* chore: add grouped gemm wmma bf16 example
* refactor: reuse more code between instance factory functions
* chore: turn test failure if not all batch sizes are supported into a warning
* chore: made failing of test on unsupported instances conditional to not break old tests
* chore: add log message to failure case where AK1/BK1/KBatch is too high for K value
* fix: issue with new overloads of GridwiseGemm_wmma_cshuffle_v3::Run()
* fix: stray comma after parameter list
* fix: compilation issues on RDNA3 and tests failing due to unsupported problems still being ran
* chore: update copyright in header comments
* nit: minor feebdack
* refactor: unified XDL / wma tests
* fix: properly disable FP8 instances when ONLY targeting gfx11
* refactor: add v3 suffix to grouped_gemm device struct name
* fix: small typos in example code
* fix: fully exclude xdl/wmma instances when using the corresponding cmake flags
* chore: remove unused destructor and added pipeline support checks to remove unnecessary paths
* fix: make sure to not add instance library to group if library was skipped
* fix: make sure xdl grouped gemm doesnt fail the new test
* fix: explicitly exclude test if no xdl/wmma support, as pattern matching fails in this case
* fix: examples not working since dependent types and functions were moved to ck namespace in develop
* fix: tests failing when compiling for just gfx11 due to trying to run unsupported instances
* chore: replace/add copyright headers with new format
- Added support for static_distributed_tensor<...>
- Added support for tile_distribution<...>
- Added support for tensor_view<...>
- Added support for tensor_descriptor<...>
Now type-print handles both:
1. Type aliases (::BottomTensorView, ::TensorDesc, etc.)
2. Full types with no runtime storage (static_distributed_tensor, etc.)
Shows [from type] indicator for all type-only extractions.
Example: type-print dst_tensor
Works even when 'p dst_tensor' shows 'Cannot access memory'
Co-Authored-By: Claude <noreply@anthropic.com>
* Pass hdim to tile_example_fmha_fwd in fp8 tests
* Add WMMA support to fwd FMHA pipelines
* Tune tile sizes a bit for less spilling
fp16 256 is still quite slow
* Fix Q grad tile distribution for warp size = 32 and hdim >= 256
With AccDataType = float and warp size = 32, K0 becomes 0, K repeat is required to correcty distribute the tile.
* Use code based on BlockDropout in BlockDropoutBwd
* Fix split KV combine kernel for gfx12 (warp size 32) and make it more universal
* Fix LSE LDS tensor descriptors: kMaxSplits and kM0 were swapped, it worked on gfx9
because they both equal to 8 while on gfx12 they are 8 and 4;
* Fix Oacc LDS tensor descriptor: it was transposed even though its shape=[4 * kM0, kN1],
it worked on gfx9 because 4 * kM == kN1 == 32;
* Removing these hidden dependecies allows to support:
* any number of warps (power-of-2), not only 4;
* kN1 = 16, not only 32;
* any number of splits;
* Rename ids like o_acc_4 and Oacc4 to eliminate confusion: kNumWarps doesn't have to be 4 now
* Replace hard-coded kN1 in dispatch code with the requested tile size
* Add gfx12-specific tile sizes for split KV
* Pass GPU architecture to kernel generation scripts
This is still a temporary solution.
* Build and run FMHA CI tests for gfx12
* Fix issue after merging
* Fix bwd tile sizes
The current pipelines always read only one tile K and V tile, this
requires bk0 == bhdq and bk2 == bhdv (kK0 == kQKHeaddim and
kK2 == kVHeaddim).
* Use hardware f32->f8 on gfx12, remove v_perm
__builtin_amdgcn_perm is not needed because
__builtin_amdgcn_cvt_pk_fp8_f32 allows to specify which word (16 bit of
32-bit dword) is used to store results (two f8 values).
* Update changelog
* Add WMMA support to pagedkv
* Fix scripts after rebasing
* Support 16x16 (MFMA, WMMA) and 32x32 (MFMA) tiles in fwd and bwd BlockDropout
Add comments with dropout implementation details
Fix performance regression of fwd+dropout
* Remove some usage of type punning (reinterpret_cast with ref or ptr) in Philox;
* "scalarize" seed and offset, they may come either from kernel args or from device memory
(presumably loaded with vector loads).
These changes help the compiler to procude more optimal code and reduce register spilling.
Use WarpGemmDispatcher instead of explicit WarpGemmMfma... to get CWarpDstrEncoding
Use code based on BlockDropout in BlockDropoutBwd
Refactor BlockDropout (fwd)
Implement BlockDropout (fwd) for WMMA
Originally BlockDropout only supported 32x32 tiles (IsWG32 = true),
this version supports 16x16 tiles.
If MPerBlock > MWarp * 16, it can generate numbers for two 16x16 tiles, similarly
to BlockDropoutBwd.
Implement BlockDropoutBwd for WMMA
Remove MakeRandValLds* functions unused in BlockDropoutBwd
Remove unused Run overload from BlockDropoutBwd
* Fix regression with philox seed and offset when they exceed 32-bit int
__builtin_amdgcn_readfirstlane works with 32-bit values, seed and offset
are 64-bit so they get truncated.
* Fix names after cherry-picking
* Fix selection of a fallback tile based on bm0
The assumption that the largest bm0 == 128 is not always true for
current fp32 tiles.
* Do not use filters related to qr_async_trload
They disable tiles/pipelines which are valid for gfx12.
* Use different dstr encoding when C is transposed
* Do not call GetQKBlockGemm (and hence WarpGemmDispatcher) in host code
Some WarpGemmDispatcher instantiations are defined only
for specific archs and undefined on host.
Calculations related to sched barriers are moved from Pipeline's public
fields into pipeline's operator().
* Fix incorrect name WarpGemmMfmaFp8Fp8F32M32N32K16SwizzleBTransposedCDistribution
Correct name is WarpGemmMfmaFp8Fp8F32M32N32K32SwizzleBTransposedCDistribution
because it's 32x32x16 with IterateK = 2 so K = 32, also all tiles used
in codegen scripts are 32, 32, 32.
* Generalize usages of WarpGemmDispatcher for MFMA and WMMA
WarpGemmMfmaFp8Fp8F32M32N32K32SwizzleBTransposedCDistribution is still
used explicitly becaus of swizzle factor = 4.
* Mark has_load_tr as maybe_unused
There are no transpose loading for RDNA.
* Remove CK_TILE_USE_MFMA/WMMA from fmha-related code
* Detect BlockSize on host based on warp size of the current device
If kBlockSize == kNumWarps * get_warp_size(), the kernel is launched with
kBlockSize / 2 because on host get_warp_size() == 64 always.
* Fix calculation of grid size for combine kernel with warp size = 32
* Add missing includes and header
* Support multiple archs in one binary for fwd
* Support multiple archs in one binary for fwd_splitkv, fwd_appendkv, pagedkv_prefill
* Support multiple archs in one binary for bwd
* trload kernels are compiled only for gfx950;
* instances with padding are checked after instances without padding so
they can be used as fallbacks (similarly to fwd);
* Extract common code from register_traits
* Revert "Fix regression with philox seed and offset when they exceed 32-bit int"
To simplify merging , the proper fix is in develop already.
* Support new numerical d paddings in trait ordering checks
* Build fp32 tests only on gfx9
* Do not use hardcoded M0 = 64 for dot bwd kernel
* Use textwrap.indent from standard library
* Make fp8 pipelines on gfx12 consistent with gfx9
* Update tests for current pipelines
* Make ninja check more responsive in CI
ninja buffers output so this job looks hanging.
* Support fp8fp32 by limiting O vector size
The fp32 output type requires storing 8 * sizeof(float) = 32 bytes,
which is not implemented (here 8 is the number of C values per lane for
v_wmma_f32_16x16x16...).
* Remove unused cmake options
* Unify including amd_buffer_addressing.hpp/_builtins.hpp
* Temporarily use amd_buffer_addressing.hpp on >=gfx10
amd_buffer_addressing_builtins.hpp uses inline asm for loads/stores
which is not compatible with >=gfx10:
* 1 scalar for exec masks instead of 2,
* gfx12 uses different instruction names etc.
* Update asm in bf16 conversions to work with warp 32
* Do not generate splitkv/appendkv with vlayout=col for consistency with fwd
* Add arch tags to kernels/host funcs, compile for each arch separately
* Add kM0 to fmha_bwd_dot_do_o kernel name to match filename
* Add workaround for miscompilation of bwd with padded hdim
SWDEV-559729: v_wmma instructions can be incorrectly placed in divergent
branches used to store padded tensors (when some lanes are inactive due
to padding). Inline asm with dummy dependencies on VGPRs of the tensors
prevents the compiler doing this.
* Fix add_gtest_executable for absolute paths
Some tests (like gemm_tile_engine) pass absolute paths to source files.
In CI the branch name is a part of the root dir, and if the branch name
contains "wmma", "xdl" etc., files can be incorrectly excluded.
* Run only hdim 128 smoke tests for fp8fp32
There are no instances for hdim 64 and 256.
* Format py with ruff to simplify merging develop
* Fix incorrect var name
* Codegen for gfx9,gfx950 when --targets is not specified
Aiter and Pytorch require changes for passing their targets to the codegen scripts.
With this temporary solution the files are generated but not all of them
have to be really built (depending on the used --offload-arch=).
* Combine arch-related values into ArchTrait
This more centralized approach removes duplication of various formatting templates.
* Try a workaround for Jenkins error "groovyjarjarasm.asm.MethodTooLargeException: Method too large"
Some code is extracted into a function.
* [CK_BILDER] Add compile-time reflection for a convolution instance
Introduce InstanceTraits template metaprogramming framework to enable runtime introspection of device kernel template parameters without requiring implementation knowledge. This reflection system extracts configuration details (block sizes, data types, layouts, tuning parameters) directly from kernel specializations through template
pattern matching. In particular, the GetInstanceString method returns a string that uniquely idenitfies the kernel, by explicitly serializing all template paramter values.
This provides critical functionality for MIOpen integration, since the existing GetTypeString method is ambiguous, and only captures some of the template paramters.
The implementation uses a two-level design: a primary InstanceTraits template declaration in instance_traits.hpp serves as the interface, while kernel-specific specializations (e.g., for DeviceGroupedConvFwdMultipleABD_Xdl_CShuffle_V3) provide the actual extraction logic. This separation allows the reflection system to scale to additional kernel types without modifying the core interface.
Key architectural decisions:
- Forward-declare device kernels in instance_traits.hpp to avoid circular dependencies, since device implementation headers will include the reflection headers
- Use compile-time constants and type aliases to expose kernel parameters, enabling zero-overhead introspection
- Provide a templated instance_string() function that generates human-readable kernel configuration strings by serializing all template parameters in order, useful for debugging and kernel identification
- Guard reflection integration with preprocessor definition CK_EXPERIMENTAL_BUILDER to keep it opt-in until the API stabilizes
- Add GetInstanceString() virtual method to BaseOperator, allowing runtime polymorphic access to compile-time kernel information
This infrastructure also enables upcoming higher-level semantic reflection abstractions (like ConvTraits) to query kernel configurations programmatically.
Includes unit tests validating both the trait extraction accuracy and the string generation format.
Add experimental builder infrastructure for composable_kernel
- Add experimental/builder directory with README documentation.
- Create initial test infrastructure with CMakeLists.txt and placeholder test.
- Update root CMakeLists.txt to support CK_EXPERIMENTAL_BUILDER option.
- Update .gitignore to not treat `experimental/builder` as a CMake build directory.
This establishes the directory structure for a high-level builder pattern that will provide a semantically-clear interface for constructing CK operations, with initial focus on convolution kernels for MIOpen integration.
* Code drop for gtest to test atomic_add for a tensor
* Adding additional test cases
* Fix clang errors in CI pipeline
* Updated test cases
* Fix the Navi card atomic add problem
* solved the define problem
* add more print out traces
* Fix the float4 missing case
* solved the gfx9 errors
* Address the comment
---------
Co-authored-by: Khushbu <khuagarw@amd.com>
Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
Configure C++ standard with a CMake variable.
Defaults to C++20, but can be set to C++17 to test backwards compatibility.
* Add validation for allowed C++ standards.
* build CK in rehl8 docker with std=c++17
---------
Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>
Problem
=======
relocation R_X86_64_32 out of range: 5405348154 is not in [0, 4294967295]
Solution
========
The problem was caused due the limitation comes from the 32 bit offsets
used in original DWARF standard.
We have the option to switch to 64bit offset for your libs which free
us from 4G size boundary.
add -gdwarf64 and -Og to avoid this limit.
* - elevate important build messages to log level STATUS
- comment out the rest (temporarily)
* - marked all low importance build messages as log_level=DEBUG
* tweaks to the miopen specific build. add way to skip clang-tidy checks and a way to skip some custom build targets MIOpen also has.
* move the tidy if statment
---------
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
* Prepare files for DeviceGemm_Wmma_CShuffleV3
* Implement main part of CShuffleV3 with block pipeline v3 for WMMA
* Remove unused functions and template params for A/B descriptors
* Support both gfx11 and gfx12
* Enable SplitK for gfx12 and disable for gfx11
* Added RowColRow layout for DeviceGemmV2 fp16
* Added more instances for Row, Col, Row data layout
* Added instances for DeviceGemm_Wmma_CShuffleV3, Col, Row, Row data layout
* Added instances for DeviceGemm_Wmma_CShuffleV3, Col, Col, Row data layout
* Added more instances for DeviceGemm_Wmma_CShuffleV3, Row, Row, Row data layout
* Fix formatting
* Add documentation
Based on e5ad48a784
* Enable gemm_universal profiling for gfx11/12
* Add WMMA intrinsics for F8/BF8
* Support F8/BF8 DeviceGemm_Wmma_CShuffleV3, add basic instances
* Add BF16 instances and tests
* Fix test_gemm_universal_wmma_fp8 by adding CK_USE_WMMA_FP8
---------
Co-authored-by: Anca Hamuraru <anca@streamhpc.com>
* New branch for codegen changes
* Fix verify function for int4
* pk_int4 codegen
* Update to review comments
* Remove codegen directory and rename filenames
* Remove extra files; clean up CMake file
* New branch for codegen changes
* Fix verify function for int4
* pk_int4 codegen
* Update to review comments
* Remove codegen directory and rename filenames
* Remove extra files; clean up CMake file
* code changes for single instance
* config file rename, added few more combinations in json file
* Fix cmake file
* Addressing review comments
* Reverting files changed by merge to develop
---------
Co-authored-by: ThomasNing <thomas.ning@amd.com>
* Add runtime check in example_gemm_xdl_streamk for gfx950
* Add runtime check in grouped conv fwd examples for gfx950
* Disable CK_USE_AMD_MFMA_GFX950
* Add new instances for gfx950
* Fix test_gemm_universal on gfx950