mirror of
https://github.com/ROCm/composable_kernel.git
synced 2026-06-10 16:28:38 +00:00
2709abee7ee4c836d9932375ea2ac82fa05933ca
3108 Commits
| Author | SHA1 | Message | Date | |
|---|---|---|---|---|
|
|
2709abee7e |
[rocm-libraries] ROCm/rocm-libraries#4577 (commit a36922c)
[CK_TILE] FMHA BWD Launcher Interface (#4577) ## Motivation Reduce memory usage; Be prepared to implement optimizations of reducing nsplits in deterministic cases. ## Technical Details This PR introduces a new launcher interface for the FMHA backward operation, replacing direct function calls with a more structured approach. The launcher encapsulates kernel dispatch logic and provides access to computed metadata like the number of dQ acc splits. **Changes:** - Added `fmha_bwd_launcher` class that wraps kernel execution and exposes `dq_acc_splits` - Moved `fmha_bwd_traits` construction earlier in the execution flow to support launcher initialization - Refactored code generation to produce both legacy API and new launcher constructor ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. |
||
|
|
94dd8b6955 |
[rocm-libraries] ROCm/rocm-libraries#4958 (commit 713881f)
bf8 and bf16 support for Universal GEMM in Tile Engine (#4958) ## Motivation Currently we have only fp8 and fp16 datatype support for universal GEMM in Tile Engine with this PR support for bf8 and bf16 datatype will be added during the CI phase ## Technical Details Adding bf8 and bf16 support ## Test Plan NA ## Test Result NA ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. |
||
|
|
a49a464f74 |
[rocm-libraries] ROCm/rocm-libraries#5045 (commit 64a5502)
[CK] Address a bunch of errors associated with targeting gfx1200 on Windows (#5045) ## Motivation Still addressing errors that are blocking the merge of TheRock PR: https://github.com/ROCm/TheRock/actions/runs/22545831304/job/65308264096?pr=3382 ## Technical Details 1. There are multiple fmha python scripts that are writing native paths which are confusing cmake. I addressed one of these in an earlier PR https://github.com/ROCm/rocm-libraries/pull/4812 and now I'm addressing more that are exposed with gfx1200 target: ``` [composable_kernel configure] CMake Error at example/ck_tile/50_sparse_attn/CMakeLists.txt:61 (add_library): [composable_kernel configure] Syntax error in cmake code when parsing string [composable_kernel configure] [composable_kernel configure] B:\build\ml-libs\composable_kernel\build\example\ck_tile\50_sparse_attn\fmha_jenga_fwd_d128_fp16_batch_b128x128x32x128x32x128_r4x1x1_r4x1x1_w32x32x16_w32x32x16_qr_async_vr_psddv_nlogits_nbias_nmask_nskip_nsquant_ntrload.cpp [composable_kernel configure] [composable_kernel configure] Invalid character escape '\b'. ``` 2. In the following compiler error we see gemm_prec_str<ADataType, BDataType> being passed as a function to concat(...), instead of being evaluated with the parenthesis operator(), i.e., gemm_prec_str<ADataType, BDataType>(). There are multiples instances of this, I wonder what non-msvc compilers do here: ``` [composable_kernel] FAILED: [code=1] example/ck_tile/38_block_scale_gemm/CMakeFiles/tile_example_gemm_quant.dir/gemm_bquant_quantgrouped_mx_bf16bf8.cpp.obj [composable_kernel] In file included from E:/TheRock/rocm-libraries/projects/composablekernel/example/ck_tile/38_block_scale_gemm/gemm_bquant_quantgrouped_mx_bf16bf8.cpp:4: [composable_kernel] In file included from E:/TheRock/rocm-libraries/projects/composablekernel/example/ck_tile/38_block_scale_gemm\run_gemm_quant_example.inc:17: [composable_kernel] In file included from E:/TheRock/rocm-libraries/projects/composablekernel/include\ck_tile/host.hpp:7: [composable_kernel] E:/TheRock/rocm-libraries/projects/composablekernel/include\ck_tile/host/concat.hpp:119:21: error: implicit conversion between pointer-to-function and pointer-to-object is a Microsoft extension [-Werror,-Wmicrosoft-cast] [composable_kernel] 119 | ((oss << sep << rest), ...); [composable_kernel] | ^~~~ [composable_kernel] E:/TheRock/rocm-libraries/projects/composablekernel/include\ck_tile/ops/gemm_quant/kernel/gemm_quant_kernel.hpp:248:16: note: in instantiation of function template specialization 'ck_tile::concat<char, char[11], std::basic_string<char> (), std::basic_string<char>>' requested here [composable_kernel] 248 | return concat('_', "gemm_quant", gemm_prec_str<ADataType, BDataType>, GemmPipeline::GetName()); [composable_kernel] | ^ ``` There are plenty of other places where we use gemm_prec_str with the operator(), so I'm pretty sure these were just typos...but I'd like some eyes on it. 3. There are 2 tests that fail to build on Windows, which I've excluded from the build but will open bug tickets for: 1. gemm_weight_preshuffle 2. grouped_gemm_preshuffle Here's a sample of the compiler error for these tests: ``` [composable_kernel] [16/19] Building HIP object test/ck_tile/grouped_gemm_preshuffle/CMakeFiles/test_ck_tile_grouped_gemm_preshuffle.dir/test_grouped_gemm_preshuffle.cpp.obj [composable_kernel] FAILED: [code=1] test/ck_tile/grouped_gemm_preshuffle/CMakeFiles/test_ck_tile_grouped_gemm_preshuffle.dir/test_grouped_gemm_preshuffle.cpp.obj [composable_kernel] E:\TheRock\build\core\clr\dist\lib\llvm\bin\clang++.exe -DCK_ENABLE_BF16 -DCK_ENABLE_BF8 -DCK_ENABLE_FP16 -DCK_ENABLE_FP32 -DCK_ENABLE_FP64 -DCK_ENABLE_FP8 -DCK_ENABLE_INT8 -DCK_TILE_USE_WMMA=1 -DCK_TIME_KERNEL=1 -DCK_USE_OCP_FP8 -DCK_USE_WMMA -DCK_USE_WMMA_FP8 -DCK_USE_XDL -DDPP_KERNELS -DUSE_PROF_API=1 -D__HIP_PLATFORM_AMD__=1 -D__HIP_PLATFORM_HCC__=1 -D__HIP_ROCclr__=1 -IE:/TheRock/rocm-libraries/projects/composablekernel/profiler/include -IE:/TheRock/rocm-libraries/projects/composablekernel -IE:/TheRock/rocm-libraries/projects/composablekernel/library/include -IE:/TheRock/rocm-libraries/projects/composablekernel/include -IE:/TheRock/build/ml-libs/composable_kernel/build/include -IE:/TheRock/build/base/half/stage/include -isystem E:/TheRock/build/core/clr/dist/include -isystem E:/TheRock/build/ml-libs/composable_kernel/build/_deps/gtest-src/googletest/include -isystem E:/TheRock/build/ml-libs/composable_kernel/build/_deps/gtest-src/googletest -isystem E:/TheRock/build/ml-libs/composable_kernel/build/_deps/getopt-src/src -O3 -DNDEBUG -std=gnu++20 --offload-arch=gfx1200 -D_DLL -D_MT -Xclang --dependent-lib=msvcrt -Wall -Wextra -Wcomment -Wendif-labels -Wformat -Winit-self -Wreturn-type -Wsequence-point -Wswitch -Wtrigraphs -Wundef -Wuninitialized -Wunreachable-code -Wunused -Wno-reserved-identifier -Wno-option-ignored -Wsign-compare -Wno-extra-semi-stmt -Wno-unused-template -Wno-missing-field-initializers -Wno-error=deprecated-declarations -Wall -Wextra -Wcomment -Wendif-labels -Wformat -Winit-self -Wreturn-type -Wsequence-point -Wswitch -Wtrigraphs -Wundef -Wuninitialized -Wunreachable-code -Wunused -Wno-reserved-identifier -Wno-option-ignored -Wsign-compare -Wno-extra-semi-stmt -Wno-unused-template -Weverything -Wno-c++98-compat -Wno-c++98-compat-pedantic -Wno-conversion -Wno-double-promotion -Wno-exit-time-destructors -Wno-extra-semi -Wno-float-conversion -Wno-gnu-anonymous-struct -Wno-gnu-zero-variadic-macro-arguments -Wno-missing-prototypes -Wno-nested-anon-types -Wno-padded -Wno-return-std-move-in-c++11 -Wno-shorten-64-to-32 -Wno-sign-conversion -Wno-unknown-warning-option -Wno-unused-command-line-argument -Wno-weak-vtables -Wno-covered-switch-default -Wno-unsafe-buffer-usage -Wno-unused-lambda-capture -Wno-nvcc-compat -Wno-c++20-compat -Wno-bit-int-extension -Wno-pass-failed -Wno-switch-default -Wno-unique-object-duplication -fbracket-depth=1024 -Wno-nrvo -Werror -Weverything -fcolor-diagnostics -Wno-c++20-extensions -Wno-global-constructors -Wno-undef -DCK_TILE_USE_OCP_FP8 -MD -MT test/ck_tile/grouped_gemm_preshuffle/CMakeFiles/test_ck_tile_grouped_gemm_preshuffle.dir/test_grouped_gemm_preshuffle.cpp.obj -MF test\ck_tile\grouped_gemm_preshuffle\CMakeFiles\test_ck_tile_grouped_gemm_preshuffle.dir\test_grouped_gemm_preshuffle.cpp.obj.d -o test/ck_tile/grouped_gemm_preshuffle/CMakeFiles/test_ck_tile_grouped_gemm_preshuffle.dir/test_grouped_gemm_preshuffle.cpp.obj -x hip -c E:/TheRock/rocm-libraries/projects/composablekernel/test/ck_tile/grouped_gemm_preshuffle/test_grouped_gemm_preshuffle.cpp [composable_kernel] In file included from E:/TheRock/rocm-libraries/projects/composablekernel/test/ck_tile/grouped_gemm_preshuffle/test_grouped_gemm_preshuffle.cpp:8: [composable_kernel] In file included from E:/TheRock/rocm-libraries/projects/composablekernel/include\ck_tile/host.hpp:6: [composable_kernel] In file included from E:/TheRock/rocm-libraries/projects/composablekernel/include\ck_tile/host/check_err.hpp:16: [composable_kernel] In file included from E:/TheRock/rocm-libraries/projects/composablekernel/include\ck_tile/core.hpp:89: [composable_kernel] E:/TheRock/rocm-libraries/projects/composablekernel/include\ck_tile/core/utility/env.hpp:110:31: warning: 'getenv' is deprecated: This function or variable may be unsafe. Consider using _dupenv_s instead. To disable deprecation, use _CRT_SECURE_NO_WARNINGS. See online help for details. [-Wdeprecated-declarations] [composable_kernel] 110 | const char* vp = std::getenv(name); [composable_kernel] | ^ [composable_kernel] C:\Program Files (x86)\Windows Kits\10\include\10.0.22621.0\ucrt\stdlib.h:1183:20: note: 'getenv' has been explicitly marked deprecated here [composable_kernel] 1183 | _Check_return_ _CRT_INSECURE_DEPRECATE(_dupenv_s) [composable_kernel] | ^ [composable_kernel] C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.44.35207\include\vcruntime.h:368:55: note: expanded from macro '_CRT_INSECURE_DEPRECATE' [composable_kernel] 368 | #define _CRT_INSECURE_DEPRECATE(_Replacement) _CRT_DEPRECATE_TEXT( \ [composable_kernel] | ^ [composable_kernel] C:\Program Files (x86)\Microsoft Visual Studio\2022\BuildTools\VC\Tools\MSVC\14.44.35207\include\vcruntime.h:358:47: note: expanded from macro '_CRT_DEPRECATE_TEXT' [composable_kernel] 358 | #define _CRT_DEPRECATE_TEXT(_Text) __declspec(deprecated(_Text)) [composable_kernel] | ^ [composable_kernel] clang++: error: clang frontend command failed due to signal (use -v to see invocation) [composable_kernel] AMD clang version 22.0.0git (https://github.com/ROCm/llvm-project.git a2dc42b87c63e686377a69f09ea23aec7550babc+PATCHED:e4d5bf498b7b8626bb9716f1f5a5946d45025918) [composable_kernel] Target: x86_64-pc-windows-msvc [composable_kernel] Thread model: posix [composable_kernel] InstalledDir: E:\TheRock\build\core\clr\dist\lib\llvm\bin [composable_kernel] clang++: note: diagnostic msg: Error generating preprocessed source(s). [composable_kernel] ninja: build stopped: subcommand failed. [composable_kernel FAILED WITH CODE 1 in 238 seconds] ninja: build stopped: subcommand failed. ``` ## Test Plan Wait for internal CI and make sure build compiles locally. ## Test Result Waiting on CI ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. |
||
|
|
f798c36fdd |
[rocm-libraries] ROCm/rocm-libraries#4943 (commit ea40212)
[CK] Updating CI skip logic (#4943) ## Motivation The CI skip logic has two issues that prevented it from working correctly: 1. **Incorrect file patterns**: After migrating from standalone repo to `rocm-libraries`, file paths now include the `projects/composablekernel/` prefix (e.g., `projects/composablekernel/docs/README.md`). The skip patterns were still checking for paths starting with `docs/`, which never matched. 2. **Incomplete build type support**: Jenkins multibranch pipelines provide different environment variables for PR builds (`$CHANGE_TARGET`, `$CHANGE_ID`) vs branch builds (`$BRANCH_NAME`). The previous logic only compared `HEAD~1..HEAD` for branch builds, which missed changes from multi-commit pushes and didn't properly handle feature branch builds. When CI skipped or ran, there was no visibility into which files triggered the decision, making it difficult to diagnose issues. You can now see which files triggered the CI run. ## Technical Details PR builds: Compares all commits against origin/$CHANGE_TARGET. Feature branch builds: Uses git merge-base to find divergence point from develop and checks all touched files since then. Scheduled develop builds are unaffected. These builds are forced to run from the pipeline parameters. Example log output for PR Builds: <img width="647" height="260" alt="image" src="https://github.com/user-attachments/assets/c8673a81-acb2-4fb2-acbb-1c07b5ab3b69" /> Example log output for Branch Builds: <img width="488" height="287" alt="image" src="https://github.com/user-attachments/assets/fbb17ba7-eb2c-42a4-b820-b2a8b9e479c4" /> ## Test Plan Pre-PR validation (branch builds): Push commits with only documentation changes → CI should skip. I will have to verify this after this PR is merged! Push commits with code changes → CI should run Push commits that modify then revert code → CI should run (catching reverts) Verify debug output clearly shows skip/run decision Post-PR validation (PR builds): Create PR with only doc changes → CI should skip. I will have to verify this after this PR is merged! Create PR with mixed doc + code changes → CI should run and log which files triggered it Verify debug output clearly shows skip/run decision ## Test Result All branch build checks succeeded. All PR build checks succeeded. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. |
||
|
|
a1bbe89982 |
[rocm-libraries] ROCm/rocm-libraries#4301 (commit 0821c9f)
test: Add umbrella test targets for CK Tile operations (#4301) ## Proposed changes Adds operation-specific umbrella test targets for CK Tile to enable running all tests for a specific operation without running the entire test suite. This improves the development workflow by allowing faster iteration when working on specific operations. ## Motivation Previously, developers working on CK Tile operations could only: - Run individual test executables one at a time - Run global labels (, , ) which test the entire codebase - Build all tests for an operation but had no simple way to run them all This made it cumbersome to validate changes to a specific operation (e.g., GEMM quantization) without either running tests individually or running the entire test suite. ### Documentation - - Comprehensive testing guide with usage examples and implementation details ## Usage Examples # Run all GEMM tests with 256 parallel jobs ninja -j256 ck_tile_gemm_tests # Run all GEMM block scale (quantization) tests ninja -j256 ck_tile_gemm_block_scale_tests # Run all GEMM StreamK tests ninja -j256 ck_tile_gemm_streamk_tests ## Checklist Please put an into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask. - [x] I have added tests relevant to the introduced functionality, and the unit tests are passing locally - [x] I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, **IF** the test takes more than 30 seconds to run. - [x] I have added inline documentation which enables the maintainers with understanding the motivation - [x] I have removed the stale documentation which is no longer relevant after this pull request - [ ] (If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request - [x] I have run on all changed files - [x] Any dependent changes have been merged ## Discussion If this is a relatively large or complex change, feel free to start a discussion by explaining why you chose the solution you did and what alternatives you considered --- 🔁 Imported from [ROCm/composable_kernel#3654](https://github.com/ROCm/composable_kernel/pull/3654) 🧑💻 Originally authored by @AviralGoelAMD --------- Co-authored-by: AviralGoelAMD <aviral.goel@amd.com> Co-authored-by: assistant-librarian[bot] <assistant-librarian[bot]@users.noreply.github.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> Co-authored-by: Thomas Ning <Thomas.Ning@amd.com> |
||
|
|
c754ee3df0 |
[rocm-libraries] ROCm/rocm-libraries#5036 (commit 0bee213)
[CK] Switch compiler branch from staging to develop and upgrade sccache. (#5036) ## Motivation Upgrade to official sccache version 0.14, since it now supports hip. Also, switching daily builds from amd-staging to develop compiler branch, since it should be more stable. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. |
||
|
|
468fbec35d |
[rocm-libraries] ROCm/rocm-libraries#4800 (commit 9dcf0cf)
[CK Profiler] Instance selection for grouped conv profilers (#4800) ## Motivation This PR adds instance selection support for ckProfiler grouped convolution operations (forward, backward data, backward weight), allowing users to run specific kernel instances rather than sweeping all available instances. When profiling or debugging convolution kernels, users often need to test specific kernel configurations without running the full instance sweep. This is particularly useful for: - Debugging a specific failing instance - Profiling a known-best configuration - Quick validation during development ## Technical Details **Features added**: - `--instance <id>` flag to run only the N-th valid instance (0-indexed) - `--list-instances` flag to list all valid instances without running any kernels - Named arguments can appear anywhere on the command line - Best instance index is now printed with results for reference - Python script support via `-ii` / `--instance_index` arguments **Design decisions**: - Named arguments (`--instance`, `--list-instances`) instead of positional to avoid conflicts with existing parameters - Instance index refers to the N-th valid instance (0-indexed), not the global instance index - Auto-disable verification when `--list-instances` is used for fast enumeration - Shared utilities in `profiler_arg_utils.hpp` to deduplicate parsing logic ## Test Plan Manual testing with various scenarios: List all valid instances: ```bash ./bin/ckProfiler grouped_conv_fwd <usual args> --list-instances ``` Run only instance 5: ```bash ./bin/ckProfiler grouped_conv_fwd <usual args> --instance 5 ``` Test cases: - Single instance selection - List instances mode - Out-of-bounds instance index (verified warning messages) - No instance flag (runs all instances - default behavior) - All three operations (fwd, bwd_data, bwd_weight) ## Test Result All test scenarios passed: - Instance selection correctly filters kernel executions - List mode enumerates valid instances without running kernels - Invalid indices produce appropriate warnings without crashing - Default behavior (all instances) unchanged when flags not provided - Consistent behavior across all three grouped convolution operations ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. |
||
|
|
554e2cae4c |
[rocm-libraries] ROCm/rocm-libraries#5038 (commit 6e74de7)
[CK_BUILDER] Update developer notes in the CK Builder source directories (#5038) ## Motivation This PR updates the developer notes for the CK Tile builder. It captures the current state of the implementation in more detail, and frames the description around the need to have true facade. There is no functional change, only better alignment of developer notes with the current code. This doc clearly explains the current technical debt: that we have created many facades that expose the implementation details. There is an expanded section on reflection that explains how unified reflection will help clarify the unified builder design. Additional changes are just better accounting for the current state of the code, including previously undocumented operations. A few typos and cosmetic issues are cleaned up, too. |
||
|
|
11fa35bea6 |
[rocm-libraries] ROCm/rocm-libraries#4984 (commit 962b047)
[CK_TILE] Reduce Register Spills in Stream-K Reductions (#4984) ## Motivation In CK Tile Stream-K, kernels using one of two non-atomic reduction strategies (i.e., linear, tree) have high register spill count, with the tree reduction generally being worse. These changes act a first step to help decrease the register spill count. ## Technical Details ### Problem 1: Unvectorized access to partials In both the linear and tree reductions, workgroups write partials results to a global buffer; another workgroup will later read this data. When the initial logic to support reading and writing to the partials buffer was added (see https://github.com/ROCm/composable_kernel/pull/3107), the tile distribution encoding used to read from and write to partials matches the register layout for the accumulator of the mfma instruction used for the kernel. Since we do not currently use the transposed register layout for the accumulator, we end with an encoding that is not optimized for writing to HBM. For example: Consider the register layout of the `v_mfma_f32_16x16x32_fp8_fp8` instruction. ```bash ./matrix_calculator.py --architecture gfx942 --instruction v_mfma_f32_16x16x32_fp8_fp8 --register-layout --C-matrix ``` <img width="1113" height="537" alt="image" src="https://github.com/user-attachments/assets/afc8f556-08cc-4224-a6e5-b5edabc5fc02" /> The above shows that threads are responsible for consecutive elements down a column of the C tile. If we use this distribution to read and write to partials with C in row major, then threads are unable to perform vectorized reads and writes. Note: thread 0 is shown in red and thread 1 is shown in green. Since the C-shuffle Epilogue only supports C in row major, reading and writing to partials is highly unoptimized. ### Problem 2: Missed opportunity for SPGR use in tree reduction loop Since the reduction occurs between workgroups, all threads in the workgroup follow the same execution paths in the tree reduction logic, hence various variables should be using SGPRs, but they are not. ### Implemented Solutions 1. Add a new tile distribution encoding that is optimized for accessing partials in HBM. This encoding does not change the data assignment to threads, it merely changes the addresses to which they write/read in the partials buffer. For example, continuing with the `v_mfma_f32_16x16x32_fp8_fp8` instruction, the new encoding would result in threads writing in the following layout: <img width="517" height="342" alt="image" src="https://github.com/user-attachments/assets/93b5e0ea-bafc-47b8-89bb-c40ba75cb202" /> This layout ensures that each thread writes along a row, enabling `buffer_{store|load}_dwordx4` instructions (i.e., vectorized accesses). This helps reduce register usage due to requiring fewer offset calculations. 2. To force SGPR usage in the tree reduction loop, I make use of CK Tile's `amd_wave_read_first_lane` which is a wrapper around `__builtin_amdgcn_readfirstlane`. This helps reduce VGPR spills in the tree reduction. _These changes do not fully eliminate register spills. Future work will aim to further reduce spills. But these changes make good progress._ ## Test Plan Added tests for different warp tile sizes to validate that the new encoding works with different `WarpGemm` variants. ## Test Result All tests pass locally on all gfx9 architectures. Some results for decreases in register spills on gfx942: (BL = baseline) | Kernel | SGPR Spill (BL) | SGPR Spill (new) | SGPR Delta | SGPR % | VGPR Spill (BL) | VGPR Spill (new) | VGPR Delta | VGPR % | |--------|------------------:|------------------:|-----------:|-------:|-------------------:|------------------:|-----------:|-------:| | fp16 linear F/F/F/T 256x256x32 2x2x1 32x32x16 | 223 | 0 | -223 | -100.0% | 21 | 20 | -1 | -4.8% | | fp16 tree F/F/F/T 256x256x32 2x2x1 32x32x16 | 233 | 11 | -222 | -95.3% | 443 | 23 | -420 | -94.8% | | fp8 linear F/F/F/F 256x256x32 2x2x1 32x32x32 | 221 | 3 | -218 | -98.6% | 12 | 6 | -6 | -50.0% | | fp8 tree F/F/F/F 256x256x32 2x2x1 32x32x32 | 230 | 14 | -216 | -93.9% | 396 | 12 | -384 | -97.0% | ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. |
||
|
|
6a334e5086 |
[rocm-libraries] ROCm/rocm-libraries#4804 (commit 832dd0e)
Add Tile Distribution Encoding Register Mapping debug utility for MFMA / WMMA unification work. (#4804) ## Motivation This PR adds a small utility that allows you to use Tile Distribution Encodings to directly map matrix elements to register locations and vice versa. It can also print forward and backward layout mappings similar to the Matrix Calculator utility. The utility is not meant for index calculations in actual kernels, but rather as a debugging tool and probably for automated verification of the policy structs in the new WMMA / MFMA unification design. ## Technical Details Tile Distribution Encodings are a core part of CK Tile which can define the relationship between register and intrinsic matrix fragment elements. They allow for any mapping based on unmerge and merge transformations. Also, they allow for a special "Repeat" dimensions which acts like an additional matrix dimension and allows for replication of certain matrix elements. The new mapping utility can deal with all aspects. ## Test Plan Since this is a debug utility there is nothing to directly test, but there is an example file that defines four different Tile Distribution Encodings and prints their forward and backward mappings, along with some extra parameters. ## Test Result ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. |
||
|
|
b834c3a895 |
[rocm-libraries] ROCm/rocm-libraries#5011 (commit b31a678)
[CK] Fix aiter tests in CI (#5011) ## Motivation Updates the CK/AITER CI Docker build to source Composable Kernel either from `ROCm/rocm-libraries` (via sparse-checkout) or directly from `ROCm/composable_kernel`, aiming to make aiter tests reliable in CI. **Changes:** - Added a build arg to toggle fetching CK from `ROCm/rocm-libraries` (enabled by default). - Implemented sparse-checkout + local re-init/commit flow to materialize CK into a local `ck/` directory. - Updated aiter’s CK vendoring step to clone from the locally prepared `ck/` directory. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. |
||
|
|
481ca26b93 |
[rocm-libraries] ROCm/rocm-libraries#4294 (commit 6601702)
Cleanup and refactoring related to tile loading (#4294) ## Proposed changes Cleanup and refactoring done while implementing mixed precision for fp16/bf16 x fp8 Key changes: - Renamed load_interleaved_pk_type.hpp to load_and_convert_tile.hpp and refactored the API to use consistent naming conventions - Updated load_tile_transpose functions to use output parameters instead of return values for consistency - Removed unused variable declarations and simplified type deduction logic - Define load_tile_with_elementwise to use tuple types explicitly for clarity ## Checklist Please put an `x` into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask. - [ ] I have added tests relevant to the introduced functionality, and the unit tests are passing locally - [ ] I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, **IF** the test takes more than 30 seconds to run. - [x] I have added inline documentation which enables the maintainers with understanding the motivation - [ ] I have removed the stale documentation which is no longer relevant after this pull request - [ ] (If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request - [X] I have run `clang-format` on all changed files - [ ] Any dependent changes have been merged ## Discussion If this is a relatively large or complex change, feel free to start a discussion by explaining why you chose the solution you did and what alternatives you considered --- 🔁 Imported from [ROCm/composable_kernel#3505](https://github.com/ROCm/composable_kernel/pull/3505) 🧑💻 Originally authored by @SamiAario-AMD --------- Co-authored-by: Sami Aario <samaario@amd.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> |
||
|
|
1d1c649c3e |
[rocm-libraries] ROCm/rocm-libraries#4518 (commit dd161dc)
[CK_TILE] Fix CShuffleEpilogue test to use correct GEMM accumulator distribution (#4518) ## Summary The test was using LDS distribution to create the accumulator tile, but CShuffleEpilogue expects the GEMM accumulator distribution that BlockGemm produces. This mismatch caused incorrect data permutation. ## Changes - Use WarpGemmDispatcher to get correct accumulator distribution encoding - Load test input from host-initialized global memory for deterministic verification - Shard tests by data type (FP16, FP8) with gfx950-specific FP8 tests - Extract scale tests into separate target for better organization - Implement exact permutation verification (all unique values appear once) - Reduce tile size from 256x256 to 128x128 to fit in unique fp16 range - Add parameterized test configurations for various warp layouts and MFMA types ## Test plan - [x] Run new cshuffle epilogue tests 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com> --------- Co-authored-by: Claude <noreply@anthropic.com> Co-authored-by: systems-assistant[bot] <systems-assistant[bot]@users.noreply.github.com> |
||
|
|
dd51de47ff |
[rocm-libraries] ROCm/rocm-libraries#4313 (commit 080ac66)
[CK] Fix gptoss sink (#4313) ## Motivation This PR removes conditional logic for handling infinity values in the sink mechanism across multiple FMHA pipeline implementations, defaulting sink_size to 0 and adding a constraint in the kernel selection logic. ## Technical Details Changes: Removed __builtin_isinf_sign(sink_v) checks and conditional initialization of LSE accumulators across 7 pipeline files Added default initialization (= 0) for sink_size in 4 argument structs Added F_sink == "f" constraint to kernel compatibility checking ## Test Plan Local test ## Test Result passed ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> Co-authored-by: systems-assistant[bot] <systems-assistant[bot]@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com> |
||
|
|
d080cfd2dc |
[rocm-libraries] ROCm/rocm-libraries#4873 (commit 580ad4f)
[CK] CK Tile improvements and fixes for depthwise merged convolutions forward (#4873) ## Motivation Performance benchmarks showed that old CK's depthwise merged convolutions are much faster than CK Tile's ones. ## Technical Details After investigation it showed up that the requirement that A/CVectorload is a multiple of gemm's rightmost dimension is too strict in case of processing multiple groups, because if tensor is in NHWGC/NHWGK format, then if C/K is equal to 1, we can use vectorloads on the G dimension, which is added by this PR. Filter5x5 specialization was also added, because some models are using it, it's similar to 3x3, the only difference is the window size. This addition was needed, because of the differences of tensor descriptor transformations betweeen CK and CK Tile. In old CK the case of grouped depthwise 5x5 convs was supported via Default specialization, but in CK Tile that case was not working properly. ## Test Plan Performance was tested by our internal test suite, which contains several DL models. ## Test Result Tests results showed significant performance uplift for depthwise(3x3, 5x5) cases --------- Co-authored-by: Bartlomiej Kocot <barkocot@amd.com> |
||
|
|
41070044bd |
[rocm-libraries] ROCm/rocm-libraries#4828 (commit 7de19bb)
Add generate_identity_sequences helper and replace lambdas with named functors (#4828) ## Summary - Add `generate_identity_sequences<N>()` helper that returns `Tuple<Sequence<0>, Sequence<1>, ..., Sequence<N-1>>` - Replace lambdas with named functors in `transform_tensor_descriptor` - Add `unpack_and_merge_sequences` helper functor - Reduces `transform_tensor_descriptor` instantiations from 388 to 32 (92% reduction) ## Motivation Multiple call sites use `generate_tuple([](auto i) { return Sequence<i>{}; }, Number<N>{})` pattern. A named helper reduces lambda instantiations. Additionally, each lambda in `transform_tensor_descriptor` creates a unique closure type, causing the function to be instantiated separately for every call site. Named functors share a single type, so the compiler reuses the same instantiation. ## Changes ### Part 1: generate_identity_sequences helper - Replaces common lambda pattern for generating identity sequences - Each lambda expression creates a unique closure type, causing separate template instantiations at every call site - Named helper shares a single type across all uses ### Part 2: Named functors in transform_tensor_descriptor - Add `unpack_and_merge_sequences` helper to replace lambda in `GetNumOfHiddenDimension` - Use `generate_identity_sequences` in `matrix_padder.hpp` ## Test Plan - [x] Added 7 unit tests: - 4 tests for `generate_identity_sequences` - 3 tests for `unpack_and_merge_sequences` - [ ] Waiting for full CI ## Related PRs This PR merges the functionality from: - ROCm/composable_kernel#3588 (generate_identity_sequences helper) - ROCm/composable_kernel#3589 (Named functors in transform_tensor_descriptor) Part of PR stack for issue #4229 (Reduce CK/CKTile Build Times) **Note:** This PR supersedes #4283, ROCm/composable_kernel#3588 and ROCm/composable_kernel#3589, which can be closed once this is merged. --- 🔁 Imported from [ROCm/composable_kernel#3628](https://github.com/ROCm/composable_kernel/pull/3628) 🧑💻 Originally authored by @tenpercent Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> |
||
|
|
a67aaa1b96 |
[rocm-libraries] ROCm/rocm-libraries#4875 (commit e35e3f2)
[CK] Port non-grouped convolution instances to the grouped kernels (#4875) ## Motivation Port non-grouped convolution instances to the grouped kernels to deprecated older non-grouped implementations. ## Technical Details Add the same instances as non-grouped but using grouped kernel. ## Test Plan test_grouped_convnd_fwd ## Test Result pass ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. AICK-724 |
||
|
|
6777af8c60 |
[rocm-libraries] ROCm/rocm-libraries#4821 (commit 9456e0f)
[CK TILE] Refactor MX FLATMM example (#4821) Refactor the MX FLATMM example to support more pipelines across different architectures. This work facilitates the NPI team roadmap. |
||
|
|
fab78539d1 |
[rocm-libraries] ROCm/rocm-libraries#4975 (commit 5bee6f0)
[CK] Add gfx1103 to GPU target lists (#4975) Motivation We need to fix multi-architecture CI build convergence for the gfx110X-all shard (ROCm/TheRock#3499). The gfx110X-all CI shard targets gfx1100–gfx1103, but gfx1103 is missing from CK's default CK_GPU_TARGETS lists. While CK's source code already fully supports gfx1103 (architecture enums, compiler defines, WMMA intrinsics, device detection), the CMake target lists omit it, which prevents standalone builds from including gfx1103 by default. This is a prerequisite for the corresponding TheRock change that adds gfx1100–gfx1103 to the `_ck_supported_gfx_targets` allowlist in ml-libs/CMakeLists.txt. Technical Details Add gfx1103 to the default CK_GPU_TARGETS fallback lists in projects/composablekernel/CMakeLists.txt: - Line 220: comment documenting supported GPU_ARCHS values - Line 227: target list for HIP < 6.3 (non-Windows) - Line 229: target list for HIP 6.3–6.4 (non-Windows) - Line 231: target list for HIP 6.4–6.4.43483 (non-Windows) The newest HIP version block (≥ 6.4.43483) already uses gfx11-generic, which covers all gfx11 family targets including gfx1103, so no change is needed there. No source code changes are required — all architecture-specific support for gfx1103 is already in place: - include/ck/ck.hpp: __gfx1103__ included in __gfx11__ macro - include/ck_tile/core/arch/arch.hpp: GFX1103 enum and device property mappings - include/ck_tile/core/config.hpp: CK_TILE_ARCH_GFX1103 flag - include/ck/host_utility/device_prop.hpp / include/ck_tile/host/device_prop.hpp: is_gfx11_supported() includes gfx1103 Test Plan - Configure CK standalone build with -DGPU_TARGETS="gfx1103" and verify it configures without warnings and compiles successfully. - After the companion TheRock PR lands, verify the gfx110X-all CI shard builds CK and produces a CK-enabled libMIOpen.so matching the structure of other shards (no "gfx110X is not supported by composable kernel" warnings). Test Result I configured with gfx1103 and built with `ninja -j 192` on an in-memory filesystem in 49 minutes. The windows build was successful and took 2 1/2 hours on 192 cores. |
||
|
|
2e92a9ab11 |
[rocm-libraries] ROCm/rocm-libraries#4963 (commit cb6bbf6)
[CK][CK Tile] Fix batched gemm kernel 2 lds (#4963) ## Motivation Fix 2 lds batched gemm universal gemm call. Disable split k for not valid atomic add instruction size. ## Technical Details Fix 2 lds batched gemm universal gemm call. Disable split k for not valid atomic add instruction size. ## Test Plan CI overall ## Test Result pending ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. |
||
|
|
f72b9a3cbd |
[rocm-libraries] ROCm/rocm-libraries#4750 (commit c065793)
[CK_BUILDER] ck builder conv transfer fix (#4750) ## Motivation This PR fixes how CK Builder is validating transfer vector size and adds proper validation for LDS transfer vector size as well. ## Changes: * [__source vector dim__] -- Before this PR the data transfer validation logic didn't allow to set the source vectorized dimension to 1. However there are CK instances that are doing this when the group merging is used. This is used only for `DeviceGroupedConvFwdMultipleABD_Xdl_CShuffle` kernel. * [__valid vector size__] -- Before this PR the validation logic concerned only single instruction maximum vector size. However our buffer loading logic has implemented support for loading more values through multiple buffer instructions. This again was discovered to be used in some of the convolution instances. Thus this behavior was reflected in validation logic. * [__valid LDS vector size__] -- Before this PR the LDS vector size validation was done in the same way as VMEM. This PR adds proper LDS vector size validation based on the available LDS instruction sizes. ## Test Plan Run CK BUILDER conv fwd factories tests ## Test Result All CK BUILDER conv fwd factories work (except DL one & ck tile since they're not yet added now) ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. |
||
|
|
4cd4f0adee |
[rocm-libraries] ROCm/rocm-libraries#4582 (commit 990a00d)
[CK_Builder] added bwd data kernels to builder factory (#4582) This PR adds bwd data wmma and xdl kernels to the ck builder, their instance and conv traits as well as tests for the above. --------- Co-authored-by: Kevin Abraham <kevin.abraham@streamhpc.com> Co-authored-by: John Shumway <jshumway@amd.com> |
||
|
|
2a7d95aecd |
[rocm-libraries] ROCm/rocm-libraries#4816 (commit 17ff961)
[CK] Add split-K support for ABQuantGrouped in block_scale_gemm (#4816) ## Changes ### Split-K support in `gemm_quant_kernel.hpp` - **`SplitKBatchOffset`**: Added `aq_group_offset` and `aq_k_split_offset` fields (mirroring the existing `bq_*` fields for B) to track each split-K batch's position within the AQ scale tensor. For `ABQuantGrouped`, both offsets are computed from `k_id * KRead` divided by `AQuantGroupSize::kK`. - **`MakeAQBlockWindow`**: Added an `aq_group_offset` parameter (defaulting to 0 for non-split-K paths) so the AQ tensor view's K-group dimension reflects only the remaining K-groups from the split-K offset, consistent with how `MakeBQBlockWindow` handles the BQ tensor. - **`RunGemm`**: Threads the `aq_k_split_offset` through to `MakeAQBlockWindow` when in split-K mode. ### Constraints in `IsSupportedArgument()` Four constraints gate split-K (`k_batch > 1`) for ABQuantGrouped: 1. **Mode check** — split-K is only allowed for `BQuantGrouped` (no preshuffle) or `ABQuantGrouped` (no `APreshuffleQuant`). Any other quant mode with `k_batch > 1` returns `false`. 2. **B quant group alignment** — `KRead` (per-batch K slice) must be divisible by `BQuantGroupSize::kK`. Each batch must operate on complete B quantization groups; a partial group would require splitting a scale value across batches. 3. **A quant group alignment** (new, ABQuantGrouped only) — `KRead` must also be divisible by `AQuantGroupSize::kK` for the same reason applied to the AQ scale tensor. 4. **Minimum 2 K-tile iterations per batch** (new) — The software-pipelined GEMM kernels (CompV3 family) prefetch one tile ahead, so they require `per_batch_num_loop = KRead / KPerBlock >= 2`. When `KRead == KPerBlock` (i.e. each batch is exactly one tile), the prefetch reads into the next batch's memory region and produces incorrect results. Configurations where `K == k_batch * KPerBlock` are therefore rejected. ### Example update (`run_gemm_quant_example.inc`) Updated the comment above the `IsSupportedArgument` call to document that split-K is now supported for both `BQuantGrouped` (no preshuffle) and `ABQuantGrouped` (no `APreshuffleQuant`). ## Unit Tests Two new test files covering decode and prefill tile shapes across a range of `k_batch` values (2–8), data types (FP8, BF8), and quantization group sizes (1×1×128 and 1×128×128 for B): - `test_gemm_quant_abquant_splitk_decode.cpp` — uses the decode tile shape (M=16, N=64, K_tile=256) - `test_gemm_quant_abquant_splitk_prefill.cpp` — uses the prefill tile shape (M=128, N=128, K_tile=128) Each test calls `run_test_with_validation` which runs the kernel and checks correctness against a CPU reference. Configurations excluded from tests are annotated with comments explaining which constraint they violate (typically the `per_batch_num_loop >= 2` requirement). ## Prerequisites This PR depends on #4429, which must be merged before this can be merged. --------- Co-authored-by: Erwin Terpstra <erwin.terpstra@streamhpc.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> Co-authored-by: Thomas Ning <Thomas.Ning@amd.com> |
||
|
|
d775806672 |
[rocm-libraries] ROCm/rocm-libraries#4431 (commit ca33816)
[CK] updated github repo link (#4431) The location of the github repo has changed; the landing page of the docs needs to reflect this. Updated only the git repo links in the docs folder. Also added info to the install doc about how to do a sparse checkout. Updated some refs that were messed up while I was at it. --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> |
||
|
|
720d7fa02b |
[rocm-libraries] ROCm/rocm-libraries#4592 (commit 45f76cb)
Tile Engine support for gfx950 (#4592) ## Motivation This PR adds support for the gfx950 GPU architecture to the Tile Engine in Composable Kernel library, focusing on GEMM operations with FP8 and BF8 data types. ## Technical Details Added gfx950-specific MFMA warp GEMM implementations with conditional compilation. Updated default GEMM configuration parameters for tile sizes and warp configurations. Added Jenkins CI pipeline stage for testing TILE_ENGINE_GEMM on gfx950 hardware. ## Test Plan Tile engine itself is a benchmarking utility, so if it passes the CI it will be tested automatically. ## Test Result Tile engine itself is a benchmarking utility, so if it passes the CI it will be tested automatically. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Thrupti Raj Lakshmana Gowda<ThruptiRaj.LakshmanaGowda@amd.com> Co-authored-by: Thomas Ning <Thomas.Ning@amd.com> |
||
|
|
08e8deb6dc |
[rocm-libraries] ROCm/rocm-libraries#4883 (commit 56347bb)
[CK] Disable test_fmha_fwd_fp8fp16 on gfx90a by default. (#4883) ## Motivation Since gfx90a has no native support for FP8 datatype, all FP8 tests should be disabled there by default. ## Technical Details The test_fmha_fwd_fp8fp16 is the last failing test in CK on gfx90a with staging compiler. ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. |
||
|
|
743552b6fd |
[rocm-libraries] ROCm/rocm-libraries#4340 (commit 70a312f)
Implement device_grouped_gemm_fixed_nk_bias for RDNA4 (#4340) ## Proposed changes Summary: - Modified implementation for grouped_gemm_fixed_nk_bias - FP16 WMMA examples - WMMA instances - Profiler for grouped_gemm_fixed_nk_bias - Add WMMA instances to existing tests **This PR depends on PR https://github.com/ROCm/rocm-libraries/pull/4299 and should be merged after it. Only the last 6 commits are in the scope of this PR.** ## Checklist Please put an `x` into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask. - [x] I have added tests relevant to the introduced functionality, and the unit tests are passing locally - [x] I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, **IF** the test takes more than 30 seconds to run. - [x] I have added inline documentation which enables the maintainers with understanding the motivation - [x] I have removed the stale documentation which is no longer relevant after this pull request - [ ] (If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request - [x] I have run `clang-format` on all changed files - [ ] Any dependent changes have been merged ## Discussion If this is a relatively large or complex change, feel free to start a discussion by explaining why you chose the solution you did and what alternatives you considered ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> |
||
|
|
0bad76b1de |
[rocm-libraries] ROCm/rocm-libraries#4415 (commit b3b4af7)
[CK] Remove duplicated XDL/WMMA tests (#4415) ## Motivation When we started the RDNA4 support, the XDL instances were not supporting WMMA instructions, so we duplicated some tests. In this issue, we simplified most of the duplicated test files into common test files. ## Technical Details The following tests were unified: - `batched_gemm` - `batched_gemm_gemm` - `gemm_add` - `gemm_universal` - `grouped_convnd_bwd_data` The following tests were duplicated exactly, and copied into two files with `_xdl` and `_wmma` suffixes. Now they are unified in one single file without suffix: - `gemm_multi_abd` - `gemm_b_scale` There is still an apparent duplication which is a special case, namely `test_grouped_convnd_bwd_weight_interface_{suffix}` where `{suffix}` is `xdl` or `wmma`. However, the WMMA code relies on an old implementation, and is expected to be removed in the future. In addition, it differs from the XDL implementation significantly. Therefore, it was decided to keep both files separate instead of attempting any unification. ## Test Plan `CMakeLists.txt` files were modified to support the new, unified tests. In particular, testing was done for `gfx90a`, `gfx1201` and `gfx11` architectures. ## Test Result All tests passed successfully on all three tested architectures. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Fernando Jiménez <fernando.jimenez@streamhpc.com> |
||
|
|
63652036b0 |
[rocm-libraries] ROCm/rocm-libraries#4898 (commit 753f2ac)
Create operation support matrix for CK Tile Engine (#4898) Introduce operation support matrix for CK Tile kernels detailing data types, layouts, and GPU targets. ## Motivation The tile engine currently supports a subset of CK Tile operations, but there is no in-repo reference that maps which operations, data types, layouts, and GPU targets are covered by the tile engine versus only available through hand-written examples or tests. This makes it difficult for developers to know what the tile engine already handles, what requires manual integration, and where coverage gaps exist. This PR introduces an operation support matrix as a markdown file in tile_engine/, intended to be maintained as a living document alongside the code. Because it lives in the repository rather than an external wiki or PDF, it can be reviewed and updated in the same pull requests that add or extend tile engine operations, keeping it accurate as coverage evolves. ## Technical Details Documentation only change. ## Test Plan N/A ## Test Result N/A ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. |
||
|
|
13d1a117ee |
[rocm-libraries] ROCm/rocm-libraries#4872 (commit ca623f7)
[CK] Small improvements for grouped conv backward weight (#4872) ## Motivation Improvements for CK Tile convolution builder run function and atol/rtol calculations. ## Technical Details - Add preprocessing function for wrw when k_batch is larger than 1 for builder run function - Divide num acums by number of groups to get real number of accums ## Test Plan CI wrw tests ## Test Result pending ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. AICK-783 |
||
|
|
0610c3a462 |
[rocm-libraries] ROCm/rocm-libraries#4812 (commit bb5a4dd)
[CK] Use as_posix() instead of str() for paths in fmha_fwd_appendkv.py (#4812) ## Motivation This is causing a failing PR for Windows: https://github.com/ROCm/TheRock/pull/3382 ``` [composable_kernel configure] -- Jenga kernel files to be generated: B:\build\ml-libs\composable_kernel\build\example\ck_tile\50_sparse_attn\fmha_jenga_fwd_d128_fp16_batch_b128x128x32x128x32x128_r4x1x1_r4x1x1_w32x32x16_w32x32x16_qr_async_vr_psddv_nlogits_nbias_nmask_nskip_nsquant_ntrload.cpp;B:\build\ml-libs\composable_kernel\build\example\ck_tile\50_sparse_attn\fmha_jenga_fwd_d128_fp16_batch_b128x128x32x128x32x128_r4x1x1_r4x1x1_w32x32x16_w32x32x16_qr_async_vr_psskddv_nlogits_nbias_nmask_nskip_nsquant_ntrload.cpp;B:\build\ml-libs\composable_kernel\build\example\ck_tile\50_sparse_attn\fmha_jenga_fwd_d128_fp16_batch_b128x128x32x128x32x128_r4x1x1_r4x1x1_w32x32x16_w32x32x16_qr_async_vr_psddv_nlogits_nbias_mask_nskip_nsquant_ntrload.cpp;B:\build\ml-libs\composable_kernel\build\example\ck_tile\50_sparse_attn\fmha_jenga_fwd_d128_fp16_batch_b128x128x32x128x32x128_r4x1x1_r4x1x1_w32x32x16_w32x32x16_qr_async_vr_psskddv_nlogits_nbias_mask_nskip_nsquant_ntrload.cpp;B:\build\ml-libs\composable_kernel\build\example\ck_tile\50_sparse_attn\fmha_jenga_fwd_d128_bf16_batch_b128x128x32x128x32x128_r4x1x1_r4x1x1_w32x32x16_w32x32x16_qr_async_vr_psddv_nlogits_nbias_nmask_nskip_nsquant_ntrload.cpp;B:\build\ml-libs\composable_kernel\build\example\ck_tile\50_sparse_attn\fmha_jenga_fwd_d128_bf16_batch_b128x128x32x128x32x128_r4x1x1_r4x1x1_w32x32x16_w32x32x16_qr_async_vr_psskddv_nlogits_nbias_nmask_nskip_nsquant_ntrload.cpp;B:\build\ml-libs\composable_kernel\build\example\ck_tile\50_sparse_attn\fmha_jenga_fwd_d128_bf16_batch_b128x128x32x128x32x128_r4x1x1_r4x1x1_w32x32x16_w32x32x16_qr_async_vr_psddv_nlogits_nbias_mask_nskip_nsquant_ntrload.cpp;B:\build\ml-libs\composable_kernel\build\example\ck_tile\50_sparse_attn\fmha_jenga_fwd_d128_bf16_batch_b128x128x32x128x32x128_r4x1x1_r4x1x1_w32x32x16_w32x32x16_qr_async_vr_psskddv_nlogits_nbias_mask_nskip_nsquant_ntrload.cpp;B:\build\ml-libs\composable_kernel\build\example\ck_tile\50_sparse_attn\fmha_jenga_fwd_api.cpp [composable_kernel configure] CMake Error at example/ck_tile/50_sparse_attn/CMakeLists.txt:61 (add_library): [composable_kernel configure] Syntax error in cmake code when parsing string [composable_kernel configure] [composable_kernel configure] B:\build\ml-libs\composable_kernel\build\example\ck_tile\50_sparse_attn\fmha_jenga_fwd_d128_fp16_batch_b128x128x32x128x32x128_r4x1x1_r4x1x1_w32x32x16_w32x32x16_qr_async_vr_psddv_nlogits_nbias_nmask_nskip_nsquant_ntrload.cpp [composable_kernel configure] [composable_kernel configure] Invalid character escape '\b'. ``` ## Technical Details The file: [fmha_fwd_appendkv.py](https://github.com/ROCm/rocm-libraries/compare/users/brockhargreaves-amd/ck/fix-windows-cmake-path-problem?expand=1#diff-bef22bf9ba21eb93c725493ecc7edcb6f2a8f0a9a173dcfca6bda7a9f4eced78) writes a bunch of paths to a text file which is later parsed by cmake. When passing a pathlib.Path to str(), str() converts to a native path, in this case / to \\ on Windows which confuses cmake. In this case we need to write paths with forward slashes and then pass those onward to cmake. ## Test Plan 1. Ensure this doesn't impact existing CI. 2. Ensure compilation of Windows pass locally. ## Test Result 1. Passes existing CI 2. This fixes the compilation error locally. ## Submission Checklist - [ x ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. |
||
|
|
6225173e07 |
[rocm-libraries] ROCm/rocm-libraries#4819 (commit b995a0b)
[CK] Fix windows build issues (#4819) ## Motivation Full build on Windows is currently broken due to compiler errors, this PR should help fix that. This is also holding up the following PR in the TheRock: https://github.com/ROCm/TheRock/pull/3382 ## Technical Details 1. I don't see a good reason to be nesting a windows include inside the ck_tile namespace. It was causing compiler errors too: Windows.h comes with min and max, which was conflicting with ck_tile::min and ck_tile::max, so I moved it out. I also defined NOMINMAX to prevent this inclusion in the future. 2. The TRUE/FALSE macros are already used by Windows.h, which causes an error. So I've opted for True/False. You can see this pattern in other rocm-libraries. 3. The M_PI macro isn't available, at least in the WIN32_LEAN_AND_MEAN context, from \<cmath\> on Windows. We'll be able to use std::numbers::v_pi\<float\> when we have C++20 support. 4. There was a missing \<chrono\> include. ## Test Plan Test locally and make sure this doesn't impact existing CI. ## Test Result Compiles locally and passes existing ci. ## Submission Checklist - [ x ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. |
||
|
|
cb60fdd58d |
[rocm-libraries] ROCm/rocm-libraries#4425 (commit 513cf9f)
[CK] Implement device grouped gemm fixed nk multi abd for rdna4 (#4425) ## Motivation Add support for grouped gemm multi ABD fixed NK. MR ## Technical Details Changes from the reverted PR: - Device struct for grouped gemm with multiple ABD and fixed NK (DeviceGroupedGemm_Wmma_Multi_ABD_Fixed_NK). - Wmma versions of existing example codes: 59_grouped_gemm_multi_ABD - Unit tests for both new wmma implementation and the reference xdl code (previously missing) - Note: Some Xdl instances were commented out because of unit test failures. As mentioned apparently for xdl this feature was missing tests so our assumption is either there is an implemenetation bug or these instances were not set up correctly. Has the potential for a follow-up issue. - Generic ck profiler interface with the purpose of calling unit tests. - Gemm instances with specific elementwise operations for gemm bias gelu calculations. - Added class for grouped gemm multi ABD reference calculations. Fix epilogue selection in device implementation that caused unit test failures ## Test Plan Covered by added unit tests ## Test Result CI successfully passing ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Zoltán Lakatos <zoltan.lakatos@streamhpc.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> |
||
|
|
d244aaa1c0 |
[rocm-libraries] ROCm/rocm-libraries#4791 (commit 6cc17c6)
[CK][CK TILE] Improve oob check (#4791) ## Motivation Improve OOB checks. Remove permutes which have been generated by thread buffer zero clear. at now in assembly there is only condmask instead of permute + condmask. Change number of KPack for generated instances ## Technical Details Remove permute instructions from assembly ## Test Plan test_grouped_convnd_fwd_tile ## Test Result passed ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: jakpiase <jakpia21@gmail.com> |
||
|
|
aca46a090f |
[rocm-libraries] ROCm/rocm-libraries#4430 (commit 3bcf68c)
[CK] Add project root marker for monorepo compatibility (#4430) ## Summary - Add `.ck-project-root` marker file at the composablekernel project root - Update `find_project_root()` in `script/tools/common.sh` to look for this marker instead of `.git` - Fixes project root detection when CK is part of the rocm-libraries monorepo ## Background Since the project was moved into the monorepo, the `.git` directory is at the monorepo root rather than the CK project root. This caused `find_project_root()` to return the wrong path, breaking tools in `script/tools/`. ## Test plan - [x] Verify `find_project_root` returns correct path from any CK subdirectory - [x] Verify `ck-build --help` works - [x] Verify `ck-configure --help` works Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com> Co-authored-by: Thomas Ning <Thomas.Ning@amd.com> |
||
|
|
c28d5c3f74 |
[rocm-libraries] ROCm/rocm-libraries#4295 (commit fa2cfc8)
[CK_TILE] Refactor `UniversalGemm::MakeA/B/C/DBlockViews` to allow caller to pass desciptors directly (#4295) ## Proposed changes Currently `UniversalGemmKernel::MakeA/B/C/DBlockViews` directly create tensor views from strides and sizes. This refactors the descriptor creation out and add overloaded definitions, allowing descriptors to be created separately by the caller instead of passing explicit strides, with no functional changes. This will enable further refactoring of `RunGemm` to do likewise, enabling derived kernels like BatchedContractionKernel to avoid creating separate versions (PR [#3457](https://github.com/ROCm/composable_kernel/pull/3457)). ## Checklist Please put an `x` into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask. - [x] I have added tests relevant to the introduced functionality, and the unit tests are passing locally - [ ] I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, **IF** the test takes more than 30 seconds to run. - [ ] I have added inline documentation which enables the maintainers with understanding the motivation - [ ] I have removed the stale documentation which is no longer relevant after this pull request - [ ] (If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request - [x] I have run `clang-format` on all changed files - [ ] Any dependent changes have been merged ## Discussion Since the logic within the MakeXBlockviews chains together operations on tuples, and thus the descriptors are also passed as such, adding a template parameter for the type of the input tuple was the simplest option to enable the overload without too much verbiage. However, for `MakeCBlockView` this adds a complications as the templated definitions are prone to overlap. This for now is avoided by just moving the arguments around for the descriptor version, which avoids the collision. It's not a great solution, so feel free to suggest a better one. --- 🔁 Imported from [ROCm/composable_kernel#3467](https://github.com/ROCm/composable_kernel/pull/3467) 🧑💻 Originally authored by @amd-meskelin --------- Co-authored-by: Matti Eskelinen <matti.eskelinen@amd.com> Co-authored-by: systems-assistant[bot] <systems-assistant[bot]@users.noreply.github.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> Co-authored-by: Thomas Ning <Thomas.Ning@amd.com> |
||
|
|
023ba6848e |
[rocm-libraries] ROCm/rocm-libraries#4267 (commit 3c5d95e)
[CK_TILE] Extend support of mix precision microscaling BQuant (#4267) ## Proposed changes Supported types combinations using BQuant=e8m0: - A=bf16 - B=bf16,bf8,fp4 Summary: - remove usage of `pk_fp4_raw_t`: consistent with other implementations and avoid taking into account of the packed size explicitly. In general, the raw type should not be used because CK Tile internally takes care of the PackedSize, so using the raw type adds unnecessary complexity to the implementation - handle microscaling by checking for `e8m0` type for BQuant (previous implementation was inconsistent) - add support for scaling instructions in `DequantPack8` - mx pipeline: - extend existing pipeline to support different B types - add support to scale and cast before writing to LDS or after reading from LDS (this can be defined in the `Problem` by the user) - block gemm: - mx pipeline is now using block gemm BQuant - block gemm BQuant can now load from LDS and apply scale and then call block gemm universal operator. This adds new functionalities and remove code duplication - warp gemm: - add case to support 128bit ds_read/write for both A and B when A=16bit and B=8bit - add examples and tests: note that some tests for bf16/fp4 already existed but were removed during previous tests refactoring. I added them again and other relevant tests for new types combinations ## Checklist Please put an `x` into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask. - [ ] I have added tests relevant to the introduced functionality, and the unit tests are passing locally - [ ] I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, **IF** the test takes more than 30 seconds to run. - [ ] I have added inline documentation which enables the maintainers with understanding the motivation - [ ] I have removed the stale documentation which is no longer relevant after this pull request - [ ] (If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request - [ ] I have run `clang-format` on all changed files - [ ] Any dependent changes have been merged ## Discussion If this is a relatively large or complex change, feel free to start a discussion by explaining why you chose the solution you did and what alternatives you considered --- 🔁 Imported from [ROCm/composable_kernel#3689](https://github.com/ROCm/composable_kernel/pull/3689) 🧑💻 Originally authored by @EnricoDeg --------- Co-authored-by: Enrico Degregori <enrico@streamhpc.com> Co-authored-by: systems-assistant[bot] <systems-assistant[bot]@users.noreply.github.com> Co-authored-by: Thomas Ning <Thomas.Ning@amd.com> Co-authored-by: Enrico Degregori <73224202+EnricoDeg@users.noreply.github.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> |
||
|
|
40552f0ec6 |
[rocm-libraries] ROCm/rocm-libraries#4756 (commit 79bc2ca)
[CK_TILE] Update Stream-K Reduction Strategy Enum (#4756) ## Motivation Currently, Stream-K has 3 reduction options: 1) atomics, 2) The reduction described in the Stream-K paper, and 3) a tree reduction. The reduction strategy described in the original Stream-K paper has the starting workgroup of each tile sequentially accumulating partial results of other contributing workgroups in the tile, which requires a linear number of steps. Hence, for clarity, this works updates the naming of the `StreamKReductionStrategy` enum members to better describe the existing reduction strategy options. ## Technical Details Prior to this change, the enum is as follows: ```cpp enum StreamKReductionStrategy : uint32_t { Atomic = 0u, Reduction = 1u, TreeReduction = 2u }; ``` But, the distinction between `Reduction` and `TreeReduction` is not very clear and has some redundancy. Hence, the updated enum is as follows: ```cpp enum StreamKReductionStrategy : uint32_t { Atomic = 0u, Linear = 1u, Tree = 2u }; ``` All references to `StreamKReductionStrategy` were updated to reflect this change. ## Test Plan No new functionality was added, so no new tests were added; I just validated existing tests and examples. ## Test Result All tests passed locally. ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. |
||
|
|
0f89ace29e |
[rocm-libraries] ROCm/rocm-libraries#4355 (commit e7f6909)
[CK TILE] Refactor sequence_reverse_inclusive_scan (#4355) ## Proposed changes Refactor ck tile `sequence_reverse_inclusive_scan` from recursive to for-loop. Tracking issue: #4229 This pull request introduces a new lightweight array type, `static_array`, and refactors the sequence utilities to use it for improved constexpr support and simplicity. The changes also include updates to the build system to add container-related tests. **Core Library Improvements:** * Added a new header `static_array.hpp` that defines the `static_array` type, a constexpr-friendly array with basic accessors and no custom constructors. * Updated includes in `core.hpp` and `sequence.hpp` to import `static_array`. [[1]](diffhunk://#diff-14b406eccf59794051a16c0c9c1a7e11234324bfdd107a5bbe0f173cd25bcddcR44) [[2]](diffhunk://#diff-5042e5b47bb2ba78bbab2d284338cf0503bc8fb76a7d631cc2684ad6ca832a76R7) **Refactoring to Use `static_array`:** * Refactored sequence utilities in `sequence.hpp` to use `static_array` instead of the previously forward-declared `array` type, including in histogram and array generation logic. [[1]](diffhunk://#diff-5042e5b47bb2ba78bbab2d284338cf0503bc8fb76a7d631cc2684ad6ca832a76L1108-R1133) [[2]](diffhunk://#diff-5042e5b47bb2ba78bbab2d284338cf0503bc8fb76a7d631cc2684ad6ca832a76L1130-R1146) * Rewrote the implementation of `sequence_reverse_inclusive_scan` to use `static_array` for intermediate storage, improving constexpr evaluation and clarity. **Build System and Testing:** * Added a new test subdirectory for container tests and a GoogleTest executable for `unit_sequence.cpp` to the CMake build configuration. [[1]](diffhunk://#diff-5d35ff7555d3f0b438d45cde06b661eb1332cdbec66287ac7ec3c478d688aae5R5) [[2]](diffhunk://#diff-1f54f0d2b431b7fc74f7b4ffb66e80c381c904c3383b1d27987467e3482d6d7aR1-R7) Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> |
||
|
|
9113812816 |
[rocm-libraries] ROCm/rocm-libraries#4704 (commit 17662f9)
[CK_TILE] Fix FP8 MXGEMM numerical error in async load path (#4704) ## Summary Fixes FP8 MXGEMM producing half the expected result (e.g., 128 instead of 256 with all 1s input). **Bug introduced in:** `b7de1e14cea70681a23cd1a136df42910c776e4a` - "[CK_TILE] Add blockscale GEMM support for EightWarps on gfx950 (#4280)" ## Root Cause In the `static_move_ys=true` code path in `tile_window.hpp`, the IMM optimization computes `lds_ys_offset` using a default-constructed tensor descriptor: ```cpp make_tensor_coordinate(decltype(tensor_descriptor){}, idx_ys_offset) ``` This default-constructed descriptor has different strides than the actual DRAM tensor descriptor used for dram_ys_offset. When these offsets are mixed in the address calculation: ```cpp imm_valid = lds_ys_offset % IMM_RANGE; // From wrong descriptor wave_offset = dram_ys_offset - imm_valid; // From correct descriptor ``` The final address wave_offset + imm_valid ≠ dram_ys_offset, causing incorrect memory accesses. Fix ```cpp Set imm_valid = 0 to bypass the IMM optimization and ensure the full offset is passed through wave_offset: constexpr auto imm_valid = 0; // Avoids inconsistency between lds_ys_offset and dram_ys_offset ``` This disables the 12-bit immediate field optimization in the buffer_load_lds instruction but guarantees correctness. A proper fix would require making the DRAM tensor descriptor constexpr, which is not feasible since tensor strides depend on runtime parameters (LDA, LDB). --------- Co-authored-by: ThomasNing <thomas.ning@amd.com> |
||
|
|
725cbb8813 |
[rocm-libraries] ROCm/rocm-libraries#4649 (commit 642e7e3)
[CK] Updated pre-commit entry points (#4649) ## Motivation Pre-commit fails after the transition to the monorepo. This fixes it. ## Technical Details - ## Test Plan Try to commit on CK with pre-commit enabled. ## Test Result Pre-commit should pass. (Scripts are correctly found) ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. Signed-off-by: Chris Tsiaousis <chris.tsiaousis@streamhpc.com> |
||
|
|
841e6b89d1 |
[rocm-libraries] ROCm/rocm-libraries#4584 (commit 42efd1d)
[CK_TILE][FMHA] Support gfx11 (#4584) ## Motivation Add support of gfx11 architectures (RDNA3) to FMHA. ## Technical Details Distributions (matrix elements to lane registers mapping) of gfx11 WMMA are completely different from distributions of gfx9 MFMA and gfx12 WMMA. There are two cases in FMHA where this difference matters: * usage of results (matrix C) of one GEMM as input (matrix A) of another GEMM. * random number generation for dropout (implementation for gfx9 MFMA, gfx12 WMMA and host validation produce the same results). Both cases are solved by a special remapping implemented using `__builtin_amdgcn_permlanex16` and `__builtin_amdgcn_perm`. Additional changes: * FMHA tests are now build and run only for those types for which instances exist (gfx11 supports only fp16 and bf16). * Two fixes for uninitialized values (`mask.sink` and `do_fp8_static_quant`): they may contain garbage resulting in incorrect dispatching logic, sometimes tests report that there are no instance available for current parameters. * Small fix to remove expcnt(0) from s_waitcnt instruction on gfx11 when they are not requested (i.e. every time), likely has no effect on performance but makes disassembly a bit clearer. ## Test Plan ``` ninja test_ck_tile_fmha bin/test_ck_tile_fmha_fwd_fp16 bin/test_ck_tile_fmha_fwd_bf16 bin/test_ck_tile_fmha_bwd_fp16 bin/test_ck_tile_fmha_bwd_bf16 ``` ## Test Result All tests must pass (some tests may be skipped). ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> |
||
|
|
c7c5a018ed |
[rocm-libraries] ROCm/rocm-libraries#4762 (commit 5598eb5)
Revert "[ck] Support VGPR estimate in GridwiseGemm_wmma_cshuffle_v3" (#4762) Reverts ROCm/rocm-libraries#4638 unfortunately, this PR interfered with the PR#4299 and caused build errors for gfx11: In file included from /rocm-libraries/projects/composablekernel/library/src/tensor_operation_instance/gpu/grouped_gemm_fixed_nk/device_grouped_gemm_wmma_fixed_nk_bf16_bf16_bf16_mk_kn_mn_instance.cpp:7: In file included from /rocm-libraries/projects/composablekernel/library/include/ck/library/tensor_operation_instance/gpu/grouped_gemm/device_grouped_gemm_wmma_fixed_nk_instance.hpp:11: /rocm-libraries/projects/composablekernel/include/ck/tensor_operation/gpu/device/impl/device_grouped_gemm_wmma_fixed_nk.hpp:553:21: error: no matching function for call to 'CheckValidity' 553 | if(!GridwiseGemm::CheckValidity( | ^~~~~~~~~~~~~~~~~~~~~~~~~~~ |
||
|
|
ea790cecfe |
[rocm-libraries] ROCm/rocm-libraries#4705 (commit 845bc39)
[ci] Adding composablekernel to TheRock CI (#4705) Workflow files under `projects/composablekernel/.github/workflows` do not get picked up in GitHub workflows. This will allow composable kernel changes to be build and tested properly CI tests will prove functionality |
||
|
|
981c6adfe5 |
[rocm-libraries] ROCm/rocm-libraries#4638 (commit 305ec71)
[ck] Support VGPR estimate in GridwiseGemm_wmma_cshuffle_v3 (#4638) 1. Add GetEstimateVgprCount to estimate the VGPR usage in GridwiseGemm_wmma_cshuffle_v3 2. Add IsValidCompilationParameter to disable kernel which use too many vgprs. - Currently, the threashold is AvailableVgprCount * 1.25 3. Modify examples to avoid test is disabled on gfx11 It is port from internal repo PR[#192](https://github.com/ROCm/composable_kernel/issues/192) ## Motivation <!-- Explain the purpose of this PR and the goals it aims to achieve. --> ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. --------- Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com> |
||
|
|
725cd57d43 |
[rocm-libraries] ROCm/rocm-libraries#4556 (commit 15730e7)
fix: correct ULP calculation in get_absolute_threshold for BF16 tolerance (#4556) ## Motivation BF16 grouped GEMM tests were failing on gfx1201 with errors like: ``` Error: Incorrect results! out[5457621] != ref[5457621]: -66 != -65.5 max err: 0.5, number of errors: 1 ``` The calculated absolute tolerance (atol ~0.26) was too small to account for legitimate hardware vs software BF16 conversion differences (0.5 ULP). ## Changes 1. **Discrete exponent calculation**: Changed from continuous `log2()` to `floor(log2())` to match actual IEEE 754 floating-point exponent levels 2. **Full ULP for output_error**: Changed from 0.5 to 1.0 ULP to account for hardware `__bf16` vs software `float_to_bf16()` conversion differences ## Calculation Example For the failing case with value ~66: **Before (incorrect):** ``` expo = log2(66) = 6.044... atol = 2^(6.044 - 7) * 0.5 = 2^(-0.956) * 0.5 ≈ 0.26 Error 0.5 > 0.26 → Test fails ❌ ``` **After (correct):** ``` discrete_expo = floor(log2(66)) = 6 atol = 2^(6 - 7) * 1.0 = 2^(-1) * 1.0 = 0.5 Error 0.5 ≤ 0.5 → Test passes ✓ ``` The ULP for values in [64, 128) is 2^(-1) = 0.5, and the error of 0.5 is exactly 1 ULP, which is the maximum expected difference between hardware and software BF16 conversions at tie cases. ## Rationale Hardware and software BF16 conversions can differ by up to 1 ULP at tie cases due to different rounding strategies (hardware vs IEEE 754 round-to-nearest-even). The discrete exponent ensures ULP is calculated correctly for all values within an exponent range. **Modified file**: `projects/composablekernel/include/ck_tile/host/check_err.hpp` |
||
|
|
263288a383 |
[rocm-libraries] ROCm/rocm-libraries#4299 (commit 668cd49)
173 implement device grouped gemm fixed nk for rdna4 (#4299) ## Proposed changes This PR adds an RDNA4 implementation of the device_grouped_gemm_fixed_nk instance library using for WMMA. The implementation is based on the existing DeviceGroupedGemm_Xdl_Fixed_NK design and reuses the same high-level structure, but replaces the XDL kernel with a WMMA-based one. It uses the GridwiseGemm_wmma_cshuffle_v3 kernel. At this stage, the focus is functional correctness and compatibility, not performance tuning. ## Technical Details - Device struct for grouped gemm fixed NK - Example code for the WMMA version - Unit tests for both new wmma implementation and the reference XDL code (previously missing) - Generic ck profiler interface with the purpose of calling unit tests. ## Checklist Please put an into the boxes that apply. You can also fill these out after creating the PR. If you're not sure, please don't hesitate to ask. - [x] I have added tests relevant to the introduced functionality, and the unit tests are passing locally - [x] I have added the test to REGRESSION_TESTS list defined at the top of CMakeLists.txt in tests/CMakeLists.txt, **IF** the test takes more than 30 seconds to run. - [ ] I have added inline documentation which enables the maintainers with understanding the motivation - [ ] I have removed the stale documentation which is no longer relevant after this pull request - [x] (If this change is user-facing) I have added release notes which provide the end users with a brief summary of the improvement from this pull request - [x] I have run on all changed files - [x] Any dependent changes have been merged ## Discussion If this is a relatively large or complex change, feel free to start a discussion by explaining why you chose the solution you did and what alternatives you considered --- 🔁 Imported from [ROCm/composable_kernel#3668](https://github.com/ROCm/composable_kernel/pull/3668) 🧑💻 Originally authored by @bidlekm --------- Co-authored-by: Marton Bidlek <marton.bidlek@streamhpc.com> Co-authored-by: Erwin Terpstra <erwin.terpstra@streamhpc.com> Co-authored-by: bidlekm <bidlekmarton@gmail.com> Co-authored-by: assistant-librarian[bot] <assistant-librarian[bot]@users.noreply.github.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com> |
||
|
|
e13d1a228d |
[rocm-libraries] ROCm/rocm-libraries#4655 (commit f8d76d1)
Update CMakeLists.txt (#4655) ## Motivation Tile Engine is an internal benchmarking tool and it need not be built everytime which would impact the build time with this PR we are excluding build for stream k operator in Tile Engine. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [x] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests. Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> |
||
|
|
6b9df93342 |
[rocm-libraries] ROCm/rocm-libraries#4652 (commit 39a5a53)
Revert "[CK] Add new fwd conv fp16/bf16 instances optimized for unit group size." (#4652) PR ROCm/rocm-libraries#4275 contains CK fwd conv instances optimized for `gfx950` and they do not compile for other architectures such as `gfx940`. To ensure that the optimized instances are compiled only for `gfx950`, compile-time guard `#if defined(CK_USE_GFX950)` was used. This approach works correctly when we compile for a single architecture, but when we compile simultaneously for multiple architectures, flag `CK_USE_GFX950` is set for non-gfx950 archs as well. As a result, the multi-arch compilation fails. The problem doesn't appear in the ROCm libraries CI/CD pipeline since only one architecture is compiled at a time. Hence, the CI/CD passed for the original PR. Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> |
||
|
|
c49eb518e1 |
[rocm-libraries] ROCm/rocm-libraries#4297 (commit 5ff580c)
moe flatmm xcd remap (#4297) co-authors: @Chi-Chu319 @juuso-oskari Added XCD remapping for flatmm moe <html xmlns:v="urn:schemas-microsoft-com:vml" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40"> <head> <meta name=ProgId content=Excel.Sheet> <meta name=Generator content="Microsoft Excel 15"> <link id=Main-File rel=Main-File href="file:///C:/Users/tianxiwu/AppData/Local/Temp/msohtmlclip1/01/clip.htm"> <link rel=File-List href="file:///C:/Users/tianxiwu/AppData/Local/Temp/msohtmlclip1/01/clip_filelist.xml"> <style> <!--table {mso-displayed-decimal-separator:"\."; mso-displayed-thousand-separator:"\,";} @page {margin:.75in .7in .75in .7in; mso-header-margin:.3in; mso-footer-margin:.3in;} tr {mso-height-source:auto;} col {mso-width-source:auto;} br {mso-data-placement:same-cell;} td {padding-top:1px; padding-right:1px; padding-left:1px; mso-ignore:padding; color:black; font-size:11.0pt; font-weight:400; font-style:normal; text-decoration:none; font-family:Arial, sans-serif; mso-font-charset:0; mso-number-format:General; text-align:general; vertical-align:bottom; border:none; mso-background-source:auto; mso-pattern:auto; mso-protection:locked visible; white-space:nowrap; mso-rotate:0;} --> </style> </head> <body link="#467886" vlink="#96607D"> batch | Mixtral (tflops, wip_355) | Mixtral-7B (tflops, our branch) | perf boost -- | -- | -- | -- 64 | 865.424 | 995.455 | 15.0% 256 | 886.336 | 1020.96 | 15.2% 1024 | 890.808 | 1022.53 | 14.8% </body> </html> --- 🔁 Imported from [ROCm/composable_kernel#3161](https://github.com/ROCm/composable_kernel/pull/3161) 🧑💻 Originally authored by @Chi-Chu319 --------- Co-authored-by: Tianxing Wu <chi0chu319@gmail.com> Co-authored-by: Tianxing Wu <tianxing.wu@amd.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> Co-authored-by: systems-assistant[bot] <systems-assistant[bot]@users.noreply.github.com> Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com> |