composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-05-21 05:19:20 +00:00

Author	SHA1	Message	Date
assistant-librarian[bot]	4d2856612c	Merge commit '4c2c18ef486641d1493f3dc272a1e0e079676308' into develop	2026-01-22 02:55:52 +00:00
Michał Kulikowski	04f7e1fce4	[CK][Examples] Extending support for rdna3/4 part 4: (#3264 ) * [CK][Examples] Extending support for rdna3/4 part 4: -example_gemm_xdl_streamk -example_gemm_xdl_fp16_fp8_v3 -example_gemm_xdl_fp16_v3 Signed-off-by: Michal Kulikowski <Michal.Kulikowski@amd.com> * [CK][Examples] Revert example\01_gemm\gemm_xdl_streamk parameters change. Signed-off-by: Michal Kulikowski <Michal.Kulikowski@amd.com> --------- Signed-off-by: Michal Kulikowski <Michal.Kulikowski@amd.com> Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> [ROCm/composable_kernel commit: `4c2c18ef48`]	2026-01-21 18:10:16 -08:00
assistant-librarian[bot]	aadd581b8d	Merge commit '1040d9b1f53945867d78d0bbcf03de65ee01aea3' into develop	2026-01-21 18:24:44 +00:00
Robin Voetter	2b54a86c04	[CK_BUILDER] Replace reference conv with old ck implementation (#3604 ) * ck-builder: remove SPATIAL_DIM parameter from ConvTensorLayouts This information is already in the SIGNATURE, so its pointless to pass it separately. This streamlines the interface of those functions a bit. Also touches up the style of those files in general. * ck-builder: implement reference conv using old ck The old ck implementation is more featureful and better tested. * ck-builder: replace test_reference_execution reference with old ck This strips out the ck-tile gpu reference implementation completely. * ck-builder: clean up test_reference_execution - Remove unneccesary messages - Replace EXPECT_TRUE(true) with EXPECT_NO_THROW() [ROCm/composable_kernel commit: `1040d9b1f5`]	2026-01-21 19:18:47 +01:00
andrew clark	5a27de45e5	Sanitizing URL-encoded characters from the image file name (#3622 ) [ROCm/composable_kernel commit: `0fbb3bb8c4`]	2026-01-21 11:00:53 -07:00
assistant-librarian[bot]	579d2eb5fb	Merge commit 'f41f37da969d8f0dbcf590b72e5ac8e74e8846b6' into develop	2026-01-21 16:34:17 +00:00
Yi DING	0bb1c90674	Add CMakePresets.json (#3284 ) Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> [ROCm/composable_kernel commit: `f41f37da96`]	2026-01-21 08:04:24 -08:00
assistant-librarian[bot]	8fbde9114b	Merge commit 'fcc9372c009c8e0a23fece77b582da83b04a654f' into develop	2026-01-21 02:52:11 +00:00
Yi DING	a0935f7669	[CK_TILE] Fix Int32 Overflow in Deterministic FMHA BWD (#3615 ) [ROCm/composable_kernel commit: `fcc9372c00`]	2026-01-21 09:54:46 +08:00
assistant-librarian[bot]	b2c76ff10f	Merge commit 'd5ae81b2922773f7cdf4a02a2e1fd57d0e4df851' into develop	2026-01-20 22:14:29 +00:00
Erwin Terpstra	b079841b10	Implement batched gemm add relu gemm add for rdna4 (#3391 ) * wip: test suite for batched gemm multiple d gemm multiple d, working on gridwise implenentation * wip: many fixes in implementation of batched gemm gemm multiple d * wip: batched gemm gemm multiple d gridwise op compiling, not working yet * fix: incorrect d0 grid indexing in batched gemm gemm multipled * feat: add instances for batched gemm add relu gemm add * chore: configure instance with low vector transfer size for odd sizes * chore: add some more validation to device batched gemm gemm multiple d, and removed template parameter that didn't really make sense * fix: upate device_batched_gemm_gemm_wmma to work with new gridwise changes * fix: disable odd size tests on XDL archs * chore: removed temporary logging * chore: update some references to C tensor to E tensor * Tentative fix for example template params * Tentative fix for non-multi-D batched gemm gemm device impl. * Tentative fix for xdl example template params * Tentative fix for profiler build on gfx90a * chore: improve device batched gemm gemm multi D comment to include all ops and dimensions * chore: explicitly call ck::make_tuple to prevent issues when std::make_tuple would apply * fix: make the gemm1 data types match what happens in the device op * feat: add d0s/d1s datatypes and layouts to the device op type string * chore: change element-wise op so addition happens in fp32 * chore: add static asserts for gemm0/gemm1 calculated wave sizes * chore: also updated other element-wise ops to use fp32 calculations * chore: log number of supported instances * chore: update instance comment * chore: disable kernel timing in example by default * fix: gemm1 wave size calculation * fix: make sure batched gemm multiple d gemm multiple d profiler performs correct type conversions * chore: remove increased tolerance in batched gemm gemm multiple d example * chore: add comment explaining that verification fails for certain input values * chore: clarify instance comment --------- Co-authored-by: kiefer <kiefer.van.teutem@streamhpc.com> [ROCm/composable_kernel commit: `d5ae81b292`]	2026-01-20 13:06:59 -08:00
assistant-librarian[bot]	5f61470a1f	Merge commit '91b4102a59c6013d3faeb54f250cf577b2f129ce' into develop	2026-01-20 19:35:23 +00:00
Max Podkorytov	8b842250da	Add persistent async input scheduler for GEMM kernels (#3520 ) Add signal-based synchronization for persistent GEMM kernels where input data becomes available incrementally. Uses modulo wraparound (like PyTorch's AsyncMM) for chunk index calculation: chunk_idx = ((tile_idx + tile_idx_pivot) / tiles_per_chunk) % num_chunks Key components: - PersistentAsyncInputScheduler struct with tiles_per_chunk_m, chunk_signals, tile_idx_pivot_m, and num_chunks fields - wait_eq_wave method using __builtin_amdgcn_s_sleep for power efficiency - IsSupportedArgument validation for scheduler parameters - Example demonstrating async input scheduling with simulated producer - GTest unit tests covering all layout combinations [ROCm/composable_kernel commit: `91b4102a59`]	2026-01-20 10:37:09 -08:00
assistant-librarian[bot]	58bb88f499	Merge commit '8f75869408210cb85e9eb7ff639c4c9dad1331cb' into develop	2026-01-20 18:17:53 +00:00
Linjun-AMD	e227e837be	Revert "[CK_TILE][FMHA] Add new tile size for async (#3586 )" (#3613 ) This reverts commit 217ac48fd83deef3d0d5084815689e8c79958cc1. [ROCm/composable_kernel commit: `8f75869408`]	2026-01-20 09:40:54 -08:00
Estevan Vedovelli	8e5475654b	Add support to fp16 + compute fp16 and bf16 + compute bf16 contractions (#3598 ) * Add support to fp16 + compute fp16 and bf16 + compute bf16 contractions Enables hipTensor to access the WMMA HW functionalities for these combinations of datatype on gfx11 and gfx12. * Fix change to contraction scale tests * Fix clang-format [ROCm/composable_kernel commit: `7d8bca7ddc`]	2026-01-20 09:39:57 -08:00
assistant-librarian[bot]	6a0cbcb01d	Merge commit '4d58c70e6cf76ce6cb40aa6035ebccbb28493f71' into develop	2026-01-20 17:18:34 +00:00
Cong Ma	364ad3d521	[CK TILE GEMM] Add bf8 support to tile engine streamk generator (#3543 ) [ROCm/composable_kernel commit: `4d58c70e6c`]	2026-01-20 10:01:33 -07:00
assistant-librarian[bot]	a7320b9717	Merge commit '6300ad3c62298dc6fdddfcf19ecd074f7f08fa96' into develop	2026-01-20 16:18:17 +00:00
music-dino	750bd72b3d	Batched gemm softmax gemm descriptor fix (#3564 ) * Add rocm to prefix path for codegen * Fix issue with c0_matrix_mask construction [ROCm/composable_kernel commit: `6300ad3c62`]	2026-01-20 07:25:30 -08:00
assistant-librarian[bot]	43058803dc	Merge commit 'b09121f86066381f3662fdbdee6a810849a8a1a7' into develop	2026-01-20 10:16:09 +00:00
Wojciech Laskowski	6ad65bc855	WMMA support for batched_gemm_reduce (#3332 ) Summary: - added new device impl of Batched GEMM Reduce for WMMA - added instance library - added WMMA impl to the Batched GEMM Reduce tests [ROCm/composable_kernel commit: `b09121f860`]	2026-01-20 10:50:46 +01:00
assistant-librarian[bot]	38c7251ed1	Merge commit '0727e85e523aac7a1e82af00f44081cc67f5cde0' into develop	2026-01-20 06:20:32 +00:00
Bartłomiej Kocot	85c5741492	[CK_BUILDER] Add grouped conv fwd ck tile profiler (#3518 ) * [BULDER] Add grouped conv fwd ck tile profiler * [CK TILE] Fix grouped conv kernels splitk and double lds * Updates * Fixes * Move to ckProfiler * Fixes * fix * fix * Change instances to empty list by default * fix * fix * Update grouped_convolution_signatures.hpp * Update grouped_convolution_forward_tile_algs.hpp * [CK TILE] Add grouped convolution forward tests (#3556) * [CK TILE] Add grouped convolution forward tests * fix jenkins * fixes * comments fixes * unit test * unit test fix * Move instances outside builder * fix includes * clang format fix * readme fix * fix includes * fixes [ROCm/composable_kernel commit: `0727e85e52`]	2026-01-19 22:29:01 -07:00
assistant-librarian[bot]	895404d62b	Merge commit '0517d43d312356c62cc33bea4f0ecc5613e87079' into develop	2026-01-20 00:37:44 +00:00
Cong Ma	c42cd28370	[CK TILE] remove dependency on std chrono (#3599 ) * [CK TILE] remove dependency on std chrono * Apply suggestions from code review Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> [ROCm/composable_kernel commit: `0517d43d31`]	2026-01-19 15:31:02 -08:00
Linjun-AMD	ecda0fe2e9	[CK_TILE][FMHA] Add new tile size for async (#3586 ) * add new tile size for async Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> * Update example/ck_tile/01_fmha/codegen/ops/fmha_fwd.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * fix lse error Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> --------- Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> [ROCm/composable_kernel commit: `f3aafb9555`]	2026-01-19 15:22:33 -08:00
assistant-librarian[bot]	b5bde883eb	Merge commit '98abfa4ade0f7b5204adf4da00e95be9453dce74' into develop	2026-01-19 21:13:18 +00:00
Max Podkorytov	8bd33c4a35	Optimize clang-format check in Jenkins CI (#3597 ) This change improves the clang-format CI check to be faster and not depend on git being available in the build environment. Changes: - Use `find` instead of `git ls-files` (no git dependency) - Check all C++ files: .h, .hpp, .cpp, .h.in, .hpp.in, .cpp.in, *.cl - Exclude build/ and include/rapidjson directories - Use parallel processing with 8 cores (-P 8) for ~8x speedup - Show only errors with unified diff format (-u) - Clear error messages: "ERROR: <file> needs formatting" - Preserve original logic: run clang-format only when RUN_CPPCHECK=false, or run both clang-format and cppcheck when RUN_CPPCHECK=true Performance: - Sequential processing: ~93 seconds for 5,899 files - Parallel with 8 cores: ~12 seconds for 5,899 files - Per-file processing time: ~15ms This reduces CI time while maintaining code formatting standards. [ROCm/composable_kernel commit: `98abfa4ade`]	2026-01-19 12:23:06 -08:00
assistant-librarian[bot]	17b4f104b2	Merge commit '66d6a1cfa6807866487becc87cba95a0965f51f9' into develop	2026-01-19 16:15:25 +00:00
dependabot[bot]	ae64f66966	Bump rocm-docs-core[api_reference] from 1.31.2 to 1.31.3 in /docs/sphinx (#3602 ) Bumps [rocm-docs-core[api_reference]](https://github.com/ROCm/rocm-docs-core) from 1.31.2 to 1.31.3. - [Release notes](https://github.com/ROCm/rocm-docs-core/releases) - [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md) - [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.31.2...v1.31.3) --- updated-dependencies: - dependency-name: rocm-docs-core[api_reference] dependency-version: 1.31.3 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> [ROCm/composable_kernel commit: `66d6a1cfa6`]	2026-01-19 07:41:59 -08:00
assistant-librarian[bot]	3d4bb495f8	Merge commit '1a6d1b59ef7358e4f07afcc0a163af7aa4b985a9' into develop	2026-01-19 10:16:14 +00:00
Adam Osewski	a9ff38bc89	[CK_BUILDER] Convolution forward transfer concepts. (#3535 ) * Rename member variable to better reflect its actuall meaning. * Add transfer checks for conv fwd xdl. * Validate tensor layouts & vector size conv fwd v3. * Add combined transfer concepts. * Add transfer concepts for conv fwd factories. * Fix clang format * Add helper instruction to get max mem vector instruction width. * Apply review comments. * Rename thread cluster access(->arrange) order concept * FIx merge artifacts. * Add generic access order limits into block transfer concept. [ROCm/composable_kernel commit: `1a6d1b59ef`]	2026-01-19 10:54:10 +01:00
assistant-librarian[bot]	e60d79a9a1	Merge commit 'fe40a5d13941b64162cffce9496d1d94a90f80a5' into develop	2026-01-17 08:14:43 +00:00
Erwin Terpstra	9c660bfbe3	Implement batched gemm bias permute for RDNA4 (#3534 ) * feat: test setup for batched contraction (aka batched gemm multiple d e permute) * wip: device struct for WMMA batched contraction multiple d based on new gridwise op * feat: working batched contraction on RDNA, non-naive tensor descriptors for gridwise_gemm_wmma_cshuffle_v3, test setup for odd cases * fix: failure to resolve template parameters when calling new function overload * fix: passing reference type as parameter instead of underlying types * fix: merge error caused duplicate definitions * fix: make sure constness of template and parameters types match * fix: don't compile batched contraction test on unsupported architectures * feat: add example for new wmma implementation, and consolidate example code between platforms * style: return inline instead of with branch * chore: add extra assert on vector memory access sizes * chore: clean up some unused variables * fix: correct tail number calculation, added small cases and extra instances to the test * fix: properly support wave transfer by generating correct grid descriptors dependent on the transfer method [ROCm/composable_kernel commit: `fe40a5d139`]	2026-01-17 08:30:27 +01:00
assistant-librarian[bot]	9c4010cd17	Merge commit 'f9104ef9b3b794f8e02757cbf2935818f5389dac' into develop	2026-01-17 00:38:39 +00:00
Cong Ma	487f1beee9	[CK TILE QUANT GEMM] use OverrideADataType in aquant pipeline (#3584 ) [ROCm/composable_kernel commit: `f9104ef9b3`]	2026-01-16 15:27:39 -08:00
assistant-librarian[bot]	73b0cfde4e	Merge commit '3f735c127b8e78b702a31e19cb6e0e35eda3588a' into develop	2026-01-16 19:13:41 +00:00
Johannes Graner	b12d70ae04	[CK Profiler] Restore CPU tensor initialization when verification is not done on GPU (#3594 ) * Fix large case init bounds * Revert "Fix large case init bounds" This reverts commit `1abca05c6f`. * Restore CPU initialization for do_verification != 2 [ROCm/composable_kernel commit: `3f735c127b`]	2026-01-16 10:56:53 -08:00
logicat	fb918acff9	Remove unnecessary hip_fp16 include from stream_config (#3549 ) [ROCm/composable_kernel commit: `fec81109f1`]	2026-01-16 10:40:05 -08:00
John Shumway	0b3ee64c89	Disable CK Builder for SLES15 in Jenkins CI (#3581 ) 1. Added `-DCK_EXPERIMENTAL_BUILDER=OFF` to the `setup_args` to explicitly disable the experimental builder 2. Added a detailed comment explaining why this is necessary: - SLES15 is a legacy platform with limited C++20 ecosystem support - While the ROCm compiler supports C++20, the older system libraries and standard library implementation on SLES15 does not reliably support all C++20 features required by the experimental CK Builder [ROCm/composable_kernel commit: `2d233c838a`]	2026-01-16 10:36:23 -08:00
spolifroni-amd	f7614e006b	CK Tile: fix some issues (#3557 ) * Adding CK Tile documentation * Updates based on feedback * Fix tile window API description * Fix remaining images * add documentation about flush_cache and rotating_buffer functionality in ck_tile * Supplement the documentation * light edit of the ck tile conceptual doc --------- Co-authored-by: Vidyasagar <vanantha@amd.com> Co-authored-by: AviralGoelAMD <aviral.goel@amd.com> Co-authored-by: ThomasNing <thomas.ning@amd.com> [ROCm/composable_kernel commit: `427d4fb9e9`]	2026-01-16 10:34:44 -08:00
Thrupti Raj Lakshmana Gowda	f9ff023328	Fixing GEMM Multi D on Tile Engine (#3583 ) [ROCm/composable_kernel commit: `de8ee379ad`]	2026-01-16 10:17:21 -08:00
assistant-librarian[bot]	d9030f5343	Merge commit '644cdbe3c92f9af16067e539edb4a13e6b9e7c86' into develop	2026-01-16 02:52:08 +00:00
John Shumway	d4990deb79	Merge pull request #3573 from ROCm/jshumway/builder-readme [ROCm/composable_kernel commit: `644cdbe3c9`]	2026-01-15 17:55:04 -08:00
assistant-librarian[bot]	7b405e44b0	Merge commit '086a1f8861ef8c81db854e7f2749458b69121617' into develop	2026-01-15 17:20:33 +00:00
Max Podkorytov	f6d1bb77e0	Add LLM-agnostic Docker and build analysis tools (#3576 ) This commit introduces utility tools for building, testing, and analyzing Composable Kernel. The tools are designed to be LLM-agnostic and can be used with any AI assistant or directly from the command line. Tools Added: ============ 1. ck-docker - Docker container management - Start/stop ROCm-enabled containers - Build targets with CMake + Ninja - Run tests with gtest filters - Auto-detect GPU targets (gfx950, gfx942, etc.) - Per-user, per-branch container naming to avoid conflicts 2. ck-build-analysis - Build time profiling - Uses Clang's -ftime-trace for compilation analysis - Aggregates statistics across multiple trace files - Identifies template instantiation bottlenecks - Generates detailed Markdown reports with: * Compilation phase breakdown * Top expensive instantiations * Template family analysis * Data-driven optimization recommendations - Configurable granularity (1µs to 500µs) - PEP 723 compliant Python script with auto-dependency management via uv Key Features: ============= - LLM-agnostic design (works with any AI assistant) - Zero-configuration setup with automatic dependency installation - Comprehensive documentation in script/tools/README.md - Security hardening (input validation, no command injection) - Multi-file trace aggregation for accurate build analysis - Jinja2-based report generation for customizable output Implementation: =============== - script/tools/ck-docker - Main Docker orchestration script - script/tools/ck-build-analysis - Build analysis orchestration - script/tools/common.sh - Shared utilities (container mgmt, GPU detection) - script/tools/analyze_build_trace.py - PEP 723 compliant Python analyzer - script/tools/templates/ - Jinja2 templates for report generation - script/tools/README.md - Comprehensive documentation Directory Structure: ==================== script/tools/ ├── README.md # Main overview ├── README_ck-docker.md # ck-docker documentation ├── README_ck-build-analysis.md # ck-build-analysis documentation ├── ck-docker # Docker orchestration script ├── ck-build-analysis # Build analysis orchestration ├── common.sh # Shared utilities ├── analyze_build_trace.py # Python analyzer (PEP 723) └── templates/ └── build_analysis_report.md.jinja # Report template The tools follow Unix philosophy: do one thing well, compose easily, and work from both CLI and programmatic contexts. [ROCm/composable_kernel commit: `086a1f8861`]	2026-01-15 08:30:23 -08:00
assistant-librarian[bot]	f0f4dbbffc	Merge commit 'f57395689b92ca1f644e6e549e763f6c293ced22' into develop	2026-01-15 16:19:30 +00:00
dependabot[bot]	fcdc0f7fee	Bump rocm-docs-core[api_reference] from 1.31.1 to 1.31.2 in /docs/sphinx (#3577 ) Bumps [rocm-docs-core[api_reference]](https://github.com/ROCm/rocm-docs-core) from 1.31.1 to 1.31.2. - [Release notes](https://github.com/ROCm/rocm-docs-core/releases) - [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md) - [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.31.1...v1.31.2) --- updated-dependencies: - dependency-name: rocm-docs-core[api_reference] dependency-version: 1.31.2 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> [ROCm/composable_kernel commit: `f57395689b`]	2026-01-15 07:49:06 -08:00
Michal Kulikowski	eb0080ab85	[CK][Examples] Fixing stride issues in ck examples 14/65/68/69 by workaround - Bypassing hostTensor validation -Fixing args num in ck examples 68/69 Signed-off-by: Michal Kulikowski <Michal.Kulikowski@amd.com> [ROCm/composable_kernel commit: `e1f2a44096`]	2026-01-15 16:43:02 +01:00

... 6 7 8 9 10 ...

4050 Commits