composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-07-17 09:08:35 +00:00

Author	SHA1	Message	Date
Linjun-AMD	ecda0fe2e9	[CK_TILE][FMHA] Add new tile size for async (#3586 ) * add new tile size for async Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> * Update example/ck_tile/01_fmha/codegen/ops/fmha_fwd.py Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * fix lse error Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> --------- Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> [ROCm/composable_kernel commit: `f3aafb9555`]	2026-01-19 15:22:33 -08:00
Max Podkorytov	8bd33c4a35	Optimize clang-format check in Jenkins CI (#3597 ) This change improves the clang-format CI check to be faster and not depend on git being available in the build environment. Changes: - Use `find` instead of `git ls-files` (no git dependency) - Check all C++ files: .h, .hpp, .cpp, .h.in, .hpp.in, .cpp.in, *.cl - Exclude build/ and include/rapidjson directories - Use parallel processing with 8 cores (-P 8) for ~8x speedup - Show only errors with unified diff format (-u) - Clear error messages: "ERROR: <file> needs formatting" - Preserve original logic: run clang-format only when RUN_CPPCHECK=false, or run both clang-format and cppcheck when RUN_CPPCHECK=true Performance: - Sequential processing: ~93 seconds for 5,899 files - Parallel with 8 cores: ~12 seconds for 5,899 files - Per-file processing time: ~15ms This reduces CI time while maintaining code formatting standards. [ROCm/composable_kernel commit: `98abfa4ade`]	2026-01-19 12:23:06 -08:00
dependabot[bot]	ae64f66966	Bump rocm-docs-core[api_reference] from 1.31.2 to 1.31.3 in /docs/sphinx (#3602 ) Bumps [rocm-docs-core[api_reference]](https://github.com/ROCm/rocm-docs-core) from 1.31.2 to 1.31.3. - [Release notes](https://github.com/ROCm/rocm-docs-core/releases) - [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md) - [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.31.2...v1.31.3) --- updated-dependencies: - dependency-name: rocm-docs-core[api_reference] dependency-version: 1.31.3 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> [ROCm/composable_kernel commit: `66d6a1cfa6`]	2026-01-19 07:41:59 -08:00
Adam Osewski	a9ff38bc89	[CK_BUILDER] Convolution forward transfer concepts. (#3535 ) * Rename member variable to better reflect its actuall meaning. * Add transfer checks for conv fwd xdl. * Validate tensor layouts & vector size conv fwd v3. * Add combined transfer concepts. * Add transfer concepts for conv fwd factories. * Fix clang format * Add helper instruction to get max mem vector instruction width. * Apply review comments. * Rename thread cluster access(->arrange) order concept * FIx merge artifacts. * Add generic access order limits into block transfer concept. [ROCm/composable_kernel commit: `1a6d1b59ef`]	2026-01-19 10:54:10 +01:00
Erwin Terpstra	9c660bfbe3	Implement batched gemm bias permute for RDNA4 (#3534 ) * feat: test setup for batched contraction (aka batched gemm multiple d e permute) * wip: device struct for WMMA batched contraction multiple d based on new gridwise op * feat: working batched contraction on RDNA, non-naive tensor descriptors for gridwise_gemm_wmma_cshuffle_v3, test setup for odd cases * fix: failure to resolve template parameters when calling new function overload * fix: passing reference type as parameter instead of underlying types * fix: merge error caused duplicate definitions * fix: make sure constness of template and parameters types match * fix: don't compile batched contraction test on unsupported architectures * feat: add example for new wmma implementation, and consolidate example code between platforms * style: return inline instead of with branch * chore: add extra assert on vector memory access sizes * chore: clean up some unused variables * fix: correct tail number calculation, added small cases and extra instances to the test * fix: properly support wave transfer by generating correct grid descriptors dependent on the transfer method [ROCm/composable_kernel commit: `fe40a5d139`]	2026-01-17 08:30:27 +01:00
Cong Ma	487f1beee9	[CK TILE QUANT GEMM] use OverrideADataType in aquant pipeline (#3584 ) [ROCm/composable_kernel commit: `f9104ef9b3`]	2026-01-16 15:27:39 -08:00
Johannes Graner	b12d70ae04	[CK Profiler] Restore CPU tensor initialization when verification is not done on GPU (#3594 ) * Fix large case init bounds * Revert "Fix large case init bounds" This reverts commit `1abca05c6f`. * Restore CPU initialization for do_verification != 2 [ROCm/composable_kernel commit: `3f735c127b`]	2026-01-16 10:56:53 -08:00
logicat	fb918acff9	Remove unnecessary hip_fp16 include from stream_config (#3549 ) [ROCm/composable_kernel commit: `fec81109f1`]	2026-01-16 10:40:05 -08:00
John Shumway	0b3ee64c89	Disable CK Builder for SLES15 in Jenkins CI (#3581 ) 1. Added `-DCK_EXPERIMENTAL_BUILDER=OFF` to the `setup_args` to explicitly disable the experimental builder 2. Added a detailed comment explaining why this is necessary: - SLES15 is a legacy platform with limited C++20 ecosystem support - While the ROCm compiler supports C++20, the older system libraries and standard library implementation on SLES15 does not reliably support all C++20 features required by the experimental CK Builder [ROCm/composable_kernel commit: `2d233c838a`]	2026-01-16 10:36:23 -08:00
spolifroni-amd	f7614e006b	CK Tile: fix some issues (#3557 ) * Adding CK Tile documentation * Updates based on feedback * Fix tile window API description * Fix remaining images * add documentation about flush_cache and rotating_buffer functionality in ck_tile * Supplement the documentation * light edit of the ck tile conceptual doc --------- Co-authored-by: Vidyasagar <vanantha@amd.com> Co-authored-by: AviralGoelAMD <aviral.goel@amd.com> Co-authored-by: ThomasNing <thomas.ning@amd.com> [ROCm/composable_kernel commit: `427d4fb9e9`]	2026-01-16 10:34:44 -08:00
Thrupti Raj Lakshmana Gowda	f9ff023328	Fixing GEMM Multi D on Tile Engine (#3583 ) [ROCm/composable_kernel commit: `de8ee379ad`]	2026-01-16 10:17:21 -08:00
John Shumway	d4990deb79	Merge pull request #3573 from ROCm/jshumway/builder-readme [ROCm/composable_kernel commit: `644cdbe3c9`]	2026-01-15 17:55:04 -08:00
Max Podkorytov	f6d1bb77e0	Add LLM-agnostic Docker and build analysis tools (#3576 ) This commit introduces utility tools for building, testing, and analyzing Composable Kernel. The tools are designed to be LLM-agnostic and can be used with any AI assistant or directly from the command line. Tools Added: ============ 1. ck-docker - Docker container management - Start/stop ROCm-enabled containers - Build targets with CMake + Ninja - Run tests with gtest filters - Auto-detect GPU targets (gfx950, gfx942, etc.) - Per-user, per-branch container naming to avoid conflicts 2. ck-build-analysis - Build time profiling - Uses Clang's -ftime-trace for compilation analysis - Aggregates statistics across multiple trace files - Identifies template instantiation bottlenecks - Generates detailed Markdown reports with: * Compilation phase breakdown * Top expensive instantiations * Template family analysis * Data-driven optimization recommendations - Configurable granularity (1µs to 500µs) - PEP 723 compliant Python script with auto-dependency management via uv Key Features: ============= - LLM-agnostic design (works with any AI assistant) - Zero-configuration setup with automatic dependency installation - Comprehensive documentation in script/tools/README.md - Security hardening (input validation, no command injection) - Multi-file trace aggregation for accurate build analysis - Jinja2-based report generation for customizable output Implementation: =============== - script/tools/ck-docker - Main Docker orchestration script - script/tools/ck-build-analysis - Build analysis orchestration - script/tools/common.sh - Shared utilities (container mgmt, GPU detection) - script/tools/analyze_build_trace.py - PEP 723 compliant Python analyzer - script/tools/templates/ - Jinja2 templates for report generation - script/tools/README.md - Comprehensive documentation Directory Structure: ==================== script/tools/ ├── README.md # Main overview ├── README_ck-docker.md # ck-docker documentation ├── README_ck-build-analysis.md # ck-build-analysis documentation ├── ck-docker # Docker orchestration script ├── ck-build-analysis # Build analysis orchestration ├── common.sh # Shared utilities ├── analyze_build_trace.py # Python analyzer (PEP 723) └── templates/ └── build_analysis_report.md.jinja # Report template The tools follow Unix philosophy: do one thing well, compose easily, and work from both CLI and programmatic contexts. [ROCm/composable_kernel commit: `086a1f8861`]	2026-01-15 08:30:23 -08:00
dependabot[bot]	fcdc0f7fee	Bump rocm-docs-core[api_reference] from 1.31.1 to 1.31.2 in /docs/sphinx (#3577 ) Bumps [rocm-docs-core[api_reference]](https://github.com/ROCm/rocm-docs-core) from 1.31.1 to 1.31.2. - [Release notes](https://github.com/ROCm/rocm-docs-core/releases) - [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md) - [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.31.1...v1.31.2) --- updated-dependencies: - dependency-name: rocm-docs-core[api_reference] dependency-version: 1.31.2 dependency-type: direct:production update-type: version-update:semver-patch ... Signed-off-by: dependabot[bot] <support@github.com> Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com> [ROCm/composable_kernel commit: `f57395689b`]	2026-01-15 07:49:06 -08:00
Michal Kulikowski	eb0080ab85	[CK][Examples] Fixing stride issues in ck examples 14/65/68/69 by workaround - Bypassing hostTensor validation -Fixing args num in ck examples 68/69 Signed-off-by: Michal Kulikowski <Michal.Kulikowski@amd.com> [ROCm/composable_kernel commit: `e1f2a44096`]	2026-01-15 16:43:02 +01:00
Yung-sheng Tu	97f2fa2912	Implement device_gemm_universal_preshuffle_instance for RDNA4 (#3429 ) * add device_gemm_wmma_cshuffle_v3_b_preshuffle.hpp * add examples * add instances to test * remove duplicate code between examples [ROCm/composable_kernel commit: `6df2d70143`]	2026-01-15 07:19:31 -08:00
Estevan Vedovelli	09d084bfb4	Fix error when building with -DCMAKE_BUILD_TYPE=Debug (#3541 ) Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> [ROCm/composable_kernel commit: `e30207985a`]	2026-01-15 09:35:24 -05:00
Jeff Huang	445ec888ba	[FMHA] Enable page size 16 for batch prefill kernel (#3568 ) * [FMHA] Enable page size 16 for batch prefill kernel * Refactor batch prefill KV offset logic to simplify template arguments - Remove redundant `kLog2PageSize` and `kIsVTileFitsInPage` from template args. - Add static assert to forbid `page_size=1` with vectorized layout. [ROCm/composable_kernel commit: `993d3e2f0e`]	2026-01-15 22:11:44 +08:00
John Shumway	753043b27a	[CK_BUILDER] Convert convolution traits to a struct with factory functions (#3547 ) * Factor helpers out of conv_traits.hpp * Create a non-templated conv_traits struct * Migrate to new instance-specific instance_to_conv_traits functions * Clean up reflection concepts * Clean up ConvTraits helpers * Update testing for convolution traits This is a lot of cleanup on tests to have verbose coverage of feature extraction, explicit tests for each supported device kernel, and simple, readable test code. * Address reviewer comments and resolve merge conflict [ROCm/composable_kernel commit: `5122637215`]	2026-01-15 10:03:21 +01:00
John Shumway	c756e421db	Update README.md files to match recent code changes This is mostly adjustments to enum values so that the docs align correctly with the current code. Also updated the calendar scope of the project to extend through March 2026. [ROCm/composable_kernel commit: `df7ee270a6`]	2026-01-15 02:15:29 -05:00
Illia Silin	3827441343	add aiter test_batch_prefill and simplify jenkins file a bit (#3570 ) [ROCm/composable_kernel commit: `8705fdcb0c`]	2026-01-14 14:07:47 -08:00
Emily Martins	8661ee5a16	Disable CK Tile Stream-K reduction tests (#3559 ) The test_ck_tile_streamk_reduction test suite seems to have transient failures; hence, we are disabling these tests for now. We will re-enable them once the bug is resolved. [ROCm/composable_kernel commit: `7f912909ca`]	2026-01-14 14:02:21 -07:00
John Shumway	c744f9015e	[CK_BUILDER] Update owners file for more reviews for CK Builder (#3572 ) Adding owners permissions for two leading developers on the CK Builder subproject to help with reviews on that project, especially in the EU time zones. Remove aska-0096, who has left AMD [ROCm/composable_kernel commit: `f08fb3f748`]	2026-01-14 12:43:55 -08:00
Bartłomiej Kocot	8c72adabeb	Disable ActiveWorkgroupsPerCU for different arch in wmma kernels (#3566 ) [ROCm/composable_kernel commit: `a346cfa960`]	2026-01-14 12:37:12 -08:00
Bartłomiej Kocot	9aea6a52ed	Fix grouped conv bwd data wmma check (#3562 ) [ROCm/composable_kernel commit: `a07c8e38bd`]	2026-01-14 11:04:37 -08:00
Khushbu Agarwal	7da4e47a5f	[CK_Tile] Support for group size 128 for Preshuffle quant for 2d block scale gemm (#3462 ) * formatted * formatted * formatting * formatting * formatting * [CK TILE GEMM] Refactor block_scale_gemm examples - Split cpp file to reduce building time - Support multiple GemmConfig * [CK TILE GEMM] Refactor block_scale_gemm examples - Update Readme * enable prefill shapes * [CK TILE GEMM] Refactor block_scale_gemm examples - Add support for rowcol and tensor GEMM operations * [CK TILE GEMM] Refactor block_scale_gemm examples - Update README * adding preshuffle quant as new parameter and its associated new files * remove debugging statements * adding test * enable preshuffle quant with permuteN * updating readme and correcponding gemmconfigs * updating cmake file * fixing CI failures for grouped quant gemm * debugging permuteN * debugging * debugging PermuteN * initial commit * resolving merge conflicts * adding test cases * initial commit with prints * debugging * fine-grained working * debugging medium grained * fixing the tile window * formatting * enabling prefill shapes * working prefill shapes * formatted * clean up * code cleanup * bug fix after merging with develop * G128 working for both prefill and decode shapes for preshufflequant * clean up after merging with develop * fixing group 64 for decode shapes * non preshufflequant working for group size 128 * enable preshuffleb and preshufflequant with variour group sizes * reduce build time by splitting example into diff datatype files * Adding tests for preshuffleQuant * address review comment * fix for gfx1201 * compile time fix for gfx1201 * clang formatted --------- Co-authored-by: Cong Ma <congma13@amd.com> Co-authored-by: Thomas Ning <Thomas.Ning@amd.com> Co-authored-by: Agarwal <khuagarw@ctr2-alola-login-03.amd.com> [ROCm/composable_kernel commit: `118afa455c`]	2026-01-14 10:00:19 -08:00
Ville Pietilä	2eb573a0e2	Build CK on Windows (#3458 ) * CMakeLists.txt hack for Windows. * Add Windows build instructions. * Fix type issue with variadic min function. * Use std::common_type to fix the variadic min/max functions. * Enable CPU guard compilation on Windows. * Suppress warnings related to std::getenv on Windows platform. * Git ignore the output directory on Windows platform. * Powershell script for running tests and generating reports. * Improve test logging. * Disable non-conv tests. * Fix Debug build on Windows. * More debug build changes. * Update Windows build instructions. * Enable all tests. * Test fixes. * Suppress not found linker options warning. * Update unsigned long literals and format specifiers to work correctly in Windows * Fix conv 3D bwd weight bilinear tests on Windows. * Revert changes on .gitignore. * Clean-up CMake project file for Windows builds. * clang-format * Fix definition of CMAKE_PREFIX_PATH on both Linux and Windows platforms. * Fix building examples on Windows. * Update Readme. * Remove the suppression of the deprecated warnings. * Remove Windows specific min/max implementations from CK Tile math core. * Remove unnecessary no-op on Windows. --------- Co-authored-by: User <user@example.com> Co-authored-by: Ville Pietilä <none> Co-authored-by: John Afaganis <john.afaganis@amd.com> Co-authored-by: Ville Pietilä <> [ROCm/composable_kernel commit: `1fc5a3f3ac`]	2026-01-14 07:31:45 -08:00
Johannes Graner	b313b8eaea	[CK] Refactor GPU verification kernel to gather error stats on GPU (#3551 ) * Refactor GPU verification kernel to gather erorr stats on GPU * Check if result is all zero * non-negative error count doesn't need custom Atomics * Remove unnecessary AtomicMaxFloat function * Simpler warp reduction, remove passed flag * Move verification header to include * Fix header path in test * Fix block reduction loop [ROCm/composable_kernel commit: `f173642087`]	2026-01-14 16:04:50 +01:00
Johannes Graner	923923ac8d	[CK Profiler] Initialize tensors on GPU in CK profiler (#3550 ) * Initialize tensors on GPU in CK profiler * Kick CI [ROCm/composable_kernel commit: `3ccb15ea02`]	2026-01-14 16:04:14 +01:00
Linjun-AMD	75ea587550	[CK_TILE][FMHA] Enable gpt-oss sink (#3490 ) * Enable gptoss sink Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> * Update include/ck_tile/ops/fmha/pipeline/block_fmha_fwd_splitkv_pipeline_qr_ks_vs.hpp Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update include/ck_tile/ops/fmha/pipeline/block_fmha_fwd_splitkv_pipeline_qr_ks_vs.hpp Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * add gptoss sink test Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> * update CHANGELOG.md Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> * fix test args error Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> * Update test_fmha_fwd.cpp * update sink test Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> * Revert "update sink test" This reverts commit `970b4f1686`. * update sink test Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> * update valid sink_v in splitkv pipeline Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> * Update block_fmha_batch_prefill_pipeline_qr_ks_vs_async.hpp * Update example_fmha_fwd.cpp * fix lse error Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> * fix clangformat error Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> * fix aiter scale error Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> * Update block_fmha_pipeline_qr_ks_vs.hpp * div scale_s for sink_value Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> * Update fmha_fwd_runner.hpp * update sink_value with bias Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> * Update block_fmha_batch_prefill_pipeline_qr_ks_vs_async.hpp * Fix typo in dropout parameter in fmha_batch_prefill_kernel * Update block_fmha_batch_prefill_pipeline_qr_ks_vs_async.hpp * Update example_fmha_fwd.cpp * Update include/ck_tile/ops/fmha/pipeline/block_fmha_pipeline_qr_ks_vs_async_trload.hpp Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update include/ck_tile/ops/fmha/pipeline/block_fmha_fwd_splitkv_pipeline_nwarp_sshuffle_qr_ks_vs.hpp Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * optimized some code Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> * fix splitkv error Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> * update sink reference Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> * Update fmha_fwd_runner.hpp * Update smoke_test_fwd_sink.sh --------- Signed-off-by: Linjun-AMD <Jun.Lin@amd.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com> [ROCm/composable_kernel commit: `717ed0b59f`]	2026-01-14 21:32:06 +08:00
Enrico Degregori	ad907f8d54	Add support for direct store in epilogue and padding support for wave transfer without transpose (#3465 ) - Add support for direct store in epilogue instead of cshuffle - Add padding support for wave transfer without transpose - Add wave transfer with interleaved layout to support direct store - Enable new functionalities on GEMMs - Add optional new functionality support for grouped convolution fwd - Add some fast instances for grouped convolution fwd with new functionalities (proper tuning needed) [ROCm/composable_kernel commit: `693ff3bbb3`]	2026-01-14 11:02:19 +01:00
Thrupti Raj Lakshmana Gowda	e231cfb3dc	[CK TILE ENGINE] CI fix for Basic Tile Engine (#3554 ) * memory op changes * memory op changes * Fixing TILE_ENGINE_BASIC in Tile Engine * Removing gfx90a from Tile Engine Run * [CK TILE ENGINE] increasing ci configs for BASIC case * Setting RUN_TILE_ENGINE_BASIC_TESTS to ON by default --------- Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com> [ROCm/composable_kernel commit: `51027474af`]	2026-01-13 16:20:30 -08:00
Thomas Ning	0c8c232a0a	Shuffle fix for gfx950 (#3491 ) * solve compiler issue * solve the gfx950 mfma shuffle regression * refactor jenkinsfile to handle arch name better * [CK TILE] set divisor to count of thread along k dimension * fix the compiler error * solve degradation * Finish the multiplies fix * fix the scales * solve compilation error * solve the composes * solve the error of tile sweeper * fix the test and example * fix for gfx950 --------- Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com> Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com> Co-authored-by: Cong Ma <congma13@amd.com> [ROCm/composable_kernel commit: `00c46785a8`]	2026-01-13 09:21:29 -08:00
Ville Pietilä	e40687bfc3	[CK_BUILDER] Add bwd weight factories (#3509 ) * Add placeholder test. * Initial conv bwd weight factory. * Conv builder test refactoring. * Add missing pieces to bwd weight factory. * Improve compile time erros message when no matching factory is found. * Use amcro to ensure automatic macthing between concepts are their string representations. * Improve compile time diagnostics. * Small improvements. * Improve missing member/wrong type compile-time errors. * Improve compile time diagnostics. * Concept bug fixes. * Remove debug assert. * Update algorithm signature diagnostics. * Factory bug fixes. * First functional version of bwd weight conv factory. * Refactor handing of GEMM-K batch template parameter in conv bwd weight factory. * Concept improvements. * Improve concept diagnostics. * Introduve a common size type for concepts. * Update compiletime diagnostics to use the size type. * Update conv specialization enum. * Fix fwd conv builder tests. * Fix smoke tests. * Separate bwd weigth and bwd data tests into separate targets. * Clean-up CK Tile builder tests. * Add bwd weight XDL CShuffle V3 factory. * Build conv bwd weigth v3 instances successfully. * Add instance traits for DeviceGroupedConvBwdWeight_Xdl_CShuffleV3. * Test fix. * Add instance traits for bwd weight algorithms. * Add unit tests for instance strings. * Build new instance traits unit tests but exclude WMMA for now. * Added factory for DeviceGroupedConvBwdWeightTwoStage_Xdl_CShuffle. * Conv bwd weight DL factory. * Final implementation for bwd weight DL factory. * Add test for creating DeviceGroupedConvBwdWeightMultipleD_Xdl_CShuffle instance. * Add factory for DeviceGroupedConvBwdWeightMultipleD_Xdl_CShuffle * Treat ref algorithm the same way as real algorithms in the dispatcher. * Refactor large tensor support and WMMA configuration. * Add factory and tests for DeviceGroupedConvBwdWeight_Wmma_CShuffleV3. * Update Readme. * Fix WMMA bwd weight tests. * Added factory and tests for DeviceGroupedConvBwdWeightTwoStage_Wmma_CShuffleV3. * Factory and tests for DeviceGroupedConvBwdWeight_Wmma_CShuffle. * Dispatching for DeviceGroupedConvBwdWeightMultipleD_Wmma_CShuffle. * Add factory for DeviceGroupedConvBwdWeightMultipleD_Wmma_CShuffleV3 * Fix DeviceGroupedConvBwdWeightMultipleD_Wmma_CShuffleV3 factory and compute types for input and output tensor in bwd weigth convs. * Fix fwd factories after refactoring. * clang-format * Move compile-time diagnostics to a separate branch. * Fix ref algorithm dispatching. * Fix smoke tests. * clang-format * Fix factory for regular WMMA conv bwd weight. * Clarify builder Readme. * Remove obsolete test file. * Fix test after merge. * clang-format * Remove the C++26 extensions. * Unify conv elementwise ops and layout definitions for fwd and bwd directions. * Remove old layout and elementwise ops. * Unify handling of conv tensor types between fwd and bwd directions. * Unify block transfer for fwd and bwd directions. Rename ThreadSliceDim to ThreadClusterRank. * Make BlockTransferDescriptor concept parametrized. Introduce a common TileTransferParameters concept for conv algorithms. * clang-format --------- Co-authored-by: Ville Pietilä <> [ROCm/composable_kernel commit: `9908a87c31`]	2026-01-13 18:12:38 +02:00
Po Yen Chen	18b676b24c	fix incorrect List import in reduce_parameter.py (#3555 ) [ROCm/composable_kernel commit: `710fa1fd3d`]	2026-01-13 20:03:05 +05:30
Erwin Terpstra	d69aeffd0d	Implement grouped gemm tile loop for RDNA4 (#3304 ) * feat: grouped gemm tile loop support for RDNA4 * fix: removed extra parameter from grouped gemm example instance * fix: FP8 check incorrectly enabling FP8 on RDNA3 [ROCm/composable_kernel commit: `eb041079a3`]	2026-01-13 07:14:23 +01:00
Jeff Huang	0d13ef7329	[CK Tile] Fix FMHA LSE calculation and potential division by zero (#3326 ) This commit addresses numerical stability issues in the BlockFmhaPipelineQRKSVS pipeline when bias has -inf masking values: 1. Explicitly handle the case where the accumulated exponential sum (l) is zero. In this case, the LSE is now correctly set to negative infinity, preventing log(0) errors. 2. Extend the zero-check protection in the normalization step to cover the ELEMENTWISE_BIAS case, preventing potential division by zero. [ROCm/composable_kernel commit: `141f77aa12`]	2026-01-13 13:52:26 +08:00
Jeff Huang	99b88be5fb	[FMHA] Support page_size=1 (linear layout) in batch prefill pipeline (#3545 ) - Enable page_size=1 support in batch prefill codegen (linear layout only). - Implement per-token page lookup in `kv_offset_array_transform` for page_size=1 to handle 3D input tensors correctly. - Relax `kPageBlockSize` alignment assertion for the page_size=1 case. [ROCm/composable_kernel commit: `c9f112b026`]	2026-01-13 12:04:43 +08:00
ZheWang	0a2c5c6262	fix mxfp8-gemm example failure (#3531 ) Co-authored-by: ZheWang <zhewan@amd.com> [ROCm/composable_kernel commit: `a575acb245`]	2026-01-13 10:26:45 +08:00
Aviral Goel	d4718f5f31	WIP: extract MakeALdsDescriptor() from child to parent class for code readability (#3392 ) Co-authored-by: Thomas Ning <Thomas.Ning@amd.com> [ROCm/composable_kernel commit: `5aaa031350`]	2026-01-12 09:51:58 -08:00
Aviral Goel	3096269434	refactor: remove Default scheduler implementation as it not used anymore (#3542 ) * refactor: remove Default scheduler implementation as it not used anymore * refactor: remove dead code from gemm universal kernel * chore: add descriptive comments about amd intrinsic hardware sync instructions * fix: label existing memory pipeline for aquant as intrawave [ROCm/composable_kernel commit: `e809861d49`]	2026-01-12 09:51:06 -08:00
Johannes Graner	c89e55681e	[CK profiler] Perform verification on GPU when using GPU reference (#3482 ) * Simple verification kernel for ckProfiler * Verification kernel unit tests * Explicit synchronization * Address review comments [ROCm/composable_kernel commit: `18c2ff6019`]	2026-01-12 12:12:41 +01:00
kabrahamAMD	706a75f6d9	adressed review comments from PR3459 (#3526 ) Co-authored-by: Kevin Abraham <kevin.abraham@streamhpc.com> [ROCm/composable_kernel commit: `20f66c1e6b`]	2026-01-12 09:47:00 +01:00
Robin Voetter	feabf9c026	ck-builder: tensor input/output reflection (#3536 ) This adds some utilities to automatically generate UniqueInputs, UniqueOutputs, alloc_inputs, alloc_outputs, and validate, based on a Inputs::reflect() and Outputs::reflect(). [ROCm/composable_kernel commit: `b352a68606`]	2026-01-12 09:45:53 +01:00
yadaish	684ebd42da	moe fp8 blockscale use nt (#3524 ) * nt on fp8 blockscale * some improve and tests needs to be fixed * update * fix format * revert useless change * revert any change in amd_buffer_coherence [ROCm/composable_kernel commit: `32408c8bc0`]	2026-01-12 10:48:10 +08:00
damien-lejeune	693548d8b2	Dlejeune/ck tile 2d multiple reductions (#3147 ) * WIP * Add Unit tests for the Multi Reduction Kernel * clang format * Rename multiblock to threadwise * Multiblock WIP * Fix multi reduce multi block unit tests * Multi Reduce Tile Engine: WIP * refactoring + try addressing precision error * Fix multiops examples * Cleanup * Clean up tile engine's reduce op * Update changelog * Fix remod/clang * Fix dates * Fix documentation & missing file * Fix comments * Use the update_tile api in the multi-block kernel * Unify threadwise/multiblock into a single kernel + default multiblock output to float in tests * Add TileParitioner * Cleanup * Add warning when no data to process, in the example * Refactoring Reduce kernel Tile Partioner + cleanup * Move the tile partioner to its own file * Add missing includes * Fix copyright header with update_amd_copyright_headers.py * Fix change of interface in Reduce2dProblem --------- Co-authored-by: Damien Lejeune <damien.lejeune@amd.com> Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com> [ROCm/composable_kernel commit: `4216d43da8`]	2026-01-09 11:16:37 +01:00
Robin Voetter	a77b9e56fd	[CK_BUILDER] Debug utilities (#3528 ) * ck-builder: make toString to_string We are using snake case for CK-Builder * ck-builder: add debug.hpp with tensor descriptor printing function This adds some initial functionality to debug.hpp, a header which will be used to house some debug utilities. * ck-builder: abstract nd-iteration Abstracting this makes it easier to test, clearer, and allows us to use it elsewhere (such as in debug.hpp soon) * ck-builder: tensor printing * ck-builder: rename INT32 to I32 This makes it more in line with the other data type definitions. [ROCm/composable_kernel commit: `e3884bbf05`]	2026-01-08 10:14:13 +01:00
Thrupti Raj Lakshmana Gowda	011d705947	Removing memop from chshuffle (#3530 ) [ROCm/composable_kernel commit: `770a14494e`]	2026-01-07 23:34:43 -08:00
Johannes Graner	c427b9ba2a	[CK] Allow tensors larger than 2GB in grouped conv bwd weight (#3169 ) * Take split_k into account when checking 2GB tensor limit. * Revert "Take split_k into account when checking 2GB tensor limit." This reverts commit `adf35c91be`. * Optimize grouped conv bwd wei split_k off calc (cherry picked from commit 2115642ee59050dabd81393c1b8f03b34adc05aa) * Update gridwise_gemm_xdl_cshuffle_conv_v3.hpp (cherry picked from commit 900d4d4b466f5730ae1189370d3c96267c35ea69) * Fix tensor descriptors and stride calculations * Don't miss half of the elements * Fix buffer size calculations * Disable hack if stride not divisible by k_batch * Clean up comments * Disallow hack in non-contiguous edge cases * Index -> Dim * Fix broken test * Refactor applicability checks into separate function * fix missed variable name * Fix variable name in info print * update V3 2GB check * No more regression, use templates instead * Code deduplication * Regression fix for cshuffle * arch-guarded atomic_add implementations for gfx11 * Similar for half(4\|8)_t as well * Only use both offset hacks at the same time * Revert "arch-guarded atomic_add implementations for gfx11" This reverts commit `3883fe6935`. This reverts commit `5311ec608d`. * Reapply "arch-guarded atomic_add implementations for gfx11" This reverts commit `1972adeddc`. * Only remove float4 atomic_add * Refactor to single flag * Consolidate template parameters * Consolidate flag in transformers --------- Co-authored-by: Bartlomiej Kocot <barkocot@amd.com> [ROCm/composable_kernel commit: `ee2c35b92d`]	2026-01-08 08:02:02 +01:00
Bartłomiej Kocot	5b70f71374	[CK TILE] Fix grouped conv kernels splitk and double lds (#3527 ) [ROCm/composable_kernel commit: `bc497beffb`]	2026-01-08 07:59:38 +01:00

1 2 3 4 5 ...

2909 Commits