composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-05-14 10:09:41 +00:00

Author	SHA1	Message	Date
JH-Leon-KIM-AMD	35df1d1b79	[CK TILE] Grouped conv fwd split image (#2970 ) * Refactor split-image implementation: simplify code and remove redundant variables * Add padding debug output to split-image implementation - Added debug prints for padding calculations in transform_conv_fwd_to_gemm.hpp - Verified padding works correctly with all tests passing * Fix sign comparison warning after rebase with origin/develop - Cast blockIdX from unsigned to signed index_t for comparisons - Integrated with new GetOutputTileIndex logic from upstream - Updated to use amd_wave_read_first_lane instead of __builtin_amdgcn_readfirstlane * Fix Split-N with groups bug and clean up unused parameters - Fixed batch stride calculation to include G dimension for grouped convolutions - When moving between batches in NHWGC/NWGC/NDHWGC layouts, need to account for all groups - Removed unused multi-split parameters (we only support 2-way split) - All tests now pass: G=1 with Split-N, G>1 with Split-N, G>1 without Split-N * Implement recursive queue-based split-image detection and calculation - Add LaunchKernelWithSplitIfNeeded() helper method in transform_conv_fwd_to_gemm.hpp - Implement recursive binary splitting algorithm (10GB→5GB+5GB→...) - Correctly handle odd dimensions (61→30+31) - Calculate proper offsets for each split piece - Update invoker to use split-image helper Note: Split detection and calculation work correctly but kernel launching for individual pieces requires kernel modification to handle different spatial dimensions (unlike Split-N which uses blockIdx.z). * WIP: Split-Image investigation - found architecture mismatch - Split-N modifies N_ directly in transformer constructor - Split-Image needs different approach due to varying dimensions - Added split calculation logic for 1D and 2D convolutions - Still facing memory issues when creating piece transformers Key finding: Split-N uses blockIdx.z for parallel execution, while Split-Image needs sequential execution of non-uniform pieces. * Add 1D split-image implementation for grouped convolution (N=1 working) Implements split-image for 1D convolution to handle large tensors that exceed memory thresholds. This is a critical milestone with N=1 fully working and tested. Key Changes: - Invoker: Add split-image logic that splits W dimension in half - Transformer: Add SplitConvProblem helper for recursive splitting - Calculate offsets for LEFT and RIGHT pieces - Launch two kernels sequentially (LEFT then RIGHT) Implementation Details: - Binary split: divides W dimension by 2 - LEFT piece: W=0 to W/2, keeps left padding, removes right padding - RIGHT piece: W/2 to W, removes left padding, keeps right padding - Offset calculation accounts for stride, dilation, and padding - Physical memory offset (no padding in memory) Test Results (N=1): ✅ 94/94 tests passing - Comprehensive tests: 36/36 (channels, padding, stride, dilation, filters, groups) - Edge case tests: 31/31 (odd dimensions, extreme parameters, boundaries) - Stress tests: 27/27 (maximum dimensions, up to 91.4 TFlops) Known Limitations: - Only works with N=1 (single batch) - N>1 fails when split-image triggers (offset calculation issue with Split-N) - Root cause: Split-N modifies N in transformer, but offset calculated in invoker - Solution planned: Move offset calculation to transformer (next phase) Files Modified: - grouped_convolution_forward_invoker.hpp: Add split-image logic - transform_conv_fwd_to_gemm.hpp: Add SplitConvProblem helper This commit represents a stable, tested 1D split-image implementation for N=1 cases. It's an important milestone before extending to N>1 and multi-dimensional splits. * Add basic split-image implementation for 1D/2D/3D grouped convolution This is a working baseline implementation that splits large spatial dimensions to handle memory constraints. Implementation: - 1D: W-split for NWGC layout (36/36 tests passing) - 2D: H-split for NHWGC layout (20/20 tests passing) - 3D: D-split for NDHWGC layout (verified working) Features: - Binary split of outermost spatial dimension - Sequential LEFT/RIGHT kernel launches - Proper padding adjustment at split boundaries - Offset calculation for pointer arithmetic - Debug output for verification Threshold: 100KB (configurable in transformer) Known limitations: - No safety checks for edge cases (to be added) - Offset calculated before Split-N (incompatible with N>1, to be fixed) - No recursive splitting for very large tensors Next steps: - Add safety checks (is_possible_to_split_) - Move offset calculation to transformer (after Split-N) - Test with N>1 + split-image combination Refactor split-image to unified structure for 1D/2D/3D Unified the three separate dimension-specific blocks into a single common implementation with dimension-specific stride calculations. Benefits: - Reduced code from 636 → 348 lines (45% reduction) - Eliminated code duplication - Easier to maintain and extend - Single source of truth for split logic Implementation: - Common: Binary split, offset calc, padding adjustment, kernel launch - Dimension-specific: Stride calculation only - 1D: stride = G * C - 2D: stride = W_in * G * C - 3D: stride = H_in * W_in * G * C Test results (all passing): - 1D: 36/36 tests ✅ - 2D: 20/20 tests ✅ - 3D: 28/28 tests ✅ - Total: 84/84 (100%) All test scenarios verified: - Varying channels, padding, stride, dilation - Filter sizes (1x1 pointwise to 7x7) - Multiple groups (G=1,2,4) - Odd dimensions - Complex combinations * Add safety checks for split-image in all dimensions Added is_possible_to_split safety checks to prevent crashes when splitting is not feasible. Safety checks verify: 1. Output dimension > 1 (can't split single element) 2. RIGHT piece starts after left padding 3. LEFT piece ends within input bounds If checks fail, falls back to normal kernel launch. Verified for all dimensions: - 1D (W-split): Wo=1 case triggers fallback - 2D (H-split): Ho=1 case triggers fallback - 3D (D-split): Do=1 case triggers fallback Original 84 tests still pass - they use normal configurations that naturally satisfy safety conditions. Safety checks protect against pathological edge cases with: - Very small spatial dimensions - Extreme stride/dilation combinations - Invalid padding configurations * Fix Split-N + Split-Image compatibility issue Fixed critical bug where Split-N and Split-Image working together caused ~50% incorrect results due to wrong batch stride calculation. Problem: - Batch stride was calculated using MODIFIED spatial dimensions (e.g., W=50000 after split) instead of ORIGINAL dimensions (W=100000) - Spatial offset was applied globally in invoker, not per-batch in kernel - Each batch (blockIdx.z) got wrong memory offset Solution: 1. Store spatial offset in kargs (don't apply to pointer in invoker) 2. Copy correct batch_stride from temp_kargs to left/right kargs 3. Apply formula in operator(): ptr = base + (batch × stride) + spatial_offset Changes: - grouped_convolution_forward_kernel.hpp: * Added spatial_offset_in/out fields to KernelArgs * Apply batch + spatial offset in operator() - grouped_convolution_forward_invoker.hpp: * Keep base pointer, store spatial offset in kargs * Copy batch_stride from temp_kargs (has original dimensions) - transform_conv_fwd_to_gemm.hpp: * Add debug output for split-image calculation Results: - N=1 tests: 84/84 passing (100%) - N>1 tests: Now all passing (previously ~50% errors) - Tested: 1D, 2D, 3D with N=1,2,4,8,16,20 * Implement unified threshold for Split-N and Split-Image This commit consolidates threshold management for both Split-N and Split-Image operations into a single source of truth, eliminating code duplication and fixing offset calculation issues. Key Changes: ============ 1. Transformer (transform_conv_fwd_to_gemm.hpp): - Moved TwoGB constant to public section for unified access - CalculateSplitImage() now takes no parameters - Uses internal threshold: TwoGB / sizeof(CDataType) - Calculates offsets using N_ (after Split-N) for correctness 2. Kernel (grouped_convolution_forward_kernel.hpp): - GetSplitImageInfo() simplified to take no parameters - Forwards to transformer's CalculateSplitImage() - Clean interface with unified threshold internally 3. Invoker (grouped_convolution_forward_invoker.hpp): - Removed redundant threshold calculation - Simplified to call kargs.GetSplitImageInfo() with no params - Clean early-return pattern (no unnecessary else blocks) - Removed duplicate/dead code paths Benefits: ========= - Single source of truth: TwoGB defined once in transformer - No parameter passing for threshold between components - Correct offset calculation using N_ (post-Split-N) - Cleaner code with no duplication - All tests passing: 1D/2D/3D with various N values Testing: ======== - Split-Image only (N=1, large spatial): PASS - Split-N only (N>1, small spatial): PASS - Both splits active (N>1, large spatial): PASS - No splits (N=1, small spatial): PASS - CPU verification correct for all scenarios * Comment out outdated split-image code (SplitConvProblem/LaunchKernelWithSplitIfNeeded) The old recursive queue-based implementation has been replaced by the new CalculateSplitImage() method which is simpler and correctly handles Split-N + Split-Image interaction. Changes: - Wrapped lines 381-1078 in #if 0...#endif - Old methods: SplitConvProblem() and LaunchKernelWithSplitIfNeeded() - Preserved for reference but disabled from compilation - No functional changes - all tests still pass The new implementation (CalculateSplitImage at line ~2163) provides: - Correct offset calculation using N_ (after Split-N) - Simpler binary split logic - Better integration with unified threshold approach * Implement recursive split-image with depth limit (MAX_DEPTH=10) Changes: - Add depth tracking to SplitPiece struct - Implement two stopping conditions: 1. Piece size below threshold (optimal case) 2. Depth >= MAX_DEPTH (prevents infinite recursion) - Remove MAX_PIECES limit in favor of depth-based control - Support up to 2^10 = 1024 pieces with depth 10 This allows handling extreme tensor sizes while ensuring termination. Pieces larger than threshold will still launch correctly if depth limit reached. Tested with H=100 (4 levels), H=2000 (6 levels), H=4000 (9 levels) - all pass CPU verification. * Summary of recursive split-image implementation: - Recursive queue-based splitting with depth limit (MAX_DEPTH=10, up to 1024 pieces) - Two stopping conditions: size below threshold OR max depth reached - Cumulative offset tracking through all recursion levels - LEFT piece inherits parent offset, RIGHT accumulates (parent + local) - Per-batch spatial offset application in kernel operator() - Batch stride uses original dimensions (before split) - Works with Split-N: split-N first, then recursive split-image - Handles odd dimensions, padding, stride, dilation correctly - All 1D/2D/3D tests pass with CPU verification * Add comment explaining MAX_DEPTH capacity for 2GB threshold * Refactor: move recursive split-image logic to transformer - Move LaunchWithRecursiveSplit() from invoker to transform_conv_fwd_to_gemm.hpp - Simplify invoker from ~250 lines to ~140 lines (removed 110 lines of inline logic) - Encapsulate SplitPiece struct and BFS splitting algorithm in transformer - Remove unused includes (queue, vector) from invoker - Add documentation comment for AreDescriptorsSmallerThan2GB() - Improve code organization and reusability - No performance overhead (static template function, compiler inlines) - All tests passing with 2GB production threshold * Apply clang-format-18 formatting - Format invoker and transformer files with clang-format-18 - Fix brace placement and alignment - No functional changes * Fix clang-format-18 issues in forward kernel - Remove extra blank lines - Fix line wrapping for template calls - Consolidate GetSplitImageInfo() to single line * Update include/ck_tile/ops/grouped_convolution/utils/transform_conv_fwd_to_gemm.hpp Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update include/ck_tile/ops/grouped_convolution/utils/transform_conv_fwd_to_gemm.hpp Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update include/ck_tile/ops/grouped_convolution/kernel/grouped_convolution_forward_kernel.hpp Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update include/ck_tile/ops/grouped_convolution/kernel/grouped_convolution_forward_kernel.hpp Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Split-Image implementation with temporary fixed divider - Implemented spatial dimension splitting (Split-Image) for large tensors - Added piece-based coordinate transformation for 1D/2D/3D convolutions - Integrated Split-N (batch splitting) with automatic threshold detection - Fixed M dimension calculation to include batch: M = N × spatial_size - Added spatial offset support in kernel arguments - Verified 20/20 test cases passing for Split-Image alone - Known issue: Split-N + Split-Image combination needs coordinate fix Implementation Details: - Split factors: 4 (1D), 4×4 (2D), 4×4×4 (3D) - temporary fixed values - Batch strides properly calculated for NWGC/NHWGC/NDHWGC layouts - Piece descriptors track spatial boundaries and block ranges - No performance overhead for N=1 cases * Fix 1D split-image padding issue with per-piece dimensions - Store actual size per piece to handle non-uniform splits - Remove dead code from transform utils * Fix 2D/3D split-image with independent split factors per dimension Problem: Single split factor caused non-uniform pieces when dimensions didn't divide evenly. Result: 18/25 (72%) 2D padding combinations failed. Solution: Independent split factor selection for W, H, D dimensions. Each dimension gets optimal factor based on its own size. Test Results: - 1D: 42/42 pass (100%) - 2D: 25/25 pass (100%) - Total: 67/67 combinations verified * Remove unused split-image struct fields Cleanup of split-image implementation: - Removed unused piece_d, piece_h, piece_w fields from SplitImageInfo struct - These fields were declared but never used in the kernel - Per-piece dimensions are already stored in pieces[] array - Reduces struct size and improves code clarity Tested: 1D/2D/3D convolutions with split-image, padding, stride all pass * Refactor split-image invoker code for improved readability - Extract piece calculation logic into calculate_piece lambda helper - Extract kernel args population into populate_split_image_kargs lambda - Use aggregate initialization for cleaner struct population - Reduce nesting depth and improve maintainability - Fix outdated comment about split-image implementation status * Refactor split-image code and remove debug prints - Extract GPU kernel helper lambdas for better readability - Remove all split-image debug print statements - Set memory threshold to 2GB for production - All tests pass with CPU verification * Add split-image safety constraints and refactor to utils - Add MAX_TOTAL_PIECES=64 limit to prevent segfault - Move calculate_spatial_piece to library utils - Add layout validation (NWGC, NHWGC, NDHWGC only) - Fix hierarchical splitting to respect piece limits - Add proper documentation and formatting * Change split-image from runtime to compile-time branching Response to @bartekxk review comment: Convert 'if(kargs.num_spatial_pieces > 1)' to 'if constexpr(EnableSplitImage)' Changes: - Add EnableSplitImage template parameter to kernel - Change runtime if to compile-time if constexpr - Update invoker to instantiate kernel variants with true/false Benefits: - Eliminates runtime branching in GPU kernel - Dead code elimination (each variant is smaller) - Better compiler optimization Files modified: 2 Lines changed: 20 total (6 in kernel, 14 in invoker) Tests: 27/27 passed (100%) Performance: No regression * Add split-image example as separate binary - Create grouped_convolution_forward_split_image example - Add grouped_convolution_forward_split_image_invoker.hpp - Update CMakeLists.txt to build split_image binary * Replace linear search with binary search in find_piece_id - Change O(n) to O(log n) for finding piece ownership - Matches reference implementation in large_tensor_cshuffle * Simplify split-image code and fix integer overflow - Extract lambda functions to static helper methods - Pre-calculate constants in invoker - Fix integer overflow in tensor size calculation for large tensors * Trigger CI rerun - fix merge conflicts * Fix merge conflict markers * Fix clang-format: remove space before {} * Fix clang-format: comment wrapping and Swish constructor * Rename split_image to large_tensor for clarity - Renamed grouped_convolution_forward_split_image.cpp -> grouped_convolution_forward_large_tensor.cpp - Renamed grouped_convolution_forward_split_image_invoker.hpp -> grouped_convolution_forward_large_tensor_invoker.hpp - Updated CMakeLists.txt target name: tile_example_grouped_conv_fwd_split_image -> tile_example_grouped_conv_fwd_large_tensor - Updated comments to refer to 'large tensor' instead of 'split-image' * Update comments and include in large_tensor example - Updated header comments to use 'large tensor' terminology - Fixed include path to use large_tensor_invoker.hpp * Remove test code, restore 2GB threshold * Update include/ck_tile/ops/grouped_convolution/utils/transform_conv_fwd_to_gemm.hpp Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Fix build errors after develop merge and complete rename to large_tensor This commit addresses compilation errors from the develop merge and completes the rename from split_image to large_tensor. Changes: 1. Fix CDEElementWise typo in grouped_convolution_forward_invoker.hpp 2. Fix template parameter order in large_tensor_invoker.hpp - TransformConvFwdToGemm signature changed in develop - NumGroupsToMerge and SplitN parameters swapped positions 3. Fix missing template parameter in GroupedConvFwdHostArgs 4. Fix EpiloguePipeline scope in kernel (merge conflict) 5. Update binary name references in test scripts * Restore 2GB threshold for split-image Changed threshold from 100MB (testing) back to 2GB for production use. * Fix const-correctness in ds_ptr cast * Update include/ck_tile/ops/grouped_convolution/kernel/grouped_convolution_forward_kernel.hpp Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Apply clang-format-18 * update c++ 18 format * Apply clang-format-18 to transform_conv_fwd_to_gemm.hpp --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> [ROCm/composable_kernel commit: `1fbb47ad30`]	2025-11-01 14:18:16 +02:00
Aviral Goel	d17d3f0766	test(grouped_gemm): add unit tests for grouped_gemm bquant with preshuffleB true (#3119 ) * add tensorwise quant in grouped gemm * fix example issue * update test cases * format codes * clang format * use GTEST_FAIL * add bquant to grouped_gemm * add tensorwise quant in grouped gemm * fix example issue * update test cases * format codes * clang format * use GTEST_FAIL * fix a bug in test_grouped_gemm_util * skip test when use wmma on grouped_quant kernel * change cmake * fix a bug in test_grouped_gemm_util * skip test when use wmma on grouped_quant kernel * change cmake * tests(quant_grouped_gemm): add unit tests to cover bquant in grouped_gemm * Update test/ck_tile/grouped_gemm_quant/test_grouped_gemm_util_quant.hpp Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update example/ck_tile/17_grouped_gemm/quant_grouped_gemm.hpp Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * feat: add bf8 support * chore: remove unnecessary decltype usage * chore: add default quant_mode to function signature as fallback * fix: pass correct runtime pipeline params in grouped_gemm bquant kernel Calculate has_hot_loop, num_loop, and tail_number on device side for each GEMM problem instead of using default values. This fixes incorrect results when different problems in the group have different K dimensions. * chore: set default quant mode in function signature * test: add additional test cases to cover edge case of no hotloop * change code based on comments * WIP: bquant preshuffle b compiles but gives numerical error * feat(grouped_gemm_quant): bquant with preshuffleB support added to grouped_gemm example & kernel * refactor: refactor code after merge commit * chore: remove print statements * test(grouped_gemm): split test cases by quant mode to reduce compilation time and add bquant-preshuffleB mode test cases --------- Co-authored-by: kyle-256 <Kyle.Zhao@amd.com> Co-authored-by: ThomasNing <thomas.ning@amd.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> [ROCm/composable_kernel commit: `8f1274d9b6`]	2025-10-31 12:07:06 -07:00
Thrupti Raj Lakshmana Gowda	10e844d93c	[CK TILE ENGINE] GEMM Multi D Restructure (#3121 ) * Renaming old code * Adding GEMM code with new Architecture * Partial Progress : Errors * Partial Progress : Working code * Changes to element wise function * Removing Debugging statements * Working GEMM Multi D code * Removing Stale Code * Address Copilot review comments * Address Copilot review comments * Changes to validation file * Changes to common code snippets * Creating common folder * Removing duplicate files * Pointing to right common file * Pointing to right common file * Pointing to right common file * Changing to VERBOSE * Changing CMAKE messages to verbose * Updating Cmake with right layout datatype configs * Working code for GEMM Multi D [ROCm/composable_kernel commit: `a33d98f8e2`]	2025-10-31 12:02:46 -07:00
Max Podkorytov	49500a1b3d	[CK-tile] unhardcode the number of LDS banks from universal gemm policy (#3130 ) Fixes LDS bank conflicts on gfx950 for universal gemm v3 pipeline Replaces hardcoded LDS layer calculations with dynamic computation using the new architecture helpers Adds architecture-specific helper function get_n_lds_banks() Changes function attributes from CK_TILE_HOST_DEVICE to CK_TILE_DEVICE in universal gemm policy [ROCm/composable_kernel commit: `04efd282cf`]	2025-10-31 11:58:11 -07:00
Enrico Degregori	71bd07a783	WMMA gemm_add_relu_add_layernorm (#2989 ) * Summary: - Refactor epilogue (with CShuffle) to support fused operations: - EpilogueCShuffleBase holds common parts - EpilogueCShuffle: runs CShuffle and write out - EpilogueWelfordCShuffle: holds Welford specific arguments, runs CShuffle, write out, Welford first part and Welford write out - Extend thread transfer v7r3: - Support for intermediate data type different from src and dst type - New functionality to write to dst buffer and keep data (to be able to use them for additional operations) * Adress review comments [ROCm/composable_kernel commit: `4ebc48a3cd`]	2025-10-31 11:19:26 -07:00
Anton Gorenko	4f47945979	Fix synchronization issue in fwd qr pipeline with dropout (#3135 ) BlockFmhaPipelineQRKSVS reuses LDS for K and dropout so there must be block_sync_lds between loading k_lds_window by gemm_0 and storing dropout randval. [ROCm/composable_kernel commit: `e9596228ff`]	2025-10-31 09:44:52 -07:00
John Shumway	46dd130e26	Add the last two forward instance traits. (#3134 ) * Add InstanceTraits for DeviceGroupedConvFwdMultipleD_Wmma_CShuffle * Add InstanceTraits for kernel_grouped_conv_fwd_dl_multiple_d * A few small changes to fix broken instance traits. [ROCm/composable_kernel commit: `5ed2046bee`]	2025-10-31 07:52:42 -07:00
andrew clark	9950df2ae7	Adding new alert failure patterns (#3122 ) * Adding GPU not found pattern Also, failurePatterns does not need to be global. Moved variable to live in the failure notifications function scope. * Testing new failure type * Testing failure * Removing the forced failure test * Adding an additional failure pattern [ROCm/composable_kernel commit: `1977e4b96a`]	2025-10-31 07:38:31 -07:00
John Afaganis	a987c5dc2e	Add copyright notices to missing files (#3133 ) [ROCm/composable_kernel commit: `3f996ee738`]	2025-10-31 07:35:11 -07:00
kabrahamAMD	abc0a0b77f	Kabraham/fix block gemm v1 b scale (#3129 ) * fixed synchronization issue in block gemm pipeline v1 that caused b_scale to fail * run clang-format --------- Co-authored-by: Kevin Abraham <kevin.abraham@streamhpc.com> [ROCm/composable_kernel commit: `a7c52e8afa`]	2025-10-31 07:19:01 -07:00
Bartłomiej Kocot	fcabe28158	[CK TILE] Clear output buffers for grouped conv bwd (#3127 ) [ROCm/composable_kernel commit: `c2d7931446`]	2025-10-31 14:11:54 +01:00
Yi DING	acec30dd09	[CK_TILE] Add mxfp4 flatmm (#3080 ) * Squashed commit of the following: commit 3e1a851dad834776efbe4fe365ac82c4ed312010 Author: Ding, Yi <yi.ding@amd.com> Date: Thu Oct 23 06:10:54 2025 +0000 Fix & clean after rebase commit 1edf485092f44411da9a1796a4a6b72d5cdb67c6 Author: Ding, Yi <yi.ding@amd.com> Date: Wed Oct 22 10:46:13 2025 +0000 Squashed commit of the following: commit 5276b28a51dac7b5d2106fbae8e78de190ee0de1 Author: mtgu0705 <mtgu@amd.com> Date: Mon Sep 22 02:04:27 2025 -0500 fix bandwidth calculation commit d645bb20c6d879154c30ecd82bbff4d2a9206750 Author: mtgu0705 <mtgu@amd.com> Date: Mon Sep 22 00:58:59 2025 -0500 updates commit 0fa7e6b88aaf81a36034aa7607746de295de4263 Author: mtgu0705 <mtgu@amd.com> Date: Fri Sep 19 00:39:46 2025 -0500 fix a bug, set the A DS_read preload size to 4 for MXFP4 commit 50cafa824e2267f2b2f0dfeeb93e69a673630c61 Author: mtgu0705 <mtgu@amd.com> Date: Thu Sep 18 01:19:03 2025 -0500 fix a_wrap preload issue for large MPerBlock. commit e6333bbbc6ef540e24f92095040085f1ed59041e Author: mtgu0705 <mtgu@amd.com> Date: Wed Sep 17 21:34:03 2025 -0500 optimized the VGPR repack issue for MXFP4 commit e99e4932c401b9f6d1893dd5044c2827d6b3f145 Author: Gino Lu <gino.lu@amd.com> Date: Wed Sep 17 04:19:44 2025 -0500 fix time error commit 4586ce6da7fba0514f2e01a8124c76b7d494e124 Author: mtgu0705 <mtgu@amd.com> Date: Wed Sep 17 03:58:00 2025 -0500 updated, function passed. commit c4f25e7579573db5681b9160f6bdb1349f3566f1 Author: mtgu0705 <mtgu@amd.com> Date: Tue Sep 16 22:21:39 2025 -0500 fix, function partially passed commit a51b56eb6b00b99a4e8d2802dbf5b5b5277b54d8 Author: mtgu0705 <mtgu@amd.com> Date: Tue Sep 16 03:01:12 2025 -0500 fix, reference function passed, next check kernel function commit 5b02643ebab18960e8f9ba66c6bd2f91774f9cae Author: Gino Lu <gino.lu@amd.com> Date: Tue Sep 16 02:29:01 2025 -0500 let pack/unpack return pk_fp4_t commit 76d37c5d4b17530e95c6fced31bff66a35d54b8f Author: mtgu0705 <mtgu@amd.com> Date: Mon Sep 15 20:50:26 2025 -0500 fix commit e5be3e162b9a20e5355bd556d2b27afb6d8bf085 Author: Gino Lu <gino.lu@amd.com> Date: Mon Sep 15 05:51:06 2025 -0500 fix bug commit 39a024efe4aa773df589712b1290803bb5ab5d1d Author: mtgu0705 <mtgu@amd.com> Date: Mon Sep 15 04:02:05 2025 -0500 fix core dump issue, function is not correct. commit 16c49d268cfe065b5112b960b2d852b26552686a Author: mtgu0705 <mtgu@amd.com> Date: Mon Sep 15 03:03:02 2025 -0500 updates, build pass commit fe7a961852dee6eff3be3cf1e0d0fabec5cd42ee Author: mtgu0705 <mtgu@amd.com> Date: Mon Sep 15 00:05:18 2025 -0500 updates commit aaf9fe8022a72df59e04e4d5886dca3ba9c23400 Author: Gino Lu <gino.lu@amd.com> Date: Sun Sep 14 23:40:28 2025 -0500 fix bug commit a3da89290e1553b85fbf1171c07e93ac0f5584db Author: Gino Lu <gino.lu@amd.com> Date: Fri Sep 12 03:28:50 2025 -0500 fix interface commit c5ff747e72d877461ba61dc19a0fe15527d3161e Author: Gino Lu <gino.lu@amd.com> Date: Fri Sep 12 02:53:50 2025 -0500 add interface in warp_gemm_impl commit 0a48d369e601cc798589fc59e0784bdbfc0a22f9 Author: mtgu0705 <mtgu@amd.com> Date: Wed Sep 10 05:03:08 2025 -0500 updates some fixes. commit aaa2beca30ff5546d171a2028d1894fd4e131d4e Author: mtgu0705 <mtgu@amd.com> Date: Tue Sep 9 04:37:42 2025 -0500 fix after merge ginolu/add_wgmfma_dispatcher commit bf87449b09cba690922b2f3f78ba39bf1b1e472e Merge: 05ab58e3d 991d7fdbb Author: mtgu0705 <mtgu@amd.com> Date: Mon Sep 8 22:09:15 2025 -0500 Merge remote-tracking branch 'origin/ginolu/add_wgmfma_dispatcher' into mtgu/cktile_mxfp4_flatmm_dev commit 05ab58e3de2b708aceda63d704089c0fa89437ae Author: mtgu0705 <mtgu@amd.com> Date: Mon Sep 8 21:42:47 2025 -0500 update mx flatmm tail pipeline commit 991d7fdbb726d65091a91b5cc2800f798a6661fc Merge: ad046084a `b2f280046` Author: Gino Lu <gino.lu@amd.com> Date: Mon Sep 8 19:10:23 2025 -0500 Merge branch 'develop' into ginolu/add_wgmfma_dispatcher commit ad046084a2f6e4ebf0cd8b47d0d72b74815061fa Author: Gino Lu <gino.lu@amd.com> Date: Mon Sep 8 19:09:55 2025 -0500 fix type error commit 42e16b43a035364a42789d7ce45a1e6a7d1d2609 Author: mtgu0705 <mtgu@amd.com> Date: Mon Sep 8 04:01:40 2025 -0500 update hotloop pipeline commit c2f69745346545087c8ce24acaba2961bb93ef0b Merge: adbeeb90b `8b4be3a0e` Author: Gino Lu <gino.lu@amd.com> Date: Fri Sep 5 04:22:26 2025 -0500 Merge branch 'develop' into ginolu/add_wgmfma_dispatcher commit adbeeb90be1533f8aeb8c1d5aea6470d45a455a0 Author: Gino Lu <gino.lu@amd.com> Date: Fri Sep 5 04:21:26 2025 -0500 fix clang format commit e2378ac393bb79ac80a8eef84677bffce86d9e0a Author: mtgu0705 <mtgu@amd.com> Date: Wed Sep 3 10:00:54 2025 -0500 some updates commit bdc18a2269db49ff88e1ef1db30f83ea430d7544 Merge: 6c5cea2b7 `feec59755` Author: asleepzzz <hanwen.chang@amd.com> Date: Wed Sep 3 13:22:03 2025 +0800 Merge branch 'develop' into ginolu/add_wgmfma_dispatcher commit 6c5cea2b7a306f5d0ad346cb9baf6370ea2a73fe Author: Gino Lu <gino.lu@amd.com> Date: Mon Sep 1 02:11:02 2025 -0500 fix vec size error commit 76d1dfa352087dfd5867c8909b73726d3a1e853e Author: Gino Lu <gino.lu@amd.com> Date: Mon Sep 1 01:23:39 2025 -0500 fix format error commit a9061aaa1b4bfaa9db102c75b9d74863f39708a9 Author: mtgu0705 <mtgu@amd.com> Date: Sat Aug 30 03:19:07 2025 -0500 update codes commit 0caa184a271a8824ef40f87de456d0fa2500c8ad Author: mtgu0705 <mtgu@amd.com> Date: Fri Aug 29 11:27:33 2025 -0500 init ck_tile mxfp4 flatmm commit 5d46a6635f04bd69b76f7eda1438862e271b987a Author: Feng Shijie <Shijie.Feng@amd.com> Date: Thu Aug 28 08:02:50 2025 +0000 Add bias for f16xf4 moe_flatmm commit dd112dc302d17f541737671a3ac557d7c09ff969 Author: Feng Shijie <Shijie.Feng@amd.com> Date: Wed Aug 27 13:39:47 2025 +0000 update case construction commit b1aca68a073d82c7b3c7bb53286e5f415999edc1 Author: Feng Shijie <Shijie.Feng@amd.com> Date: Tue Aug 26 12:32:29 2025 +0000 support swiglu activaion and use rcpf to accelerate silu commit 49235bd42349a84fc2ebd7ad0b100cc2545bb80a Author: Gino Lu <gino.lu@amd.com> Date: Tue Aug 26 02:33:55 2025 -0500 first commit commit c169e39d6381b932cf7098cc118db29df91da1cb Author: root <root@smci355-ccs-aus-m02-25.cs-aus.dcgpu> Date: Fri Aug 22 04:01:59 2025 -0500 add line to last commit 318f9bf317306454941bbf394c1940023edcf0ac Author: root <root@smci355-ccs-aus-m02-25.cs-aus.dcgpu> Date: Fri Aug 22 03:20:46 2025 -0500 adjust A_LDS descriptor to avoid bankconflict commit 9d066120ed068d6d102da25d619e170a28a04d18 Author: root <root@smci355-ccs-aus-m02-25.cs-aus.dcgpu> Date: Thu Aug 21 09:46:52 2025 -0500 enable hotloop commit 61a895e6b821798970afffd0e9432a21e2f04df8 Author: Feng Shijie <Shijie.Feng@amd.com> Date: Thu Aug 21 09:12:21 2025 +0000 support atomic_pk_add_bf16 on gfx950 commit 9f14864e45f21d8c1bc70a94988fb86c2c0017d8 Author: Feng Shijie <Shijie.Feng@amd.com> Date: Thu Aug 21 06:58:55 2025 +0000 use int64_t as expert stride to avoid overflow commit e63af46b32e1139a1e59dee6f46b9971047c4026 Author: Feng Shijie <Shijie.Feng@amd.com> Date: Wed Aug 20 13:53:32 2025 +0000 use v4i32 as the storage type for B to avoid repack operation commit 6cf0224dd8a229bf2be726ca861c736c9b5f5415 Author: Feng Shijie <Shijie.Feng@amd.com> Date: Wed Aug 20 06:40:03 2025 +0000 add pk_fp4_t and e8m0_t support for amd_buffer_load_impl commit 67a591f2240b0b035029edad904627f98b3839fd Author: Feng Shijie <Shijie.Feng@amd.com> Date: Wed Aug 20 04:39:14 2025 +0000 optimize cvt_pkf4_to_f16 implementation commit 51c7126e77e9b17af694eaa57040e487f9d443e8 Author: Feng Shijie <Shijie.Feng@amd.com> Date: Tue Aug 19 14:56:46 2025 +0000 optimize A_LDS descriptor to avoid bankconflict commit c113160f326353290a2878d7b8febf7daed91d71 Author: Feng Shijie <Shijie.Feng@amd.com> Date: Mon Aug 18 18:43:37 2025 +0000 fix gate-up when GU_NRepeat > 1 commit a45ca0e9934ca4bb9114f65621d5c9582d937a45 Author: Feng Shijie <Shijie.Feng@amd.com> Date: Mon Aug 18 17:28:11 2025 +0000 add fp16xf4 moe commit dc8c8e484804f7bca10c8f0764540af3b5884e83 Author: Feng Shijie <Shijie.Feng@amd.com> Date: Sun Aug 17 17:51:18 2025 +0000 rename example commit b177c967141cfdc401d3f36bf17830fe99893600 Author: Feng Shijie <Shijie.Feng@amd.com> Date: Fri Aug 15 06:20:46 2025 +0000 remove additional check when e8m0->float commit d467f9688c3d35f391e15089135edb1ad1d38b05 Author: Feng Shijie <Shijie.Feng@amd.com> Date: Thu Aug 14 09:34:12 2025 +0000 eliminate repeat dequant commit 1b20674b26ab3ce6bd2f710dd729fd4cc0f79428 Merge: faa3c0278 7d02625e7 Author: Feng Shijie <Shijie.Feng@amd.com> Date: Wed Aug 13 16:51:49 2025 +0000 Merge remote-tracking branch 'origin/moe_flatmm' into feat-mixed_input_flatmm commit faa3c0278cf11b7105a4302dea3a4416520b2cc7 Author: Feng Shijie <Shijie.Feng@amd.com> Date: Wed Aug 13 16:16:48 2025 +0000 update f16xMXF4 commit a2a2e1dab05501cc2136133236c01c08d51db4ea Author: Feng Shijie <Shijie.Feng@amd.com> Date: Wed Aug 13 10:48:53 2025 +0000 update scale-preshuffle for MXF4 commit eac9667feb899419dda1628164c092b969852660 Author: Feng Shijie <Shijie.Feng@amd.com> Date: Mon Aug 11 11:24:34 2025 +0000 update commit 7d02625e7678882af653f52c2a4ddaf64568a41c Author: Feng Shijie <Shijie.Feng@amd.com> Date: Mon Aug 11 08:38:23 2025 +0000 optimize gemm2 atomic_add pattern commit d5f3c3e3ec72d0e6739467c4dc0b4e209f6d1192 Author: Feng Shijie <Shijie.Feng@amd.com> Date: Mon Aug 11 07:59:47 2025 +0000 update scale for mxfp4 commit 15db198084614466bd4cfd4943fcb549cab2069a Author: Feng Shijie <Shijie.Feng@amd.com> Date: Mon Aug 11 07:56:14 2025 +0000 update case construction commit 5dff349d82a5f70b6eea821d2622df51f90ef200 Author: Feng Shijie <Shijie.Feng@amd.com> Date: Mon Aug 11 06:03:06 2025 +0000 update granularity control commit d32cdc52144f65ec473f4ec8e45ea23968811184 Author: Feng Shijie <Shijie.Feng@amd.com> Date: Mon Aug 11 03:42:46 2025 +0000 fix TileConfig commit 26f38c5716304ee5f84e5c4f6f88144d9f3dddaf Author: Gino Lu <gino.lu@amd.com> Date: Thu Aug 7 21:37:28 2025 +0800 Add e8m0 scaled convert into CK_TILE (#2617) * first commit * remove redundent code * modify according to comments. * fix type_convert error with scaled_type_convert commit 419041478745f65dfec18859e75a13d975089519 Author: Feng Shijie <Shijie.Feng@amd.com> Date: Fri Aug 8 20:19:16 2025 +0000 add mixed_prec fp16xfp4 commit 92e2a8b0308b9b107df9d2fd63a961efce706402 Author: Feng Shijie <Shijie.Feng@amd.com> Date: Thu Aug 7 09:22:04 2025 +0000 debug mixed_prec flatmm commit dea3ce80496ebcb00512979f0c3bb897f25e11a5 Merge: fde443bc3 b4f45fe14 Author: lalala-sh <Jiaxing.Wen@amd.com> Date: Wed Aug 6 16:49:47 2025 +0800 Merge pull request #2626 from ROCm/felix/flatmm_fix_splitk fix split k commit d480e8150358cc4ef8b05e25afe299141fad4fde Author: Feng Shijie <Shijie.Feng@amd.com> Date: Wed Aug 6 08:33:33 2025 +0000 add moe_flatmm commit b4f45fe14d11569f34de40c8a205cd6760b61357 Author: coderfeli <coderfeli@163.com> Date: Wed Aug 6 02:45:31 2025 +0000 fix split k commit fde443bc38fe60e52195817ecb2c7b20d772eedb Author: Feng Shijie <Shijie.Feng@amd.com> Date: Mon Aug 4 07:16:36 2025 +0000 fix flatmm with scaling when WarpTileM == 32 commit 5a0667afa889a5af8c6b8509232eabd50cf5efef Author: Feng Shijie <Shijie.Feng@amd.com> Date: Fri Aug 1 11:01:23 2025 +0000 optimize scaling epilogue commit 5c3502bbf71833c6f6f7d4a1cc4f4fd93811f522 Author: Feng Shijie <Shijie.Feng@amd.com> Date: Fri Aug 1 07:28:38 2025 +0000 fix wrong config for fp8 scaling commit eb2d0653cdb86603cb11539cbac466b6431b58b7 Author: Feng Shijie <Shijie.Feng@amd.com> Date: Wed Jul 30 06:20:30 2025 +0000 prune debug message commit 0c089cb56343a39e02a1ee38e9cabeb71ba35e92 Author: Feng Shijie <Shijie.Feng@amd.com> Date: Wed Jul 30 04:52:08 2025 +0000 fix compile error commit 61759ca30ce3787f70e228c3919b3e4d354016dd Author: Feng Shijie <Shijie.Feng@amd.com> Date: Tue Jul 29 15:42:58 2025 +0000 Add persistent option on flatmm for tuning commit b36dc5dd55f15fc1ce8eb21637bdec862e56a883 Author: AMD-dteng <dteng@amd.com> Date: Tue Jul 29 22:48:00 2025 +0800 update pipeline v1: add atomic IGLP schedule commit f886f26994454fc2b4fc3433c86bf699767a2a7c Author: lalala-sh <Jiaxing.Wen@amd.com> Date: Thu Jul 24 09:09:27 2025 +0000 fix error log throwing commit 4b4686ab144daa9061fbda17f3df4c17600c8e9a Author: Feng Shijie <Shijie.Feng@amd.com> Date: Mon Jul 28 08:24:51 2025 +0000 crz idea commit 7099af44a81be41431ba70ae60827b60116d02d2 Author: Feng Shijie <Shijie.Feng@amd.com> Date: Sun Jul 27 11:57:38 2025 +0000 Add permuteN optimzization when NRepeat % 2 == 0 on flatmm commit b147524c92e69a267337c8e48b6e64bcb1483551 Author: sjfeng <j514681085@icloud.com> Date: Sun Jul 27 17:24:08 2025 +0800 try to remove c_shuffle_lds commit 2dd94f59d1a7740a5689e1713ed45588cd0d55dd Author: Feng Shijie <Shijie.Feng@amd.com> Date: Fri Jul 25 07:41:48 2025 +0000 fix loop-dim mismatch and improve c_shuffle alu parallelism commit 4e93f0c5e27806adc070e4caa81661069295751c Merge: 3f12ef5aa 0eb7455f1 Author: lalala-sh <Jiaxing.Wen@amd.com> Date: Thu Jul 24 08:46:51 2025 +0000 merge flatmm -scale commit 3f12ef5aa52ced1bff3bfb57b878358330e9e095 Author: lalala-sh <Jiaxing.Wen@amd.com> Date: Thu Jul 24 16:19:58 2025 +0800 revert delete of inc file commit 08c3a0d184d7581dc5be364f5b36f16fb4a8d6fa Author: solin <bingzhou@amd.com> Date: Thu Jul 24 04:38:16 2025 +0000 reorg flatmm code commit 0eb7455f106604d5254ed16b0daeda68e2a148e3 Author: Feng Shijie <Shijie.Feng@amd.com> Date: Wed Jul 23 19:12:31 2025 +0000 fix flatmm syntax error on gfx950 commit 695ff87e68fdcbe28452c1805cd4dbb643c45495 Author: Feng Shijie <Shijie.Feng@amd.com> Date: Wed Jul 23 19:04:22 2025 +0000 support flatmm scaling commit e3c29d9dea8758db96b998982ccc8bd1c4e8298d Author: valarLip <340077269@qq.com> Date: Wed Jul 23 08:44:12 2025 +0000 merge flatmm pipe v0 from dteng_flatmm_opt commit 425c366fa4c30426ff36cade89b39fd8cb7b9732 Author: lalala-sh <Jiaxing.Wen@amd.com> Date: Wed Jul 23 15:38:12 2025 +0800 build pass commit 6b377a9481535696de40f175d7e2159263d21bdc Author: lalala-sh <Jiaxing.Wen@amd.com> Date: Wed Jul 23 07:20:26 2025 +0000 fix bug commit b6dc58d1ea676fe480c0243ae098c875498f6d6a Author: lalala-sh <Jiaxing.Wen@amd.com> Date: Wed Jul 23 15:01:53 2025 +0800 sync commit 904359f401866ee810484e6b8f5b46d79d9e25c8 Author: valarLip <340077269@qq.com> Date: Tue Jul 22 08:09:35 2025 +0000 adaptive scheduler instead of Macro definition commit f29916c17228c17de9923aab62e7d72d7a30f4e9 Author: lalala-sh <Jiaxing.Wen@amd.com> Date: Thu Jul 17 08:40:35 2025 +0000 fix tail handler bug commit e2c60a90929fec955d91db909d50db538d58363b Author: lalala-sh <Jiaxing.Wen@amd.com> Date: Wed Jul 16 10:12:19 2025 +0000 merge from dteng_flatmm_opt --------- Co-authored-by: lalala-sh <Jiaxing.Wen@amd.com> Co-authored-by: AMD-dteng <dteng@amd.com> Co-authored-by: solin <bingzhou@amd.com> Co-authored-by: sjfeng <j514681085@icloud.com> Co-authored-by: valarLip <340077269@qq.com> Co-authored-by: asleepzzz <hanwen.chang@amd.com> Co-authored-by: Feng Shijie <Shijie.Feng@amd.com> Co-authored-by: coderfeli <coderfeli@163.com> Co-authored-by: Gino Lu <gino.lu@amd.com> Co-authored-by: mtgu0705 <mtgu@amd.com> * Fix crash on small M * Apply suggestion from @Copilot --------- Co-authored-by: lalala-sh <Jiaxing.Wen@amd.com> Co-authored-by: AMD-dteng <dteng@amd.com> Co-authored-by: solin <bingzhou@amd.com> Co-authored-by: sjfeng <j514681085@icloud.com> Co-authored-by: valarLip <340077269@qq.com> Co-authored-by: asleepzzz <hanwen.chang@amd.com> Co-authored-by: Feng Shijie <Shijie.Feng@amd.com> Co-authored-by: coderfeli <coderfeli@163.com> Co-authored-by: Gino Lu <gino.lu@amd.com> Co-authored-by: mtgu0705 <mtgu@amd.com> [ROCm/composable_kernel commit: `e135dd518d`]	2025-10-31 11:29:05 +08:00
Ville Pietilä	2ff4da7949	[CK_BUILDER] Generalize convolution factory to build arbitrary device operations. (#3116 ) Generalize the current convolution factory in CK Builder to be able to build instances of any relevant convolution device operation. The main changes are: * Added new enums FwdGroupConvDeviceOperation, BwdDataGroupConvDeviceOperation, and * BwdWeightGroupConvDeviceOperation that contain the device operations for which the builder should be able to build instances. * Create a union structure GroupConvDeviceOp that can represent a single value of the fwd, bwd weight, or bwd data device operations. This would be more naturally represented by std::variant object, but we cannot use std::variant in NTTPs because it is not a structural object. * Introduced a new member device_operation in the ConvSignatureDescriptor concept that assumes GroupConvDeviceOp value. * Added predicates to be used in creation ConvFactory specialization for the different device operation. When we add support for a new device operation, we'll just create a new ConvFactory specialization with appropriate predicates. * Changed handling of the convolution layouts (GroupConvLayout1D, GroupConvLayout2D, GroupConvLayout3D) to use the union based handling, i.e., there's now a GroupConvLayout union struct that can hold a single value of the 1D, 2D, or 3D layouts. This simplifies the handling of the different layouts as we get rid of templatized convolution signature. These code changes allow developers to work more easily in parallel when adding new device operations. * Fix building CK Builder instance traits after the introduction of direct load template parameter in CK. * Fix clang-formatting. [ROCm/composable_kernel commit: `b387249fd9`]	2025-10-30 16:13:58 -07:00
Ville Pietilä	f65d76ed37	[CK_BUILDER] Rename CK Builder test targets with consistent prefix test_ckb (#3114 ) * Rename CK Builder test targets with consistent prefix test_ckb. * Add test_ckb_all target to build all CK Builder tests. * Update Readme for CK Builder. [ROCm/composable_kernel commit: `90da26ccfd`]	2025-10-30 16:08:32 -07:00
Ville Pietilä	0a49238dd5	Fixed building CK Tile grouped conv fwd bias clamp example. (#3124 ) [ROCm/composable_kernel commit: `22d9f99942`]	2025-10-30 18:17:48 +02:00
SamiAario-AMD	d76e2879d0	Lwpck 3550: Implement and test fixed precision fp8 x bf8 (#2963 ) * HasHotLoop is a constexpr * Remove an unused function * Remove some unused include statements * Add implementation and tests for fp8 x bf8 weight preshuffle GEMM * Add implementation and tests for fp8 x bf8 in CK Tile basic and universal GEMMs * Remove two barrier calls that HotLoopScheduler already calls * No need to suppress a variable that hasn't been declared * Replace six arg_parser arguments with constexpr literals * Simplify run_gemm_test_prec_type * The strides don't need to be passed via arg_parser as we use their default values * The layouts don't need to be passed as arguments twice * Pass M N and K as regular arguments, not using the argument parser * We can now remove the argument parser * Add a common file for precision types to be used in testing * Convert basic and universal GEMM tests to use gtest * Make GemmConfig a test parameter, and form test cases as the cartesian product GemmConfigs x PrecTypes * Add GemmConfigComputeV4 to the GEMM configs to run the universal tests on * Added a changelog entry * Add missing copyright statements * ifndef-define-endif is not needed with pragma once * Fix a comment * Add F8 x BF8 tests for CompV4 in test_gemm_pipeline_kernel_types.hpp * Disable the unreliable test MoeSortingCase4 --------- Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com> [ROCm/composable_kernel commit: `254bce9346`]	2025-10-30 13:36:10 +01:00
Ville Pietilä	4694b1b4a7	[CK_TILE] Improve grouped conv kernel name generation (#3028 ) * Improve the grouped conv kernel name generation in CK Tile. * Fix building CShuffle epilogue tests. --------- Co-authored-by: Bartłomiej Kocot <barkocot@amd.com> [ROCm/composable_kernel commit: `9ee9f4d2a3`]	2025-10-30 13:19:07 +01:00
Jimniu	f5200e06c6	Jimniu/ ck tile gemm stride validation (#2710 ) * Add stride validation for gemm_basic * change default stride statement * Fix build error * Fix pre-commit failure * Addressed PR comments * clear the redundant code * clang format --------- Co-authored-by: mkumar16-amd <mkumar16@amd.com> Co-authored-by: ThomasNing <thomas.ning@amd.com> [ROCm/composable_kernel commit: `8c4cb4f9f4`]	2025-10-29 19:45:09 -07:00
Anton Gorenko	9a012c3135	[CK_TILE] Support WMMA (gfx12) in FMHA (#2528 ) * Pass hdim to tile_example_fmha_fwd in fp8 tests * Add WMMA support to fwd FMHA pipelines * Tune tile sizes a bit for less spilling fp16 256 is still quite slow * Fix Q grad tile distribution for warp size = 32 and hdim >= 256 With AccDataType = float and warp size = 32, K0 becomes 0, K repeat is required to correcty distribute the tile. * Use code based on BlockDropout in BlockDropoutBwd * Fix split KV combine kernel for gfx12 (warp size 32) and make it more universal * Fix LSE LDS tensor descriptors: kMaxSplits and kM0 were swapped, it worked on gfx9 because they both equal to 8 while on gfx12 they are 8 and 4; * Fix Oacc LDS tensor descriptor: it was transposed even though its shape=[4 * kM0, kN1], it worked on gfx9 because 4 * kM == kN1 == 32; * Removing these hidden dependecies allows to support: * any number of warps (power-of-2), not only 4; * kN1 = 16, not only 32; * any number of splits; * Rename ids like o_acc_4 and Oacc4 to eliminate confusion: kNumWarps doesn't have to be 4 now * Replace hard-coded kN1 in dispatch code with the requested tile size * Add gfx12-specific tile sizes for split KV * Pass GPU architecture to kernel generation scripts This is still a temporary solution. * Build and run FMHA CI tests for gfx12 * Fix issue after merging * Fix bwd tile sizes The current pipelines always read only one tile K and V tile, this requires bk0 == bhdq and bk2 == bhdv (kK0 == kQKHeaddim and kK2 == kVHeaddim). * Use hardware f32->f8 on gfx12, remove v_perm __builtin_amdgcn_perm is not needed because __builtin_amdgcn_cvt_pk_fp8_f32 allows to specify which word (16 bit of 32-bit dword) is used to store results (two f8 values). * Update changelog * Add WMMA support to pagedkv * Fix scripts after rebasing * Support 16x16 (MFMA, WMMA) and 32x32 (MFMA) tiles in fwd and bwd BlockDropout Add comments with dropout implementation details Fix performance regression of fwd+dropout * Remove some usage of type punning (reinterpret_cast with ref or ptr) in Philox; * "scalarize" seed and offset, they may come either from kernel args or from device memory (presumably loaded with vector loads). These changes help the compiler to procude more optimal code and reduce register spilling. Use WarpGemmDispatcher instead of explicit WarpGemmMfma... to get CWarpDstrEncoding Use code based on BlockDropout in BlockDropoutBwd Refactor BlockDropout (fwd) Implement BlockDropout (fwd) for WMMA Originally BlockDropout only supported 32x32 tiles (IsWG32 = true), this version supports 16x16 tiles. If MPerBlock > MWarp * 16, it can generate numbers for two 16x16 tiles, similarly to BlockDropoutBwd. Implement BlockDropoutBwd for WMMA Remove MakeRandValLds* functions unused in BlockDropoutBwd Remove unused Run overload from BlockDropoutBwd * Fix regression with philox seed and offset when they exceed 32-bit int __builtin_amdgcn_readfirstlane works with 32-bit values, seed and offset are 64-bit so they get truncated. * Fix names after cherry-picking * Fix selection of a fallback tile based on bm0 The assumption that the largest bm0 == 128 is not always true for current fp32 tiles. * Do not use filters related to qr_async_trload They disable tiles/pipelines which are valid for gfx12. * Use different dstr encoding when C is transposed * Do not call GetQKBlockGemm (and hence WarpGemmDispatcher) in host code Some WarpGemmDispatcher instantiations are defined only for specific archs and undefined on host. Calculations related to sched barriers are moved from Pipeline's public fields into pipeline's operator(). * Fix incorrect name WarpGemmMfmaFp8Fp8F32M32N32K16SwizzleBTransposedCDistribution Correct name is WarpGemmMfmaFp8Fp8F32M32N32K32SwizzleBTransposedCDistribution because it's 32x32x16 with IterateK = 2 so K = 32, also all tiles used in codegen scripts are 32, 32, 32. * Generalize usages of WarpGemmDispatcher for MFMA and WMMA WarpGemmMfmaFp8Fp8F32M32N32K32SwizzleBTransposedCDistribution is still used explicitly becaus of swizzle factor = 4. * Mark has_load_tr as maybe_unused There are no transpose loading for RDNA. * Remove CK_TILE_USE_MFMA/WMMA from fmha-related code * Detect BlockSize on host based on warp size of the current device If kBlockSize == kNumWarps * get_warp_size(), the kernel is launched with kBlockSize / 2 because on host get_warp_size() == 64 always. * Fix calculation of grid size for combine kernel with warp size = 32 * Add missing includes and header * Support multiple archs in one binary for fwd * Support multiple archs in one binary for fwd_splitkv, fwd_appendkv, pagedkv_prefill * Support multiple archs in one binary for bwd * trload kernels are compiled only for gfx950; * instances with padding are checked after instances without padding so they can be used as fallbacks (similarly to fwd); * Extract common code from register_traits * Revert "Fix regression with philox seed and offset when they exceed 32-bit int" To simplify merging , the proper fix is in develop already. * Support new numerical d paddings in trait ordering checks * Build fp32 tests only on gfx9 * Do not use hardcoded M0 = 64 for dot bwd kernel * Use textwrap.indent from standard library * Make fp8 pipelines on gfx12 consistent with gfx9 * Update tests for current pipelines * Make ninja check more responsive in CI ninja buffers output so this job looks hanging. * Support fp8fp32 by limiting O vector size The fp32 output type requires storing 8 * sizeof(float) = 32 bytes, which is not implemented (here 8 is the number of C values per lane for v_wmma_f32_16x16x16...). * Remove unused cmake options * Unify including amd_buffer_addressing.hpp/_builtins.hpp * Temporarily use amd_buffer_addressing.hpp on >=gfx10 amd_buffer_addressing_builtins.hpp uses inline asm for loads/stores which is not compatible with >=gfx10: * 1 scalar for exec masks instead of 2, * gfx12 uses different instruction names etc. * Update asm in bf16 conversions to work with warp 32 * Do not generate splitkv/appendkv with vlayout=col for consistency with fwd * Add arch tags to kernels/host funcs, compile for each arch separately * Add kM0 to fmha_bwd_dot_do_o kernel name to match filename * Add workaround for miscompilation of bwd with padded hdim SWDEV-559729: v_wmma instructions can be incorrectly placed in divergent branches used to store padded tensors (when some lanes are inactive due to padding). Inline asm with dummy dependencies on VGPRs of the tensors prevents the compiler doing this. * Fix add_gtest_executable for absolute paths Some tests (like gemm_tile_engine) pass absolute paths to source files. In CI the branch name is a part of the root dir, and if the branch name contains "wmma", "xdl" etc., files can be incorrectly excluded. * Run only hdim 128 smoke tests for fp8fp32 There are no instances for hdim 64 and 256. * Format py with ruff to simplify merging develop * Fix incorrect var name * Codegen for gfx9,gfx950 when --targets is not specified Aiter and Pytorch require changes for passing their targets to the codegen scripts. With this temporary solution the files are generated but not all of them have to be really built (depending on the used --offload-arch=). * Combine arch-related values into ArchTrait This more centralized approach removes duplication of various formatting templates. * Try a workaround for Jenkins error "groovyjarjarasm.asm.MethodTooLargeException: Method too large" Some code is extracted into a function. [ROCm/composable_kernel commit: `1e77695fe8`]	2025-10-29 13:31:08 -07:00
John Shumway	2f0242c5ab	Add instance traits for two more grouped forward convolutions (#3112 ) [ROCm/composable_kernel commit: `cafaeb6b7b`]	2025-10-29 16:04:13 +01:00
Ville Pietilä	abccb649d1	[CK_Tile] Merge multiple convolution groups into a single GEMM batch (#2986 ) * Fix compilation of the grouped conv examples. * Fix grouped conv bwd weight example output in CK Tile. * Add number of groups to merge to ck tile grouped gemm example. * Initial set of tests for TransformConvBwdWeightToGemm. * Added unit tests for TransformConvBwdWeightToGemm conv groups are merged. * WIP: Tensor transformations. * Add unit tests for coordinate transforms. * Fully working conv group merging for TransformConvBwdWeightToGemm. * WIP: Merged conv groups offset calculation. * Adde unit tests for tensor view. * WIP: Merged conv groups epilogue. * Enable running multiple conv groups per batch. * Add tests for tile_distribution_encoding. * Change example to match optimally depthwise convolution with merged groups. * Add more tests for tensor view. * Integration test for reading diagonal blocks from grouped distributed tensor. * Improved integration test. * Improve test for accessing diagonal blocks. * Added integration test for cshuffle epilogue LDS tile distribution. * Add more logging. * Increase the max number of reported errors. * WIP: merged conv groups GEMM epilogue changes. * LDS to global memory copy. * Fix tile window size for c block. * Integration test for CShuffle epilogue. * Improved CShuffle test. * WIP: Separate epilogue for merged conv groups. * Tile example parameters changes to match depthwise conv. * Offset fixes. * Epilogue fixes. * Working baseline for depthwise covolution with merged conv groups. * Fix build. * Initial unit tests for tensor descriptor. * Add one more unit test for tensor view. * WIP: LDS to global mem transfer using CK tile tensor descriptor and tile distribution encoding. * Fully functional LDS to global mem transfer using tensor descriptor and tile distribution encoding. * Add more comments, disable debug code. * Remove debug and other dead code. * Code clean-up for bwd tensor transformations. * Enable running multiple GEMM batches of merged conv groups. * Add compile check for assumed row-mjor layout. * Fix strides in 1D conv to gemm transformation. * WIP: Simplify conv to gemm transformations and handle K > 1 and C > 1 cases. * Fix case k > 1 and c=1. * Remove debug code. * Make MPerGroup and NPerGroup template parameters. * Add additional check for non-supported c > 1 case. * WIP: Put back the generic tensor descriptors for convolutions. * Fix tensor descriptors. * Remove the obsolete template parameters. * Add more instances. * Fix bugs in merged conv groups tensor descriptors. * Fix tensor descriptors for merged conv groups when K > 1. * Remove debug output. * Remove dead code. * Fix merge conflicts. * Code clean-up. * Remove unused code. * Run clang-formatting. * Remove debug prints and obsolete tests. * Check that number of convolution groups is multiple of merged groups. * Fix build after removing obsolete functionality. * Remove obsolete enumeration. * Fix new unit projects. * Remove unnecessary includes. * Fix passing the number of merged groups. * Remove unrelated tests. * Fix IsSupportedArgument for bwd weight conv kernel. * Fix clang formatting. * Fix the bwd weight conv to gemm mapping for num merged groups > 1. * GEMM config for conv group merging. * Fix clang-formatting. * Remove obsolete comment. * Fix typos in comment strings. * Increase the max number of reported errors when testing against reference implementation. * Rename gemm_config to conv_config. * Rename GemmConfig to ConvConfig and move NumGroupsToMerge into ConvConfig. * Change num_groups_to_merge to a boolean flag in the ck tile grouped conv example. * Run clang-format. * Add number of merged groups into kernel name string. * Remove group merging flag from CK Tile grouped conv example. [ROCm/composable_kernel commit: `121bf0e1f3`]	2025-10-29 16:49:28 +02:00
andrew clark	332a0e1696	Added failure pattern check (#3111 ) [ROCm/composable_kernel commit: `aa22da07be`]	2025-10-29 08:19:56 -06:00
Bartłomiej Kocot	801546f608	Grouped conv fwd with direct load (#3082 ) * Grouped conv fwd with direct load * fix * fix * Add IsSupported check * Fix * fix inductor [ROCm/composable_kernel commit: `66bae4306c`]	2025-10-29 09:54:42 +01:00
Yashvardhan Agarwal	edea16ce14	[CK_TILE] Add indexing to pooling operator (Lwpck 3892) (#3013 ) * Add indexing support to pooling operator - Add IndexDataType template parameter to pooling problem and kernel definitions - Enable pooling kernel to output indices of selected elements during max/absmax pooling - Add overloaded operators for Max and AbsMax that track when values change using bool changed parameter - Support optional index buffer allocation and management in device memory - Modify BlockReduce2d classes to handle index tensors alongside value tensors - Add separate shared memory allocation for index data in cross-warp reductions - Create validate_pool_indices function to verify index correctness - Modify pool3d.cpp example to demonstrate index output functionality - Add tests for index output * fixes * Refactor BlockReduce2D functions to get rid auxiliary private types. * comment resolutions and some changes to block_reduce2d - index reference implementation improved - reduce_operator.hpp cleanedup - updated the block_reduce2d.hpp to have index calculation for BlockReduce2dLinearCrossWarpSync as well * conditionally used variable declaration improvement - the conditionally used vairbales are used only when indexing is enabled. To inform the compiler that they may be unused and declare them with least size possible. This may allow it to be optimized compared to the previous declarations * comment resolutions * lexical ordering of the indicies - introduced accumulate methods that handle the intermediate steps if needed to order the indexes * add reduce_operator_accumulate.hpp to core.hpp --------- Co-authored-by: Adam Osewski <Adam.Osewski@amd.com> [ROCm/composable_kernel commit: `3052d7c9e6`]	2025-10-29 09:58:04 +02:00
Jeff Huang	9ad15a658c	[CK_TILE] fmha: Add query padding support to backward pass (#3097 ) * [CK_TILE] fmha: Add query padding support to backward pass Introduces support for query sequence padding (q_padding) in the FMHA backward pass kernels. - Passing `seqlen_q_ptr` to the backward kernels to distinguish logical from physical sequence lengths. - Updating `OGradDotO`, `ConvertQGrad`, and `DQDKDV` kernels to respect logical lengths and handle zero-length sequences. - Aligning LSE indexing in the forward kernel with the padded layout for consistency. - Adding a new GTest suite (`test_fmha_bwd_kernel_padding.cpp`) with comprehensive tests for various padding scenarios, including zero-length sequences and deterministic mode. * fix clang format * Adapt fmha_bwd_runner.cpp to new q, kv sequence padding Add backward q/kv sequence padding unit tests. * [CK_TILE] fmha: Unify sequence length and padding handling Refactor the handling of sequence lengths and padding in the FMHA forward and backward kernels to provide a more unified and flexible interface. - Replaced `seqstart_padded__ptr` with a more robust system that uses `seqstart__ptr` for physical sequence lengths and introduces `seqlen__ptr` and `cu_seqlen__ptr` for logical (unpadded) lengths. - Established a clear order of precedence for determining sequence length: cumulative lengths (`cu_seqlen__ptr`) take priority, followed by per-sequence lengths (`seqlen__ptr`), and finally physical lengths derived from `seqstart_*_ptr`. - Clarified the distinction between "group mode" and "batch mode" and how sequence lengths are handled in each case. - Renamed `cu_seqlen_kv_ptr` to `cu_seqlen_k_ptr` for consistency. - Updated comments and documentation to reflect the new argument structure and usage. --------- Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com> [ROCm/composable_kernel commit: `7c6430eca0`]	2025-10-29 13:56:11 +08:00
Ville Pietilä	dc1cd3df0c	[CK_BUILDER] Clean-up fwd conv builder implementation (#3110 ) [ROCm/composable_kernel commit: `13e13ce359`]	2025-10-28 20:37:33 -07:00
Sami Remes	39e77ae650	[CK_TILE] Top-K with Sigmoid kernel (#3062 ) * Add sigmoid option to topk_softmax * fix formatting * add to changelog * Apply suggestions from code review Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Use else if Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com> [ROCm/composable_kernel commit: `515e283091`]	2025-10-28 10:54:06 -07:00
Robin Voetter	b09d802931	[CK_BUILDER] Factory tests (#3071 ) This pull requests adds some initial "factory tests" - these check that the instances which are used in MIOpen are actually present in CK. The main reason for this is documentation and sanity checking. Its likely that these tests get outdated fast, so we'll have to maintain them, but fortunately this is quite straight forward and shouldn't take a lot of time once they are in place. [ROCm/composable_kernel commit: `6f58d6e457`]	2025-10-28 10:27:42 -07:00
Illia Silin	96f8b985b7	Add option to build ckProfiler packages for individual architectures. (#3105 ) * refactor package generation, add dedicated switch * allow building packages not only on gfx9 * enable last stage to post packages * stash packages from different arch into separate stashes * build packages daily automatically [ROCm/composable_kernel commit: `155d63f4fe`]	2025-10-28 09:48:11 -07:00
Michał Kulikowski	cd5eeca2b0	[CK][Examples] Fix for example_grouped_gemm_multiple_d_dl_fp16 - corrected stride for B matrix. (#3104 ) Fix for example_elementwise_layernorm_blockwise - corrected cmdline. Signed-off-by: Michal Kulikowski <Michal.Kulikowski@amd.com> [ROCm/composable_kernel commit: `b0aab85baa`]	2025-10-28 09:47:25 -07:00
Illia Silin	02a10e6946	Fix multiple test failures with staging compiler. (#3103 ) * fix sync issues with staging compiler * fix codegen * use separate sync for gfx11 [ROCm/composable_kernel commit: `331273b474`]	2025-10-28 08:07:19 -07:00
Mateusz Ozga	8d51d0ef4d	[CK_TILE] Fixed multi-abd GEMM test, NaN problem (#2979 ) * Multi-ABD NaN problem * Rollback tests --------- Co-authored-by: root <root@splinter-126-008d.aus.dcgpu> Co-authored-by: Thomas Ning <Thomas.Ning@amd.com> Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com> [ROCm/composable_kernel commit: `da4247a6df`]	2025-10-28 15:53:36 +01:00
Aviral Goel	dfbc489a6b	[CK_TILE] Add Bquant to Grouped Gemm (#3063 ) * update test cases * format codes * use GTEST_FAIL * add bquant to grouped_gemm * fix a bug in test_grouped_gemm_util * skip test when use wmma on grouped_quant kernel * add tensorwise quant in grouped gemm * fix example issue * update test cases * format codes * fix a bug in test_grouped_gemm_util * tests(quant_grouped_gemm): add unit tests to cover bquant in grouped_gemm * Update test/ck_tile/grouped_gemm_quant/test_grouped_gemm_util_quant.hpp Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update example/ck_tile/17_grouped_gemm/quant_grouped_gemm.hpp Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * feat: add bf8 support * chore: remove unnecessary decltype usage * chore: add default quant_mode to function signature as fallback * fix: pass correct runtime pipeline params in grouped_gemm bquant kernel Calculate has_hot_loop, num_loop, and tail_number on device side for each GEMM problem instead of using default values. This fixes incorrect results when different problems in the group have different K dimensions. * chore: set default quant mode in function signature * test: add additional test cases to cover edge case of no hotloop * chore: clang formatting --------- Co-authored-by: kyle-256 <Kyle.Zhao@amd.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> [ROCm/composable_kernel commit: `4368fd9f57`]	2025-10-28 10:20:24 -04:00
Ville Pietilä	96c8bba2e4	Add name member to CK elementwise operations. (#3102 ) [ROCm/composable_kernel commit: `1c17bae816`]	2025-10-27 22:19:29 -07:00
John Shumway	c237ad2950	[CK_BUILDER] Test and fix instance traits utils. (#3096 ) * Refactor instance_traits_util and add unit tests tests * Address reviewer comments. Just adds some TODOs to indicate deprecated layouts in our reflection. Our strategy is to leave the reflection code broad (covering deprecated features), but keep the builder concepts narrow. Once we've removed deprecated features from all instances, we can remove them from reflection. Also add a comment to the cmake to explain the unit test target test_conv_builder. * Addressed more reviewer comments. * Remove duplicate PassThrough::name Accidentally added this field to the end of the struct, too. The `name` field should be a the start of the struct for consistency. [ROCm/composable_kernel commit: `54746e9329`]	2025-10-27 22:14:08 -07:00
Illia Silin	b97849e066	Fix AITER tests. (#3106 ) * change base docker image for aiter * do not add group irc to aiter docker * add user and group jenkins * pip install ninja * update permissions for /home/jenkins [ROCm/composable_kernel commit: `e02b1e7caf`]	2025-10-27 20:59:21 -07:00
arai713	df355e12a8	[CK_TILE] Stream-K Gemm Example for fp8 and bf8 (#3041 ) * Addition of streamk fp8 example for CK Tile * Adding in bf8 streamk example in CK Tile * Refactoring fp8/bf8 unit tests Refactored the unit tests for fp8/bf8 to utilize the test harness. Implemented smoke tests with layouts: CCR, CRR, RCR, RRR for fp8/bf8. The tests are using 128x128x32 for the tile configuration, as other configurations revealed implementation gaps that are currently being documented. [ROCm/composable_kernel commit: `715395bc86`]	2025-10-27 19:29:03 -07:00
Thrupti Raj Lakshmana Gowda	f32ef6ed17	Ck tile engine gemm (#2982 ) * Partial Progress : CK Tile Engine GEMM * Partial Progress : CK Tile Engine GEMM * Partial Progress : Working GEMM Code * Partial Progress : Working GEMM Code * Changinf jenkins to remove preshuffle * Partial Progress : CK TILE ENGINE GEMM Debugging * Partial Progress : Removing changes that are not GEMM * Partial Progress : Validation of full block size in GEMM * Changes in Jenkins to run only fp16 and bf16 * Addressing Review Comments * Partial Progress : Addressing CI issues * Partial Progress - Runing GEMM for fp16,bf16 and rcr * Clang * Adding fp8 and bf8 * Adding fp8 and bf8 * Adding additional architrcture * Limited datatypes and layouts * Adding k_block_per_cu in test config * Changes to faling CI errors * Changes to faling CI errors * Validation for GEMM * Adding Layout support * Adding Validations * Adding layout in jenkins * Update on Jenkins * Distribution validation for GEMM * Resolving merge conflicts * Solving merge conflicts [ROCm/composable_kernel commit: `7fc0a38e90`]	2025-10-27 21:11:13 -05:00
Khushbu Agarwal	e10a11323a	Fix quant scale matrix layout for block scale gemm (#3079 ) * Adding support for TiledPermuteN * Adding test * moving shuffle functions to common place * resolving commit hook * fix formatting [ROCm/composable_kernel commit: `b11f53a484`]	2025-10-27 13:56:07 -07:00
mkumar16-amd	5dc38c98bf	Added Support for tile_grouped_gemm_preshuffle example (#2993 ) * Added Support for tile_grouped_gemm_preshuffle example * Resolved PR comments + Added unit tests for preshuffle with persistent * Fixed CMake Build config error * Fix clang error that caused CI to fail * Fix clang formatting * Fix clang issue * Fix errors causing test cases to fail * Fix grouped_gemm_preshuffle unit test failure * Resolve PR comments * Cleaned code + removed unnecassary changes * Update test/ck_tile/grouped_gemm_preshuffle/test_grouped_gemm_preshuffle_util.hpp Co-authored-by: Aviral Goel <aviral.goel@amd.com> * Fix clang formatting * Made changes to improve code readability --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> Co-authored-by: Thomas Ning <Thomas.Ning@amd.com> Co-authored-by: Aviral Goel <aviral.goel@amd.com> [ROCm/composable_kernel commit: `a46b725992`]	2025-10-27 11:31:19 -07:00
Ville Pietilä	d859b04023	[CK_BUILDER] First fwd convolution builder implementation (#3070 ) * Add experimental builder infrastructure for composable_kernel - Add experimental/builder directory with README documentation. - Create initial test infrastructure with CMakeLists.txt and placeholder test. - Update root CMakeLists.txt to support CK_EXPERIMENTAL_BUILDER option. - Update .gitignore to not treat `experimental/builder` as a CMake build directory. This establishes the directory structure for a high-level builder pattern that will provide a semantically-clear interface for constructing CK operations, with initial focus on convolution kernels for MIOpen integration. * Fix clang formatting. * Fix CMake build infrastructure for experimental builder - Add experimental/builder CMakeLists.txt with proper subdirectory structure - Add placeholder include/ck_tile/builder CMakeLists.txt for header installation - Fix gtest.cmake to use include_guard to prevent multiple inclusions - Update root CMakeLists.txt to include full builder directory instead of just tests * Scope C++20 settingto the test code Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Remove redundant GTest::gtest linkage Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Introduce basic types, and convolution algorithm concepts and limits. * Add convolution signature concepts. * Add convolution factory. * Finalize conv factory implementation for fwd convolutions. * Add type definitions for testing. * Add placeholder test. * Add convolution builder definition. * Fully functional fwd conv builder. * Test improvements. * Clean-up include headers. * Enable the limit checks for the convolution algorithm parameters. * Remove dead code. * clang formatting. * Add more tests and missing conv specialization argument. * clang formatting. * Add explicit handling of the tensor layouts. * Add complete 2D/3D layout support to CK Builder - Add missing 2D layouts: GNHWC_GKYXC_GNHWK, NGCHW_GKCYX_NGKHW - Add missing 3D layout: GNDHWC_GKZYXC_GNDHWK - Add 1D layouts (NWGC, NGCW, GNWC, NGCW_GKCX) for future support - Add 3 tests for new 2D/3D layouts - All tests pass (5/5) * Add tests for remaining 2D/3D layouts - Add test for 2D NGCHW_GKYXC_NGKHW (channels-first) with Filter1x1Stride1Pad0 - Add test for 3D NDHWGC_GKZYXC_NDHWGK (channels-last) - All 7 tests pass (complete coverage for all 2D/3D forward layouts) * Change enum converters to consteval. * 7 tests with pipeline and specialization\| Test # \| Dim \| Type \| Layout \| Pipeline \| Specialization \| \|--------\|-----\|------\|----------------------\|----------\|-------------------------\| \| 1 \| 2D \| BF16 \| NHWGC_GKYXC_NHWGK \| V1 \| DEFAULT \| \| 2 \| 2D \| FP16 \| GNHWC_GKYXC_GNHWK \| V3 \| FILTER_1X1_PAD0 \| \| 3 \| 2D \| FP32 \| NGCHW_GKCYX_NGKHW \| V4 \| FILTER_1X1_STRIDE1_PAD0 \| \| 4 \| 2D \| BF16 \| NHWGC_GKYXC_NHWGK \| V5 \| FILTER_3x3 \| \| 5 \| 3D \| FP32 \| NGCDHW_GKCZYX_NGKDHW \| V1 \| FILTER_1X1_PAD0 \| \| 6 \| 3D \| BF16 \| GNDHWC_GKZYXC_GNDHWK \| V3 \| DEFAULT \| \| 7 \| 3D \| FP16 \| NDHWGC_GKZYXC_NDHWGK \| V4 \| FILTER_1X1_PAD0 \| * Add missing convolution layouts and provide better compile-time error in instance traits. * Fix clang formatting. * Changed I8 -> S8. * Fix signature. * Rename concepts and corresponding members. * Rename LDS related parameters. * Remove ODD_C specialization. Add V2 pipeline. * Add missing types. * Add elementwise operation to the conv signature. * Improve compile-time error message for unsupported elementwise ops. * Separate different fwd conv builder tests into separate compilation units. * Fix layout to string and add name to old CK PassThrough elementwise op. * Enable both CK and CK Tile tensor layouts in instance traits. * Fix clang-format. --------- Co-authored-by: John Shumway <jshumway@amd.com> Co-authored-by: John Shumway <john.shumwayjr@gmail.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: JH-Leon-KIM-AMD <jeonghyun.kim@amd.com> [ROCm/composable_kernel commit: `6c2ca1211a`]	2025-10-27 20:09:24 +02:00
Johannes Graner	3b8e9864c6	[CK_TILE] Add conv fwd + bias + clamp example (#3012 ) * Implement argument passing to element-wise functions for fwd convolution * Add files for fwd + bias + clamp example * Implement Bias * Implement Clamp * Elementwise function composition * Composition unit test * Implement fwd + bias + clamp example * Simplify argument passing and composition * elfunc -> bias_and_clamp * Rename function to specify example * Move element-wise function instantiation to kernel * Make bias a runtime tensor * No ugly namespace aliasing * Initialize element-wise function on host * Remove function initialization helper, simplify Compose initialization * Remove unintended LSP compatibility patch * Clean up includes and unused code * Switch names in cshuffle epilogue * Move CDElementwise to conv traits * Re-add required include * Initialize bias in same way as other tensors * Better type specification for ds pointer * Disable 1D convolution * Add warning for non-group-constant bias [ROCm/composable_kernel commit: `5c1974065e`]	2025-10-27 18:43:09 +01:00
arai713	cbf24c87c6	[CK_TILE] Stream-K operator() Reboot (#3064 ) * Persistent Stream-K Kernel Implementation This change implements an operator() function in the reboot::StreamKKernel class that is enabled when the Persistent flag is set to true. In this case, the data-parallel portion and the Stream-K portion of the kernel are fully persistent. The changes were made in the reboot namespace. A future PR will remove the old Stream-K kernel class and remove the reboot namespace. * Unit Tests for Persistent Stream-K Kernel This change contains the inital test suite for the Persitent Stream-K Kernel. The files contain "reboot" in the name; a future PR will remove tests for the old Stream-K Kernel and remove the "reboot" naming. A future commit will add tests for the non-persistent kernel. Also added estimate_num_wgs_per_tile to the StreamKTilePartitionerBase class. This allows us to estimate the number of accumulations done per macro tile in C to use during validation when computing relative and absolute tolerance. * Adding implementation for the Non-Persistent Stream-K kernel This code is adding the operator() function for the Non-Persistent Stream-K kernel. Persistency of the kernel is determined through a template argument. The Non-Persistent kernel will allocate additional workgroups for the data parallel section, leading to a different structure for processing the data parallel and Stream-K sections. There has been an addition to the TilePartitioner to get access to the whether Persistent has been set to true or false in the StreamKKernel. * Adding in the tests for the Non-Persistent Stream-K kernel * Refactor Stream-K Reboot Unit Tests This commit makes the following changes: - Update test cases to determine M, N, and K based on the number of CUs. This ensures that each test case is one of Edge Case, SK Only, DP Only, or DP + 2 Tile SK regardless of the architecture. - Since the DP + 2 Tile SK test case takes long to run, this change moves this case into a separate .inc file and labels it as an extended test. - Since the extended test takes > 30 seconds to run, this test is added to the list of regression tests. * Fix spelling errors in comments for test cases Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Changes based on review Removed const volatile for typenames Set up alias for is_tuple_t Naming changes for clarity: GemmCommon -> BaseGemm Moved std::enable_if_t out of template parameters and changed to a return type for operator() Added constructor for StreamKKernelArgs to clarify UniversalGemm inheritance --------- Co-authored-by: Emily Martins <emily.martins@amd.com> Co-authored-by: Christopher Millette <63608002+cgmillette@users.noreply.github.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> [ROCm/composable_kernel commit: `054fdb765c`]	2025-10-27 09:14:17 -07:00
John Shumway	facd83876e	Add .cline* files to .gitignore (#3101 ) Developers who use cline on the code base need to ignore .cline* directories like .cline_storage and .clinerules. Using a wildcard to ignore any other cline-related directories. [ROCm/composable_kernel commit: `0b68423015`]	2025-10-27 08:29:15 -07:00
Enrico Degregori	b0c0571809	Fix multi-abd tests bug (#3099 ) [ROCm/composable_kernel commit: `06973b1cf4`]	2025-10-27 08:09:02 -07:00
andrew clark	66310cc5bf	Jenkins Alerts Notifications (#3086 ) * Testing minimal pipeline * Update Jenkinsfile * Testing webhook * Testing webhook * Testing webhook * Testing build log output * Testing log retrieval * Testing * Testing pattern matching * Fixing regex * Testing error detection * Testing log formatting Including additional context around log failure. * Testing notification message format * Update Jenkinsfile * Notification formatting * Testing secure interpolation * Testing string interpolation * Notification format * Fixing markdown * Testing markdown * Testing markdown * Revert "Testing markdown" This reverts commit `adeb6d2d55`. * Testing different markdown format * Revert "Testing different markdown format" This reverts commit `bf5406a1cd`. * Testing markdown * Testing markdown * Testing markdown * Testing markdown * Testing markdown * Testing notification * Testing notification * Testing notification * Testing failure mode * Testing failure mode * Adding new patterns and tests * Commenting * Stage name fix * Moving to notification on failure only * Fixing notification format * Testing env vars * Testing build url redirect * Testing no log errors * Testing no errors case * Integrating into primary jenkinsfile * Updating notification message Removed emoji from message [ROCm/composable_kernel commit: `a1ce64374f`]	2025-10-27 08:24:36 -06:00
Thrupti Raj Lakshmana Gowda	20ef4380d7	Ck tile engine preshuffle (#2919 ) * Partial Progress : Preshuffle working code for datatype * Partial Progress : Preshuffle Cleanup * Working code for default config with min max step * Partial Progress : PermuteN implemented in validation * Partial Progress : PermuteN changes in Preshuffle * CK Tile Engine Preshuffle Complete * CK TILE ENGINE : Preshuffle Layout validation * CK Tile Engine Preshuffle Validation * Preshuffle Validation check * CK Tile Engine Preshuffle : Fixing Validation Cases * Addressing PR review Comments * Changes in config * Addressing Review Comments * Adding additional architecture in Jenkins * Partial Progress : Selective Datatype and layouts * Limited datatypes and layouts * Addressing CI errors * Datatype updates * Datatype updates * Datatype changes to Preshuffle * Addressing Review Comments * Addressing Review Comments * Datatype changes * Changes to Cmake * Update on Jenkins * Formatting with precommit * Ruff Formatting [ROCm/composable_kernel commit: `8b185e872e`]	2025-10-27 09:15:34 -05:00
John Shumway	a3261e87a3	[CK Builder] Add missing tf32 type to reflection. (#3090 ) We need to check all the architectures for build errors. This missing tf32 type came up as a build failure when I compiled for different instinct architectures. [ROCm/composable_kernel commit: `6d709dac41`]	2025-10-25 07:28:12 -07:00
Adam Osewski	75a0f41bb0	[CK_Builder] Add name member to unary elementwise ops & update builder traits. (#3093 ) * Add name member to unary elementwise ops. * Update elementwise_op_name to check for name attribute. * Require that the layout is derived from BaseTensorLayout struct. [ROCm/composable_kernel commit: `f53d857b25`]	2025-10-25 07:27:03 -07:00
kabrahamAMD	93a92cf2da	[CK_BUILDER] Add inline string diff for tests (#3067 ) Adds new testing functionality: an inline diff for string comparison. Example usage: EXPECT_THAT("Actual string", ck_tile::test::StringEqWithDiff("Expected string")); Failure message: Value of: "Actual string" Expected: "Expected string" Actual: "Actual string" (of type char [14]), Diff: "[Expe\|A]ct[ed\|ual] string" The inline-diff function uses the Wagner-Fischer algorithm to find the minimum edit distance and generate diff markers, which has O(N^2) complexity. It has optional color codes that are enabled with the matcher. [ROCm/composable_kernel commit: `e576992dca`]	2025-10-25 07:22:41 -07:00

1 2 3 4 5 ...

2566 Commits