composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-05-25 23:34:46 +00:00

Author	SHA1	Message	Date
assistant-librarian[bot]	7148cc6371	Merge commit '31c019f5891f75a2c9a26cb3d3e61c63596e4c30' into develop	2025-11-04 19:11:52 +00:00
Vidyasagar Ananthan	4d72320b51	Chunk Ctests so we dont run into large number of tests error (#3050 ) * Chunk Ctests so we dont run into large number of tests error * Addressing feedback from copilot [ROCm/composable_kernel commit: `31c019f589`]	2025-11-04 10:31:32 -08:00
assistant-librarian[bot]	8c8fec6769	Merge commit '5abe4109e0c30993b9e1afe00f95154939043859' into develop	2025-11-04 18:15:42 +00:00
Cong Ma	0343c4e1fe	Introduces the new partitioner to implement the reduction StreamK kernel. (#3107 ) * Introduces the new partitioner to implement the reduction StreamK kernel * Add more doc text to functions * Add persistent-dp option to streamk example * Update example/ck_tile/40_streamk_gemm/README.md [ROCm/composable_kernel commit: `5abe4109e0`]	2025-11-04 10:32:17 -07:00
assistant-librarian[bot]	4d94ea61e1	Merge commit '13ba06f1e75a28037c78c9d75f660f4ab7877d27' into develop	2025-11-04 17:11:25 +00:00
Thomas Ning	1a8f824938	fix the blockscale 2d case (#3148 ) Co-authored-by: Aviral Goel <aviral.goel@amd.com> [ROCm/composable_kernel commit: `13ba06f1e7`]	2025-11-04 11:55:23 -05:00
assistant-librarian[bot]	32a26d371b	Merge commit '0be0288f58879123c228373525c4b438d354694f' into develop	2025-11-04 15:13:12 +00:00
John Shumway	a9d0980ad9	[CK_BUILDER] Update copyright messages. (#3150 ) * Update copyright messages. Copyright messages should no longer include a year. This PR updates all 38 source files to the new format. * Switch to (C) from unicode copyright symbol. The unicodein comments was causing compilation errors. [ROCm/composable_kernel commit: `0be0288f58`]	2025-11-04 15:35:16 +01:00
John Shumway	52204ff4e5	[CK_BUILDER] Add backward weight instance traits for xdl cshuffle. (#3143 ) * Add backward weight instance traits for xdl cshuffle. To keep instance test file sizes reasonable, we start a new test_bwd_weight_instances_traits.cpp test file. * Fix copyright notices. * Remove (c) symbol, replace with (C). Having UTF-8 in source caused an error with code generation. [ROCm/composable_kernel commit: `6dbee64886`]	2025-11-04 15:34:00 +01:00
assistant-librarian[bot]	5b7defb9da	Merge commit '8681ced9629f6e952afa5b77c5f3549d60920efa' into develop	2025-11-04 14:12:38 +00:00
Bartłomiej Kocot	052c043d99	[CK TILE] Refactor Conv configs and Conv Elementwise (#3151 ) * [CK TILE] Refactor Conv configs and Conv Elementwise * fix [ROCm/composable_kernel commit: `8681ced962`]	2025-11-04 15:04:53 +01:00
assistant-librarian[bot]	58d420c0a4	Merge commit '99f38e4d9bedcf1b09d58653c354f042f8c509ae' into develop	2025-11-04 00:35:23 +00:00
Bartłomiej Kocot	a3a55b00d7	[CK TILE] Refactor grouped conv fwd large tensor (#3144 ) [ROCm/composable_kernel commit: `99f38e4d9b`]	2025-11-04 00:34:48 +01:00
assistant-librarian[bot]	a0410f0a05	Merge commit 'c7ded76cc784f0b4d2c24d3985cb587ad22cbd7f' into develop	2025-11-03 21:11:57 +00:00
Vidyasagar Ananthan	c9e7b735c0	Adding note on CMake convenience script (#3139 ) * Adding note on convenience script * Addressing feedback * Update README.md reword --------- Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com> [ROCm/composable_kernel commit: `c7ded76cc7`]	2025-11-03 12:21:57 -08:00
assistant-librarian[bot]	a8059a2e58	Merge commit '507d81c3af51b81f15b946a2a4bef7f594620292' into develop	2025-11-03 20:14:18 +00:00
Enrico Degregori	9575bcd099	Fix splitk preshuffle (#3137 ) * Fix splitK multiply_multiply_wp * Add tests for gemm_multiply_multiply_wp * Add tests for gemm_universal_preshuffle (KBatch = 1) * Add tests gemm_blockscale_wp * Fix splitk gemm universal preshuffle * Run new tests on arch supporting fp8 * Restore example * Fix strides profiler * Fix tests * Fix clang format * Finalize profiler preshuffle with tolerances * Minor improvements to splitk related changes * Address review comments: clang format and ckProfiler typo * Remove b_k_split_offset from SplitKBatchOffset struct [ROCm/composable_kernel commit: `507d81c3af`]	2025-11-03 11:59:01 -08:00
assistant-librarian[bot]	7ce8c0cf8f	Merge commit '057b7d43b4f1edd4bc6e881403588af8c8e96fd4' into develop	2025-11-03 18:14:59 +00:00
Thomas Ning	bf0dc8ce56	fix the compv4 and async pipeline when tile handler is 1 (#3141 ) [ROCm/composable_kernel commit: `057b7d43b4`]	2025-11-03 09:37:35 -08:00
assistant-librarian[bot]	8a049e4de5	Merge commit '2ec57a8e704f55b545877f6e4f545ebda4a21833' into develop	2025-11-03 17:12:19 +00:00
Emily Martins	b00303a831	Replace CK_TILE_PIPELINE macros with a common enum This change replaces pipeline macros like CK_TILE_PIPELINE_COMPUTE_V3, CK_TILE_PIPELINE_MEMORY, etc in the CK Tile examples with a common enum called GemmPipeline to reduce code duplication. [ROCm/composable_kernel commit: `2ec57a8e70`]	2025-11-03 09:35:05 -07:00
assistant-librarian[bot]	bc26a7282b	Merge commit 'afe1ff618df6fb28532331560f9b40a0b396a1da' into develop	2025-11-03 16:13:52 +00:00
Michael Mcminn	699f7daae3	Ud fix moe sorting gfx908 (#2720 ) * Adding a ds permute fallback for the gfx908 and older for row_newbcast:7 instruction * Better macro for selecting ROW_NEWBCAST * clang-format the update --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com> [ROCm/composable_kernel commit: `afe1ff618d`]	2025-11-03 07:31:31 -08:00
assistant-librarian[bot]	33df038b64	Merge commit 'd405641f06162f2a6b1bf15f890caa7105beebe4' into develop	2025-11-03 10:13:50 +00:00
msaffari-amd	7c8d79af33	Ck tile engine gemm unit tests exapand test coverage (#3025 ) * initial commit for testing datatypes, layouts and traits * correct warp tile size for small datatype config to make a validate instance for fp16, bf16, fp8 * add tile size coverage test * Cover more tests, parallel instance generation, documentation * update cmakelist to run more tests * initial codes to support add test params in json file * add congurable problem sizes for different tests * modify README.md * clean test_gemm_simple code * correct padding coverage test * Add comprehensive and quick tile size config files * remove fp64 from datatypes * update documents. manage selecting tile_size config (quick or Comprehensive) * correct padding test problem sizes * update comprehensive test and correct documents * Skip GEMM tests with unsupported arguments instead of failing * change gen_single instead of gen_indivisual because of an issue. add splitk tests to tile_size_quick_config * clean CMakeList, remod py file * Refactor test configs: Rename tile_size to coverage, remove separate traits config, clean cmakefile, readme * update fp32, fp8 to test all layouts, clean documents and comments * limit fp32 test layouts to rcr because of compilation error on some gpus * remove fp32 because of the removing from gemm_instance_builder, make quick test smaller, updating comments * Fix fp8/bf8 test failures on gfx950 by adding OCP FP8 format support * Reduce quick_coverage test count from ~250 to ~144 for faster CI [ROCm/composable_kernel commit: `d405641f06`]	2025-11-03 10:29:16 +01:00
assistant-librarian[bot]	ead8f4df80	Merge commit '3ae3992c18045446f1b733b306265efbd14c5d57' into develop	2025-11-03 07:13:15 +00:00
Ville Pietilä	aeeed60666	[CK_BUILDER] Add conv factories for DeviceGroupedConvFwdMultipleABD_Xdl_CShuffle and DeviceGroupedConvFwdMultipleD_Wmma_CShuffle (#3138 ) * Add device operation to conv signature. Use unions to hold conv layouts and device operations. * Add predicates for all device op instances. * Use the device op signature for validation. * Fix ckb CMakeLists.txt file for tests. * Fix building CK Builder instance traits after the introduction of direct load template parameter in CK. * Fix clang-formatting. * Add factory for DeviceGroupedConvFwdMultipleABD_Xdl_CShuffle device op. * Add conv factory for DeviceGroupedConvFwdMultipleD_Wmma_CShuffle * Rename elements per wave per shuffle member in the epilogue concept. * clang-format * Add concepts and types for optional device op template parameters. * Add optional compute, direct load, and loop scheduler arguments to conv factory. * Add number of groups to merge template parameter. * clang-format. [ROCm/composable_kernel commit: `3ae3992c18`]	2025-11-03 09:03:25 +02:00
assistant-librarian[bot]	b84217c5a7	Merge commit '16e85cf179fd8e98f56d664642d37a6775d7bc4d' into develop	2025-11-03 01:41:17 +00:00
Sami Remes	9f069d6e35	[CK_TILE] B matrix 2D block scale gemm (#3074 ) * Refactor quant group size to be configurable for M/N/K, not just K * add some asserts for configurations not implemented * start setting of group size for N dimension * enable 2d for reference quant gemm * WIP: trying to figure out tile dstr and/or indexing for scale matrix * WIP * Fix handling of n dim blocks in tile windows etc * remove commented code and enable all tests again * fix formatting * Add more specialized tile distributions * Enable NWarps replication for bquant tile dstr * fix formatting * fix format * Fix some issues from the merge * fix formatting * one more fix to tile dstr, and revert debug initialization * Remove commented code Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * simplify conditions that are needed for tile distributions * only enable the working group sizes in tests * fix formatting * Update tile distribution for 2D bquant * add some documentation and 2d block scale example * fix formatting * Add in Changlog and restructure the quant 2d example * fix CMake * support the change for blockscale 2d * fix the test file --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> Co-authored-by: Cong Ma <congma13@amd.com> Co-authored-by: ThomasNing <thomas.ning@amd.com> [ROCm/composable_kernel commit: `16e85cf179`]	2025-11-02 16:49:20 -08:00
assistant-librarian[bot]	316be5c6b2	Merge commit '73f637894da54ac2014d3f7be675f1bf75a689c1' into develop	2025-11-02 04:15:35 +00:00
Aviral Goel	f4b880d058	refactor: remove gemm preshuffle pipeline v1 by removing all references from codebase (#3132 ) * test: temporarily disable flaky test_ck_tile_moe_sorting_2d_buf * refactor: deprecate gemm preshuffle pipeline v1 by removing all references from codebase * Revert "test: temporarily disable flaky test_ck_tile_moe_sorting_2d_buf" This reverts commit `573c08a085`. [ROCm/composable_kernel commit: `73f637894d`]	2025-11-02 00:06:28 -04:00
assistant-librarian[bot]	c09de8fa6f	Merge commit '45be7415864b839cc27b0455bc6eae177b4832cf' into develop	2025-11-01 20:11:51 +00:00
Aviral Goel	5be796d8a5	fix: fix bug in print tile window when printing bf8/fp8 tiles (#3120 ) * fix: fix bug in print tile window when printing bf8/fp8 tiles * test(print_tile_window_range): add unit tests to maintain function integrity * fix: fp8 numerical mismatch error on gfx950 by adding DCK_TILE_USE_OCP_FP8 [ROCm/composable_kernel commit: `45be741586`]	2025-11-01 15:28:07 -04:00
assistant-librarian[bot]	420486464a	Merge commit 'ab1a8356b6f0cd2a92392663d81c8e6ee78e4123' into develop	2025-11-01 14:11:22 +00:00
Bartłomiej Kocot	b2aa37f3f5	Add 2GB limitation for grouped conv bwd weight (#3054 ) [ROCm/composable_kernel commit: `ab1a8356b6`]	2025-11-01 14:16:45 +01:00
assistant-librarian[bot]	e065ebfb86	Merge commit '1fbb47ad304566a90a374cef4731f1a257e5e179' into develop	2025-11-01 13:15:56 +00:00
JH-Leon-KIM-AMD	5f45985732	[CK TILE] Grouped conv fwd split image (#2970 ) * Refactor split-image implementation: simplify code and remove redundant variables * Add padding debug output to split-image implementation - Added debug prints for padding calculations in transform_conv_fwd_to_gemm.hpp - Verified padding works correctly with all tests passing * Fix sign comparison warning after rebase with origin/develop - Cast blockIdX from unsigned to signed index_t for comparisons - Integrated with new GetOutputTileIndex logic from upstream - Updated to use amd_wave_read_first_lane instead of __builtin_amdgcn_readfirstlane * Fix Split-N with groups bug and clean up unused parameters - Fixed batch stride calculation to include G dimension for grouped convolutions - When moving between batches in NHWGC/NWGC/NDHWGC layouts, need to account for all groups - Removed unused multi-split parameters (we only support 2-way split) - All tests now pass: G=1 with Split-N, G>1 with Split-N, G>1 without Split-N * Implement recursive queue-based split-image detection and calculation - Add LaunchKernelWithSplitIfNeeded() helper method in transform_conv_fwd_to_gemm.hpp - Implement recursive binary splitting algorithm (10GB→5GB+5GB→...) - Correctly handle odd dimensions (61→30+31) - Calculate proper offsets for each split piece - Update invoker to use split-image helper Note: Split detection and calculation work correctly but kernel launching for individual pieces requires kernel modification to handle different spatial dimensions (unlike Split-N which uses blockIdx.z). * WIP: Split-Image investigation - found architecture mismatch - Split-N modifies N_ directly in transformer constructor - Split-Image needs different approach due to varying dimensions - Added split calculation logic for 1D and 2D convolutions - Still facing memory issues when creating piece transformers Key finding: Split-N uses blockIdx.z for parallel execution, while Split-Image needs sequential execution of non-uniform pieces. * Add 1D split-image implementation for grouped convolution (N=1 working) Implements split-image for 1D convolution to handle large tensors that exceed memory thresholds. This is a critical milestone with N=1 fully working and tested. Key Changes: - Invoker: Add split-image logic that splits W dimension in half - Transformer: Add SplitConvProblem helper for recursive splitting - Calculate offsets for LEFT and RIGHT pieces - Launch two kernels sequentially (LEFT then RIGHT) Implementation Details: - Binary split: divides W dimension by 2 - LEFT piece: W=0 to W/2, keeps left padding, removes right padding - RIGHT piece: W/2 to W, removes left padding, keeps right padding - Offset calculation accounts for stride, dilation, and padding - Physical memory offset (no padding in memory) Test Results (N=1): ✅ 94/94 tests passing - Comprehensive tests: 36/36 (channels, padding, stride, dilation, filters, groups) - Edge case tests: 31/31 (odd dimensions, extreme parameters, boundaries) - Stress tests: 27/27 (maximum dimensions, up to 91.4 TFlops) Known Limitations: - Only works with N=1 (single batch) - N>1 fails when split-image triggers (offset calculation issue with Split-N) - Root cause: Split-N modifies N in transformer, but offset calculated in invoker - Solution planned: Move offset calculation to transformer (next phase) Files Modified: - grouped_convolution_forward_invoker.hpp: Add split-image logic - transform_conv_fwd_to_gemm.hpp: Add SplitConvProblem helper This commit represents a stable, tested 1D split-image implementation for N=1 cases. It's an important milestone before extending to N>1 and multi-dimensional splits. * Add basic split-image implementation for 1D/2D/3D grouped convolution This is a working baseline implementation that splits large spatial dimensions to handle memory constraints. Implementation: - 1D: W-split for NWGC layout (36/36 tests passing) - 2D: H-split for NHWGC layout (20/20 tests passing) - 3D: D-split for NDHWGC layout (verified working) Features: - Binary split of outermost spatial dimension - Sequential LEFT/RIGHT kernel launches - Proper padding adjustment at split boundaries - Offset calculation for pointer arithmetic - Debug output for verification Threshold: 100KB (configurable in transformer) Known limitations: - No safety checks for edge cases (to be added) - Offset calculated before Split-N (incompatible with N>1, to be fixed) - No recursive splitting for very large tensors Next steps: - Add safety checks (is_possible_to_split_) - Move offset calculation to transformer (after Split-N) - Test with N>1 + split-image combination Refactor split-image to unified structure for 1D/2D/3D Unified the three separate dimension-specific blocks into a single common implementation with dimension-specific stride calculations. Benefits: - Reduced code from 636 → 348 lines (45% reduction) - Eliminated code duplication - Easier to maintain and extend - Single source of truth for split logic Implementation: - Common: Binary split, offset calc, padding adjustment, kernel launch - Dimension-specific: Stride calculation only - 1D: stride = G * C - 2D: stride = W_in * G * C - 3D: stride = H_in * W_in * G * C Test results (all passing): - 1D: 36/36 tests ✅ - 2D: 20/20 tests ✅ - 3D: 28/28 tests ✅ - Total: 84/84 (100%) All test scenarios verified: - Varying channels, padding, stride, dilation - Filter sizes (1x1 pointwise to 7x7) - Multiple groups (G=1,2,4) - Odd dimensions - Complex combinations * Add safety checks for split-image in all dimensions Added is_possible_to_split safety checks to prevent crashes when splitting is not feasible. Safety checks verify: 1. Output dimension > 1 (can't split single element) 2. RIGHT piece starts after left padding 3. LEFT piece ends within input bounds If checks fail, falls back to normal kernel launch. Verified for all dimensions: - 1D (W-split): Wo=1 case triggers fallback - 2D (H-split): Ho=1 case triggers fallback - 3D (D-split): Do=1 case triggers fallback Original 84 tests still pass - they use normal configurations that naturally satisfy safety conditions. Safety checks protect against pathological edge cases with: - Very small spatial dimensions - Extreme stride/dilation combinations - Invalid padding configurations * Fix Split-N + Split-Image compatibility issue Fixed critical bug where Split-N and Split-Image working together caused ~50% incorrect results due to wrong batch stride calculation. Problem: - Batch stride was calculated using MODIFIED spatial dimensions (e.g., W=50000 after split) instead of ORIGINAL dimensions (W=100000) - Spatial offset was applied globally in invoker, not per-batch in kernel - Each batch (blockIdx.z) got wrong memory offset Solution: 1. Store spatial offset in kargs (don't apply to pointer in invoker) 2. Copy correct batch_stride from temp_kargs to left/right kargs 3. Apply formula in operator(): ptr = base + (batch × stride) + spatial_offset Changes: - grouped_convolution_forward_kernel.hpp: * Added spatial_offset_in/out fields to KernelArgs * Apply batch + spatial offset in operator() - grouped_convolution_forward_invoker.hpp: * Keep base pointer, store spatial offset in kargs * Copy batch_stride from temp_kargs (has original dimensions) - transform_conv_fwd_to_gemm.hpp: * Add debug output for split-image calculation Results: - N=1 tests: 84/84 passing (100%) - N>1 tests: Now all passing (previously ~50% errors) - Tested: 1D, 2D, 3D with N=1,2,4,8,16,20 * Implement unified threshold for Split-N and Split-Image This commit consolidates threshold management for both Split-N and Split-Image operations into a single source of truth, eliminating code duplication and fixing offset calculation issues. Key Changes: ============ 1. Transformer (transform_conv_fwd_to_gemm.hpp): - Moved TwoGB constant to public section for unified access - CalculateSplitImage() now takes no parameters - Uses internal threshold: TwoGB / sizeof(CDataType) - Calculates offsets using N_ (after Split-N) for correctness 2. Kernel (grouped_convolution_forward_kernel.hpp): - GetSplitImageInfo() simplified to take no parameters - Forwards to transformer's CalculateSplitImage() - Clean interface with unified threshold internally 3. Invoker (grouped_convolution_forward_invoker.hpp): - Removed redundant threshold calculation - Simplified to call kargs.GetSplitImageInfo() with no params - Clean early-return pattern (no unnecessary else blocks) - Removed duplicate/dead code paths Benefits: ========= - Single source of truth: TwoGB defined once in transformer - No parameter passing for threshold between components - Correct offset calculation using N_ (post-Split-N) - Cleaner code with no duplication - All tests passing: 1D/2D/3D with various N values Testing: ======== - Split-Image only (N=1, large spatial): PASS - Split-N only (N>1, small spatial): PASS - Both splits active (N>1, large spatial): PASS - No splits (N=1, small spatial): PASS - CPU verification correct for all scenarios * Comment out outdated split-image code (SplitConvProblem/LaunchKernelWithSplitIfNeeded) The old recursive queue-based implementation has been replaced by the new CalculateSplitImage() method which is simpler and correctly handles Split-N + Split-Image interaction. Changes: - Wrapped lines 381-1078 in #if 0...#endif - Old methods: SplitConvProblem() and LaunchKernelWithSplitIfNeeded() - Preserved for reference but disabled from compilation - No functional changes - all tests still pass The new implementation (CalculateSplitImage at line ~2163) provides: - Correct offset calculation using N_ (after Split-N) - Simpler binary split logic - Better integration with unified threshold approach * Implement recursive split-image with depth limit (MAX_DEPTH=10) Changes: - Add depth tracking to SplitPiece struct - Implement two stopping conditions: 1. Piece size below threshold (optimal case) 2. Depth >= MAX_DEPTH (prevents infinite recursion) - Remove MAX_PIECES limit in favor of depth-based control - Support up to 2^10 = 1024 pieces with depth 10 This allows handling extreme tensor sizes while ensuring termination. Pieces larger than threshold will still launch correctly if depth limit reached. Tested with H=100 (4 levels), H=2000 (6 levels), H=4000 (9 levels) - all pass CPU verification. * Summary of recursive split-image implementation: - Recursive queue-based splitting with depth limit (MAX_DEPTH=10, up to 1024 pieces) - Two stopping conditions: size below threshold OR max depth reached - Cumulative offset tracking through all recursion levels - LEFT piece inherits parent offset, RIGHT accumulates (parent + local) - Per-batch spatial offset application in kernel operator() - Batch stride uses original dimensions (before split) - Works with Split-N: split-N first, then recursive split-image - Handles odd dimensions, padding, stride, dilation correctly - All 1D/2D/3D tests pass with CPU verification * Add comment explaining MAX_DEPTH capacity for 2GB threshold * Refactor: move recursive split-image logic to transformer - Move LaunchWithRecursiveSplit() from invoker to transform_conv_fwd_to_gemm.hpp - Simplify invoker from ~250 lines to ~140 lines (removed 110 lines of inline logic) - Encapsulate SplitPiece struct and BFS splitting algorithm in transformer - Remove unused includes (queue, vector) from invoker - Add documentation comment for AreDescriptorsSmallerThan2GB() - Improve code organization and reusability - No performance overhead (static template function, compiler inlines) - All tests passing with 2GB production threshold * Apply clang-format-18 formatting - Format invoker and transformer files with clang-format-18 - Fix brace placement and alignment - No functional changes * Fix clang-format-18 issues in forward kernel - Remove extra blank lines - Fix line wrapping for template calls - Consolidate GetSplitImageInfo() to single line * Update include/ck_tile/ops/grouped_convolution/utils/transform_conv_fwd_to_gemm.hpp Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update include/ck_tile/ops/grouped_convolution/utils/transform_conv_fwd_to_gemm.hpp Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update include/ck_tile/ops/grouped_convolution/kernel/grouped_convolution_forward_kernel.hpp Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update include/ck_tile/ops/grouped_convolution/kernel/grouped_convolution_forward_kernel.hpp Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Split-Image implementation with temporary fixed divider - Implemented spatial dimension splitting (Split-Image) for large tensors - Added piece-based coordinate transformation for 1D/2D/3D convolutions - Integrated Split-N (batch splitting) with automatic threshold detection - Fixed M dimension calculation to include batch: M = N × spatial_size - Added spatial offset support in kernel arguments - Verified 20/20 test cases passing for Split-Image alone - Known issue: Split-N + Split-Image combination needs coordinate fix Implementation Details: - Split factors: 4 (1D), 4×4 (2D), 4×4×4 (3D) - temporary fixed values - Batch strides properly calculated for NWGC/NHWGC/NDHWGC layouts - Piece descriptors track spatial boundaries and block ranges - No performance overhead for N=1 cases * Fix 1D split-image padding issue with per-piece dimensions - Store actual size per piece to handle non-uniform splits - Remove dead code from transform utils * Fix 2D/3D split-image with independent split factors per dimension Problem: Single split factor caused non-uniform pieces when dimensions didn't divide evenly. Result: 18/25 (72%) 2D padding combinations failed. Solution: Independent split factor selection for W, H, D dimensions. Each dimension gets optimal factor based on its own size. Test Results: - 1D: 42/42 pass (100%) - 2D: 25/25 pass (100%) - Total: 67/67 combinations verified * Remove unused split-image struct fields Cleanup of split-image implementation: - Removed unused piece_d, piece_h, piece_w fields from SplitImageInfo struct - These fields were declared but never used in the kernel - Per-piece dimensions are already stored in pieces[] array - Reduces struct size and improves code clarity Tested: 1D/2D/3D convolutions with split-image, padding, stride all pass * Refactor split-image invoker code for improved readability - Extract piece calculation logic into calculate_piece lambda helper - Extract kernel args population into populate_split_image_kargs lambda - Use aggregate initialization for cleaner struct population - Reduce nesting depth and improve maintainability - Fix outdated comment about split-image implementation status * Refactor split-image code and remove debug prints - Extract GPU kernel helper lambdas for better readability - Remove all split-image debug print statements - Set memory threshold to 2GB for production - All tests pass with CPU verification * Add split-image safety constraints and refactor to utils - Add MAX_TOTAL_PIECES=64 limit to prevent segfault - Move calculate_spatial_piece to library utils - Add layout validation (NWGC, NHWGC, NDHWGC only) - Fix hierarchical splitting to respect piece limits - Add proper documentation and formatting * Change split-image from runtime to compile-time branching Response to @bartekxk review comment: Convert 'if(kargs.num_spatial_pieces > 1)' to 'if constexpr(EnableSplitImage)' Changes: - Add EnableSplitImage template parameter to kernel - Change runtime if to compile-time if constexpr - Update invoker to instantiate kernel variants with true/false Benefits: - Eliminates runtime branching in GPU kernel - Dead code elimination (each variant is smaller) - Better compiler optimization Files modified: 2 Lines changed: 20 total (6 in kernel, 14 in invoker) Tests: 27/27 passed (100%) Performance: No regression * Add split-image example as separate binary - Create grouped_convolution_forward_split_image example - Add grouped_convolution_forward_split_image_invoker.hpp - Update CMakeLists.txt to build split_image binary * Replace linear search with binary search in find_piece_id - Change O(n) to O(log n) for finding piece ownership - Matches reference implementation in large_tensor_cshuffle * Simplify split-image code and fix integer overflow - Extract lambda functions to static helper methods - Pre-calculate constants in invoker - Fix integer overflow in tensor size calculation for large tensors * Trigger CI rerun - fix merge conflicts * Fix merge conflict markers * Fix clang-format: remove space before {} * Fix clang-format: comment wrapping and Swish constructor * Rename split_image to large_tensor for clarity - Renamed grouped_convolution_forward_split_image.cpp -> grouped_convolution_forward_large_tensor.cpp - Renamed grouped_convolution_forward_split_image_invoker.hpp -> grouped_convolution_forward_large_tensor_invoker.hpp - Updated CMakeLists.txt target name: tile_example_grouped_conv_fwd_split_image -> tile_example_grouped_conv_fwd_large_tensor - Updated comments to refer to 'large tensor' instead of 'split-image' * Update comments and include in large_tensor example - Updated header comments to use 'large tensor' terminology - Fixed include path to use large_tensor_invoker.hpp * Remove test code, restore 2GB threshold * Update include/ck_tile/ops/grouped_convolution/utils/transform_conv_fwd_to_gemm.hpp Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Fix build errors after develop merge and complete rename to large_tensor This commit addresses compilation errors from the develop merge and completes the rename from split_image to large_tensor. Changes: 1. Fix CDEElementWise typo in grouped_convolution_forward_invoker.hpp 2. Fix template parameter order in large_tensor_invoker.hpp - TransformConvFwdToGemm signature changed in develop - NumGroupsToMerge and SplitN parameters swapped positions 3. Fix missing template parameter in GroupedConvFwdHostArgs 4. Fix EpiloguePipeline scope in kernel (merge conflict) 5. Update binary name references in test scripts * Restore 2GB threshold for split-image Changed threshold from 100MB (testing) back to 2GB for production use. * Fix const-correctness in ds_ptr cast * Update include/ck_tile/ops/grouped_convolution/kernel/grouped_convolution_forward_kernel.hpp Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Apply clang-format-18 * update c++ 18 format * Apply clang-format-18 to transform_conv_fwd_to_gemm.hpp --------- Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> [ROCm/composable_kernel commit: `1fbb47ad30`]	2025-11-01 14:18:16 +02:00
assistant-librarian[bot]	d560ad2092	Merge commit '8f1274d9b655c2584b3643acac07ef813f31238e' into develop	2025-10-31 19:11:51 +00:00
Aviral Goel	658fb530ab	test(grouped_gemm): add unit tests for grouped_gemm bquant with preshuffleB true (#3119 ) * add tensorwise quant in grouped gemm * fix example issue * update test cases * format codes * clang format * use GTEST_FAIL * add bquant to grouped_gemm * add tensorwise quant in grouped gemm * fix example issue * update test cases * format codes * clang format * use GTEST_FAIL * fix a bug in test_grouped_gemm_util * skip test when use wmma on grouped_quant kernel * change cmake * fix a bug in test_grouped_gemm_util * skip test when use wmma on grouped_quant kernel * change cmake * tests(quant_grouped_gemm): add unit tests to cover bquant in grouped_gemm * Update test/ck_tile/grouped_gemm_quant/test_grouped_gemm_util_quant.hpp Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * Update example/ck_tile/17_grouped_gemm/quant_grouped_gemm.hpp Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> * feat: add bf8 support * chore: remove unnecessary decltype usage * chore: add default quant_mode to function signature as fallback * fix: pass correct runtime pipeline params in grouped_gemm bquant kernel Calculate has_hot_loop, num_loop, and tail_number on device side for each GEMM problem instead of using default values. This fixes incorrect results when different problems in the group have different K dimensions. * chore: set default quant mode in function signature * test: add additional test cases to cover edge case of no hotloop * change code based on comments * WIP: bquant preshuffle b compiles but gives numerical error * feat(grouped_gemm_quant): bquant with preshuffleB support added to grouped_gemm example & kernel * refactor: refactor code after merge commit * chore: remove print statements * test(grouped_gemm): split test cases by quant mode to reduce compilation time and add bquant-preshuffleB mode test cases --------- Co-authored-by: kyle-256 <Kyle.Zhao@amd.com> Co-authored-by: ThomasNing <thomas.ning@amd.com> Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com> [ROCm/composable_kernel commit: `8f1274d9b6`]	2025-10-31 12:07:06 -07:00
Thrupti Raj Lakshmana Gowda	27dc4d9833	[CK TILE ENGINE] GEMM Multi D Restructure (#3121 ) * Renaming old code * Adding GEMM code with new Architecture * Partial Progress : Errors * Partial Progress : Working code * Changes to element wise function * Removing Debugging statements * Working GEMM Multi D code * Removing Stale Code * Address Copilot review comments * Address Copilot review comments * Changes to validation file * Changes to common code snippets * Creating common folder * Removing duplicate files * Pointing to right common file * Pointing to right common file * Pointing to right common file * Changing to VERBOSE * Changing CMAKE messages to verbose * Updating Cmake with right layout datatype configs * Working code for GEMM Multi D [ROCm/composable_kernel commit: `a33d98f8e2`]	2025-10-31 12:02:46 -07:00
Max Podkorytov	b7a073f769	[CK-tile] unhardcode the number of LDS banks from universal gemm policy (#3130 ) Fixes LDS bank conflicts on gfx950 for universal gemm v3 pipeline Replaces hardcoded LDS layer calculations with dynamic computation using the new architecture helpers Adds architecture-specific helper function get_n_lds_banks() Changes function attributes from CK_TILE_HOST_DEVICE to CK_TILE_DEVICE in universal gemm policy [ROCm/composable_kernel commit: `04efd282cf`]	2025-10-31 11:58:11 -07:00
Enrico Degregori	e6be7bcc2a	WMMA gemm_add_relu_add_layernorm (#2989 ) * Summary: - Refactor epilogue (with CShuffle) to support fused operations: - EpilogueCShuffleBase holds common parts - EpilogueCShuffle: runs CShuffle and write out - EpilogueWelfordCShuffle: holds Welford specific arguments, runs CShuffle, write out, Welford first part and Welford write out - Extend thread transfer v7r3: - Support for intermediate data type different from src and dst type - New functionality to write to dst buffer and keep data (to be able to use them for additional operations) * Adress review comments [ROCm/composable_kernel commit: `4ebc48a3cd`]	2025-10-31 11:19:26 -07:00
assistant-librarian[bot]	96199abfbe	Merge commit 'e9596228ff7f6ddb68fbd2f0f9e964cfb6af61cf' into develop	2025-10-31 18:15:38 +00:00
Anton Gorenko	2136eddf8a	Fix synchronization issue in fwd qr pipeline with dropout (#3135 ) BlockFmhaPipelineQRKSVS reuses LDS for K and dropout so there must be block_sync_lds between loading k_lds_window by gemm_0 and storing dropout randval. [ROCm/composable_kernel commit: `e9596228ff`]	2025-10-31 09:44:52 -07:00
assistant-librarian[bot]	3d78c17295	Merge commit '5ed2046bee509cd907b9e609ae18a871864f1738' into develop	2025-10-31 15:12:07 +00:00
John Shumway	a8a377ca53	Add the last two forward instance traits. (#3134 ) * Add InstanceTraits for DeviceGroupedConvFwdMultipleD_Wmma_CShuffle * Add InstanceTraits for kernel_grouped_conv_fwd_dl_multiple_d * A few small changes to fix broken instance traits. [ROCm/composable_kernel commit: `5ed2046bee`]	2025-10-31 07:52:42 -07:00
andrew clark	d2474f5396	Adding new alert failure patterns (#3122 ) * Adding GPU not found pattern Also, failurePatterns does not need to be global. Moved variable to live in the failure notifications function scope. * Testing new failure type * Testing failure * Removing the forced failure test * Adding an additional failure pattern [ROCm/composable_kernel commit: `1977e4b96a`]	2025-10-31 07:38:31 -07:00
John Afaganis	c6b0458d1d	Add copyright notices to missing files (#3133 ) [ROCm/composable_kernel commit: `3f996ee738`]	2025-10-31 07:35:11 -07:00
kabrahamAMD	b7429e620c	Kabraham/fix block gemm v1 b scale (#3129 ) * fixed synchronization issue in block gemm pipeline v1 that caused b_scale to fail * run clang-format --------- Co-authored-by: Kevin Abraham <kevin.abraham@streamhpc.com> [ROCm/composable_kernel commit: `a7c52e8afa`]	2025-10-31 07:19:01 -07:00
assistant-librarian[bot]	70ac1657a1	Merge commit 'c2d79314469f569c13c205ff5383f284c90d7445' into develop	2025-10-31 13:20:09 +00:00

... 18 19 20 21 22 ...

4046 Commits