[CK_TILE] Add CK Tile bwd weight profiler
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
## Motivation
To compare old CK and CK Tile, we need to extend the current CK profiler
to support running also CK Tile instance with the same API. In order to
have the same instance coverage in CK Tile compared to the old CK, I've
added code generation from old CK configurations to CK Tile instances
using the CK Builder.
## Technical Details
- The codegen python script for CK Tile fwd convs is extended to support
also bwd weight and bwd data.
- The generated instances are added to the CMake build (target
`device_grouped_conv_bwd_weight_tile_instance`s).
- A new profiler op (`grouped_conv_bwd_weight_tile`) has been added to
the CK Profiler.
[CK] CK Tile improvements and fixes for depthwise merged
convolutions forward (#4873)
## Motivation
Performance benchmarks showed that old CK's depthwise merged
convolutions are much faster than CK Tile's ones.
## Technical Details
After investigation it showed up that the requirement that A/CVectorload
is a multiple of gemm's rightmost dimension is too strict in case of
processing multiple groups, because if tensor is in NHWGC/NHWGK format,
then if C/K is equal to 1, we can use vectorloads on the G dimension,
which is added by this PR. Filter5x5 specialization was also added,
because some models are using it, it's similar to 3x3, the only
difference is the window size. This addition was needed, because of the
differences of tensor descriptor transformations betweeen CK and CK
Tile. In old CK the case of grouped depthwise 5x5 convs was supported
via Default specialization, but in CK Tile that case was not working
properly.
## Test Plan
Performance was tested by our internal test suite, which contains
several DL models.
## Test Result
Tests results showed significant performance uplift for depthwise(3x3,
5x5) cases
[CK][CK TILE] Add has hot loop check for pipeline v1
## Motivation
Add has hot loop check for pipeline v1 (v1 basic and v1 basic async).
Enable more tests which have been fixed by this change.
## Technical Details
Hot loop has been executed without num loop check.
## Test Plan
test_grouped_convnd_fwd_tile
## Test Result
Passed
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
AICK-651
AICK-663
[CK] CK Tile grouped convolution direct load
## Motivation
CK Tile grouped convolution forward direct load support.
## Technical Details
Basic pipeline for direct load and new instances for forward for v1 and
v4 pipelines.
## Test Plan
test_grouped_convnd_fwd_tile
## Test Result
CI pending
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
AICK-130
* initial poc
* factor out common parts in operator()
* cv4
* rest of the universal gemm pipelines
* fix test
* remove boilerplate from tile engine
* fix example
* fix example
* format
* fix tests build for gemm
* remove base pipeline codegen from gemm instance builder
* unify v3 logic with the rest of universal gemm pipelines
* fix build for multi abd test
* fix test gemm multi d
* fix build for weight preshuffle
* fix grouped gemm test
* fix grouped gemm multi d test
* fix grouped gemm preshuffle
* fix grouped gemm example except for quant
* fix gemm preshuffle
* fix splitk 2 stage example
* fix batched gemm example
* fix multid example
* fix multiabd example
* fix batched gemm test
* fixup
* fix examples build
* fix grouped gemm test build
* fix smoke builder
* hacky poc
* fix tile engine
* kill the lambda
* maybe fix test build
* more fixes
* clang-format
* save temp
* clang-format
* mostly fix examples
* clang-format
* remove dead code
* more cleanup
* fix fmha bwd build (default epilogue set/add appears to be broken)
* fix default epilogue tests but not correctness
* clang-format
* fix bquant
* clang-format
* cleanup dead code
* rearrange make windows for readability
* restore changes to IsSupportedArgument
* fix smoke-builder
* clang-format
* fixup rename class
* build fixes
* clang-format
* fix builder
* fixup
* remove set from builder tests
* fix test
* clang-format
* re-refactor the kernels
* clang-format
* fix header license
* remove memory operation from conv bwd test
* clang-format
* clang-format example,include
* clang-format test
* build fixes
* clang-format
* solve compilation error
* fix the CI
* solve compilation error
* clang format
* solve merge conflict
* solve merge conflict
* solve the gfx11 error
* solve test error
* moar build fixes
* remove AtomicAddRequiresKBatchGreaterThanOne test since the property is removed from the kernel scope
---------
Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
* fix for splitk if splitk < grid
* add different splitk implementation
* minor bugfix for streamk gemm
* Add test
---------
Co-authored-by: Bartlomiej Kocot <barkocot@amd.com>
* First version of split-K autodeduction.
* Fix circular dependency and kernel construction.
* Fix tolerance calculation for bwd weight example.
* Simplify kernel construction.
* Fix kernel launching bug for split-K autodeduce.
* Add split-K autodeduction support for the two stage example.
* Fix a corner case.
* Fix clang-format.
* Fix clang-format for inc files.
* Add missing header.
* Prevent too large split-K values.
* Fix formatting.
* Add unit tests for IsSupportedArgument in grouped bwd conv.
* clang-format.
* Fix merge conflicts.
* Address feedback from code review.
* clang-format
* Fix new tests after merge.
---------
Co-authored-by: Ville Pietilä <>
* Merge fwd conv groups in CK Tile.
* Fix building CK fwd convs.
* Add number of merged groups to conv fwd kernel name.
* Get number of merged groups from conv config.
* Rename GemmConfig to ConvConfig.
* Clean-up TODOs.
* Check that number of conv groups must be divisible by the number of merged groups.
* Improve error handling in the conv fwd example.
* Fix clang-format.
* Fix group offsets.
* Fix merge problem.
* Address feedback from code review.
* Fix clang-formatting.
* Refactor split-image implementation: simplify code and remove redundant variables
* Add padding debug output to split-image implementation
- Added debug prints for padding calculations in transform_conv_fwd_to_gemm.hpp
- Verified padding works correctly with all tests passing
* Fix sign comparison warning after rebase with origin/develop
- Cast blockIdX from unsigned to signed index_t for comparisons
- Integrated with new GetOutputTileIndex logic from upstream
- Updated to use amd_wave_read_first_lane instead of __builtin_amdgcn_readfirstlane
* Fix Split-N with groups bug and clean up unused parameters
- Fixed batch stride calculation to include G dimension for grouped convolutions
- When moving between batches in NHWGC/NWGC/NDHWGC layouts, need to account for all groups
- Removed unused multi-split parameters (we only support 2-way split)
- All tests now pass: G=1 with Split-N, G>1 with Split-N, G>1 without Split-N
* Implement recursive queue-based split-image detection and calculation
- Add LaunchKernelWithSplitIfNeeded() helper method in transform_conv_fwd_to_gemm.hpp
- Implement recursive binary splitting algorithm (10GB→5GB+5GB→...)
- Correctly handle odd dimensions (61→30+31)
- Calculate proper offsets for each split piece
- Update invoker to use split-image helper
Note: Split detection and calculation work correctly but kernel launching
for individual pieces requires kernel modification to handle different
spatial dimensions (unlike Split-N which uses blockIdx.z).
* WIP: Split-Image investigation - found architecture mismatch
- Split-N modifies N_ directly in transformer constructor
- Split-Image needs different approach due to varying dimensions
- Added split calculation logic for 1D and 2D convolutions
- Still facing memory issues when creating piece transformers
Key finding: Split-N uses blockIdx.z for parallel execution,
while Split-Image needs sequential execution of non-uniform pieces.
* Add 1D split-image implementation for grouped convolution (N=1 working)
Implements split-image for 1D convolution to handle large tensors that
exceed memory thresholds. This is a critical milestone with N=1 fully
working and tested.
Key Changes:
- Invoker: Add split-image logic that splits W dimension in half
- Transformer: Add SplitConvProblem helper for recursive splitting
- Calculate offsets for LEFT and RIGHT pieces
- Launch two kernels sequentially (LEFT then RIGHT)
Implementation Details:
- Binary split: divides W dimension by 2
- LEFT piece: W=0 to W/2, keeps left padding, removes right padding
- RIGHT piece: W/2 to W, removes left padding, keeps right padding
- Offset calculation accounts for stride, dilation, and padding
- Physical memory offset (no padding in memory)
Test Results (N=1):
✅ 94/94 tests passing
- Comprehensive tests: 36/36 (channels, padding, stride, dilation, filters, groups)
- Edge case tests: 31/31 (odd dimensions, extreme parameters, boundaries)
- Stress tests: 27/27 (maximum dimensions, up to 91.4 TFlops)
Known Limitations:
- Only works with N=1 (single batch)
- N>1 fails when split-image triggers (offset calculation issue with Split-N)
- Root cause: Split-N modifies N in transformer, but offset calculated in invoker
- Solution planned: Move offset calculation to transformer (next phase)
Files Modified:
- grouped_convolution_forward_invoker.hpp: Add split-image logic
- transform_conv_fwd_to_gemm.hpp: Add SplitConvProblem helper
This commit represents a stable, tested 1D split-image implementation
for N=1 cases. It's an important milestone before extending to N>1
and multi-dimensional splits.
* Add basic split-image implementation for 1D/2D/3D grouped convolution
This is a working baseline implementation that splits large spatial
dimensions to handle memory constraints.
Implementation:
- 1D: W-split for NWGC layout (36/36 tests passing)
- 2D: H-split for NHWGC layout (20/20 tests passing)
- 3D: D-split for NDHWGC layout (verified working)
Features:
- Binary split of outermost spatial dimension
- Sequential LEFT/RIGHT kernel launches
- Proper padding adjustment at split boundaries
- Offset calculation for pointer arithmetic
- Debug output for verification
Threshold: 100KB (configurable in transformer)
Known limitations:
- No safety checks for edge cases (to be added)
- Offset calculated before Split-N (incompatible with N>1, to be fixed)
- No recursive splitting for very large tensors
Next steps:
- Add safety checks (is_possible_to_split_*)
- Move offset calculation to transformer (after Split-N)
- Test with N>1 + split-image combination
* Refactor split-image to unified structure for 1D/2D/3D
Unified the three separate dimension-specific blocks into a single
common implementation with dimension-specific stride calculations.
Benefits:
- Reduced code from 636 → 348 lines (45% reduction)
- Eliminated code duplication
- Easier to maintain and extend
- Single source of truth for split logic
Implementation:
- Common: Binary split, offset calc, padding adjustment, kernel launch
- Dimension-specific: Stride calculation only
- 1D: stride = G * C
- 2D: stride = W_in * G * C
- 3D: stride = H_in * W_in * G * C
Test results (all passing):
- 1D: 36/36 tests ✅
- 2D: 20/20 tests ✅
- 3D: 28/28 tests ✅
- Total: 84/84 (100%)
All test scenarios verified:
- Varying channels, padding, stride, dilation
- Filter sizes (1x1 pointwise to 7x7)
- Multiple groups (G=1,2,4)
- Odd dimensions
- Complex combinations
* Add safety checks for split-image in all dimensions
Added is_possible_to_split safety checks to prevent crashes when
splitting is not feasible.
Safety checks verify:
1. Output dimension > 1 (can't split single element)
2. RIGHT piece starts after left padding
3. LEFT piece ends within input bounds
If checks fail, falls back to normal kernel launch.
Verified for all dimensions:
- 1D (W-split): Wo=1 case triggers fallback
- 2D (H-split): Ho=1 case triggers fallback
- 3D (D-split): Do=1 case triggers fallback
Original 84 tests still pass - they use normal configurations
that naturally satisfy safety conditions.
Safety checks protect against pathological edge cases with:
- Very small spatial dimensions
- Extreme stride/dilation combinations
- Invalid padding configurations
* Fix Split-N + Split-Image compatibility issue
Fixed critical bug where Split-N and Split-Image working together
caused ~50% incorrect results due to wrong batch stride calculation.
Problem:
- Batch stride was calculated using MODIFIED spatial dimensions
(e.g., W=50000 after split) instead of ORIGINAL dimensions (W=100000)
- Spatial offset was applied globally in invoker, not per-batch in kernel
- Each batch (blockIdx.z) got wrong memory offset
Solution:
1. Store spatial offset in kargs (don't apply to pointer in invoker)
2. Copy correct batch_stride from temp_kargs to left/right kargs
3. Apply formula in operator(): ptr = base + (batch × stride) + spatial_offset
Changes:
- grouped_convolution_forward_kernel.hpp:
* Added spatial_offset_in/out fields to KernelArgs
* Apply batch + spatial offset in operator()
- grouped_convolution_forward_invoker.hpp:
* Keep base pointer, store spatial offset in kargs
* Copy batch_stride from temp_kargs (has original dimensions)
- transform_conv_fwd_to_gemm.hpp:
* Add debug output for split-image calculation
Results:
- N=1 tests: 84/84 passing (100%)
- N>1 tests: Now all passing (previously ~50% errors)
- Tested: 1D, 2D, 3D with N=1,2,4,8,16,20
* Implement unified threshold for Split-N and Split-Image
This commit consolidates threshold management for both Split-N and
Split-Image operations into a single source of truth, eliminating
code duplication and fixing offset calculation issues.
Key Changes:
============
1. Transformer (transform_conv_fwd_to_gemm.hpp):
- Moved TwoGB constant to public section for unified access
- CalculateSplitImage() now takes no parameters
- Uses internal threshold: TwoGB / sizeof(CDataType)
- Calculates offsets using N_ (after Split-N) for correctness
2. Kernel (grouped_convolution_forward_kernel.hpp):
- GetSplitImageInfo() simplified to take no parameters
- Forwards to transformer's CalculateSplitImage()
- Clean interface with unified threshold internally
3. Invoker (grouped_convolution_forward_invoker.hpp):
- Removed redundant threshold calculation
- Simplified to call kargs.GetSplitImageInfo() with no params
- Clean early-return pattern (no unnecessary else blocks)
- Removed duplicate/dead code paths
Benefits:
=========
- Single source of truth: TwoGB defined once in transformer
- No parameter passing for threshold between components
- Correct offset calculation using N_ (post-Split-N)
- Cleaner code with no duplication
- All tests passing: 1D/2D/3D with various N values
Testing:
========
- Split-Image only (N=1, large spatial): PASS
- Split-N only (N>1, small spatial): PASS
- Both splits active (N>1, large spatial): PASS
- No splits (N=1, small spatial): PASS
- CPU verification correct for all scenarios
* Comment out outdated split-image code (SplitConvProblem/LaunchKernelWithSplitIfNeeded)
The old recursive queue-based implementation has been replaced by the
new CalculateSplitImage() method which is simpler and correctly handles
Split-N + Split-Image interaction.
Changes:
- Wrapped lines 381-1078 in #if 0...#endif
- Old methods: SplitConvProblem() and LaunchKernelWithSplitIfNeeded()
- Preserved for reference but disabled from compilation
- No functional changes - all tests still pass
The new implementation (CalculateSplitImage at line ~2163) provides:
- Correct offset calculation using N_ (after Split-N)
- Simpler binary split logic
- Better integration with unified threshold approach
* Implement recursive split-image with depth limit (MAX_DEPTH=10)
Changes:
- Add depth tracking to SplitPiece struct
- Implement two stopping conditions:
1. Piece size below threshold (optimal case)
2. Depth >= MAX_DEPTH (prevents infinite recursion)
- Remove MAX_PIECES limit in favor of depth-based control
- Support up to 2^10 = 1024 pieces with depth 10
This allows handling extreme tensor sizes while ensuring termination.
Pieces larger than threshold will still launch correctly if depth limit reached.
Tested with H=100 (4 levels), H=2000 (6 levels), H=4000 (9 levels) - all pass CPU verification.
* Summary of recursive split-image implementation:
- Recursive queue-based splitting with depth limit (MAX_DEPTH=10, up to 1024 pieces)
- Two stopping conditions: size below threshold OR max depth reached
- Cumulative offset tracking through all recursion levels
- LEFT piece inherits parent offset, RIGHT accumulates (parent + local)
- Per-batch spatial offset application in kernel operator()
- Batch stride uses original dimensions (before split)
- Works with Split-N: split-N first, then recursive split-image
- Handles odd dimensions, padding, stride, dilation correctly
- All 1D/2D/3D tests pass with CPU verification
* Add comment explaining MAX_DEPTH capacity for 2GB threshold
* Refactor: move recursive split-image logic to transformer
- Move LaunchWithRecursiveSplit() from invoker to transform_conv_fwd_to_gemm.hpp
- Simplify invoker from ~250 lines to ~140 lines (removed 110 lines of inline logic)
- Encapsulate SplitPiece struct and BFS splitting algorithm in transformer
- Remove unused includes (queue, vector) from invoker
- Add documentation comment for AreDescriptorsSmallerThan2GB()
- Improve code organization and reusability
- No performance overhead (static template function, compiler inlines)
- All tests passing with 2GB production threshold
* Apply clang-format-18 formatting
- Format invoker and transformer files with clang-format-18
- Fix brace placement and alignment
- No functional changes
* Fix clang-format-18 issues in forward kernel
- Remove extra blank lines
- Fix line wrapping for template calls
- Consolidate GetSplitImageInfo() to single line
* Update include/ck_tile/ops/grouped_convolution/utils/transform_conv_fwd_to_gemm.hpp
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* Update include/ck_tile/ops/grouped_convolution/utils/transform_conv_fwd_to_gemm.hpp
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* Update include/ck_tile/ops/grouped_convolution/kernel/grouped_convolution_forward_kernel.hpp
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* Update include/ck_tile/ops/grouped_convolution/kernel/grouped_convolution_forward_kernel.hpp
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* Split-Image implementation with temporary fixed divider
- Implemented spatial dimension splitting (Split-Image) for large tensors
- Added piece-based coordinate transformation for 1D/2D/3D convolutions
- Integrated Split-N (batch splitting) with automatic threshold detection
- Fixed M dimension calculation to include batch: M = N × spatial_size
- Added spatial offset support in kernel arguments
- Verified 20/20 test cases passing for Split-Image alone
- Known issue: Split-N + Split-Image combination needs coordinate fix
Implementation Details:
- Split factors: 4 (1D), 4×4 (2D), 4×4×4 (3D) - temporary fixed values
- Batch strides properly calculated for NWGC/NHWGC/NDHWGC layouts
- Piece descriptors track spatial boundaries and block ranges
- No performance overhead for N=1 cases
* Fix 1D split-image padding issue with per-piece dimensions
- Store actual size per piece to handle non-uniform splits
- Remove dead code from transform utils
* Fix 2D/3D split-image with independent split factors per dimension
Problem: Single split factor caused non-uniform pieces when dimensions
didn't divide evenly. Result: 18/25 (72%) 2D padding combinations failed.
Solution: Independent split factor selection for W, H, D dimensions.
Each dimension gets optimal factor based on its own size.
Test Results:
- 1D: 42/42 pass (100%)
- 2D: 25/25 pass (100%)
- Total: 67/67 combinations verified
* Remove unused split-image struct fields
Cleanup of split-image implementation:
- Removed unused piece_d, piece_h, piece_w fields from SplitImageInfo struct
- These fields were declared but never used in the kernel
- Per-piece dimensions are already stored in pieces[] array
- Reduces struct size and improves code clarity
Tested: 1D/2D/3D convolutions with split-image, padding, stride all pass
* Refactor split-image invoker code for improved readability
- Extract piece calculation logic into calculate_piece lambda helper
- Extract kernel args population into populate_split_image_kargs lambda
- Use aggregate initialization for cleaner struct population
- Reduce nesting depth and improve maintainability
- Fix outdated comment about split-image implementation status
* Refactor split-image code and remove debug prints
- Extract GPU kernel helper lambdas for better readability
- Remove all split-image debug print statements
- Set memory threshold to 2GB for production
- All tests pass with CPU verification
* Add split-image safety constraints and refactor to utils
- Add MAX_TOTAL_PIECES=64 limit to prevent segfault
- Move calculate_spatial_piece to library utils
- Add layout validation (NWGC, NHWGC, NDHWGC only)
- Fix hierarchical splitting to respect piece limits
- Add proper documentation and formatting
* Change split-image from runtime to compile-time branching
Response to @bartekxk review comment:
Convert 'if(kargs.num_spatial_pieces > 1)' to 'if constexpr(EnableSplitImage)'
Changes:
- Add EnableSplitImage template parameter to kernel
- Change runtime if to compile-time if constexpr
- Update invoker to instantiate kernel variants with true/false
Benefits:
- Eliminates runtime branching in GPU kernel
- Dead code elimination (each variant is smaller)
- Better compiler optimization
Files modified: 2
Lines changed: 20 total (6 in kernel, 14 in invoker)
Tests: 27/27 passed (100%)
Performance: No regression
* Add split-image example as separate binary
- Create grouped_convolution_forward_split_image example
- Add grouped_convolution_forward_split_image_invoker.hpp
- Update CMakeLists.txt to build split_image binary
* Replace linear search with binary search in find_piece_id
- Change O(n) to O(log n) for finding piece ownership
- Matches reference implementation in large_tensor_cshuffle
* Simplify split-image code and fix integer overflow
- Extract lambda functions to static helper methods
- Pre-calculate constants in invoker
- Fix integer overflow in tensor size calculation for large tensors
* Trigger CI rerun - fix merge conflicts
* Fix merge conflict markers
* Fix clang-format: remove space before {}
* Fix clang-format: comment wrapping and Swish constructor
* Rename split_image to large_tensor for clarity
- Renamed grouped_convolution_forward_split_image.cpp -> grouped_convolution_forward_large_tensor.cpp
- Renamed grouped_convolution_forward_split_image_invoker.hpp -> grouped_convolution_forward_large_tensor_invoker.hpp
- Updated CMakeLists.txt target name: tile_example_grouped_conv_fwd_split_image -> tile_example_grouped_conv_fwd_large_tensor
- Updated comments to refer to 'large tensor' instead of 'split-image'
* Update comments and include in large_tensor example
- Updated header comments to use 'large tensor' terminology
- Fixed include path to use large_tensor_invoker.hpp
* Remove test code, restore 2GB threshold
* Update include/ck_tile/ops/grouped_convolution/utils/transform_conv_fwd_to_gemm.hpp
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* Fix build errors after develop merge and complete rename to large_tensor
This commit addresses compilation errors from the develop merge and
completes the rename from split_image to large_tensor.
Changes:
1. Fix CDEElementWise typo in grouped_convolution_forward_invoker.hpp
2. Fix template parameter order in large_tensor_invoker.hpp
- TransformConvFwdToGemm signature changed in develop
- NumGroupsToMerge and SplitN parameters swapped positions
3. Fix missing template parameter in GroupedConvFwdHostArgs
4. Fix EpiloguePipeline scope in kernel (merge conflict)
5. Update binary name references in test scripts
* Restore 2GB threshold for split-image
Changed threshold from 100MB (testing) back to 2GB for production use.
* Fix const-correctness in ds_ptr cast
* Update include/ck_tile/ops/grouped_convolution/kernel/grouped_convolution_forward_kernel.hpp
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* Apply clang-format-18
* update c++ 18 format
* Apply clang-format-18 to transform_conv_fwd_to_gemm.hpp
---------
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* Improve the grouped conv kernel name generation in CK Tile.
* Fix building CShuffle epilogue tests.
---------
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>
* Fix compilation of the grouped conv examples.
* Fix grouped conv bwd weight example output in CK Tile.
* Add number of groups to merge to ck tile grouped gemm example.
* Initial set of tests for TransformConvBwdWeightToGemm.
* Added unit tests for TransformConvBwdWeightToGemm conv groups are merged.
* WIP: Tensor transformations.
* Add unit tests for coordinate transforms.
* Fully working conv group merging for TransformConvBwdWeightToGemm.
* WIP: Merged conv groups offset calculation.
* Adde unit tests for tensor view.
* WIP: Merged conv groups epilogue.
* Enable running multiple conv groups per batch.
* Add tests for tile_distribution_encoding.
* Change example to match optimally depthwise convolution with merged groups.
* Add more tests for tensor view.
* Integration test for reading diagonal blocks from grouped distributed tensor.
* Improved integration test.
* Improve test for accessing diagonal blocks.
* Added integration test for cshuffle epilogue LDS tile distribution.
* Add more logging.
* Increase the max number of reported errors.
* WIP: merged conv groups GEMM epilogue changes.
* LDS to global memory copy.
* Fix tile window size for c block.
* Integration test for CShuffle epilogue.
* Improved CShuffle test.
* WIP: Separate epilogue for merged conv groups.
* Tile example parameters changes to match depthwise conv.
* Offset fixes.
* Epilogue fixes.
* Working baseline for depthwise covolution with merged conv groups.
* Fix build.
* Initial unit tests for tensor descriptor.
* Add one more unit test for tensor view.
* WIP: LDS to global mem transfer using CK tile tensor descriptor and tile distribution encoding.
* Fully functional LDS to global mem transfer using tensor descriptor and tile distribution encoding.
* Add more comments, disable debug code.
* Remove debug and other dead code.
* Code clean-up for bwd tensor transformations.
* Enable running multiple GEMM batches of merged conv groups.
* Add compile check for assumed row-mjor layout.
* Fix strides in 1D conv to gemm transformation.
* WIP: Simplify conv to gemm transformations and handle K > 1 and C > 1 cases.
* Fix case k > 1 and c=1.
* Remove debug code.
* Make MPerGroup and NPerGroup template parameters.
* Add additional check for non-supported c > 1 case.
* WIP: Put back the generic tensor descriptors for convolutions.
* Fix tensor descriptors.
* Remove the obsolete template parameters.
* Add more instances.
* Fix bugs in merged conv groups tensor descriptors.
* Fix tensor descriptors for merged conv groups when K > 1.
* Remove debug output.
* Remove dead code.
* Fix merge conflicts.
* Code clean-up.
* Remove unused code.
* Run clang-formatting.
* Remove debug prints and obsolete tests.
* Check that number of convolution groups is multiple of merged groups.
* Fix build after removing obsolete functionality.
* Remove obsolete enumeration.
* Fix new unit projects.
* Remove unnecessary includes.
* Fix passing the number of merged groups.
* Remove unrelated tests.
* Fix IsSupportedArgument for bwd weight conv kernel.
* Fix clang formatting.
* Fix the bwd weight conv to gemm mapping for num merged groups > 1.
* GEMM config for conv group merging.
* Fix clang-formatting.
* Remove obsolete comment.
* Fix typos in comment strings.
* Increase the max number of reported errors when testing against reference implementation.
* Rename gemm_config to conv_config.
* Rename GemmConfig to ConvConfig and move NumGroupsToMerge into ConvConfig.
* Change num_groups_to_merge to a boolean flag in the ck tile grouped conv example.
* Run clang-format.
* Add number of merged groups into kernel name string.
* Remove group merging flag from CK Tile grouped conv example.
* Implement argument passing to element-wise functions for fwd convolution
* Add files for fwd + bias + clamp example
* Implement Bias
* Implement Clamp
* Elementwise function composition
* Composition unit test
* Implement fwd + bias + clamp example
* Simplify argument passing and composition
* elfunc -> bias_and_clamp
* Rename function to specify example
* Move element-wise function instantiation to kernel
* Make bias a runtime tensor
* No ugly namespace aliasing
* Initialize element-wise function on host
* Remove function initialization helper, simplify Compose initialization
* Remove unintended LSP compatibility patch
* Clean up includes and unused code
* Switch names in cshuffle epilogue
* Move CDElementwise to conv traits
* Re-add required include
* Initialize bias in same way as other tensors
* Better type specification for ds pointer
* Disable 1D convolution
* Add warning for non-group-constant bias
* Have a workable version for SGPR
* have a workable version for atomic add
* Revert "have a workable version for atomic add"
This reverts commit 792377a590c26cfff9c8f545d9a9e8484a7422eb.
* substitute with the new sgpr read api
* update the CHANGELOG
* have a workable version for atomic add
* Revert "have a workable version for atomic add"
This reverts commit 792377a590c26cfff9c8f545d9a9e8484a7422eb.
* change to static for logic
* have a workable version for atomic add
* Revert "have a workable version for atomic add"
This reverts commit 792377a590c26cfff9c8f545d9a9e8484a7422eb.
## What's New
Add Split-N support for grouped convolution forward to handle tensors >2GB by splitting the batch dimension.
## Bug Fix
Fixed 32-bit integer overflow that caused crashes with 6+ splits:
- Use `long_index_t` for batch offset calculations
- Remove redundant GemmM initialization in constructors
## How It Works
- Automatically splits batch dimension when tensor exceeds 2GB
- Uses grid.z dimension for parallel processing of splits
- Each split processes a subset of batches independently
## Testing
Verified with tile_example_grouped_conv_fwd:
- n=3000 (6 splits) ✓
- n=3500 (7 splits) ✓
- n=10480 (40 splits) ✓
* base working version for single groupped conv bwd data
* Fix 2d descriptor
* fix groups
* Add 3d support
* fixes
* fixes
* fixes
---------
Co-authored-by: Jakub Piasecki <jakpia21@gmail.com>