* [CK TILE STREAMK] Introduce initial support for tile engine in streamk GEMM.
- This commit lays the groundwork for integrating the tile engine into streamk GEMM.
It focuses on creating benchmark executables for streamk GEMM.
- Additional scripts like test_benchmark.sh and gemm_benchmark.py will be added once
the streamk implementation reaches stability.
* [CK TILE STREAMK] Enable CI to execute tile engine benchmarks for StreamK GEMM
* [CK TILE STREAMK] Refactor: Extract common utility functions.
* [CK TILE STREAMK] Revise tile engine of streamk to align with the updated implementation
* Add pre-commit
* [CK TILE STREAMK] Add 'dp_persistent' and 'reduction_strategy' in output of CK TILE STREAMK
* [CK TILE STREAMK] Fix a bug about value of 'dp_persistent' of CK TILE STREAMK
* [CK TILE STREAMK] Update Jenkinsfile
* [CK TILE Engine] Update StreamK tile engine help message
Remove default value messages as they are automatically printed
* [CK TILE Engine] Update StreamK tile engine
- Remove namespace reboot
* [CK TILE Engine] Update StreamK tile engine
- Fix merge error
This renames the typeToStr struct in the common utilities to DataTypeTraits and removes all duplication of DataTypeTraits across files in CK Tile.
Co-authored-by: Christopher Millette <63608002+cgmillette@users.noreply.github.com>
* remove EXCLUDE_FROM_ALL from ck-tile examples
-> +15 min build time w/ 64 threads for a single arch
* fix cpp17 compile error in the ck-tile examples
---------
Co-authored-by: khuagarw <khuagarw@amd.com>
Co-authored-by: Ding, Yi <yi.ding@amd.com>
- Add conditional selection to use v3 pipeline when PreshuffleQuant is true
- Add static assertion in memory pipeline to prevent PreshuffleQuant usage
- Restore BaseBQuantGemmPipelineAgBgCrCompV3 for BQuant cases
- Update BaseGemmPipeline selection to handle all quant modes properly
* Remove old CK Tile Stream-K implementation
The original CK Stream-K implementation was based on old CK's Stream-K
block to C tile map. However, this implementation did not align with the
original Stream-K paper. Thus, we implemented a new tile partitioner and
associated Stream-K kernel, which was placed in the reboot namespace.
Now that the new Stream-K implementation is ready, this change removes
all artifacts of the old implementation. Specifically, the following
changes were made:
- Removes old Stream-K tile partitioner from CK Tile
- Removes the reboot namespace such that the new implementation resides
in the ck_tile namespace only.
- Adds tests for bf8 and fp8 using the new implementation
- Removes tests for the old implementation
- Remove the v2 suffix from the new CK Tile Tile Partitioner
derived classes.
- Updates Stream-K Kernel ops file to use /** commenting style.
* Remove v2 from tile partitioner validation function names
* Wrap ck host utitlies in CK namespace.
The CK and CK-Tile source code bases are incompatible because CK is not properly using namespaces everywhere. In particular, we need to put hip_check_error in the ck namespace.
Move all functions in include/ck_/host_utility that were in global namespace into the ck namespace.
There may be additional namespace problems like this, and it's possible we'll have namespace clashes. But it is good design to properly guard our to code bases (CK and CKTile) so that they can both coexist. Moreover, estabilishing this compatiblity is essential if we are going to allow the builder to instantiate kernels from either template library.
* Add using declarations to test code.
After moving some of the untils into the ck namespace, most examples and a few tests had to be updated to recognize the new namespace declarations. We add using declarations to individual compute units for functions that were previously in the global namespace.
* Add using declarations to client examples.
* pooling example readme update
- The updated readme explains the transformations of the pooling kernel
using a mermaid diagram
* Update example/ck_tile/36_pooling/README.md
Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>
* resolve comments
---------
Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>
* Extend AK1 / BK1 support:
- Add support for AK1 != BK1
- Add support for AK1, BK1 > 8
- Introduce KInner template parameter for pipelines when loading multiple tiles with one instruction
* fix clang format
* Introduces the new partitioner to implement the reduction StreamK kernel
* Add more doc text to functions
* Add persistent-dp option to streamk example
* Update example/ck_tile/40_streamk_gemm/README.md
This change replaces pipeline macros like CK_TILE_PIPELINE_COMPUTE_V3,
CK_TILE_PIPELINE_MEMORY, etc in the CK Tile examples with a common enum
called GemmPipeline to reduce code duplication.
* Refactor quant group size to be configurable for M/N/K, not just K
* add some asserts for configurations not implemented
* start setting of group size for N dimension
* enable 2d for reference quant gemm
* WIP: trying to figure out tile dstr and/or indexing for scale matrix
* WIP
* Fix handling of n dim blocks in tile windows etc
* remove commented code and enable all tests again
* fix formatting
* Add more specialized tile distributions
* Enable NWarps replication for bquant tile dstr
* fix formatting
* fix format
* Fix some issues from the merge
* fix formatting
* one more fix to tile dstr, and revert debug initialization
* Remove commented code
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* simplify conditions that are needed for tile distributions
* only enable the working group sizes in tests
* fix formatting
* Update tile distribution for 2D bquant
* add some documentation and 2d block scale example
* fix formatting
* Add in Changlog and restructure the quant 2d example
* fix CMake
* support the change for blockscale 2d
* fix the test file
---------
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Cong Ma <congma13@amd.com>
Co-authored-by: ThomasNing <thomas.ning@amd.com>
* Refactor split-image implementation: simplify code and remove redundant variables
* Add padding debug output to split-image implementation
- Added debug prints for padding calculations in transform_conv_fwd_to_gemm.hpp
- Verified padding works correctly with all tests passing
* Fix sign comparison warning after rebase with origin/develop
- Cast blockIdX from unsigned to signed index_t for comparisons
- Integrated with new GetOutputTileIndex logic from upstream
- Updated to use amd_wave_read_first_lane instead of __builtin_amdgcn_readfirstlane
* Fix Split-N with groups bug and clean up unused parameters
- Fixed batch stride calculation to include G dimension for grouped convolutions
- When moving between batches in NHWGC/NWGC/NDHWGC layouts, need to account for all groups
- Removed unused multi-split parameters (we only support 2-way split)
- All tests now pass: G=1 with Split-N, G>1 with Split-N, G>1 without Split-N
* Implement recursive queue-based split-image detection and calculation
- Add LaunchKernelWithSplitIfNeeded() helper method in transform_conv_fwd_to_gemm.hpp
- Implement recursive binary splitting algorithm (10GB→5GB+5GB→...)
- Correctly handle odd dimensions (61→30+31)
- Calculate proper offsets for each split piece
- Update invoker to use split-image helper
Note: Split detection and calculation work correctly but kernel launching
for individual pieces requires kernel modification to handle different
spatial dimensions (unlike Split-N which uses blockIdx.z).
* WIP: Split-Image investigation - found architecture mismatch
- Split-N modifies N_ directly in transformer constructor
- Split-Image needs different approach due to varying dimensions
- Added split calculation logic for 1D and 2D convolutions
- Still facing memory issues when creating piece transformers
Key finding: Split-N uses blockIdx.z for parallel execution,
while Split-Image needs sequential execution of non-uniform pieces.
* Add 1D split-image implementation for grouped convolution (N=1 working)
Implements split-image for 1D convolution to handle large tensors that
exceed memory thresholds. This is a critical milestone with N=1 fully
working and tested.
Key Changes:
- Invoker: Add split-image logic that splits W dimension in half
- Transformer: Add SplitConvProblem helper for recursive splitting
- Calculate offsets for LEFT and RIGHT pieces
- Launch two kernels sequentially (LEFT then RIGHT)
Implementation Details:
- Binary split: divides W dimension by 2
- LEFT piece: W=0 to W/2, keeps left padding, removes right padding
- RIGHT piece: W/2 to W, removes left padding, keeps right padding
- Offset calculation accounts for stride, dilation, and padding
- Physical memory offset (no padding in memory)
Test Results (N=1):
✅ 94/94 tests passing
- Comprehensive tests: 36/36 (channels, padding, stride, dilation, filters, groups)
- Edge case tests: 31/31 (odd dimensions, extreme parameters, boundaries)
- Stress tests: 27/27 (maximum dimensions, up to 91.4 TFlops)
Known Limitations:
- Only works with N=1 (single batch)
- N>1 fails when split-image triggers (offset calculation issue with Split-N)
- Root cause: Split-N modifies N in transformer, but offset calculated in invoker
- Solution planned: Move offset calculation to transformer (next phase)
Files Modified:
- grouped_convolution_forward_invoker.hpp: Add split-image logic
- transform_conv_fwd_to_gemm.hpp: Add SplitConvProblem helper
This commit represents a stable, tested 1D split-image implementation
for N=1 cases. It's an important milestone before extending to N>1
and multi-dimensional splits.
* Add basic split-image implementation for 1D/2D/3D grouped convolution
This is a working baseline implementation that splits large spatial
dimensions to handle memory constraints.
Implementation:
- 1D: W-split for NWGC layout (36/36 tests passing)
- 2D: H-split for NHWGC layout (20/20 tests passing)
- 3D: D-split for NDHWGC layout (verified working)
Features:
- Binary split of outermost spatial dimension
- Sequential LEFT/RIGHT kernel launches
- Proper padding adjustment at split boundaries
- Offset calculation for pointer arithmetic
- Debug output for verification
Threshold: 100KB (configurable in transformer)
Known limitations:
- No safety checks for edge cases (to be added)
- Offset calculated before Split-N (incompatible with N>1, to be fixed)
- No recursive splitting for very large tensors
Next steps:
- Add safety checks (is_possible_to_split_*)
- Move offset calculation to transformer (after Split-N)
- Test with N>1 + split-image combination
* Refactor split-image to unified structure for 1D/2D/3D
Unified the three separate dimension-specific blocks into a single
common implementation with dimension-specific stride calculations.
Benefits:
- Reduced code from 636 → 348 lines (45% reduction)
- Eliminated code duplication
- Easier to maintain and extend
- Single source of truth for split logic
Implementation:
- Common: Binary split, offset calc, padding adjustment, kernel launch
- Dimension-specific: Stride calculation only
- 1D: stride = G * C
- 2D: stride = W_in * G * C
- 3D: stride = H_in * W_in * G * C
Test results (all passing):
- 1D: 36/36 tests ✅
- 2D: 20/20 tests ✅
- 3D: 28/28 tests ✅
- Total: 84/84 (100%)
All test scenarios verified:
- Varying channels, padding, stride, dilation
- Filter sizes (1x1 pointwise to 7x7)
- Multiple groups (G=1,2,4)
- Odd dimensions
- Complex combinations
* Add safety checks for split-image in all dimensions
Added is_possible_to_split safety checks to prevent crashes when
splitting is not feasible.
Safety checks verify:
1. Output dimension > 1 (can't split single element)
2. RIGHT piece starts after left padding
3. LEFT piece ends within input bounds
If checks fail, falls back to normal kernel launch.
Verified for all dimensions:
- 1D (W-split): Wo=1 case triggers fallback
- 2D (H-split): Ho=1 case triggers fallback
- 3D (D-split): Do=1 case triggers fallback
Original 84 tests still pass - they use normal configurations
that naturally satisfy safety conditions.
Safety checks protect against pathological edge cases with:
- Very small spatial dimensions
- Extreme stride/dilation combinations
- Invalid padding configurations
* Fix Split-N + Split-Image compatibility issue
Fixed critical bug where Split-N and Split-Image working together
caused ~50% incorrect results due to wrong batch stride calculation.
Problem:
- Batch stride was calculated using MODIFIED spatial dimensions
(e.g., W=50000 after split) instead of ORIGINAL dimensions (W=100000)
- Spatial offset was applied globally in invoker, not per-batch in kernel
- Each batch (blockIdx.z) got wrong memory offset
Solution:
1. Store spatial offset in kargs (don't apply to pointer in invoker)
2. Copy correct batch_stride from temp_kargs to left/right kargs
3. Apply formula in operator(): ptr = base + (batch × stride) + spatial_offset
Changes:
- grouped_convolution_forward_kernel.hpp:
* Added spatial_offset_in/out fields to KernelArgs
* Apply batch + spatial offset in operator()
- grouped_convolution_forward_invoker.hpp:
* Keep base pointer, store spatial offset in kargs
* Copy batch_stride from temp_kargs (has original dimensions)
- transform_conv_fwd_to_gemm.hpp:
* Add debug output for split-image calculation
Results:
- N=1 tests: 84/84 passing (100%)
- N>1 tests: Now all passing (previously ~50% errors)
- Tested: 1D, 2D, 3D with N=1,2,4,8,16,20
* Implement unified threshold for Split-N and Split-Image
This commit consolidates threshold management for both Split-N and
Split-Image operations into a single source of truth, eliminating
code duplication and fixing offset calculation issues.
Key Changes:
============
1. Transformer (transform_conv_fwd_to_gemm.hpp):
- Moved TwoGB constant to public section for unified access
- CalculateSplitImage() now takes no parameters
- Uses internal threshold: TwoGB / sizeof(CDataType)
- Calculates offsets using N_ (after Split-N) for correctness
2. Kernel (grouped_convolution_forward_kernel.hpp):
- GetSplitImageInfo() simplified to take no parameters
- Forwards to transformer's CalculateSplitImage()
- Clean interface with unified threshold internally
3. Invoker (grouped_convolution_forward_invoker.hpp):
- Removed redundant threshold calculation
- Simplified to call kargs.GetSplitImageInfo() with no params
- Clean early-return pattern (no unnecessary else blocks)
- Removed duplicate/dead code paths
Benefits:
=========
- Single source of truth: TwoGB defined once in transformer
- No parameter passing for threshold between components
- Correct offset calculation using N_ (post-Split-N)
- Cleaner code with no duplication
- All tests passing: 1D/2D/3D with various N values
Testing:
========
- Split-Image only (N=1, large spatial): PASS
- Split-N only (N>1, small spatial): PASS
- Both splits active (N>1, large spatial): PASS
- No splits (N=1, small spatial): PASS
- CPU verification correct for all scenarios
* Comment out outdated split-image code (SplitConvProblem/LaunchKernelWithSplitIfNeeded)
The old recursive queue-based implementation has been replaced by the
new CalculateSplitImage() method which is simpler and correctly handles
Split-N + Split-Image interaction.
Changes:
- Wrapped lines 381-1078 in #if 0...#endif
- Old methods: SplitConvProblem() and LaunchKernelWithSplitIfNeeded()
- Preserved for reference but disabled from compilation
- No functional changes - all tests still pass
The new implementation (CalculateSplitImage at line ~2163) provides:
- Correct offset calculation using N_ (after Split-N)
- Simpler binary split logic
- Better integration with unified threshold approach
* Implement recursive split-image with depth limit (MAX_DEPTH=10)
Changes:
- Add depth tracking to SplitPiece struct
- Implement two stopping conditions:
1. Piece size below threshold (optimal case)
2. Depth >= MAX_DEPTH (prevents infinite recursion)
- Remove MAX_PIECES limit in favor of depth-based control
- Support up to 2^10 = 1024 pieces with depth 10
This allows handling extreme tensor sizes while ensuring termination.
Pieces larger than threshold will still launch correctly if depth limit reached.
Tested with H=100 (4 levels), H=2000 (6 levels), H=4000 (9 levels) - all pass CPU verification.
* Summary of recursive split-image implementation:
- Recursive queue-based splitting with depth limit (MAX_DEPTH=10, up to 1024 pieces)
- Two stopping conditions: size below threshold OR max depth reached
- Cumulative offset tracking through all recursion levels
- LEFT piece inherits parent offset, RIGHT accumulates (parent + local)
- Per-batch spatial offset application in kernel operator()
- Batch stride uses original dimensions (before split)
- Works with Split-N: split-N first, then recursive split-image
- Handles odd dimensions, padding, stride, dilation correctly
- All 1D/2D/3D tests pass with CPU verification
* Add comment explaining MAX_DEPTH capacity for 2GB threshold
* Refactor: move recursive split-image logic to transformer
- Move LaunchWithRecursiveSplit() from invoker to transform_conv_fwd_to_gemm.hpp
- Simplify invoker from ~250 lines to ~140 lines (removed 110 lines of inline logic)
- Encapsulate SplitPiece struct and BFS splitting algorithm in transformer
- Remove unused includes (queue, vector) from invoker
- Add documentation comment for AreDescriptorsSmallerThan2GB()
- Improve code organization and reusability
- No performance overhead (static template function, compiler inlines)
- All tests passing with 2GB production threshold
* Apply clang-format-18 formatting
- Format invoker and transformer files with clang-format-18
- Fix brace placement and alignment
- No functional changes
* Fix clang-format-18 issues in forward kernel
- Remove extra blank lines
- Fix line wrapping for template calls
- Consolidate GetSplitImageInfo() to single line
* Update include/ck_tile/ops/grouped_convolution/utils/transform_conv_fwd_to_gemm.hpp
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* Update include/ck_tile/ops/grouped_convolution/utils/transform_conv_fwd_to_gemm.hpp
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* Update include/ck_tile/ops/grouped_convolution/kernel/grouped_convolution_forward_kernel.hpp
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* Update include/ck_tile/ops/grouped_convolution/kernel/grouped_convolution_forward_kernel.hpp
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* Split-Image implementation with temporary fixed divider
- Implemented spatial dimension splitting (Split-Image) for large tensors
- Added piece-based coordinate transformation for 1D/2D/3D convolutions
- Integrated Split-N (batch splitting) with automatic threshold detection
- Fixed M dimension calculation to include batch: M = N × spatial_size
- Added spatial offset support in kernel arguments
- Verified 20/20 test cases passing for Split-Image alone
- Known issue: Split-N + Split-Image combination needs coordinate fix
Implementation Details:
- Split factors: 4 (1D), 4×4 (2D), 4×4×4 (3D) - temporary fixed values
- Batch strides properly calculated for NWGC/NHWGC/NDHWGC layouts
- Piece descriptors track spatial boundaries and block ranges
- No performance overhead for N=1 cases
* Fix 1D split-image padding issue with per-piece dimensions
- Store actual size per piece to handle non-uniform splits
- Remove dead code from transform utils
* Fix 2D/3D split-image with independent split factors per dimension
Problem: Single split factor caused non-uniform pieces when dimensions
didn't divide evenly. Result: 18/25 (72%) 2D padding combinations failed.
Solution: Independent split factor selection for W, H, D dimensions.
Each dimension gets optimal factor based on its own size.
Test Results:
- 1D: 42/42 pass (100%)
- 2D: 25/25 pass (100%)
- Total: 67/67 combinations verified
* Remove unused split-image struct fields
Cleanup of split-image implementation:
- Removed unused piece_d, piece_h, piece_w fields from SplitImageInfo struct
- These fields were declared but never used in the kernel
- Per-piece dimensions are already stored in pieces[] array
- Reduces struct size and improves code clarity
Tested: 1D/2D/3D convolutions with split-image, padding, stride all pass
* Refactor split-image invoker code for improved readability
- Extract piece calculation logic into calculate_piece lambda helper
- Extract kernel args population into populate_split_image_kargs lambda
- Use aggregate initialization for cleaner struct population
- Reduce nesting depth and improve maintainability
- Fix outdated comment about split-image implementation status
* Refactor split-image code and remove debug prints
- Extract GPU kernel helper lambdas for better readability
- Remove all split-image debug print statements
- Set memory threshold to 2GB for production
- All tests pass with CPU verification
* Add split-image safety constraints and refactor to utils
- Add MAX_TOTAL_PIECES=64 limit to prevent segfault
- Move calculate_spatial_piece to library utils
- Add layout validation (NWGC, NHWGC, NDHWGC only)
- Fix hierarchical splitting to respect piece limits
- Add proper documentation and formatting
* Change split-image from runtime to compile-time branching
Response to @bartekxk review comment:
Convert 'if(kargs.num_spatial_pieces > 1)' to 'if constexpr(EnableSplitImage)'
Changes:
- Add EnableSplitImage template parameter to kernel
- Change runtime if to compile-time if constexpr
- Update invoker to instantiate kernel variants with true/false
Benefits:
- Eliminates runtime branching in GPU kernel
- Dead code elimination (each variant is smaller)
- Better compiler optimization
Files modified: 2
Lines changed: 20 total (6 in kernel, 14 in invoker)
Tests: 27/27 passed (100%)
Performance: No regression
* Add split-image example as separate binary
- Create grouped_convolution_forward_split_image example
- Add grouped_convolution_forward_split_image_invoker.hpp
- Update CMakeLists.txt to build split_image binary
* Replace linear search with binary search in find_piece_id
- Change O(n) to O(log n) for finding piece ownership
- Matches reference implementation in large_tensor_cshuffle
* Simplify split-image code and fix integer overflow
- Extract lambda functions to static helper methods
- Pre-calculate constants in invoker
- Fix integer overflow in tensor size calculation for large tensors
* Trigger CI rerun - fix merge conflicts
* Fix merge conflict markers
* Fix clang-format: remove space before {}
* Fix clang-format: comment wrapping and Swish constructor
* Rename split_image to large_tensor for clarity
- Renamed grouped_convolution_forward_split_image.cpp -> grouped_convolution_forward_large_tensor.cpp
- Renamed grouped_convolution_forward_split_image_invoker.hpp -> grouped_convolution_forward_large_tensor_invoker.hpp
- Updated CMakeLists.txt target name: tile_example_grouped_conv_fwd_split_image -> tile_example_grouped_conv_fwd_large_tensor
- Updated comments to refer to 'large tensor' instead of 'split-image'
* Update comments and include in large_tensor example
- Updated header comments to use 'large tensor' terminology
- Fixed include path to use large_tensor_invoker.hpp
* Remove test code, restore 2GB threshold
* Update include/ck_tile/ops/grouped_convolution/utils/transform_conv_fwd_to_gemm.hpp
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* Fix build errors after develop merge and complete rename to large_tensor
This commit addresses compilation errors from the develop merge and
completes the rename from split_image to large_tensor.
Changes:
1. Fix CDEElementWise typo in grouped_convolution_forward_invoker.hpp
2. Fix template parameter order in large_tensor_invoker.hpp
- TransformConvFwdToGemm signature changed in develop
- NumGroupsToMerge and SplitN parameters swapped positions
3. Fix missing template parameter in GroupedConvFwdHostArgs
4. Fix EpiloguePipeline scope in kernel (merge conflict)
5. Update binary name references in test scripts
* Restore 2GB threshold for split-image
Changed threshold from 100MB (testing) back to 2GB for production use.
* Fix const-correctness in ds_ptr cast
* Update include/ck_tile/ops/grouped_convolution/kernel/grouped_convolution_forward_kernel.hpp
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* Apply clang-format-18
* update c++ 18 format
* Apply clang-format-18 to transform_conv_fwd_to_gemm.hpp
---------
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* add tensorwise quant in grouped gemm
* fix example issue
* update test cases
* format codes
* clang format
* use GTEST_FAIL
* add bquant to grouped_gemm
* add tensorwise quant in grouped gemm
* fix example issue
* update test cases
* format codes
* clang format
* use GTEST_FAIL
* fix a bug in test_grouped_gemm_util
* skip test when use wmma on grouped_quant kernel
* change cmake
* fix a bug in test_grouped_gemm_util
* skip test when use wmma on grouped_quant kernel
* change cmake
* tests(quant_grouped_gemm): add unit tests to cover bquant in grouped_gemm
* Update test/ck_tile/grouped_gemm_quant/test_grouped_gemm_util_quant.hpp
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* Update example/ck_tile/17_grouped_gemm/quant_grouped_gemm.hpp
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* feat: add bf8 support
* chore: remove unnecessary decltype usage
* chore: add default quant_mode to function signature as fallback
* fix: pass correct runtime pipeline params in grouped_gemm bquant kernel
Calculate has_hot_loop, num_loop, and tail_number on device side for each
GEMM problem instead of using default values. This fixes incorrect results
when different problems in the group have different K dimensions.
* chore: set default quant mode in function signature
* test: add additional test cases to cover edge case of no hotloop
* change code based on comments
* WIP: bquant preshuffle b compiles but gives numerical error
* feat(grouped_gemm_quant): bquant with preshuffleB support added to grouped_gemm example & kernel
* refactor: refactor code after merge commit
* chore: remove print statements
* test(grouped_gemm): split test cases by quant mode to reduce compilation time and add bquant-preshuffleB mode test cases
---------
Co-authored-by: kyle-256 <Kyle.Zhao@amd.com>
Co-authored-by: ThomasNing <thomas.ning@amd.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* Pass hdim to tile_example_fmha_fwd in fp8 tests
* Add WMMA support to fwd FMHA pipelines
* Tune tile sizes a bit for less spilling
fp16 256 is still quite slow
* Fix Q grad tile distribution for warp size = 32 and hdim >= 256
With AccDataType = float and warp size = 32, K0 becomes 0, K repeat is required to correcty distribute the tile.
* Use code based on BlockDropout in BlockDropoutBwd
* Fix split KV combine kernel for gfx12 (warp size 32) and make it more universal
* Fix LSE LDS tensor descriptors: kMaxSplits and kM0 were swapped, it worked on gfx9
because they both equal to 8 while on gfx12 they are 8 and 4;
* Fix Oacc LDS tensor descriptor: it was transposed even though its shape=[4 * kM0, kN1],
it worked on gfx9 because 4 * kM == kN1 == 32;
* Removing these hidden dependecies allows to support:
* any number of warps (power-of-2), not only 4;
* kN1 = 16, not only 32;
* any number of splits;
* Rename ids like o_acc_4 and Oacc4 to eliminate confusion: kNumWarps doesn't have to be 4 now
* Replace hard-coded kN1 in dispatch code with the requested tile size
* Add gfx12-specific tile sizes for split KV
* Pass GPU architecture to kernel generation scripts
This is still a temporary solution.
* Build and run FMHA CI tests for gfx12
* Fix issue after merging
* Fix bwd tile sizes
The current pipelines always read only one tile K and V tile, this
requires bk0 == bhdq and bk2 == bhdv (kK0 == kQKHeaddim and
kK2 == kVHeaddim).
* Use hardware f32->f8 on gfx12, remove v_perm
__builtin_amdgcn_perm is not needed because
__builtin_amdgcn_cvt_pk_fp8_f32 allows to specify which word (16 bit of
32-bit dword) is used to store results (two f8 values).
* Update changelog
* Add WMMA support to pagedkv
* Fix scripts after rebasing
* Support 16x16 (MFMA, WMMA) and 32x32 (MFMA) tiles in fwd and bwd BlockDropout
Add comments with dropout implementation details
Fix performance regression of fwd+dropout
* Remove some usage of type punning (reinterpret_cast with ref or ptr) in Philox;
* "scalarize" seed and offset, they may come either from kernel args or from device memory
(presumably loaded with vector loads).
These changes help the compiler to procude more optimal code and reduce register spilling.
Use WarpGemmDispatcher instead of explicit WarpGemmMfma... to get CWarpDstrEncoding
Use code based on BlockDropout in BlockDropoutBwd
Refactor BlockDropout (fwd)
Implement BlockDropout (fwd) for WMMA
Originally BlockDropout only supported 32x32 tiles (IsWG32 = true),
this version supports 16x16 tiles.
If MPerBlock > MWarp * 16, it can generate numbers for two 16x16 tiles, similarly
to BlockDropoutBwd.
Implement BlockDropoutBwd for WMMA
Remove MakeRandValLds* functions unused in BlockDropoutBwd
Remove unused Run overload from BlockDropoutBwd
* Fix regression with philox seed and offset when they exceed 32-bit int
__builtin_amdgcn_readfirstlane works with 32-bit values, seed and offset
are 64-bit so they get truncated.
* Fix names after cherry-picking
* Fix selection of a fallback tile based on bm0
The assumption that the largest bm0 == 128 is not always true for
current fp32 tiles.
* Do not use filters related to qr_async_trload
They disable tiles/pipelines which are valid for gfx12.
* Use different dstr encoding when C is transposed
* Do not call GetQKBlockGemm (and hence WarpGemmDispatcher) in host code
Some WarpGemmDispatcher instantiations are defined only
for specific archs and undefined on host.
Calculations related to sched barriers are moved from Pipeline's public
fields into pipeline's operator().
* Fix incorrect name WarpGemmMfmaFp8Fp8F32M32N32K16SwizzleBTransposedCDistribution
Correct name is WarpGemmMfmaFp8Fp8F32M32N32K32SwizzleBTransposedCDistribution
because it's 32x32x16 with IterateK = 2 so K = 32, also all tiles used
in codegen scripts are 32, 32, 32.
* Generalize usages of WarpGemmDispatcher for MFMA and WMMA
WarpGemmMfmaFp8Fp8F32M32N32K32SwizzleBTransposedCDistribution is still
used explicitly becaus of swizzle factor = 4.
* Mark has_load_tr as maybe_unused
There are no transpose loading for RDNA.
* Remove CK_TILE_USE_MFMA/WMMA from fmha-related code
* Detect BlockSize on host based on warp size of the current device
If kBlockSize == kNumWarps * get_warp_size(), the kernel is launched with
kBlockSize / 2 because on host get_warp_size() == 64 always.
* Fix calculation of grid size for combine kernel with warp size = 32
* Add missing includes and header
* Support multiple archs in one binary for fwd
* Support multiple archs in one binary for fwd_splitkv, fwd_appendkv, pagedkv_prefill
* Support multiple archs in one binary for bwd
* trload kernels are compiled only for gfx950;
* instances with padding are checked after instances without padding so
they can be used as fallbacks (similarly to fwd);
* Extract common code from register_traits
* Revert "Fix regression with philox seed and offset when they exceed 32-bit int"
To simplify merging , the proper fix is in develop already.
* Support new numerical d paddings in trait ordering checks
* Build fp32 tests only on gfx9
* Do not use hardcoded M0 = 64 for dot bwd kernel
* Use textwrap.indent from standard library
* Make fp8 pipelines on gfx12 consistent with gfx9
* Update tests for current pipelines
* Make ninja check more responsive in CI
ninja buffers output so this job looks hanging.
* Support fp8fp32 by limiting O vector size
The fp32 output type requires storing 8 * sizeof(float) = 32 bytes,
which is not implemented (here 8 is the number of C values per lane for
v_wmma_f32_16x16x16...).
* Remove unused cmake options
* Unify including amd_buffer_addressing.hpp/_builtins.hpp
* Temporarily use amd_buffer_addressing.hpp on >=gfx10
amd_buffer_addressing_builtins.hpp uses inline asm for loads/stores
which is not compatible with >=gfx10:
* 1 scalar for exec masks instead of 2,
* gfx12 uses different instruction names etc.
* Update asm in bf16 conversions to work with warp 32
* Do not generate splitkv/appendkv with vlayout=col for consistency with fwd
* Add arch tags to kernels/host funcs, compile for each arch separately
* Add kM0 to fmha_bwd_dot_do_o kernel name to match filename
* Add workaround for miscompilation of bwd with padded hdim
SWDEV-559729: v_wmma instructions can be incorrectly placed in divergent
branches used to store padded tensors (when some lanes are inactive due
to padding). Inline asm with dummy dependencies on VGPRs of the tensors
prevents the compiler doing this.
* Fix add_gtest_executable for absolute paths
Some tests (like gemm_tile_engine) pass absolute paths to source files.
In CI the branch name is a part of the root dir, and if the branch name
contains "wmma", "xdl" etc., files can be incorrectly excluded.
* Run only hdim 128 smoke tests for fp8fp32
There are no instances for hdim 64 and 256.
* Format py with ruff to simplify merging develop
* Fix incorrect var name
* Codegen for gfx9,gfx950 when --targets is not specified
Aiter and Pytorch require changes for passing their targets to the codegen scripts.
With this temporary solution the files are generated but not all of them
have to be really built (depending on the used --offload-arch=).
* Combine arch-related values into ArchTrait
This more centralized approach removes duplication of various formatting templates.
* Try a workaround for Jenkins error "groovyjarjarasm.asm.MethodTooLargeException: Method too large"
Some code is extracted into a function.