* Wrap ck host utitlies in CK namespace.
The CK and CK-Tile source code bases are incompatible because CK is not properly using namespaces everywhere. In particular, we need to put hip_check_error in the ck namespace.
Move all functions in include/ck_/host_utility that were in global namespace into the ck namespace.
There may be additional namespace problems like this, and it's possible we'll have namespace clashes. But it is good design to properly guard our to code bases (CK and CKTile) so that they can both coexist. Moreover, estabilishing this compatiblity is essential if we are going to allow the builder to instantiate kernels from either template library.
* Add using declarations to test code.
After moving some of the untils into the ck namespace, most examples and a few tests had to be updated to recognize the new namespace declarations. We add using declarations to individual compute units for functions that were previously in the global namespace.
* Add using declarations to client examples.
* Use vectorized stores for dropout randvals
With no kPadSeqLenK the kernel uses 2 buffer_store_dwordx2 instead of
16 buffer_store_byte. This requires less registers and reduces spilling.
* Calculate dropout randvals for storing and applying only once
Even though it may add a small overhead when storing is not required,
it uses significantly less registers and hence no spilling.
* Reapply "[CK_TILE] Non-K Major from old CK to CK-Tile (#2442)" (#3017)
This reverts commit e4298e55c7.
* WIP
* take Y2 as the AK1/BK1 value, that is the 'vector size' after shuffle
* use get_n_lds_banks()
* clang-format
---------
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
* Add missing copyright statements
* Use ck_tile::host_tensor_descriptor instead of a custom lambda
* Refactor use of check_data_type in test classes
* Use TEST_SUITE_NAME with TYPED_TEST_SUITE
* Remove an unused namespace
* Make dim3 const
* Add BF8 x BF8 tests for CompV3 in test_gemm_pipeline_kernel_types.hpp
* Add F8 x BF8 tests for CompV3 in test_gemm_pipeline_kernel_types.hpp
* Add BF16 x I4 tests for CompV3 in test_gemm_pipeline_kernel_types.hpp
* Add BF16 x BF16 tests for CompV3 in test_gemm_pipeline_kernel_types.hpp
* Add BF8 x I4 tests for CompV3 in test_gemm_pipeline_kernel_types.hpp
* Add F8 x I4 tests for CompV3 in test_gemm_pipeline_kernel_types.hpp
* Add F16 x I4 tests for CompV3 in test_gemm_pipeline_kernel_types.hpp
* Skip failing tests of F16 x I4 for CompV3 with K == 2 * K_Tile
* Add missing precision type combinations to CompV4 from CompV3
* Move the INT8 tests around for consistency with KernelTypesCompV3Wmma
* Add missing precision type combinations to CompV3Wmma from CompV3
* Remove the basic and universal tests and their dependencies
* On __gfx950__, avoid using transposed loading of A with datatype pk_int4_t of B
* Use ADataType and BDataType instead of ComputeDataType for WarpGemm
* Explicitly set some return types to void
* Use more general typenames in InterleavedPKTypeLoader
* Add load_interleaved_pk_type.hpp to common.hpp
* Use std::is_same_v in load_int4_tile
* Add handling of LoadTranspose to load_int4_tile
* Factor out common code in several places using load_int4_tile
* Add support for pk_int4_t using load_int4_tile
* Fix formatting
* Extend AK1 / BK1 support:
- Add support for AK1 != BK1
- Add support for AK1, BK1 > 8
- Introduce KInner template parameter for pipelines when loading multiple tiles with one instruction
* fix clang format
* Introduces the new partitioner to implement the reduction StreamK kernel
* Add more doc text to functions
* Add persistent-dp option to streamk example
* Update example/ck_tile/40_streamk_gemm/README.md
* Add backward weight instance traits for xdl cshuffle.
To keep instance test file sizes reasonable, we start a new test_bwd_weight_instances_traits.cpp test file.
* Fix copyright notices.
* Remove (c) symbol, replace with (C).
Having UTF-8 in source caused an error with code generation.
This change replaces pipeline macros like CK_TILE_PIPELINE_COMPUTE_V3,
CK_TILE_PIPELINE_MEMORY, etc in the CK Tile examples with a common enum
called GemmPipeline to reduce code duplication.
* Adding a ds permute fallback for the gfx908 and older for row_newbcast:7 instruction
* Better macro for selecting ROW_NEWBCAST
* clang-format the update
---------
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
* Refactor quant group size to be configurable for M/N/K, not just K
* add some asserts for configurations not implemented
* start setting of group size for N dimension
* enable 2d for reference quant gemm
* WIP: trying to figure out tile dstr and/or indexing for scale matrix
* WIP
* Fix handling of n dim blocks in tile windows etc
* remove commented code and enable all tests again
* fix formatting
* Add more specialized tile distributions
* Enable NWarps replication for bquant tile dstr
* fix formatting
* fix format
* Fix some issues from the merge
* fix formatting
* one more fix to tile dstr, and revert debug initialization
* Remove commented code
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* simplify conditions that are needed for tile distributions
* only enable the working group sizes in tests
* fix formatting
* Update tile distribution for 2D bquant
* add some documentation and 2d block scale example
* fix formatting
* Add in Changlog and restructure the quant 2d example
* fix CMake
* support the change for blockscale 2d
* fix the test file
---------
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Cong Ma <congma13@amd.com>
Co-authored-by: ThomasNing <thomas.ning@amd.com>
* fix: fix bug in print tile window when printing bf8/fp8 tiles
* test(print_tile_window_range): add unit tests to maintain function integrity
* fix: fp8 numerical mismatch error on gfx950 by adding DCK_TILE_USE_OCP_FP8
* Refactor split-image implementation: simplify code and remove redundant variables
* Add padding debug output to split-image implementation
- Added debug prints for padding calculations in transform_conv_fwd_to_gemm.hpp
- Verified padding works correctly with all tests passing
* Fix sign comparison warning after rebase with origin/develop
- Cast blockIdX from unsigned to signed index_t for comparisons
- Integrated with new GetOutputTileIndex logic from upstream
- Updated to use amd_wave_read_first_lane instead of __builtin_amdgcn_readfirstlane
* Fix Split-N with groups bug and clean up unused parameters
- Fixed batch stride calculation to include G dimension for grouped convolutions
- When moving between batches in NHWGC/NWGC/NDHWGC layouts, need to account for all groups
- Removed unused multi-split parameters (we only support 2-way split)
- All tests now pass: G=1 with Split-N, G>1 with Split-N, G>1 without Split-N
* Implement recursive queue-based split-image detection and calculation
- Add LaunchKernelWithSplitIfNeeded() helper method in transform_conv_fwd_to_gemm.hpp
- Implement recursive binary splitting algorithm (10GB→5GB+5GB→...)
- Correctly handle odd dimensions (61→30+31)
- Calculate proper offsets for each split piece
- Update invoker to use split-image helper
Note: Split detection and calculation work correctly but kernel launching
for individual pieces requires kernel modification to handle different
spatial dimensions (unlike Split-N which uses blockIdx.z).
* WIP: Split-Image investigation - found architecture mismatch
- Split-N modifies N_ directly in transformer constructor
- Split-Image needs different approach due to varying dimensions
- Added split calculation logic for 1D and 2D convolutions
- Still facing memory issues when creating piece transformers
Key finding: Split-N uses blockIdx.z for parallel execution,
while Split-Image needs sequential execution of non-uniform pieces.
* Add 1D split-image implementation for grouped convolution (N=1 working)
Implements split-image for 1D convolution to handle large tensors that
exceed memory thresholds. This is a critical milestone with N=1 fully
working and tested.
Key Changes:
- Invoker: Add split-image logic that splits W dimension in half
- Transformer: Add SplitConvProblem helper for recursive splitting
- Calculate offsets for LEFT and RIGHT pieces
- Launch two kernels sequentially (LEFT then RIGHT)
Implementation Details:
- Binary split: divides W dimension by 2
- LEFT piece: W=0 to W/2, keeps left padding, removes right padding
- RIGHT piece: W/2 to W, removes left padding, keeps right padding
- Offset calculation accounts for stride, dilation, and padding
- Physical memory offset (no padding in memory)
Test Results (N=1):
✅ 94/94 tests passing
- Comprehensive tests: 36/36 (channels, padding, stride, dilation, filters, groups)
- Edge case tests: 31/31 (odd dimensions, extreme parameters, boundaries)
- Stress tests: 27/27 (maximum dimensions, up to 91.4 TFlops)
Known Limitations:
- Only works with N=1 (single batch)
- N>1 fails when split-image triggers (offset calculation issue with Split-N)
- Root cause: Split-N modifies N in transformer, but offset calculated in invoker
- Solution planned: Move offset calculation to transformer (next phase)
Files Modified:
- grouped_convolution_forward_invoker.hpp: Add split-image logic
- transform_conv_fwd_to_gemm.hpp: Add SplitConvProblem helper
This commit represents a stable, tested 1D split-image implementation
for N=1 cases. It's an important milestone before extending to N>1
and multi-dimensional splits.
* Add basic split-image implementation for 1D/2D/3D grouped convolution
This is a working baseline implementation that splits large spatial
dimensions to handle memory constraints.
Implementation:
- 1D: W-split for NWGC layout (36/36 tests passing)
- 2D: H-split for NHWGC layout (20/20 tests passing)
- 3D: D-split for NDHWGC layout (verified working)
Features:
- Binary split of outermost spatial dimension
- Sequential LEFT/RIGHT kernel launches
- Proper padding adjustment at split boundaries
- Offset calculation for pointer arithmetic
- Debug output for verification
Threshold: 100KB (configurable in transformer)
Known limitations:
- No safety checks for edge cases (to be added)
- Offset calculated before Split-N (incompatible with N>1, to be fixed)
- No recursive splitting for very large tensors
Next steps:
- Add safety checks (is_possible_to_split_*)
- Move offset calculation to transformer (after Split-N)
- Test with N>1 + split-image combination
* Refactor split-image to unified structure for 1D/2D/3D
Unified the three separate dimension-specific blocks into a single
common implementation with dimension-specific stride calculations.
Benefits:
- Reduced code from 636 → 348 lines (45% reduction)
- Eliminated code duplication
- Easier to maintain and extend
- Single source of truth for split logic
Implementation:
- Common: Binary split, offset calc, padding adjustment, kernel launch
- Dimension-specific: Stride calculation only
- 1D: stride = G * C
- 2D: stride = W_in * G * C
- 3D: stride = H_in * W_in * G * C
Test results (all passing):
- 1D: 36/36 tests ✅
- 2D: 20/20 tests ✅
- 3D: 28/28 tests ✅
- Total: 84/84 (100%)
All test scenarios verified:
- Varying channels, padding, stride, dilation
- Filter sizes (1x1 pointwise to 7x7)
- Multiple groups (G=1,2,4)
- Odd dimensions
- Complex combinations
* Add safety checks for split-image in all dimensions
Added is_possible_to_split safety checks to prevent crashes when
splitting is not feasible.
Safety checks verify:
1. Output dimension > 1 (can't split single element)
2. RIGHT piece starts after left padding
3. LEFT piece ends within input bounds
If checks fail, falls back to normal kernel launch.
Verified for all dimensions:
- 1D (W-split): Wo=1 case triggers fallback
- 2D (H-split): Ho=1 case triggers fallback
- 3D (D-split): Do=1 case triggers fallback
Original 84 tests still pass - they use normal configurations
that naturally satisfy safety conditions.
Safety checks protect against pathological edge cases with:
- Very small spatial dimensions
- Extreme stride/dilation combinations
- Invalid padding configurations
* Fix Split-N + Split-Image compatibility issue
Fixed critical bug where Split-N and Split-Image working together
caused ~50% incorrect results due to wrong batch stride calculation.
Problem:
- Batch stride was calculated using MODIFIED spatial dimensions
(e.g., W=50000 after split) instead of ORIGINAL dimensions (W=100000)
- Spatial offset was applied globally in invoker, not per-batch in kernel
- Each batch (blockIdx.z) got wrong memory offset
Solution:
1. Store spatial offset in kargs (don't apply to pointer in invoker)
2. Copy correct batch_stride from temp_kargs to left/right kargs
3. Apply formula in operator(): ptr = base + (batch × stride) + spatial_offset
Changes:
- grouped_convolution_forward_kernel.hpp:
* Added spatial_offset_in/out fields to KernelArgs
* Apply batch + spatial offset in operator()
- grouped_convolution_forward_invoker.hpp:
* Keep base pointer, store spatial offset in kargs
* Copy batch_stride from temp_kargs (has original dimensions)
- transform_conv_fwd_to_gemm.hpp:
* Add debug output for split-image calculation
Results:
- N=1 tests: 84/84 passing (100%)
- N>1 tests: Now all passing (previously ~50% errors)
- Tested: 1D, 2D, 3D with N=1,2,4,8,16,20
* Implement unified threshold for Split-N and Split-Image
This commit consolidates threshold management for both Split-N and
Split-Image operations into a single source of truth, eliminating
code duplication and fixing offset calculation issues.
Key Changes:
============
1. Transformer (transform_conv_fwd_to_gemm.hpp):
- Moved TwoGB constant to public section for unified access
- CalculateSplitImage() now takes no parameters
- Uses internal threshold: TwoGB / sizeof(CDataType)
- Calculates offsets using N_ (after Split-N) for correctness
2. Kernel (grouped_convolution_forward_kernel.hpp):
- GetSplitImageInfo() simplified to take no parameters
- Forwards to transformer's CalculateSplitImage()
- Clean interface with unified threshold internally
3. Invoker (grouped_convolution_forward_invoker.hpp):
- Removed redundant threshold calculation
- Simplified to call kargs.GetSplitImageInfo() with no params
- Clean early-return pattern (no unnecessary else blocks)
- Removed duplicate/dead code paths
Benefits:
=========
- Single source of truth: TwoGB defined once in transformer
- No parameter passing for threshold between components
- Correct offset calculation using N_ (post-Split-N)
- Cleaner code with no duplication
- All tests passing: 1D/2D/3D with various N values
Testing:
========
- Split-Image only (N=1, large spatial): PASS
- Split-N only (N>1, small spatial): PASS
- Both splits active (N>1, large spatial): PASS
- No splits (N=1, small spatial): PASS
- CPU verification correct for all scenarios
* Comment out outdated split-image code (SplitConvProblem/LaunchKernelWithSplitIfNeeded)
The old recursive queue-based implementation has been replaced by the
new CalculateSplitImage() method which is simpler and correctly handles
Split-N + Split-Image interaction.
Changes:
- Wrapped lines 381-1078 in #if 0...#endif
- Old methods: SplitConvProblem() and LaunchKernelWithSplitIfNeeded()
- Preserved for reference but disabled from compilation
- No functional changes - all tests still pass
The new implementation (CalculateSplitImage at line ~2163) provides:
- Correct offset calculation using N_ (after Split-N)
- Simpler binary split logic
- Better integration with unified threshold approach
* Implement recursive split-image with depth limit (MAX_DEPTH=10)
Changes:
- Add depth tracking to SplitPiece struct
- Implement two stopping conditions:
1. Piece size below threshold (optimal case)
2. Depth >= MAX_DEPTH (prevents infinite recursion)
- Remove MAX_PIECES limit in favor of depth-based control
- Support up to 2^10 = 1024 pieces with depth 10
This allows handling extreme tensor sizes while ensuring termination.
Pieces larger than threshold will still launch correctly if depth limit reached.
Tested with H=100 (4 levels), H=2000 (6 levels), H=4000 (9 levels) - all pass CPU verification.
* Summary of recursive split-image implementation:
- Recursive queue-based splitting with depth limit (MAX_DEPTH=10, up to 1024 pieces)
- Two stopping conditions: size below threshold OR max depth reached
- Cumulative offset tracking through all recursion levels
- LEFT piece inherits parent offset, RIGHT accumulates (parent + local)
- Per-batch spatial offset application in kernel operator()
- Batch stride uses original dimensions (before split)
- Works with Split-N: split-N first, then recursive split-image
- Handles odd dimensions, padding, stride, dilation correctly
- All 1D/2D/3D tests pass with CPU verification
* Add comment explaining MAX_DEPTH capacity for 2GB threshold
* Refactor: move recursive split-image logic to transformer
- Move LaunchWithRecursiveSplit() from invoker to transform_conv_fwd_to_gemm.hpp
- Simplify invoker from ~250 lines to ~140 lines (removed 110 lines of inline logic)
- Encapsulate SplitPiece struct and BFS splitting algorithm in transformer
- Remove unused includes (queue, vector) from invoker
- Add documentation comment for AreDescriptorsSmallerThan2GB()
- Improve code organization and reusability
- No performance overhead (static template function, compiler inlines)
- All tests passing with 2GB production threshold
* Apply clang-format-18 formatting
- Format invoker and transformer files with clang-format-18
- Fix brace placement and alignment
- No functional changes
* Fix clang-format-18 issues in forward kernel
- Remove extra blank lines
- Fix line wrapping for template calls
- Consolidate GetSplitImageInfo() to single line
* Update include/ck_tile/ops/grouped_convolution/utils/transform_conv_fwd_to_gemm.hpp
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* Update include/ck_tile/ops/grouped_convolution/utils/transform_conv_fwd_to_gemm.hpp
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* Update include/ck_tile/ops/grouped_convolution/kernel/grouped_convolution_forward_kernel.hpp
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* Update include/ck_tile/ops/grouped_convolution/kernel/grouped_convolution_forward_kernel.hpp
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* Split-Image implementation with temporary fixed divider
- Implemented spatial dimension splitting (Split-Image) for large tensors
- Added piece-based coordinate transformation for 1D/2D/3D convolutions
- Integrated Split-N (batch splitting) with automatic threshold detection
- Fixed M dimension calculation to include batch: M = N × spatial_size
- Added spatial offset support in kernel arguments
- Verified 20/20 test cases passing for Split-Image alone
- Known issue: Split-N + Split-Image combination needs coordinate fix
Implementation Details:
- Split factors: 4 (1D), 4×4 (2D), 4×4×4 (3D) - temporary fixed values
- Batch strides properly calculated for NWGC/NHWGC/NDHWGC layouts
- Piece descriptors track spatial boundaries and block ranges
- No performance overhead for N=1 cases
* Fix 1D split-image padding issue with per-piece dimensions
- Store actual size per piece to handle non-uniform splits
- Remove dead code from transform utils
* Fix 2D/3D split-image with independent split factors per dimension
Problem: Single split factor caused non-uniform pieces when dimensions
didn't divide evenly. Result: 18/25 (72%) 2D padding combinations failed.
Solution: Independent split factor selection for W, H, D dimensions.
Each dimension gets optimal factor based on its own size.
Test Results:
- 1D: 42/42 pass (100%)
- 2D: 25/25 pass (100%)
- Total: 67/67 combinations verified
* Remove unused split-image struct fields
Cleanup of split-image implementation:
- Removed unused piece_d, piece_h, piece_w fields from SplitImageInfo struct
- These fields were declared but never used in the kernel
- Per-piece dimensions are already stored in pieces[] array
- Reduces struct size and improves code clarity
Tested: 1D/2D/3D convolutions with split-image, padding, stride all pass
* Refactor split-image invoker code for improved readability
- Extract piece calculation logic into calculate_piece lambda helper
- Extract kernel args population into populate_split_image_kargs lambda
- Use aggregate initialization for cleaner struct population
- Reduce nesting depth and improve maintainability
- Fix outdated comment about split-image implementation status
* Refactor split-image code and remove debug prints
- Extract GPU kernel helper lambdas for better readability
- Remove all split-image debug print statements
- Set memory threshold to 2GB for production
- All tests pass with CPU verification
* Add split-image safety constraints and refactor to utils
- Add MAX_TOTAL_PIECES=64 limit to prevent segfault
- Move calculate_spatial_piece to library utils
- Add layout validation (NWGC, NHWGC, NDHWGC only)
- Fix hierarchical splitting to respect piece limits
- Add proper documentation and formatting
* Change split-image from runtime to compile-time branching
Response to @bartekxk review comment:
Convert 'if(kargs.num_spatial_pieces > 1)' to 'if constexpr(EnableSplitImage)'
Changes:
- Add EnableSplitImage template parameter to kernel
- Change runtime if to compile-time if constexpr
- Update invoker to instantiate kernel variants with true/false
Benefits:
- Eliminates runtime branching in GPU kernel
- Dead code elimination (each variant is smaller)
- Better compiler optimization
Files modified: 2
Lines changed: 20 total (6 in kernel, 14 in invoker)
Tests: 27/27 passed (100%)
Performance: No regression
* Add split-image example as separate binary
- Create grouped_convolution_forward_split_image example
- Add grouped_convolution_forward_split_image_invoker.hpp
- Update CMakeLists.txt to build split_image binary
* Replace linear search with binary search in find_piece_id
- Change O(n) to O(log n) for finding piece ownership
- Matches reference implementation in large_tensor_cshuffle
* Simplify split-image code and fix integer overflow
- Extract lambda functions to static helper methods
- Pre-calculate constants in invoker
- Fix integer overflow in tensor size calculation for large tensors
* Trigger CI rerun - fix merge conflicts
* Fix merge conflict markers
* Fix clang-format: remove space before {}
* Fix clang-format: comment wrapping and Swish constructor
* Rename split_image to large_tensor for clarity
- Renamed grouped_convolution_forward_split_image.cpp -> grouped_convolution_forward_large_tensor.cpp
- Renamed grouped_convolution_forward_split_image_invoker.hpp -> grouped_convolution_forward_large_tensor_invoker.hpp
- Updated CMakeLists.txt target name: tile_example_grouped_conv_fwd_split_image -> tile_example_grouped_conv_fwd_large_tensor
- Updated comments to refer to 'large tensor' instead of 'split-image'
* Update comments and include in large_tensor example
- Updated header comments to use 'large tensor' terminology
- Fixed include path to use large_tensor_invoker.hpp
* Remove test code, restore 2GB threshold
* Update include/ck_tile/ops/grouped_convolution/utils/transform_conv_fwd_to_gemm.hpp
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* Fix build errors after develop merge and complete rename to large_tensor
This commit addresses compilation errors from the develop merge and
completes the rename from split_image to large_tensor.
Changes:
1. Fix CDEElementWise typo in grouped_convolution_forward_invoker.hpp
2. Fix template parameter order in large_tensor_invoker.hpp
- TransformConvFwdToGemm signature changed in develop
- NumGroupsToMerge and SplitN parameters swapped positions
3. Fix missing template parameter in GroupedConvFwdHostArgs
4. Fix EpiloguePipeline scope in kernel (merge conflict)
5. Update binary name references in test scripts
* Restore 2GB threshold for split-image
Changed threshold from 100MB (testing) back to 2GB for production use.
* Fix const-correctness in ds_ptr cast
* Update include/ck_tile/ops/grouped_convolution/kernel/grouped_convolution_forward_kernel.hpp
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* Apply clang-format-18
* update c++ 18 format
* Apply clang-format-18 to transform_conv_fwd_to_gemm.hpp
---------
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* add tensorwise quant in grouped gemm
* fix example issue
* update test cases
* format codes
* clang format
* use GTEST_FAIL
* add bquant to grouped_gemm
* add tensorwise quant in grouped gemm
* fix example issue
* update test cases
* format codes
* clang format
* use GTEST_FAIL
* fix a bug in test_grouped_gemm_util
* skip test when use wmma on grouped_quant kernel
* change cmake
* fix a bug in test_grouped_gemm_util
* skip test when use wmma on grouped_quant kernel
* change cmake
* tests(quant_grouped_gemm): add unit tests to cover bquant in grouped_gemm
* Update test/ck_tile/grouped_gemm_quant/test_grouped_gemm_util_quant.hpp
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* Update example/ck_tile/17_grouped_gemm/quant_grouped_gemm.hpp
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* feat: add bf8 support
* chore: remove unnecessary decltype usage
* chore: add default quant_mode to function signature as fallback
* fix: pass correct runtime pipeline params in grouped_gemm bquant kernel
Calculate has_hot_loop, num_loop, and tail_number on device side for each
GEMM problem instead of using default values. This fixes incorrect results
when different problems in the group have different K dimensions.
* chore: set default quant mode in function signature
* test: add additional test cases to cover edge case of no hotloop
* change code based on comments
* WIP: bquant preshuffle b compiles but gives numerical error
* feat(grouped_gemm_quant): bquant with preshuffleB support added to grouped_gemm example & kernel
* refactor: refactor code after merge commit
* chore: remove print statements
* test(grouped_gemm): split test cases by quant mode to reduce compilation time and add bquant-preshuffleB mode test cases
---------
Co-authored-by: kyle-256 <Kyle.Zhao@amd.com>
Co-authored-by: ThomasNing <thomas.ning@amd.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Fixes LDS bank conflicts on gfx950 for universal gemm v3 pipeline
Replaces hardcoded LDS layer calculations with dynamic computation using the new architecture helpers
Adds architecture-specific helper function get_n_lds_banks()
Changes function attributes from CK_TILE_HOST_DEVICE to CK_TILE_DEVICE in universal gemm policy
* Summary:
- Refactor epilogue (with CShuffle) to support fused operations:
- EpilogueCShuffleBase holds common parts
- EpilogueCShuffle: runs CShuffle and write out
- EpilogueWelfordCShuffle: holds Welford specific arguments, runs CShuffle, write out, Welford first part and Welford write out
- Extend thread transfer v7r3:
- Support for intermediate data type different from src and dst type
- New functionality to write to dst buffer and keep data (to be able to use them for additional operations)
* Adress review comments
BlockFmhaPipelineQRKSVS reuses LDS for K and dropout so there must be
block_sync_lds between loading k_lds_window by gemm_0 and storing
dropout randval.
* Add InstanceTraits for DeviceGroupedConvFwdMultipleD_Wmma_CShuffle
* Add InstanceTraits for kernel_grouped_conv_fwd_dl_multiple_d
* A few small changes to fix broken instance traits.
* fixed synchronization issue in block gemm pipeline v1 that caused b_scale to fail
* run clang-format
---------
Co-authored-by: Kevin Abraham <kevin.abraham@streamhpc.com>
* HasHotLoop is a constexpr
* Remove an unused function
* Remove some unused include statements
* Add implementation and tests for fp8 x bf8 weight preshuffle GEMM
* Add implementation and tests for fp8 x bf8 in CK Tile basic and universal GEMMs
* Remove two barrier calls that HotLoopScheduler already calls
* No need to suppress a variable that hasn't been declared
* Replace six arg_parser arguments with constexpr literals
* Simplify run_gemm_test_prec_type
* The strides don't need to be passed via arg_parser as we use their default values
* The layouts don't need to be passed as arguments twice
* Pass M N and K as regular arguments, not using the argument parser
* We can now remove the argument parser
* Add a common file for precision types to be used in testing
* Convert basic and universal GEMM tests to use gtest
* Make GemmConfig a test parameter, and form test cases as the cartesian product GemmConfigs x PrecTypes
* Add GemmConfigComputeV4 to the GEMM configs to run the universal tests on
* Added a changelog entry
* Add missing copyright statements
* ifndef-define-endif is not needed with pragma once
* Fix a comment
* Add F8 x BF8 tests for CompV4 in test_gemm_pipeline_kernel_types.hpp
* Disable the unreliable test MoeSortingCase4
---------
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
* Improve the grouped conv kernel name generation in CK Tile.
* Fix building CShuffle epilogue tests.
---------
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>