* build: reduce build time for bqaunt unit tests by splitting into multiple cpp
* reduce the test case & add the gfx10 support
* fix: copyright header for new file
* chore: add copyright to pass the CI
* build: Hot fix to reduce massive build time by just disabling the instances
* Update include/ck_tile/core/config.hpp
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
---------
Co-authored-by: ThomasNing <thomas.ning@amd.com>
Co-authored-by: khushbu <khuagarw@amd.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* Fix some inconsistencies with OverrideBDatatype
* fix formatting
* Fix BGlobalPrefetch, no static
---------
Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
* CK Tile Stream-K Tree Reduction
This change adds the first implementation of the Stream-K tree reduction
strategy into CK Tile. The tree reduction reduces the the number of
steps for accumulating results for a tile from O(N) to O(logN) where N
is the number of workgroups contributing to a C tile.
Additionally, in the original non-atomic reduction strategy, atomics
were used to set the flags buffer and to read from the flags buffer.
Howeover, through investigation with the tree reduciton, atomics with
default (relaxed) semantics were not enough to guarantee workgroups
would not read stale data, leading to incorrect results. Stronger
acquire/release memory orderings are too expensive. So, this change
also eliminates the use of atomics for setting the flags. Instead, we
leverage cache modifiers (e.g., GLC) to avoid writing to cache, thereby
avoiding the use of atomics.
Prelimiary tests were also added for the normal reduction and tree
reduction. More will be added in a future PR via tile engine.
* Move Stream-K kernel files to a subdirectory
* Cleanup Code Style & Handle Unsupported Reductions
This change makes the following small changes:
- Add an explicit else block for unimplemented reduction strategies
- Clarify type of sk_flags_ptr via auto*
- Add description for extra_iters_before_me variable
* Run new copyright script on new files
Introduces a polymorphic describe() method to BaseOperator that enables runtime introspection of kernel configurations through a unified interface.
Key changes:
* Add virtual describe() method to BaseOperator returning Description objects
* Implement describe() in 6 device operation classes (conv fwd/bwd variants)
* Create conv_describe.hpp with factory function for ConvDescription
* Extract type definitions to conv_types.hpp to resolve circular dependencies
* Add InstanceStringDescription for kernels without full ConvDescription support
Other Improvements:
* Update tests to use describe() instead of GetInstanceString()
* Remove circular dependency include from conv_traits.hpp
* Add ODD_C to ConvFwdSpecialization enum and fix OddC mapping
* Replace silent fallback in conv_layout() with compile-time error
This provides a foundation for runtime kernel introspection and better tooling support for analyzing and debugging kernel configurations.
* support bf16*mxfp4 gemm
* rebase bf16*fp4 example to develop branch
* Clean up commented debug code in GEMM kernel
* rename example folder
* support bf16*mxfp4 gemm
* rebase bf16*fp4 example to develop branch
* Clean up commented debug code in GEMM kernel
* rename example folder
* rebase to new develop
* fix clang format
* update code according to reviewer's comment
* Update README.md
* update code according to reviewer's comment
* update code according to reviewer's comment
* Update CMakeLists.txt
* Update README.md
* Update CMakeLists.txt
* Delete files
* Delete files
* Add unit tests
* Update test_gemm_quant_base.hpp
* merge bf16*fp4 example to develop branch
* fix clang format
* fix clang format
* Update CMakeLists.txt
* fix ci test
* fix clang format
* resolve conflicts
---------
Co-authored-by: eliotwang <charyang@smci355-ccs-aus-m10-29.cs-aus.dcgpu>
Co-authored-by: ShaoChunLee <Shao-Chun.Lee@amd.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>
Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
* Support gemm_ab_scale:
- Add tests
- Integrate scaling implementation in multiple D
- Generalize existing b_scale for ab_scale
- Add instances
- Generalize implementation for ScaleBlockM, ScaleBlockN, ScaleBlockK
- Add support for all layouts supported by xdl
- Fix splitk xdl
* Fix copyright
* Wmma support for gemm_blockscale_wp (#3315)
* Support for preshuffle with ab scale
- add support for b preshuffle in GridwiseGemm_wmma_cshuffle_v3_ab_scale
- add support for AScaleLayout amnd BScaleLayout (can be different
from ALayout and BLayout, respectively)
- add Run method in v1 pipeline to support preshuffle + scaling
- add support for preshuffle gemms in common invoker
- Add splitk support
* Fix copyright header
Old sequence sort code was showing up on build profiles. Convert it to constexpr functions for much more efficient build-time execution. The sorting is still O(N^2), but our sequences are small enough it executes quickly. This reduced compilation time of a small convolution by more than 10% and time overall time spent in the compiler on a narrow build by %6.
* First version of split-K autodeduction.
* Fix circular dependency and kernel construction.
* Fix tolerance calculation for bwd weight example.
* Simplify kernel construction.
* Fix kernel launching bug for split-K autodeduce.
* Add split-K autodeduction support for the two stage example.
* Fix a corner case.
* Fix clang-format.
* Fix clang-format for inc files.
* Add missing header.
* Prevent too large split-K values.
* Fix formatting.
* Add unit tests for IsSupportedArgument in grouped bwd conv.
* clang-format.
* Fix merge conflicts.
* Address feedback from code review.
* clang-format
* Fix new tests after merge.
---------
Co-authored-by: Ville Pietilä <>
* WIP: preparing to add transpose bq support
* WIP: handle both row/col layout for BQ windows/tile dstr
* Fix build
* WIP: adding some test, debugging numerical errors
* Fix all but pkint4 tests
* Remove test_gemm_quant_typed.cpp again
* update disabled tests
* add conversion from pkint4 for b matrix
* fix formatting
* fix formatting
* Fix tr_load and use override b datatype for clarity
* fix formatting
* make bquant preshuffle tests bqlayout column-major
* wip: add aquant to grouped gemm quant example
* fix: properly handle hot loop count in aquant pipeline
* fix: add separate GemmConfig structs for AQuant, automatically select the correct one
* feat: finish support for a non-persistent kernel invocation for grouped gemm quant, and add support code to example
* refactor: cleaned up grouped gemm quant example a bit by reusing pipeline selection logic
* chore: add warp gemm dispatchers for a couple of TransposeC K=32 variants
* feat: add quant grouped gemm tests cases for aquant (regular and transpose C) and non-persistent kernel
* fix: update base pipeline classes according to changes in develop branch
* Revert "chore: add warp gemm dispatchers for a couple of TransposeC K=32 variants"
This reverts commit b3fd4d326d.
* feat: remove aquant config from grouped gemm quant example, update to add persistency as runtime parameter
* chore: removed work-around for aquant bug that has been fixed
* chore: fix typo in command-line parameters
* fix: correct K warp tile size for gfx950
* chore: incorrect warp tile configuration on gfx942
Sometimes there are not enough wait-states between v_mfma_f32... and v_accvgpr_read_b32 instructions if they are separated by s_cbranch.
The workaround is to read accvgprs to vgpr before branching.
We only want to import enums and types into the builder reflection code. But, some of the enums are included in much larger files or even big trees of include files. This leads to unintended mixing of code and very confusing interactions and symbol conflicts. We organize the includes and extract two new enum-only headers to help with decoupling in CK. This refactoring is critical if we want to include reflection in a device-operator "describe" method.
* Remove a few unnecessary includes from headers in builder/reflect/.
* Extract enums scheduler and pipeline to their own headers so they can be used without importing other code.
* Order includes alphabetically for better organization.
The immediate goal is to unblock reflection integration, and this type of cleanup helps the flexibility and robustness of the CK header library.
* initial poc
* factor out common parts in operator()
* cv4
* rest of the universal gemm pipelines
* fix test
* remove boilerplate from tile engine
* fix example
* fix example
* format
* fix tests build for gemm
* remove base pipeline codegen from gemm instance builder
* unify v3 logic with the rest of universal gemm pipelines
* fix build for multi abd test
* fix test gemm multi d
* fix build for weight preshuffle
* fix grouped gemm test
* fix grouped gemm multi d test
* fix grouped gemm preshuffle
* fix grouped gemm example except for quant
* fix gemm preshuffle
* fix splitk 2 stage example
* fix batched gemm example
* fix multid example
* fix multiabd example
* fix batched gemm test
* fixup
* fix examples build
* fix grouped gemm test build
* fix smoke builder
* replace qr with async pipeline
* Add fp8fp32 to DTYPE_BITS
* Add kAlignmentRandVal to avoid compile fail
* format
---------
Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
* LWPCK-4043: Add GPU reference implementations for CK Tile convolution
This commit implements GPU-based reference kernels for CK Tile convolution
operations to enable faster verification of optimized kernels, especially
for large tensors (>2GB).
Changes:
- Add naive_grouped_conv_fwd.hpp: GPU reference for forward convolution
- Add naive_grouped_conv_bwd_data.hpp: GPU reference for backward data
- Add naive_grouped_conv_bwd_weight.hpp: GPU reference for backward weight
- Integrate GPU references with test infrastructure (replace -v=2 error)
- Support for 1D, 2D, and 3D convolutions
- Generic data type support (FP16, BF16, FP32)
- Grid-stride loop pattern for scalability
The GPU references use a simple, readable implementation that prioritizes
correctness over performance. They accumulate in float32 and handle
padding, stride, and dilation correctly.
* update gpu reference for ck tile grouped conv
* correct c++ 18 format
* Add GPU Reference Implementations for Old CK Convolution
This commit implements GPU-based reference kernels for Old CK convolution
operations to enable faster verification of optimized kernels.
Changes:
- Fixed old CK forward GPU reference (naive_conv_fwd.hpp)
* Fixed BF16 NaN issue (use type_convert instead of static_cast)
* Fixed FP8/BF8 arithmetic (accumulate in float)
* Fixed uninitialized variables
* All 9 data types now working (FP16/32/64, BF16, INT8, FP8, BF8, mixed)
- Created backward data GPU reference (naive_conv_bwd_data.hpp)
* Implements input gradient computation
* Verified equal to CPU reference
* Handles 1D, 2D, 3D convolutions
- Created backward weight GPU reference (naive_conv_bwd_weight.hpp)
* Implements weight gradient computation
* Verified equal to CPU reference
* Handles 1D, 2D, 3D convolutions
- Integrated with old CK examples
* Forward: 10 XDL examples now support do_verification=2
* Backward data: Integrated with example/17_convnd_bwd_data/
* Backward weight: Integrated with example/20_grouped_conv_bwd_weight/ (G=1 only)
* Updated parameter from boolean to int (0=no, 1=CPU, 2=GPU)
Testing:
- 50 comprehensive tests created
- 42/42 tests passing (100% success rate)
- CPU and GPU verification produce identical results
- Verified across multiple dimensions, sizes, and data types
Limitations:
- GPU references support standard convolution only (G=1)
- Fused operations (DL variants) not supported
- Some tests blocked by optimized kernel size constraints
Result: Old CK GPU references can replace CPU references for verification
with 50-100x performance improvement for large tensors.
* Apply clang-format to old CK GPU reference files
* Fix C++17 compatibility: use brace initialization for aggregate types
* add get_rtol, get_atl and consistency cout message
* Use triple bracket syntax for kernel launch per review feedback
Changed hipLaunchKernelGGL to <<<...>>> syntax as suggested by @aosewski.
This is more idiomatic HIP/CUDA style and equally correct.
All tests still passing after this change.
* Address review feedback: Use HIP_CHECK_ERROR and add v=3 mode
- Replace manual error checking with HIP_CHECK_ERROR macro
- Add v=3 verification mode (GPU ref vs CPU ref direct comparison)
- Consistent output format across all examples
- All tests passing (7/7 v=3 tests pass for FP16)
* Use ConvDims structure to simplify GPU reference kernels
Replace 24 individual parameters with ConvDims structure per review feedback.
- Add conv_common.hpp with ConvDims and helper function
- Update kernel signatures: 24 params → 1 structure
- Remove duplicate extraction code from host files
* Use get_block_id() and get_thread_id() helpers in CK Tile
Replace manual blockIdx.x/threadIdx.x arithmetic with helper functions.
Updated 3 CK Tile GPU reference kernels per review feedback.
* Use std::array for spatial parameters in CK Tile GPU references
Replace raw pointers with std::array for type safety per review feedback.
- Add conv_common.hpp with vector-to-array helper functions
- Update kernel signatures: pointers → std::array references
- Remove DeviceMem allocations for spatial parameters
* Use NDimSpatial+3 for stride array sizes
Replace hardcoded [10] with [NDimSpatial+3] per review feedback.
Array sizes now correctly reflect actual dimensions needed.
* Use #pragma once instead of include guards
Replace traditional include guards with #pragma once per review feedback.
Updated 3 Old CK GPU reference headers.
* Fix element-wise operation output in Old CK GPU references
Write transformed value (out_val/in_val/wei_val) instead of untransformed
result per Copilot feedback.
This ensures element-wise operations are correctly applied to output.
* Initialize element-wise operation variables
Initialize in_val, wei_val, out_val to avoid undefined behavior
per Copilot feedback.
Updated backward data and backward weight kernels.
* Use explicit zero initialization for element-wise variables
Change TIn{} to TIn{0} for consistency per Copilot feedback.
All 3 kernels now use consistent zero initialization.
* Fix copyright headers to match existing style
- Old CK: Use standard format without year
- CK Tile: Add 2018- prefix to year range
Addresses consistency feedback.
* Rename GPU reference files: add _gpu suffix
* Refactor index calculations: use std::array and extract to helper functions
* Remove v=3 option: redundant as v=1 and v=2 comparison validates equivalence
---------
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
* [CK TILE GEMM] Refactor block_scale_gemm examples
- Split cpp file to reduce building time
- Support multiple GemmConfig
* [CK TILE GEMM] Refactor block_scale_gemm examples
- Update Readme
* feat(gemm_quant): add RRR and CRR layout support for aquant gemm
* test(gemm_quant): add unit tests for RRR and CRR layout support for aquant gemm
* fix: compilation error on gfx950 by omitting support for the gpu in example and unit tests
* fix: test cases compilation failure due to PR# 2095
* fix: make condition to filter out tests for gfx950 more explicit
* need to support the gfx950
* fix: add layout suppot for gfx950
* Extend pk_int4_t support for block_scale_gemm aquant CR and RR layout (#3277)
* WIP: add support for pk_int4_t for aquant mode layouts RR and CR
* test(block_scale_gemm): add unit tests for CRR and RRR layout when data type is int4 && aquant
* fix: compile time error for gfx950
* fix: minor bug where is_a_load_tr_v() was mising
* feat(block_scale_gemm): Add layout Col-Col-Row-Col (ABC-Aquant) for tensors in aquant (#3318)
* feat(block_scale_gemm): Add layout Col-Col-Row-Col (ABC-Aquant) for tensors in aquant
* test: add unit tests for new layout support CCRC for aquant block scale gemm
* docs: update changelog with new layout support info
* Update CHANGELOG.md
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
---------
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* refactor: break test instances into multiple cpp files to reduce build time (#3319)
* feat(block_scale_gemm): Add layout Col-Col-Row-Col (ABC-Aquant) for tensors in aquant
* test: add unit tests for new layout support CCRC for aquant block scale gemm
* refactor: break test instances into multiple cpp files to reduce build time
* chore: rename file for better code readability
* fix: merge conflict resolution
* fix: remove memory pipeline because new layout is not compatible
* build: resolve build errors for gfx950 by modifying is_a_load_tr() & is_b_load_tr()
* refactor: address review comments
* solve the conflict
---------
Co-authored-by: Cong Ma <congma13@amd.com>
Co-authored-by: ThomasNing <thomas.ning@amd.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* Merge fwd conv groups in CK Tile.
* Fix building CK fwd convs.
* Add number of merged groups to conv fwd kernel name.
* Get number of merged groups from conv config.
* Rename GemmConfig to ConvConfig.
* Clean-up TODOs.
* Check that number of conv groups must be divisible by the number of merged groups.
* Improve error handling in the conv fwd example.
* Fix clang-format.
* Fix group offsets.
* Fix merge problem.
* Address feedback from code review.
* Fix clang-formatting.
* Add help for example
* Refactore the compute reference batched contraction to manage stride-aware calculation and some code cleanings
* Add stride-aware reference for batched contraction with independent D tensor layouts
* Add -num_d argument for runtime D tensor count selection in batched contraction
* Add stride vector arguments in example code for testing non-contiguous batched contraction inputs
* Add descriptor-based architecture for batched contraction multi-dimensional stride support
* Add multi-dimensional non-contiguous stride support to batched contraction, num_d = 0
* Add complete multi-dimensional stride support via descriptors
* Enable vectorization in descriptor-based batched contraction. Add pad_tensor_view to local RunGemm
* Clean up batched contraction: remove old UniversalGemmKernel path
* Clean up batched contraction: remove legacy paths and finalize docs
* Optimize batched contraction example: pass dimension sizes not vectors
* correct the reference calculation, unsigned int to int
* Fix batched_contraction C++17 build errors for gfx90a CI
* wip: grouped_gemm implementation based on wmma kernel + example for fp16
* chore: clean up grouped_gem_wmma_splitk_fp16 example
* chore: add cmake options to fully disable XDL or WMMA kernels
* feat: add tests for grouped gemma wmma instances for f16 and bf16 (all layouts)
* chore: add grouped gemm wmma bf16 example
* refactor: reuse more code between instance factory functions
* chore: turn test failure if not all batch sizes are supported into a warning
* chore: made failing of test on unsupported instances conditional to not break old tests
* chore: add log message to failure case where AK1/BK1/KBatch is too high for K value
* fix: issue with new overloads of GridwiseGemm_wmma_cshuffle_v3::Run()
* fix: stray comma after parameter list
* fix: compilation issues on RDNA3 and tests failing due to unsupported problems still being ran
* chore: update copyright in header comments
* nit: minor feebdack
* refactor: unified XDL / wma tests
* fix: properly disable FP8 instances when ONLY targeting gfx11
* refactor: add v3 suffix to grouped_gemm device struct name
* fix: small typos in example code
* fix: fully exclude xdl/wmma instances when using the corresponding cmake flags
* chore: remove unused destructor and added pipeline support checks to remove unnecessary paths
* fix: make sure to not add instance library to group if library was skipped
* fix: make sure xdl grouped gemm doesnt fail the new test
* fix: explicitly exclude test if no xdl/wmma support, as pattern matching fails in this case
* fix: examples not working since dependent types and functions were moved to ck namespace in develop
* fix: tests failing when compiling for just gfx11 due to trying to run unsupported instances
* chore: replace/add copyright headers with new format