When I asked for a description of operators that didn't have ConvTraits, I was getting very long confusing errors about ConvTraits not being defined. Now we get specific errors explaining which concepts are violated, making it easier to know which code to generalize or update.
* Add concepts to conv_traits.hpp to get better error message.
* Put the correct requires clauses in the right places to get descriptive error messages.
* General cleanup of functions in conv_traits.hpp to make functions easier to read.
* initial poc
* factor out common parts in operator()
* cv4
* rest of the universal gemm pipelines
* fix test
* remove boilerplate from tile engine
* fix example
* fix example
* format
* fix tests build for gemm
* remove base pipeline codegen from gemm instance builder
* unify v3 logic with the rest of universal gemm pipelines
* fix build for multi abd test
* fix test gemm multi d
* fix build for weight preshuffle
* fix grouped gemm test
* fix grouped gemm multi d test
* fix grouped gemm preshuffle
* fix grouped gemm example except for quant
* fix gemm preshuffle
* fix splitk 2 stage example
* fix batched gemm example
* fix multid example
* fix multiabd example
* fix batched gemm test
* fixup
* fix examples build
* fix grouped gemm test build
* fix smoke builder
* Separate layouts into separate entities for input, weight, and output tensors.
* Add test for handling bias tensor layouts.
* Use instance string in builder tests.
* Add handling of output bias data types and layouts.
* Generalize handling of the elementwise ops.
* Test fix.
* Create builder for layouts.
* Layout builder improvements.
* Improve layout builder.
* Simplify bias layout handling.
* Code clean-up.
* Move layout utils into separate file.
* Remove hard-coded layout combinations.
* Small code clean-up.
* Move data type utils into a separate file.
* Add data types, layouts, and elementwise ops per conv tensor.
* Builder bug fixes after refactoring.
* Working baseline.
* Make signature definition look nice in the test code.
* Move TensorConfig into test implementations.
* Fix all fwd conv builder tests.
* Fix conv traits and descriptors tests.
* More factory assets under a separate directory.
* Fix building conv traits.
* Fix clang-format.
* Add Readme doc to describe the design.
* Add link to main Readme. Fix links in the builder design doc.
* Clean-up data type/layout/elementwise op conversions.
* Switch from dimension and tensor type specific layouts to a flat list of tensor layouts.
* Fix clang-formatting.
* Fix clang-format for test code.
* Simplify fwd conv signature definitions in the test code.
* Remove accidental edits.
* Fix comment string.
* Fix instance factory after rebase.
* Fix tests after rebase.
* Unify layout handling.
* Add more conv layout unit tests.
* Clang-format.
* Fix merge conflicts.
* Improve elementwise op handling.
---------
Co-authored-by: Ville Pietilä <>
This PR fixes a mismatch caused when PR #3146 was merged out of sync with develop, which made its intended changes ineffective. This PR reapplies those changes to move DataTypeTraits into a common file to mitigate code duplication.
Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
* replace qr with async pipeline
* Add fp8fp32 to DTYPE_BITS
* Add kAlignmentRandVal to avoid compile fail
* format
---------
Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
* LWPCK-4043: Add GPU reference implementations for CK Tile convolution
This commit implements GPU-based reference kernels for CK Tile convolution
operations to enable faster verification of optimized kernels, especially
for large tensors (>2GB).
Changes:
- Add naive_grouped_conv_fwd.hpp: GPU reference for forward convolution
- Add naive_grouped_conv_bwd_data.hpp: GPU reference for backward data
- Add naive_grouped_conv_bwd_weight.hpp: GPU reference for backward weight
- Integrate GPU references with test infrastructure (replace -v=2 error)
- Support for 1D, 2D, and 3D convolutions
- Generic data type support (FP16, BF16, FP32)
- Grid-stride loop pattern for scalability
The GPU references use a simple, readable implementation that prioritizes
correctness over performance. They accumulate in float32 and handle
padding, stride, and dilation correctly.
* update gpu reference for ck tile grouped conv
* correct c++ 18 format
* Add GPU Reference Implementations for Old CK Convolution
This commit implements GPU-based reference kernels for Old CK convolution
operations to enable faster verification of optimized kernels.
Changes:
- Fixed old CK forward GPU reference (naive_conv_fwd.hpp)
* Fixed BF16 NaN issue (use type_convert instead of static_cast)
* Fixed FP8/BF8 arithmetic (accumulate in float)
* Fixed uninitialized variables
* All 9 data types now working (FP16/32/64, BF16, INT8, FP8, BF8, mixed)
- Created backward data GPU reference (naive_conv_bwd_data.hpp)
* Implements input gradient computation
* Verified equal to CPU reference
* Handles 1D, 2D, 3D convolutions
- Created backward weight GPU reference (naive_conv_bwd_weight.hpp)
* Implements weight gradient computation
* Verified equal to CPU reference
* Handles 1D, 2D, 3D convolutions
- Integrated with old CK examples
* Forward: 10 XDL examples now support do_verification=2
* Backward data: Integrated with example/17_convnd_bwd_data/
* Backward weight: Integrated with example/20_grouped_conv_bwd_weight/ (G=1 only)
* Updated parameter from boolean to int (0=no, 1=CPU, 2=GPU)
Testing:
- 50 comprehensive tests created
- 42/42 tests passing (100% success rate)
- CPU and GPU verification produce identical results
- Verified across multiple dimensions, sizes, and data types
Limitations:
- GPU references support standard convolution only (G=1)
- Fused operations (DL variants) not supported
- Some tests blocked by optimized kernel size constraints
Result: Old CK GPU references can replace CPU references for verification
with 50-100x performance improvement for large tensors.
* Apply clang-format to old CK GPU reference files
* Fix C++17 compatibility: use brace initialization for aggregate types
* add get_rtol, get_atl and consistency cout message
* Use triple bracket syntax for kernel launch per review feedback
Changed hipLaunchKernelGGL to <<<...>>> syntax as suggested by @aosewski.
This is more idiomatic HIP/CUDA style and equally correct.
All tests still passing after this change.
* Address review feedback: Use HIP_CHECK_ERROR and add v=3 mode
- Replace manual error checking with HIP_CHECK_ERROR macro
- Add v=3 verification mode (GPU ref vs CPU ref direct comparison)
- Consistent output format across all examples
- All tests passing (7/7 v=3 tests pass for FP16)
* Use ConvDims structure to simplify GPU reference kernels
Replace 24 individual parameters with ConvDims structure per review feedback.
- Add conv_common.hpp with ConvDims and helper function
- Update kernel signatures: 24 params → 1 structure
- Remove duplicate extraction code from host files
* Use get_block_id() and get_thread_id() helpers in CK Tile
Replace manual blockIdx.x/threadIdx.x arithmetic with helper functions.
Updated 3 CK Tile GPU reference kernels per review feedback.
* Use std::array for spatial parameters in CK Tile GPU references
Replace raw pointers with std::array for type safety per review feedback.
- Add conv_common.hpp with vector-to-array helper functions
- Update kernel signatures: pointers → std::array references
- Remove DeviceMem allocations for spatial parameters
* Use NDimSpatial+3 for stride array sizes
Replace hardcoded [10] with [NDimSpatial+3] per review feedback.
Array sizes now correctly reflect actual dimensions needed.
* Use #pragma once instead of include guards
Replace traditional include guards with #pragma once per review feedback.
Updated 3 Old CK GPU reference headers.
* Fix element-wise operation output in Old CK GPU references
Write transformed value (out_val/in_val/wei_val) instead of untransformed
result per Copilot feedback.
This ensures element-wise operations are correctly applied to output.
* Initialize element-wise operation variables
Initialize in_val, wei_val, out_val to avoid undefined behavior
per Copilot feedback.
Updated backward data and backward weight kernels.
* Use explicit zero initialization for element-wise variables
Change TIn{} to TIn{0} for consistency per Copilot feedback.
All 3 kernels now use consistent zero initialization.
* Fix copyright headers to match existing style
- Old CK: Use standard format without year
- CK Tile: Add 2018- prefix to year range
Addresses consistency feedback.
* Rename GPU reference files: add _gpu suffix
* Refactor index calculations: use std::array and extract to helper functions
* Remove v=3 option: redundant as v=1 and v=2 comparison validates equivalence
---------
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
* Create Description::instance_string() function
To expose more reflection capabilities in MIOpen, we add the instance_string functionality to the ckr::Description class. This PR introduces a base class, adds the instance_string method, and implements the method by injecting the Traits::instance_string method through the ConvDescription constructor.
This will enable us to replace the specialized get_instance_string() method on device operations with a describe() method in a subsequent PR.
* Test describe().instance_string()
Update the instance string tests to also call `ckr::describe<Instance>().instance_string()`. This documents that the xld kernels are supported with describe(), but WMMA and DL kernels are not yet supported. Also update namespace and add a HasConvTraits concept.
* [CK TILE GEMM] Refactor block_scale_gemm examples
- Split cpp file to reduce building time
- Support multiple GemmConfig
* [CK TILE GEMM] Refactor block_scale_gemm examples
- Update Readme
* feat(gemm_quant): add RRR and CRR layout support for aquant gemm
* test(gemm_quant): add unit tests for RRR and CRR layout support for aquant gemm
* fix: compilation error on gfx950 by omitting support for the gpu in example and unit tests
* fix: test cases compilation failure due to PR# 2095
* fix: make condition to filter out tests for gfx950 more explicit
* need to support the gfx950
* fix: add layout suppot for gfx950
* Extend pk_int4_t support for block_scale_gemm aquant CR and RR layout (#3277)
* WIP: add support for pk_int4_t for aquant mode layouts RR and CR
* test(block_scale_gemm): add unit tests for CRR and RRR layout when data type is int4 && aquant
* fix: compile time error for gfx950
* fix: minor bug where is_a_load_tr_v() was mising
* feat(block_scale_gemm): Add layout Col-Col-Row-Col (ABC-Aquant) for tensors in aquant (#3318)
* feat(block_scale_gemm): Add layout Col-Col-Row-Col (ABC-Aquant) for tensors in aquant
* test: add unit tests for new layout support CCRC for aquant block scale gemm
* docs: update changelog with new layout support info
* Update CHANGELOG.md
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
---------
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* refactor: break test instances into multiple cpp files to reduce build time (#3319)
* feat(block_scale_gemm): Add layout Col-Col-Row-Col (ABC-Aquant) for tensors in aquant
* test: add unit tests for new layout support CCRC for aquant block scale gemm
* refactor: break test instances into multiple cpp files to reduce build time
* chore: rename file for better code readability
* fix: merge conflict resolution
* fix: remove memory pipeline because new layout is not compatible
* build: resolve build errors for gfx950 by modifying is_a_load_tr() & is_b_load_tr()
* refactor: address review comments
* solve the conflict
---------
Co-authored-by: Cong Ma <congma13@amd.com>
Co-authored-by: ThomasNing <thomas.ning@amd.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Refactor the builder factory code into multiple files and subdirectories and a ck_tile::builder::factory namespace.
The factory implements compile-time dispatch from high-level signature and algorithm descriptors to our existing specialized convolution kernel implementations.
Major changes in this PR:
Dispatch logic is explicit in the function make_conv_instance instead of implicit in template specialization selection.
Helper code is moved to a subdirectory builder/factory/helpers.
Helpers now have unit tests.
Factories are moved to their own files.
Code moved to namespaces ck_tile::builder::factory and ck_tile::builder::factory::internal.
This does not yet fix the problem of bad error messages, but the make_conv_instance function makes the poor error messages clear. The choice of algorithm must be much more robust (perhaps with explicit enumeration in the algorithm descriptor), so that the dispatch doesn't fail.
Quality changes:
Making dispatch explicit rather than implicit will improve robustness, readability, maintainability, testability, and extensibility.
Separating code into separate files and subdirectories helps readability and extensibility.
Adding unit tests for helpers documents behavior and will enable more complex logic and functionality.
Separating files (especially unit tests) helps clarify includes and dependencies and makes code easier to refactor.
* Merge fwd conv groups in CK Tile.
* Fix building CK fwd convs.
* Add number of merged groups to conv fwd kernel name.
* Get number of merged groups from conv config.
* Rename GemmConfig to ConvConfig.
* Clean-up TODOs.
* Check that number of conv groups must be divisible by the number of merged groups.
* Improve error handling in the conv fwd example.
* Fix clang-format.
* Fix group offsets.
* Fix merge problem.
* Address feedback from code review.
* Fix clang-formatting.
* Add help for example
* Refactore the compute reference batched contraction to manage stride-aware calculation and some code cleanings
* Add stride-aware reference for batched contraction with independent D tensor layouts
* Add -num_d argument for runtime D tensor count selection in batched contraction
* Add stride vector arguments in example code for testing non-contiguous batched contraction inputs
* Add descriptor-based architecture for batched contraction multi-dimensional stride support
* Add multi-dimensional non-contiguous stride support to batched contraction, num_d = 0
* Add complete multi-dimensional stride support via descriptors
* Enable vectorization in descriptor-based batched contraction. Add pad_tensor_view to local RunGemm
* Clean up batched contraction: remove old UniversalGemmKernel path
* Clean up batched contraction: remove legacy paths and finalize docs
* Optimize batched contraction example: pass dimension sizes not vectors
* correct the reference calculation, unsigned int to int
* Fix batched_contraction C++17 build errors for gfx90a CI
* wip: grouped_gemm implementation based on wmma kernel + example for fp16
* chore: clean up grouped_gem_wmma_splitk_fp16 example
* chore: add cmake options to fully disable XDL or WMMA kernels
* feat: add tests for grouped gemma wmma instances for f16 and bf16 (all layouts)
* chore: add grouped gemm wmma bf16 example
* refactor: reuse more code between instance factory functions
* chore: turn test failure if not all batch sizes are supported into a warning
* chore: made failing of test on unsupported instances conditional to not break old tests
* chore: add log message to failure case where AK1/BK1/KBatch is too high for K value
* fix: issue with new overloads of GridwiseGemm_wmma_cshuffle_v3::Run()
* fix: stray comma after parameter list
* fix: compilation issues on RDNA3 and tests failing due to unsupported problems still being ran
* chore: update copyright in header comments
* nit: minor feebdack
* refactor: unified XDL / wma tests
* fix: properly disable FP8 instances when ONLY targeting gfx11
* refactor: add v3 suffix to grouped_gemm device struct name
* fix: small typos in example code
* fix: fully exclude xdl/wmma instances when using the corresponding cmake flags
* chore: remove unused destructor and added pipeline support checks to remove unnecessary paths
* fix: make sure to not add instance library to group if library was skipped
* fix: make sure xdl grouped gemm doesnt fail the new test
* fix: explicitly exclude test if no xdl/wmma support, as pattern matching fails in this case
* fix: examples not working since dependent types and functions were moved to ck namespace in develop
* fix: tests failing when compiling for just gfx11 due to trying to run unsupported instances
* chore: replace/add copyright headers with new format
* [CKBuilder] Update the testing documentation
Now that we have clear sets for smoke tests and regression test, we rearange the CMakeLists.txt file to be organized and have description and instructional comments.
Move all the test targets that compile quickly into the smoke test suite.
Update the builder README.md to reflect this new test organization and functionality.
* Update experimental/builder/README.md
Clarify integration tests description from review comment.
Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>
* Correct README.md
The regression tests here still run very fast like smoke tests, but can take minutes or even tens of minutes to compile. We test most of the builder functionality without compiling heavily-templated kernel code, but these regression tests do an expensive full build of the CK kernels.
---------
Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>
The ConvDescription::detailed command wasn't using TreeFormatter::writeLast correctly, which led to extra lines being drawn in the tree view. It's a simple fix, just a cosmetic improvment out reflection output (ASCII art).
Remove obsolete feature for extracting a description from a builder, since this should apply directly to the instance type. Also add some documentation, including a README.md for reflection.
* Add validity checks for MoE FlatMM scatter and enable bf16 hardware atomic
* correct clang-format
* removed unused rtol_atol variable from example code
* clang format correction
* remove unused varable max_accumulated_value from example
* [CK TILE STREAMK] Introduce initial support for tile engine in streamk GEMM.
- This commit lays the groundwork for integrating the tile engine into streamk GEMM.
It focuses on creating benchmark executables for streamk GEMM.
- Additional scripts like test_benchmark.sh and gemm_benchmark.py will be added once
the streamk implementation reaches stability.
* [CK TILE STREAMK] Enable CI to execute tile engine benchmarks for StreamK GEMM
* [CK TILE STREAMK] Refactor: Extract common utility functions.
* [CK TILE STREAMK] Revise tile engine of streamk to align with the updated implementation
* Add pre-commit
* [CK TILE STREAMK] Add 'dp_persistent' and 'reduction_strategy' in output of CK TILE STREAMK
* [CK TILE STREAMK] Fix a bug about value of 'dp_persistent' of CK TILE STREAMK
* [CK TILE STREAMK] Update Jenkinsfile
* [CK TILE Engine] Update StreamK tile engine help message
Remove default value messages as they are automatically printed
* [CK TILE Engine] Update StreamK tile engine
- Remove namespace reboot
* [CK TILE Engine] Update StreamK tile engine
- Fix merge error
This renames the typeToStr struct in the common utilities to DataTypeTraits and removes all duplication of DataTypeTraits across files in CK Tile.
Co-authored-by: Christopher Millette <63608002+cgmillette@users.noreply.github.com>
* remove EXCLUDE_FROM_ALL from ck-tile examples
-> +15 min build time w/ 64 threads for a single arch
* fix cpp17 compile error in the ck-tile examples
---------
Co-authored-by: khuagarw <khuagarw@amd.com>
Co-authored-by: Ding, Yi <yi.ding@amd.com>
Some of the device implementation templates have macros like GridwiseGemmMultiABDTemplateParameters that can cause build errors if multiple files are included together. This error comes up with our builder code.
To clean up the macros and make them safer, we follow these follow rules:
* Use more specific names to avoid duplication.
* Undefine the macro after it is used to avoid leaking out of the file scope.
* Use a prefix CK_ on the macro to avoid conflicting with other libraries.
* Use all caps with underscores for preprocessor macro names.
- Add conditional selection to use v3 pipeline when PreshuffleQuant is true
- Add static assertion in memory pipeline to prevent PreshuffleQuant usage
- Restore BaseBQuantGemmPipelineAgBgCrCompV3 for BQuant cases
- Update BaseGemmPipeline selection to handle all quant modes properly
* build and run ck_builder tests
* add test_ckb_all to targets
* fix syntax
* fix test path
* Update CMake targets for builder testing in CI (#3290)
Our existing CMake only had build targets. Update CMakeLists.txt to have CTEST targets:
* smoke-builder
* regression-builder
* check-builder
Co-authored-by: John Shumway <jshumway@amd.com>
* use check-builder target
* get rid of test_ckb_all target
* call ninja check-builder separately
---------
Co-authored-by: John Shumway <jshumway@amd.com>