* [CK TILE GEMM] Refactor block_scale_gemm examples
- Split cpp file to reduce building time
- Support multiple GemmConfig
* [CK TILE GEMM] Refactor block_scale_gemm examples
- Update Readme
* feat(gemm_quant): add RRR and CRR layout support for aquant gemm
* test(gemm_quant): add unit tests for RRR and CRR layout support for aquant gemm
* fix: compilation error on gfx950 by omitting support for the gpu in example and unit tests
* fix: test cases compilation failure due to PR# 2095
* fix: make condition to filter out tests for gfx950 more explicit
* need to support the gfx950
* fix: add layout suppot for gfx950
* Extend pk_int4_t support for block_scale_gemm aquant CR and RR layout (#3277)
* WIP: add support for pk_int4_t for aquant mode layouts RR and CR
* test(block_scale_gemm): add unit tests for CRR and RRR layout when data type is int4 && aquant
* fix: compile time error for gfx950
* fix: minor bug where is_a_load_tr_v() was mising
* feat(block_scale_gemm): Add layout Col-Col-Row-Col (ABC-Aquant) for tensors in aquant (#3318)
* feat(block_scale_gemm): Add layout Col-Col-Row-Col (ABC-Aquant) for tensors in aquant
* test: add unit tests for new layout support CCRC for aquant block scale gemm
* docs: update changelog with new layout support info
* Update CHANGELOG.md
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
---------
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* refactor: break test instances into multiple cpp files to reduce build time (#3319)
* feat(block_scale_gemm): Add layout Col-Col-Row-Col (ABC-Aquant) for tensors in aquant
* test: add unit tests for new layout support CCRC for aquant block scale gemm
* refactor: break test instances into multiple cpp files to reduce build time
* chore: rename file for better code readability
* fix: merge conflict resolution
* fix: remove memory pipeline because new layout is not compatible
* build: resolve build errors for gfx950 by modifying is_a_load_tr() & is_b_load_tr()
* refactor: address review comments
* solve the conflict
---------
Co-authored-by: Cong Ma <congma13@amd.com>
Co-authored-by: ThomasNing <thomas.ning@amd.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
* Merge fwd conv groups in CK Tile.
* Fix building CK fwd convs.
* Add number of merged groups to conv fwd kernel name.
* Get number of merged groups from conv config.
* Rename GemmConfig to ConvConfig.
* Clean-up TODOs.
* Check that number of conv groups must be divisible by the number of merged groups.
* Improve error handling in the conv fwd example.
* Fix clang-format.
* Fix group offsets.
* Fix merge problem.
* Address feedback from code review.
* Fix clang-formatting.
* Add help for example
* Refactore the compute reference batched contraction to manage stride-aware calculation and some code cleanings
* Add stride-aware reference for batched contraction with independent D tensor layouts
* Add -num_d argument for runtime D tensor count selection in batched contraction
* Add stride vector arguments in example code for testing non-contiguous batched contraction inputs
* Add descriptor-based architecture for batched contraction multi-dimensional stride support
* Add multi-dimensional non-contiguous stride support to batched contraction, num_d = 0
* Add complete multi-dimensional stride support via descriptors
* Enable vectorization in descriptor-based batched contraction. Add pad_tensor_view to local RunGemm
* Clean up batched contraction: remove old UniversalGemmKernel path
* Clean up batched contraction: remove legacy paths and finalize docs
* Optimize batched contraction example: pass dimension sizes not vectors
* correct the reference calculation, unsigned int to int
* Fix batched_contraction C++17 build errors for gfx90a CI
* wip: grouped_gemm implementation based on wmma kernel + example for fp16
* chore: clean up grouped_gem_wmma_splitk_fp16 example
* chore: add cmake options to fully disable XDL or WMMA kernels
* feat: add tests for grouped gemma wmma instances for f16 and bf16 (all layouts)
* chore: add grouped gemm wmma bf16 example
* refactor: reuse more code between instance factory functions
* chore: turn test failure if not all batch sizes are supported into a warning
* chore: made failing of test on unsupported instances conditional to not break old tests
* chore: add log message to failure case where AK1/BK1/KBatch is too high for K value
* fix: issue with new overloads of GridwiseGemm_wmma_cshuffle_v3::Run()
* fix: stray comma after parameter list
* fix: compilation issues on RDNA3 and tests failing due to unsupported problems still being ran
* chore: update copyright in header comments
* nit: minor feebdack
* refactor: unified XDL / wma tests
* fix: properly disable FP8 instances when ONLY targeting gfx11
* refactor: add v3 suffix to grouped_gemm device struct name
* fix: small typos in example code
* fix: fully exclude xdl/wmma instances when using the corresponding cmake flags
* chore: remove unused destructor and added pipeline support checks to remove unnecessary paths
* fix: make sure to not add instance library to group if library was skipped
* fix: make sure xdl grouped gemm doesnt fail the new test
* fix: explicitly exclude test if no xdl/wmma support, as pattern matching fails in this case
* fix: examples not working since dependent types and functions were moved to ck namespace in develop
* fix: tests failing when compiling for just gfx11 due to trying to run unsupported instances
* chore: replace/add copyright headers with new format
* [CK TILE STREAMK] Introduce initial support for tile engine in streamk GEMM.
- This commit lays the groundwork for integrating the tile engine into streamk GEMM.
It focuses on creating benchmark executables for streamk GEMM.
- Additional scripts like test_benchmark.sh and gemm_benchmark.py will be added once
the streamk implementation reaches stability.
* [CK TILE STREAMK] Enable CI to execute tile engine benchmarks for StreamK GEMM
* [CK TILE STREAMK] Refactor: Extract common utility functions.
* [CK TILE STREAMK] Revise tile engine of streamk to align with the updated implementation
* Add pre-commit
* [CK TILE STREAMK] Add 'dp_persistent' and 'reduction_strategy' in output of CK TILE STREAMK
* [CK TILE STREAMK] Fix a bug about value of 'dp_persistent' of CK TILE STREAMK
* [CK TILE STREAMK] Update Jenkinsfile
* [CK TILE Engine] Update StreamK tile engine help message
Remove default value messages as they are automatically printed
* [CK TILE Engine] Update StreamK tile engine
- Remove namespace reboot
* [CK TILE Engine] Update StreamK tile engine
- Fix merge error
This renames the typeToStr struct in the common utilities to DataTypeTraits and removes all duplication of DataTypeTraits across files in CK Tile.
Co-authored-by: Christopher Millette <63608002+cgmillette@users.noreply.github.com>
* remove EXCLUDE_FROM_ALL from ck-tile examples
-> +15 min build time w/ 64 threads for a single arch
* fix cpp17 compile error in the ck-tile examples
---------
Co-authored-by: khuagarw <khuagarw@amd.com>
Co-authored-by: Ding, Yi <yi.ding@amd.com>
- Add conditional selection to use v3 pipeline when PreshuffleQuant is true
- Add static assertion in memory pipeline to prevent PreshuffleQuant usage
- Restore BaseBQuantGemmPipelineAgBgCrCompV3 for BQuant cases
- Update BaseGemmPipeline selection to handle all quant modes properly
* Remove old CK Tile Stream-K implementation
The original CK Stream-K implementation was based on old CK's Stream-K
block to C tile map. However, this implementation did not align with the
original Stream-K paper. Thus, we implemented a new tile partitioner and
associated Stream-K kernel, which was placed in the reboot namespace.
Now that the new Stream-K implementation is ready, this change removes
all artifacts of the old implementation. Specifically, the following
changes were made:
- Removes old Stream-K tile partitioner from CK Tile
- Removes the reboot namespace such that the new implementation resides
in the ck_tile namespace only.
- Adds tests for bf8 and fp8 using the new implementation
- Removes tests for the old implementation
- Remove the v2 suffix from the new CK Tile Tile Partitioner
derived classes.
- Updates Stream-K Kernel ops file to use /** commenting style.
* Remove v2 from tile partitioner validation function names
* Wrap ck host utitlies in CK namespace.
The CK and CK-Tile source code bases are incompatible because CK is not properly using namespaces everywhere. In particular, we need to put hip_check_error in the ck namespace.
Move all functions in include/ck_/host_utility that were in global namespace into the ck namespace.
There may be additional namespace problems like this, and it's possible we'll have namespace clashes. But it is good design to properly guard our to code bases (CK and CKTile) so that they can both coexist. Moreover, estabilishing this compatiblity is essential if we are going to allow the builder to instantiate kernels from either template library.
* Add using declarations to test code.
After moving some of the untils into the ck namespace, most examples and a few tests had to be updated to recognize the new namespace declarations. We add using declarations to individual compute units for functions that were previously in the global namespace.
* Add using declarations to client examples.
- Added support for static_distributed_tensor<...>
- Added support for tile_distribution<...>
- Added support for tensor_view<...>
- Added support for tensor_descriptor<...>
Now type-print handles both:
1. Type aliases (::BottomTensorView, ::TensorDesc, etc.)
2. Full types with no runtime storage (static_distributed_tensor, etc.)
Shows [from type] indicator for all type-only extractions.
Example: type-print dst_tensor
Works even when 'p dst_tensor' shows 'Cannot access memory'
Co-Authored-By: Claude <noreply@anthropic.com>
* pooling example readme update
- The updated readme explains the transformations of the pooling kernel
using a mermaid diagram
* Update example/ck_tile/36_pooling/README.md
Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>
* resolve comments
---------
Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>
* Extend AK1 / BK1 support:
- Add support for AK1 != BK1
- Add support for AK1, BK1 > 8
- Introduce KInner template parameter for pipelines when loading multiple tiles with one instruction
* fix clang format
* Introduces the new partitioner to implement the reduction StreamK kernel
* Add more doc text to functions
* Add persistent-dp option to streamk example
* Update example/ck_tile/40_streamk_gemm/README.md