[CK_TILE] add tf32 support
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit
## Proposed changes
TF32 is added in CK on gfx942 and gfx950. This PR is to initiate tf32 in
CK_TILE on gfx942 and gfx950.
## Checklist
Please put an into the boxes that apply. You can also fill these out
after creating the PR. If you're not sure, please don't hesitate to ask.
- [ ] I have added tests relevant to the introduced functionality, and
the unit tests are passing locally
- [ ] I have added the test to REGRESSION_TESTS list defined at the top
of CMakeLists.txt in tests/CMakeLists.txt, **IF** the test takes more
than 30 seconds to run.
- [ ] I have added inline documentation which enables the maintainers
with understanding the motivation
- [ ] I have removed the stale documentation which is no longer relevant
after this pull request
- [ ] (If this change is user-facing) I have added release notes which
provide the end users with a brief summary of the improvement from this
pull request
- [x] I have run on all changed files
- [ ] Any dependent changes have been merged
## Discussion
[CK TILE ENGINE] Add grouped_gemm operator to Tile Engine
(gfx942/gfx950) (#4996)
## Motivation
The grouped_gemm CK Tile kernel exists (e.g.,
`example/17_grouped_gemm/`) but has no Tile Engine wrapper. Grouped GEMM
handles multiple independent GEMM problems with varying M/N/K dimensions
in a single kernel launch. This PR adds the Tile Engine infrastructure
for automated kernel generation, benchmarking, and profiling of grouped
GEMM kernels.
Jira: AICK-809
## Technical Details
- Created Tile Engine wrapper under `tile_engine/ops/gemm/grouped_gemm/`
following the `gemm_universal` template
- Files added: `CMakeLists.txt`, `grouped_gemm_common.hpp`,
`grouped_gemm_benchmark.hpp`, `grouped_gemm_profiler.hpp`,
`grouped_gemm_benchmark.py`, `grouped_gemm_benchmark_single.cpp`,
`grouped_gemm_instance_builder.py`, `configs/`
- Supported datatypes: fp16, fp8, bf16, bf8
- Supported layouts: rcr, rrr, ccr, crr
- Target GPUs: gfx942, gfx950
- CK Tile kernel: `ck_tile::GroupedGemmKernel` from
`include/ck_tile/ops/gemm/kernel/grouped_gemm_kernel.hpp`
- Instance builder extends `GemmKernelBuilder` base class
- Registered in `tile_engine/ops/gemm/CMakeLists.txt`
- Updated Jenkinsfile to build and benchmark grouped_gemm targets in CI
- Benchmark infrastructure includes JSON output, CSV export, and
verification support
## Test Plan
- CMake configure succeeds for grouped_gemm targets
- Kernel instance builder generates valid kernel headers for all
(datatype, layout) combinations
- At least one kernel binary compiles and runs per datatype/layout
combination
- Correctness passes with `--verify 1` on gfx942/gfx950
## Test Result
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
Tile Engine support for gfx950
## Motivation
This PR adds support for the gfx950 GPU architecture to the Tile Engine
in Composable Kernel library, focusing on GEMM operations with FP8 and
BF8 data types.
## Technical Details
Added gfx950-specific MFMA warp GEMM implementations with conditional
compilation.
Updated default GEMM configuration parameters for tile sizes and warp
configurations.
Added Jenkins CI pipeline stage for testing TILE_ENGINE_GEMM on gfx950
hardware.
## Test Plan
Tile engine itself is a benchmarking utility, so if it passes the CI it
will be tested automatically.
## Test Result
Tile engine itself is a benchmarking utility, so if it passes the CI it
will be tested automatically.
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
* [Compiler] Addressing new compiler warnings
Clang enables new lifetime warnings in production and we see build
errors due to this with the staging compiler.
The attributes added in this PR are suggested by the compiler. However,
I'm not very familiar with the code base, so the changes may be
incorrect.
* Update some more instances
* Adds file-level ignores via clang diagnostic pragma
The number of instances was large, so I decided to use file-level scope
to disable the warning via pragma clang diagnostic ignored.
It also showed this warning coming from the gtest dependency. For that,
I did add the respective command line flag to the CMake variables. I
don't know if this is acceptable or not.
* This adds the remaining instances
For a build on gfx90a.
* fix clang format
* Adding couple more instances from gfx1200 build
* Fixed another few instances
---------
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>
* memory op changes
* memory op changes
* Fixing TILE_ENGINE_BASIC in Tile Engine
* Removing gfx90a from Tile Engine Run
* [CK TILE ENGINE] increasing ci configs for BASIC case
* Setting RUN_TILE_ENGINE_BASIC_TESTS to ON by default
---------
Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>
* initial poc
* factor out common parts in operator()
* cv4
* rest of the universal gemm pipelines
* fix test
* remove boilerplate from tile engine
* fix example
* fix example
* format
* fix tests build for gemm
* remove base pipeline codegen from gemm instance builder
* unify v3 logic with the rest of universal gemm pipelines
* fix build for multi abd test
* fix test gemm multi d
* fix build for weight preshuffle
* fix grouped gemm test
* fix grouped gemm multi d test
* fix grouped gemm preshuffle
* fix grouped gemm example except for quant
* fix gemm preshuffle
* fix splitk 2 stage example
* fix batched gemm example
* fix multid example
* fix multiabd example
* fix batched gemm test
* fixup
* fix examples build
* fix grouped gemm test build
* fix smoke builder
* hacky poc
* fix tile engine
* kill the lambda
* maybe fix test build
* more fixes
* clang-format
* save temp
* clang-format
* mostly fix examples
* clang-format
* remove dead code
* more cleanup
* fix fmha bwd build (default epilogue set/add appears to be broken)
* fix default epilogue tests but not correctness
* clang-format
* fix bquant
* clang-format
* cleanup dead code
* rearrange make windows for readability
* restore changes to IsSupportedArgument
* fix smoke-builder
* clang-format
* fixup rename class
* build fixes
* clang-format
* fix builder
* fixup
* remove set from builder tests
* fix test
* clang-format
* re-refactor the kernels
* clang-format
* fix header license
* remove memory operation from conv bwd test
* clang-format
* clang-format example,include
* clang-format test
* build fixes
* clang-format
* solve compilation error
* fix the CI
* solve compilation error
* clang format
* solve merge conflict
* solve merge conflict
* solve the gfx11 error
* solve test error
* moar build fixes
* remove AtomicAddRequiresKBatchGreaterThanOne test since the property is removed from the kernel scope
---------
Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
* initial poc
* factor out common parts in operator()
* cv4
* rest of the universal gemm pipelines
* fix test
* remove boilerplate from tile engine
* fix example
* fix example
* format
* fix tests build for gemm
* remove base pipeline codegen from gemm instance builder
* unify v3 logic with the rest of universal gemm pipelines
* fix build for multi abd test
* fix test gemm multi d
* fix build for weight preshuffle
* fix grouped gemm test
* fix grouped gemm multi d test
* fix grouped gemm preshuffle
* fix grouped gemm example except for quant
* fix gemm preshuffle
* fix splitk 2 stage example
* fix batched gemm example
* fix multid example
* fix multiabd example
* fix batched gemm test
* fixup
* fix examples build
* fix grouped gemm test build
* fix smoke builder
This renames the typeToStr struct in the common utilities to DataTypeTraits and removes all duplication of DataTypeTraits across files in CK Tile.
Co-authored-by: Christopher Millette <63608002+cgmillette@users.noreply.github.com>
* remove EXCLUDE_FROM_ALL from ck-tile examples
-> +15 min build time w/ 64 threads for a single arch
* fix cpp17 compile error in the ck-tile examples
---------
Co-authored-by: khuagarw <khuagarw@amd.com>
Co-authored-by: Ding, Yi <yi.ding@amd.com>
* Renaming old code
* Adding GEMM code with new Architecture
* Partial Progress : Errors
* Partial Progress : Working code
* Changes to element wise function
* Removing Debugging statements
* Working GEMM Multi D code
* Removing Stale Code
* Address Copilot review comments
* Address Copilot review comments
* Changes to validation file
* Changes to common code snippets
* Creating common folder
* Removing duplicate files
* Pointing to right common file
* Pointing to right common file
* Pointing to right common file
* Changing to VERBOSE
* Changing CMAKE messages to verbose
* Updating Cmake with right layout datatype configs
* Working code for GEMM Multi D
* Partial Progress : CK Tile Engine GEMM
* Partial Progress : CK Tile Engine GEMM
* Partial Progress : Working GEMM Code
* Partial Progress : Working GEMM Code
* Changinf jenkins to remove preshuffle
* Partial Progress : CK TILE ENGINE GEMM Debugging
* Partial Progress : Removing changes that are not GEMM
* Partial Progress : Validation of full block size in GEMM
* Changes in Jenkins to run only fp16 and bf16
* Addressing Review Comments
* Partial Progress : Addressing CI issues
* Partial Progress - Runing GEMM for fp16,bf16 and rcr
* Clang
* Adding fp8 and bf8
* Adding fp8 and bf8
* Adding additional architrcture
* Limited datatypes and layouts
* Adding k_block_per_cu in test config
* Changes to faling CI errors
* Changes to faling CI errors
* Validation for GEMM
* Adding Layout support
* Adding Validations
* Adding layout in jenkins
* Update on Jenkins
* Distribution validation for GEMM
* Resolving merge conflicts
* Solving merge conflicts
* Reading gpuname from target for gemm in ck tile engine
* Reading gpuname from target for gemm preshuffle in ck tile engine
* Reading gpuname from target for gemm preshuffle in ck tile engine
* Get GPU changes for GEMM Muti D in TILE ENGINE
* Addressing errors for gpu name in cktileengine
* initial work on adding support of gfx12 in tile_engine for GEMM benchmarking
* add stage("Run TILE_ENGINE_GEMM Tests on gfx1201") to Jenkins config
* make tile_[m/n/k] validation arch dependent
* Making edits to identify individual compilation issues.
* Minor fix for blob txt files not being created.
* Fixing compilation issues.
* Fixing ordering bug.
* Adding python profiling functionality.
* Setting individual build as default.
* Setting gpu target filtering for tile engine to gfx90a, gfx942 and gfx950.
* update the default running parameters and settings
* Fixing bug with benchmarking, shifting file generation to build instead of config.
* Updating fixes.
* Fixing json output and parsing.
* Disable ccache for tile engine gemm ops because we dont need it.
* Removing duplicate type definition.
* Improving json printing.
* Add the flexibility of different layout and more warp tile support
* Fix extra flag in name of individual kernels.
* Fixing bug with booleans.
* Solve the first patch of the post merge conflict
* Compilation fixes, and cosmetic improvements.
* Yet again compilation fixes after latest changes from develop.
* Fixing python benchmarking script.
---------
Co-authored-by: Vidyasagar Ananthan <vidyasagar.ananthan@amd.com>
Co-authored-by: Vidyasagar Ananthan <vanantha@amd.com>
* Enable persistent kernel in tile_engine and use tail handler
* Fix formatting
* Add persistent to default_config.json
* Remove extra newlines and add persistent also to user config
* Reduce instances from default_config.json
* add persistent to benchmark.json and custom_ci_config.json
* changed the config file to have few instances
---------
Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
Co-authored-by: ThomasNing <thomasning@amd.com>
* Modify CMakeLists to allow for splitting.
* Modify CMakeLists for data and layout logic.
* Run tests and get build artifact.
* Test new Cmakelists for speedup.
* Further improvements for speedup.
* turn off the FMHA
* turn off the automatic tile engine gemm
* minor fix
* disable the transpose test first
* Address the comment
* Jenkinsfile
* change the make thread to 64
* change the compile thread to 32
* Try to use with less OS memory space
* Have the Unity build batch size to 2
* reduce the chunk size
---------
Co-authored-by: Vidyasagar Ananthan <vidyasagar.ananthan@amd.com>
* update gpu_timer for rotating buffer as hipblasLt's implementation
* timing fix
* Updating gpu timer for old ck as well
* Revert "Updating gpu timer for old ck as well"
This reverts commit 958cd1bc99.
* code clean up with runtime argument; function rename
* code cleanup
* general timer fixes
* bug fix
* clang formatted
* addressing reveiew comments
* clang formatted
* Addressing review comments
* CI fix
---------
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
* Updating runtime log message for CK TILE ENGINE
* CKTile layout from config
* CKTile custom config for CI
* Documentation for Layout Changes
* CKTile Layout changes to Jenkins
* Fixing Clang Format
* Changes to Jenkins file to fix error
* fix(cmake-ck-dev): no longer sets invalid values as gpu arch
* style(py files): ruff formatting
* fix(cmake-ck-release): no longer sets invalid values as gpu arch
* chore(cmake-tile_engine): add reminder to uncomment user config json
* Changes to jenkin file to address more cases
* Changes to Jenkins to fix Error
* Changes to Jenkins file for fixing an error
* Update Jenkinsfile (#2517)
* Update Jenkinsfile
---------
Co-authored-by: ThruptiRajLakshmanaGowda <tlakshma@amd.com>
Co-authored-by: AviralGoelAMD <aviral.goel@amd.com>
Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
* updates to support int8 in 03_gemm example
* added comments, using aliases, helper functions
* test(gemm_universal): add test cases for int8 gemm pipeline
* fix(test_gemm): fix for failing test unit test for int8
* test(ck_tile): add int8 unit test for gemm universal
* refactor(gemm_universal): GPU reference verification for GEMM code improved
* style(gemm_universal): removed extra comments and did clang format
* merging recent changes to universal gemm to tile_engine
* ck tile engine integration work
* feat(tile_engine): add int8 support to tile engine ops/gemm
* feat(tile_engine): added 32 32 16 mfma instances to tile engine for int8
* style: Format code with clang-format-12
* refactor(tile_engine): address review comments
* style: removed unhelpful comments & unused variables.
* build: tile engine uses default config
* feat: add int8 support for CK_TILE GEMM
* style: added trailing commas to codegen_utils.py
* refactor: tile engine
* refactor: formatting and code review
* refactor: code formatting for python files
* fix: suppress build warning
* add support for gfx950
* refactor:KWarpTile size in gemms util
* Fix the branch and wrap up the k warp tile
* Add bf8 integration
* refactor: clang format and rebase
---------
Co-authored-by: zjli2013 <leezhengjiang@gmail.com>
Co-authored-by: AviralGoelAMD <aviral.goel@amd.com>
Co-authored-by: Khushbu Agarwal <khuagarw@amd.com>
* [CK_TILE] Refine fp8 in flatmm
1. Replace USING_MFMA_16x16x32 & USING_MFMA_16x16x32 with constexpr
2. Add an additional const check to avoid build error in HotLoopScheduler
3. Refine shuffleb to support both tile 32x32 and 16x16
4. Support command option -init
5. Move Gemm warp defintion to a separate struct
* fix clang format
* fix clang format
* keep default bhavior unchanged (warp tile = 16x16)
* fix tile engine build error
* fix a typo in codegen_utils.py
* address review comments
* address review comments
---------
Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
* - elevate important build messages to log level STATUS
- comment out the rest (temporarily)
* - marked all low importance build messages as log_level=DEBUG