* Initial implementation:
- add new thread group transfer supporting transpose instruction
- refactor AB transfer to switch between thread and wave tiles methods
* Add some comments and remove explicit wave and lane calculations
* Remove compiler option for performance
* fp16 example: use tuned instance
* Missing cleanup
* Integrate wave transfer in existing gemm and batched gemm instances
* Add fast instances
* extend implementation for 8 bit datatypes
packed types not supported
* Address review comments
* Optimize pipeline v1 and re-introduce compiler option
* Disable wave tile approach for b scale gemm
* Fix for clang20
* Avoid code duplication of amd_global_load_transpose_to_vgpr function
* rebased on top of develop
* fixed missing shuffeling and wrong indexing
* added tests for batched_b_scale
* added missing files
* fixed wrong stride computation and removed k batching (for now) due to precision issues
* reinstated k-batching with PRNG constrained to -1..1
* added specialization of GeneratorTensor_3 for int4 and fixed internal overflow
* added k-batching to reference and increased tolerances for test
* changed gemm_b_scale and gemm_universal tests to use correct parameters
* adressed review commentsd
* ported fixes back to non-batched version of b_scale
* adressed review comments
* run clang-format on older commits
* add type-conversion to AccDataType and then to CDataType to exactly mimic GPU's behavior
* added newline at end of file
* reflected changes from muitl-abd branch in batched b_scale
* fixed gfx11 issue
* changed range for pki4 to -1...1 (-0.5...0.5 never really made sense for i4 anyway and always should have caused compiler errors, but since there was no int4 specialization of GeneratorTensor3 until now, this passed
* run clang format
* set range of i4 generation to 0...1 for upstream tests to pass. This replicated previous behavior, which however means that it is NOT properly tested.
* reduced range for pk_i4 even further to 0..0
* removed failing xld instances. Failure now uncovered now that tests were fixed
* removed generation of int4 values entierly
* divide B buffer by BPackedSize
---------
Co-authored-by: Kevin Abraham <kevin.abraham@streamhpc.com>
The following changes were made
- Renamed iter to iter_start
- Renamed tile_iter to tile_iter_start
- Moved documentation from member variables to getters
- Removed double underscore from extra_iters_before_me variable
- Defined parent header in impl file
- Removed unused inlcudes
There are 2 derived structs based on whether Stream-K is persistent or not.
If it's persistent that means that both the data parallel and Stream-K sections
are data parallel. If it's non-persistent that means that only the
Stream-K section is persistent, while the data parallel section will have
separate workgroups allocated for it. Both structs will have a template
argument for Persistent.
The 2 derived classes will inherit common variables and functions from the
Stream-K TilePartitioner base class. There are additional variables for the
differing data parallel sections that will be added to each derived class,
that are in charge of the indexing/bookkeeping for the data parallel sections.
The only additional function that will differ between the 2 structs is GridSize(),
as the non-persistent will allocate extra workgroups for data parallel.
Unit tests for the derived structs are included.
To better align with the original Stream-K paper, this change implements
a new Stream-K tile partitioner base class. This class will handle the
Stream-K setup that is common to both a persistent and non-persistent DP
section. A later change will implement derived classes to handle the
differences between persistent and non-persistent DP.
This change also includes unit tests for the base tile partitioner.
* GH-2368 Adding a basic glossary
GH-2368 Minor edits
GH-2368 Adding missing READMEs and standardization.
resolving readme updates
GH-2368 Minor improvements to documentation.
Improving some readmes.
Further improvement for readmes.
Cleaned up the documentation in 'client_example' (#2468)
Update for PR
Update ACRONYMS.md to remove trivial terms
Update ACRONYMS.md to provide detailed explanations for BF16 and BF8 formats
Apply suggestion from @spolifroni-amd
Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>
Apply suggestion from @spolifroni-amd
Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>
Update README.md to clarify CK Tile API description and remove outdated references to the Tile Engine.
revise 37_transpose readme
revise 36_copy readme
Remove references to the Tile Engine in README files for 19_gemm_multi_d and 35_batched_transpose, and update distribution links for clarity.
Remove references to the Tile Engine in multiple README files and update distribution links for consistency and clarity.
Remove references to the Tile Engine in README files across multiple examples
* GH-2368 Adding a basic glossary
GH-2368 Minor edits
GH-2368 Adding missing READMEs and standardization.
resolving readme updates
GH-2368 Minor improvements to documentation.
Improving some readmes.
Further improvement for readmes.
Cleaned up the documentation in 'client_example' (#2468)
Update for PR
Update ACRONYMS.md to remove trivial terms
Update ACRONYMS.md to provide detailed explanations for BF16 and BF8 formats
Apply suggestion from @spolifroni-amd
Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>
Apply suggestion from @spolifroni-amd
Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>
Update README.md to clarify CK Tile API description and remove outdated references to the Tile Engine.
revise 37_transpose readme
revise 36_copy readme
Remove references to the Tile Engine in README files for 19_gemm_multi_d and 35_batched_transpose, and update distribution links for clarity.
Remove references to the Tile Engine in multiple README files and update distribution links for consistency and clarity.
Remove references to the Tile Engine in README files across multiple examples
Refine README files by removing outdated references to the Tile Engine
* Updates based on PR feedback 1
* Updates based on PR feedback 2
* Updates based on PR feedback 3
* Updates based on PR feedback 4
* Updates based on PR feedback 5
* Updates based on PR feedback 6
* Updates based on PR feedback 7
* Updates based on PR feedback 8
* Content Modification of CK Tile Example
* Modify the ck_tile gemm config
---------
Co-authored-by: AviralGoelAMD <aviral.goel@amd.com>
Co-authored-by: ThomasNing <thomas.ning@amd.com>
* [CK_TILE] Correct BlockWarps calculation and fix smoke-test in rmsnorm
* Update rmsnorm host reference
* Update tree reduction of rmsnorm for reference host
* Fix cross warp for m > 1 cases
* Add RMSNorm model selectable option for host reference
* Fix save_unquant cases
* Update reference rmsnorm forward function to use enum for model sensitivity
* Update reference rmsnorm calculation for model sensitivity
* Fix m warp for layernorm
* Adjust parameter of reference for twoPass
* Fix clang format
* Run clang-format-overwrite.sh to fix formating issue
* fix clang format
---------
Co-authored-by: MHYang <mengyang@amd.com>
Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>
Co-authored-by: ThomasNing <thomas.ning@amd.com>
Our current cmake/gtest.cmake file does not enable gmock. Gmock is needed for matchers that are needed for more readable unit tests. This PR enables gmock and does a little cleanup in gtest.cmake:
* Enable BUILD_GMOCK by default (was previously disabled)
* Patch gtest-src/googlemock/CMakeLists.txt for broken include path.
* Add configuration to gmock if the target is used.
No other changes in this PR, but I've verified I can use gmock matchers correctly once I include these changes in other code.
* Enable the adapted LDS B layout for Row-Major
* fix formatting
* Implement specialized col-major A LDS block descriptor
* Fix formatting
* Use VecLoadSize for AK1/BK1
* Fix some thread access pattern values
* Use GetVectorSizeA for A
* Fix formatting
* Add extra condition to avoid division by zero
* disable layout for wave32
* remove extra else
* fix formatting
* Fix formatting
* Rename one remaining TileDistributionEncodingPattern2D
* Use integer ceil division
* revert remod.py changes
* also revert utility.hpp
* use getA/BTileAccessPattern everywhere
* use integer_divide_ceil for AK0 too
---------
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
Co-authored-by: Adam Osewski <Adam.Osewski@amd.com>
* Initial commit. create batched_contraction_kernel file
* initial problem definition
* implement initial example to launch kernel
* add universal gemm to contraction. initial phase
* complete implementation for special case all Dims are 1 and no Ds
* clean code
* initial changes to support multi dimensional G
* more progress in implementing multiple G
* tmp commit
* manage dynamic NumDimG in kernel
* improving example for multi M,N,K,G handling. start generalizing kernel. it is a temporary commit
* implement the example for general Multi dimension G M N K and test different reference calculation algorithms
* 2 functions for reference using multi dimensional and flat indexing
* clean the code for muti dimentional G, M, N, K contraction and add some logs
* Add Make descriptor function in kernel for merging Ms, Ns, Ks for A, B, E
* some cleaning on kernel
* clean the code for calculating the offsets from flatten batch number
* Start adding MultiD support to kernel and example
* more changes to manage multi D in kernel and example
* manage passing multi d to kernel and testing.
* complete multi D support in kernel. modify example code to support it
* Correct algorithm to calc the correct offset values for D tensor batches and some code cleaning
* Minor fix
* Generalize example code for variable NumD tensors and apply cleanup based on review feedback
* Refactored code and addressed review feedback
* refactoring, cleaning, add documents, in kernel side and example codes
* Optimize batch offset calculation in kernel
* Inline CalculateBatchOffset in batched contraction kernel, update CHANGELOG.md
---------
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
* Add initial fp16_mem_128x128x32_2x2x1_32x32x16_NonPersistent test suite
* Account for stride when computing K offsets for A and B tensor
This change ensures that the correct stride is used when computing the K
offsets into the A and B tensors in the Stream-K Kernel's operator()
function. This ensures that the kernel executes correct regardless of
whether A and B are row or column major.
* Move helper code to test_gemm_streamk_util.hpp
* Separate tests into smoke/regression/extended. Add bf16 datatype
* Run clang-format
* Refactor combinatorial macro expansion and naming
* Adjust the initialization values to account for better tolerance on bf16
* Correct BF16 datatypes in comments
* Move the extended tests under the REGRESSION_TESTS label
* Apply suggestions from code review
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
---------
Co-authored-by: Emily Martins <emily.martins@amd.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Fixes compilation error on SLES15 with GCC 7 for gfx942 builds:
error: 'vector' may not intend to support class template argument deduction [-Werror,-Wctad-maybe-unsupported]
Changes:
- Explicitly specify template argument for `std::vector<mode_enum>` instead of relying on C++17 CTAD
- Maintains compatibility with both older (GCC 7) and newer compilers
* debugging
* debugging for prefill shapes
* comment unused code
* fix for prefill shapes
* clearing up the code
* add int4 to universal gemm example
* clang formatted
* adding test for prefill shapes in block scale gemm
* lil improv on the block pipeline
* Address Review Comment
---------
Co-authored-by: ThomasNing <thomas.ning@amd.com>
* reuse local prefetch logic from compute v4 pipeline
add single-tile test
explicit lambda capture
reuse lds block descriptors from base policy for the transposed case
match the test case kernel configuration with compute v4
* add comments
* add instances of device_grouped_conv_fwd_xdl_f32_comp_instances
* add instances of device_grouped_conv_fwd_xdl_f32_tf32_mem_instances
* add instances of device_grouped_conv_fwd_xdl_large_tensor_f32_tf32_instances
* tf32:conv:add instances for base class DeviceConvFwd
* tf32:conv:add instances for base class DeviceGroupedConvBwdDataMultipleD
* tf32:conv:add instances for base class DeviceGroupedConvBwdWeight
* add tf32 in profiler
* remove gnhwc/ngchw/ngcdhw instances
* remove non-ndhwgc/nhwgc/nhwc instances
* add check in IsSupportedArgument()
See build error log from
https://github.com/ROCm/composable_kernel/issues/2271#issuecomment-3150218542
This PR make vector element access constexpr-safe by avoiding operator[] on
ext_vector_type(2) and replace those sites in the pk_fp4 conversions so they
can be used in constant expressions, as The operator[] on ext_vector_type(2)
isn't allowed in constant expressions, which caused "constexpr function never
produces a constant expression" with a note at x[0]. Using `bit_cast` to a
trivial array representation keeps it constexpr-compatible.
Signed-off-by: Hollow Man <hollowman@opensuse.org>
* Pooling 2D/3D with refernce
* Tests & cleanup
- added test for ppoling
- cleanup
- removed 2d example
* Comment resolution
- README added
- example target name rectified
- appropriate arg description and comments added
* clang-format
* appropriate blocksize calc
* modifications for future indexing addition
- instead of transforming views we now transform the descriptors, so
that the same descriptor can be re-used for index tensor in the future
* some basic fixes
* comment resolutions
* comment resolutions
---------
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
* Add KBatch support for gemm_ab_scale
* Revert kernel parameters change
* Remove printing
* fix formatting
* fix check
* Use {} in if
---------
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
* WIP: add memory pipeline boiler plate code that compiles and works for one block
* WIP: tail handling works for memory pipeline
* WIP: numerical errors appears to have gone by adding block_sync_lds()
* fix: numerical error with memory pipeline by adding block_sync_lds() and new tail handler
* refactror: remove debug print statements and lints
* fix: remove redundant sync barriars
* chore: remove lint
* fix: remove unused code from tile handler and remove redundant block_sync_lds()
* fix: correct parent struct name for memory pipeline
* fix: remove static assert check from parent struct and add it to child struct because not all child structs needs to static assert
* fix: defer block sync lds to just before prefill
* Update Jenkinsfile
Adding logic to skip CI checks when a commit contains changes to non-relevant files like docs, .md, licenses, and .github workflow files.
* Update Jenkinsfile
* Update Jenkinsfile
* Update Jenkinsfile
Testing skip env var
* Update Jenkinsfile
Fixing syntax
* Update Jenkinsfile
Simplifying CI check logic
* Update Jenkinsfile
Testing skipping logic on stages.
* Update Jenkinsfile
Removing post block. The status for skipped stages are already reported.
* Testing Docs
Testing modifications to files in the docs folder do not trigger a the build and test stages.
* Testing Multifile Trigger
Removed Jenkinsfile from the skip patterns. Reversed change to docs file. This test should not skip CI checks.
* Clean code
Renamed setup stage to be more descriptive.
Added pipeline env variable for consistency.
Moved performance test results stage conditional up a level so the parent stage appropriate reports the status if it is skipped.
* Fixing syntax error
* Updated CRON Flags
Added the FORCE_CI flag to the CRON instructions. This will ensure CI does not skip the job.
* Updating logging
Making logs more explicit.
* Comment update
Cleaning comments.
* Update Jenkinsfile
Reverting performance reports when condition.
* Parallel Test
Testing stage status with parallel stages
* Update Jenkinsfile
* Update Jenkinsfile
Removing stages for quick testing
* Update Jenkinsfile
* Testing skipped parallel stages
Testing the addition of a coordination stage to always pass and give an update to skipped parent stages with parallel sub-stages.
* Testing parallel stages
Adding coordination stage to test if parent check status is correctly updated.
* Simplified performance results stage
Removed parent stage as there are no other parallel stages to execute (yet).
* Testing final clean up stage
* Testing check status update
Testing - forcing status to update after a stage skip.
* Testing results stage skip
* Removing test stage
* Testing pipeline
* Testing post status updates
* Process Test Results Post Event Update
The stage will report success when it skips or is successful.
* Testing non-relevant file change
This should skip build and test in CI
* Reverting test
updating regex file patterns to use strings instead of regex literal syntax.
* Fixing file matching regex
* Testing docs modification
* Fixing default env var value
* Correcting env var assignment
* Pipeline test
Updating docs file. Should skip ci.
* Testing Pipeline
Setting default run ci state.
* Adding debugging
* Removing debugging
* Pipeline test
Should skip pipeline
* Pipeline Test
Mixed files to trigger a CI run
* Adding additional status updates
The parent stage sometimes remains in pending even if the child stage completes when skipped. Added an additional status update for the parent stage.
* Fixing variable name
* Moving stage names
Moved the performance stage names to a single location because they are referenced multiple times. This reduces errors with typos in the future.
* Revert "Moving stage names"
This reverts commit 7cf6743e54.
* Update Jenkinsfile
Handle both truly empty arrays and arrays containing only empty strings.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
---------
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>