* Additional flavors for WMMA conv fwd large tensor
- added F16/BF16 clamp operation
- added F16/BF16 bias_clamp operation
- small modification to the device code to accomodate extra tensors
* changed strategy to handle GemmArgs array
* Adding generic instance
* Added generic instance to clamp and bias_clamp ops
* Addition of Stream-K tests using Tile Engine
This change adds an implementation for generating Stream-K tests using Tile Engine.
This will generate various test executables for different combinations based on the
config files. This addition has simple tests running for bf16 and fp16, with both
atomic and reduction strategies and compv3 pipeline. The tests rely on the implementation
of Stream-K in Tile Engine.
* integrating addition of tree reduction and editing the README
* temporarily removing parallel and tree reduction from configs while bugs regarding them are being resolved
* WIP POC of dispatcher
* Dispatcher python workflow setup.
* Dispatcher cleanup and updates.
Further dispatcher cleanup and updates.
Build fixes
Improvements and python to CK example
Improvements to readme
* Fixes to python paths
* Cleaning up code
* Improving dispatcher support for different arch
Fixing typos
* Fix formatting errors
* Cleaning up examples
* Improving codegeneration
* Improving and fixing C++ examples
* Adding conv functionality (fwd,bwd,bwdw) and examples.
* Fixes based on feedback.
* Further fixes based on feedback.
* Adding stress test for autogeneration and autocorrection, and fixing preshuffle bug.
* Another round of improvements based on feedback.
* Trimming out unnecessary code.
* Fixing the multi-D implementation.
* Using gpu verification for gemms and fixing convolutions tflops calculation.
* Fix counter usage issue and arch filtering per ops.
* Adding changelog and other fixes.
* Improve examples and resolve critical bugs.
* Reduce build time for python examples.
* Fixing minor bug.
* Fix compilation error.
* Improve installation instructions for dispatcher.
* Add docker based installation instructions for dispatcher.
* Fixing arch-based filtering to match tile engine.
* Remove dead code and fix arch filtering.
* Minor bugfix.
* Updates after rebase.
* Trimming code.
* Fix copyright headers.
* Consolidate examples, cut down code.
* Minor fixes.
* Improving python examples.
* Update readmes.
* Remove conv functionality.
* Cleanup following conv removable.
* Added bias_bnorm_clamp instances.
* fwd_bias_bnorm_clamp comp instances
* fwd_bias_bnorm_mem_inter and mem_intra instances
* fwd_bias_bnorm_merged_group_instances
* fwd_bias_bnorm_clamp_conv3d_bf16 and f16 instances
* Device level changes for fwd_bias_bnorm_clamp
* Added the test to the regression test list.
* Removed the part 2 and 2x instances
* Removed the irrelevant checks in wmma
* Refactored the instances to adapt to new device implementation
* Updated the reference and include files
* enabling tests
* Added missing profiler
* Added missing instance entry , deleted by mistake
* Reduce bias bnorm clamp instances to only a single generic one.
* Clean up cmakelists file
* clang-format
* Change bias bnorm clamp tests to use monotone initialization values to avoid tiny off-integer gemm results on RDNA3 from blowing up.
* Renaming some instance lists and add functions to be more standardized.
* Commented out non default instances.
---------
Co-authored-by: kiefer <kiefer.van.teutem@streamhpc.com>
* add block scale parameters to kernel
* add block scale to kernel
* add smoke test
* format
* Revert "format"
This reverts commit 356c3c9706.
* only format my code
* format py
* fix auto not allowd in function prototype
* change instance tttt to ttff
* fix structured binding issue
* change s_acc elementwise op
* async pipeline add block scale
* add quantation P using shift exp2
* precompute (m - shift) once per row
* change blk scale seqstrt ptr name
* fix some name
* fix for deduction guide
* fix some comments
* add P scale to qr_ksvs_pipeline
* add comment to idx_identity
* change the method of calculating descale block index
* unify naming style: use block_scale_ as name prefix
* unify naming style
* update the CHANGELOG.md
* Add FP8 block scale quantization support for FMHA forward kernel
---------
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
* ck-builder: remove SPATIAL_DIM parameter from ConvTensorLayouts
This information is already in the SIGNATURE, so its pointless to pass it
separately. This streamlines the interface of those functions a bit. Also
touches up the style of those files in general.
* ck-builder: implement reference conv using old ck
The old ck implementation is more featureful and better tested.
* ck-builder: replace test_reference_execution reference with old ck
This strips out the ck-tile gpu reference implementation completely.
* ck-builder: clean up test_reference_execution
- Remove unneccesary messages
- Replace EXPECT_TRUE(true) with EXPECT_NO_THROW()
* wip: test suite for batched gemm multiple d gemm multiple d, working on gridwise implenentation
* wip: many fixes in implementation of batched gemm gemm multiple d
* wip: batched gemm gemm multiple d gridwise op compiling, not working yet
* fix: incorrect d0 grid indexing in batched gemm gemm multipled
* feat: add instances for batched gemm add relu gemm add
* chore: configure instance with low vector transfer size for odd sizes
* chore: add some more validation to device batched gemm gemm multiple d, and removed template parameter that didn't really make sense
* fix: upate device_batched_gemm_gemm_wmma to work with new gridwise changes
* fix: disable odd size tests on XDL archs
* chore: removed temporary logging
* chore: update some references to C tensor to E tensor
* Tentative fix for example template params
* Tentative fix for non-multi-D batched gemm gemm device impl.
* Tentative fix for xdl example template params
* Tentative fix for profiler build on gfx90a
* chore: improve device batched gemm gemm multi D comment to include all ops and dimensions
* chore: explicitly call ck::make_tuple to prevent issues when std::make_tuple would apply
* fix: make the gemm1 data types match what happens in the device op
* feat: add d0s/d1s datatypes and layouts to the device op type string
* chore: change element-wise op so addition happens in fp32
* chore: add static asserts for gemm0/gemm1 calculated wave sizes
* chore: also updated other element-wise ops to use fp32 calculations
* chore: log number of supported instances
* chore: update instance comment
* chore: disable kernel timing in example by default
* fix: gemm1 wave size calculation
* fix: make sure batched gemm multiple d gemm multiple d profiler performs correct type conversions
* chore: remove increased tolerance in batched gemm gemm multiple d example
* chore: add comment explaining that verification fails for certain input values
* chore: clarify instance comment
---------
Co-authored-by: kiefer <kiefer.van.teutem@streamhpc.com>
Add signal-based synchronization for persistent GEMM kernels where
input data becomes available incrementally. Uses modulo wraparound
(like PyTorch's AsyncMM) for chunk index calculation:
chunk_idx = ((tile_idx + tile_idx_pivot) / tiles_per_chunk) % num_chunks
Key components:
- PersistentAsyncInputScheduler struct with tiles_per_chunk_m,
chunk_signals, tile_idx_pivot_m, and num_chunks fields
- wait_eq_wave method using __builtin_amdgcn_s_sleep for power efficiency
- IsSupportedArgument validation for scheduler parameters
- Example demonstrating async input scheduling with simulated producer
- GTest unit tests covering all layout combinations
* Add support to fp16 + compute fp16 and bf16 + compute bf16 contractions
Enables hipTensor to access the WMMA HW functionalities
for these combinations of datatype on gfx11 and gfx12.
* Fix change to contraction scale tests
* Fix clang-format
This change improves the clang-format CI check to be faster and not
depend on git being available in the build environment.
Changes:
- Use `find` instead of `git ls-files` (no git dependency)
- Check all C++ files: *.h, *.hpp, *.cpp, *.h.in, *.hpp.in, *.cpp.in, *.cl
- Exclude build/ and include/rapidjson directories
- Use parallel processing with 8 cores (-P 8) for ~8x speedup
- Show only errors with unified diff format (-u)
- Clear error messages: "ERROR: <file> needs formatting"
- Preserve original logic: run clang-format only when RUN_CPPCHECK=false,
or run both clang-format and cppcheck when RUN_CPPCHECK=true
Performance:
- Sequential processing: ~93 seconds for 5,899 files
- Parallel with 8 cores: ~12 seconds for 5,899 files
- Per-file processing time: ~15ms
This reduces CI time while maintaining code formatting standards.
* Rename member variable to better reflect its actuall meaning.
* Add transfer checks for conv fwd xdl.
* Validate tensor layouts & vector size conv fwd v3.
* Add combined transfer concepts.
* Add transfer concepts for conv fwd factories.
* Fix clang format
* Add helper instruction to get max mem vector instruction width.
* Apply review comments.
* Rename thread cluster access(->arrange) order concept
* FIx merge artifacts.
* Add generic access order limits into block transfer concept.
* feat: test setup for batched contraction (aka batched gemm multiple d e permute)
* wip: device struct for WMMA batched contraction multiple d based on new gridwise op
* feat: working batched contraction on RDNA, non-naive tensor descriptors for gridwise_gemm_wmma_cshuffle_v3, test setup for odd cases
* fix: failure to resolve template parameters when calling new function overload
* fix: passing reference type as parameter instead of underlying types
* fix: merge error caused duplicate definitions
* fix: make sure constness of template and parameters types match
* fix: don't compile batched contraction test on unsupported architectures
* feat: add example for new wmma implementation, and consolidate example code between platforms
* style: return inline instead of with branch
* chore: add extra assert on vector memory access sizes
* chore: clean up some unused variables
* fix: correct tail number calculation, added small cases and extra instances to the test
* fix: properly support wave transfer by generating correct grid descriptors dependent on the transfer method
* Fix large case init bounds
* Revert "Fix large case init bounds"
This reverts commit 1abca05c6f.
* Restore CPU initialization for do_verification != 2
1. Added `-DCK_EXPERIMENTAL_BUILDER=OFF` to the `setup_args` to explicitly disable the experimental builder
2. Added a detailed comment explaining why this is necessary:
- SLES15 is a legacy platform with limited C++20 ecosystem support
- While the ROCm compiler supports C++20, the older system libraries and standard library implementation on SLES15 does not reliably support all C++20 features required by the experimental CK Builder
* Adding CK Tile documentation
* Updates based on feedback
* Fix tile window API description
* Fix remaining images
* add documentation about flush_cache and rotating_buffer functionality in ck_tile
* Supplement the documentation
* light edit of the ck tile conceptual doc
---------
Co-authored-by: Vidyasagar <vanantha@amd.com>
Co-authored-by: AviralGoelAMD <aviral.goel@amd.com>
Co-authored-by: ThomasNing <thomas.ning@amd.com>
* Factor helpers out of conv_traits.hpp
* Create a non-templated conv_traits struct
* Migrate to new instance-specific instance_to_conv_traits functions
* Clean up reflection concepts
* Clean up ConvTraits helpers
* Update testing for convolution traits
This is a lot of cleanup on tests to have verbose coverage of feature
extraction, explicit tests for each supported device kernel, and
simple, readable test code.
* Address reviewer comments and resolve merge conflict
This is mostly adjustments to enum values so that the docs align correctly with the current code.
Also updated the calendar scope of the project to extend through March 2026.
The test_ck_tile_streamk_reduction test suite seems to have transient
failures; hence, we are disabling these tests for now. We will re-enable
them once the bug is resolved.
Adding owners permissions for two leading developers on the CK Builder subproject to help with reviews on that project, especially in the EU time zones.
Remove aska-0096, who has left AMD