* WIP: add splitk to bquant
* feat: add support for bf8i4 and fp8i4 by calculating correct stride for packed data types
* chore: remove temporary test script
* fix: incorrect tile window length for splitted bq tensor window
* chore: improve comments
* test: add unit tests to cover bquant splitk functionality
* fix: conflict resolution by renaming variables
* [Compiler] Addressing new compiler warnings
Clang enables new lifetime warnings in production and we see build
errors due to this with the staging compiler.
The attributes added in this PR are suggested by the compiler. However,
I'm not very familiar with the code base, so the changes may be
incorrect.
* Update some more instances
* Adds file-level ignores via clang diagnostic pragma
The number of instances was large, so I decided to use file-level scope
to disable the warning via pragma clang diagnostic ignored.
It also showed this warning coming from the gtest dependency. For that,
I did add the respective command line flag to the CMake variables. I
don't know if this is acceptable or not.
* This adds the remaining instances
For a build on gfx90a.
* fix clang format
* Adding couple more instances from gfx1200 build
* Fixed another few instances
---------
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>
* Adding remaining flavors for grouped conv fwd
As titled. Following variants are added:
- grouped_conv2d_fwd_dynamic_op
- grouped_conv3d_fwd_dynamic_op
- grouped_conv3d_fwd_bilinear
- grouped_conv3d_fwd_convscale
- grouped_conv3d_fwd_convinvscale
- grouped_conv3d_fwd_convscale_add
- grouped_conv3d_fwd_convscale_relu
- grouped_conv3d_fwd_scale
- grouped_conv3d_fwd_combconvscale
- grouped_conv3d_fwd_scaleadd_scaleadd_relu
* Fix incomplete parsing of types from source names in add_instance_library() cmakelists function so we don't build f8 on RDNA3.
* Do not build f8 / bf8 only flavor tests on RDNA3
* Make sure we have proper generic instances for all instance lists related to the post-ces extra flavors, with scalarPerVector = 1. Then disable all but one generic instance per instance list to reduce compile time.
* Post rebase fix: Template parameters for Grouped Conv Fwd Device Impl got tweaked upstream.
* adding int8 and fp16 overloads to the elementwise operations
* fixed copilot nits
* Addressing review comments:
- removed unnecessary examples for dynamic op
- removed unnecessary conv specalizations for all the flavors
- removed spurious bilinear and scale source files
* clang-format
* reduced no of tests
---------
Co-authored-by: Wojciech Laskowski <wojciech.laskowski@streamhpc.com>
* chore: split block scale example instances in more separate files to speed up compile times
* wip: fp4 scaffolding for abquant
* feat: add fp4 decoding-while-loading to abquant pipeline
* feat: add support for fp4 CPU verification in abquant
* chore: add time tracking to reference calculation
* feat: add a4w4 test for blockscale gemm
* feat: optimize reference calculation by preconverting values to AccType
* feat: add fp4 to fp8 look-up table
* fix: reference to wrong ComputeDataType field in QuantProblem
* feat: type utilities for determining MFMA compute types
* feat: packed fp4 for abquant weight preshuffle
* feat: add separate tests for a4w4 base case, padding and preshuffleB
* fix: fp4 conversion on gfx950 attempting to use non-supported method
* fix: test case was using quant group sizes which don't work on gfx950 due to larger mfma tile size
* chore: add fp4 preshuffleb mode to block scale example
* chore: sanity check for packed types being 1 byte
* chore: clarify tensor dimension indices with constants
* chore: replace traits check with specialized check for packed types
* style: some minor refactoring and cleanup
* fix: correct conversion table for FNUZ fp8
* chore: add fp4 instances to main abquant instances again
* chore: use same initialization branch for int4 and fp4
* chore: add missing initialization for fp4 in block scale gemm example
---------
Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
* Add multi AB support to wave transfer
* Improviments to multi ABD examples
* Add instances and use intrawave v1 instead of interwave
* Apply changes to other transfers
* Wave transfer: add support for multiple internal vgpr buffers
* Fix compilation error gfx11
* initial commit
* preshuffleQuant support for ABQuant
* fix mxfp4 to use correct QuantGroupSize
* addressing review comments and seperated Preshufflequant for A and B
* updated grouped gemm example for updated traits definition
* fix for CI failure
* updated grouped_gemm_abquant test for updated traits definition
* updated grouped_gemm_abquant test for updated traits definition
* Add padding support with transpose
Also move check before writing storing is_src_valid during reading
* Add/modify instances to use wave transfer for gemm universal
Condition is changed so now the vectorsize of vmem reading and lds
writing must be equal to 8 in order to use the wave transfer
* Fix clang format
* Modify example
* Fix bwd data
* Add restriction for wave transfer with padding and transpose
Add test case which shows this limitation
* Fix validity checks 8 bit types
* Add validity check gemm_bias_add_reduce
* Add validity check grouped gemm tile loop
* Fix validity checks new flavours
* Minor fixes
* Fix clang format
* WIP: host level interwave pipeline compiles
* WIP: interwave implementation computes correct GEMM result when no aquant
* WIP: quantization works for subset of problem shapes
* WIP: quantization works for subset of problem shapes
* WIP: interwave memory pipeline passes local test
* feat: Add interwave pipeline implementation for memory pipline in aquant
* test: add unit test for aquant memory pipeline
* WIP: host level interwave pipeline compiles
* WIP: interwave implementation computes correct GEMM result when no aquant
* WIP: quantization works for subset of problem shapes
* WIP: quantization works for subset of problem shapes
* WIP: interwave memory pipeline passes local test
* feat: Add interwave pipeline implementation for memory pipline in aquant
* fix: compilation error on gfx950
* chore: remove debug statements from the code
* test: resolve merge conflict
* test: remove non rcr unit tests from test suite
* add block scale parameters to kernel
* add block scale to kernel
* add smoke test
* format
* Revert "format"
This reverts commit 356c3c9706.
* only format my code
* format py
* fix auto not allowd in function prototype
* change instance tttt to ttff
* fix structured binding issue
* change s_acc elementwise op
* async pipeline add block scale
* add quantation P using shift exp2
* precompute (m - shift) once per row
* change blk scale seqstrt ptr name
* fix some name
* fix for deduction guide
* fix some comments
* add P scale to qr_ksvs_pipeline
* add comment to idx_identity
* change the method of calculating descale block index
* unify naming style: use block_scale_ as name prefix
* unify naming style
* update the CHANGELOG.md
* Add FP8 block scale quantization support for FMHA forward kernel
---------
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
* wip: test suite for batched gemm multiple d gemm multiple d, working on gridwise implenentation
* wip: many fixes in implementation of batched gemm gemm multiple d
* wip: batched gemm gemm multiple d gridwise op compiling, not working yet
* fix: incorrect d0 grid indexing in batched gemm gemm multipled
* feat: add instances for batched gemm add relu gemm add
* chore: configure instance with low vector transfer size for odd sizes
* chore: add some more validation to device batched gemm gemm multiple d, and removed template parameter that didn't really make sense
* fix: upate device_batched_gemm_gemm_wmma to work with new gridwise changes
* fix: disable odd size tests on XDL archs
* chore: removed temporary logging
* chore: update some references to C tensor to E tensor
* Tentative fix for example template params
* Tentative fix for non-multi-D batched gemm gemm device impl.
* Tentative fix for xdl example template params
* Tentative fix for profiler build on gfx90a
* chore: improve device batched gemm gemm multi D comment to include all ops and dimensions
* chore: explicitly call ck::make_tuple to prevent issues when std::make_tuple would apply
* fix: make the gemm1 data types match what happens in the device op
* feat: add d0s/d1s datatypes and layouts to the device op type string
* chore: change element-wise op so addition happens in fp32
* chore: add static asserts for gemm0/gemm1 calculated wave sizes
* chore: also updated other element-wise ops to use fp32 calculations
* chore: log number of supported instances
* chore: update instance comment
* chore: disable kernel timing in example by default
* fix: gemm1 wave size calculation
* fix: make sure batched gemm multiple d gemm multiple d profiler performs correct type conversions
* chore: remove increased tolerance in batched gemm gemm multiple d example
* chore: add comment explaining that verification fails for certain input values
* chore: clarify instance comment
---------
Co-authored-by: kiefer <kiefer.van.teutem@streamhpc.com>
Add signal-based synchronization for persistent GEMM kernels where
input data becomes available incrementally. Uses modulo wraparound
(like PyTorch's AsyncMM) for chunk index calculation:
chunk_idx = ((tile_idx + tile_idx_pivot) / tiles_per_chunk) % num_chunks
Key components:
- PersistentAsyncInputScheduler struct with tiles_per_chunk_m,
chunk_signals, tile_idx_pivot_m, and num_chunks fields
- wait_eq_wave method using __builtin_amdgcn_s_sleep for power efficiency
- IsSupportedArgument validation for scheduler parameters
- Example demonstrating async input scheduling with simulated producer
- GTest unit tests covering all layout combinations
* Add support to fp16 + compute fp16 and bf16 + compute bf16 contractions
Enables hipTensor to access the WMMA HW functionalities
for these combinations of datatype on gfx11 and gfx12.
* Fix change to contraction scale tests
* Fix clang-format
* feat: test setup for batched contraction (aka batched gemm multiple d e permute)
* wip: device struct for WMMA batched contraction multiple d based on new gridwise op
* feat: working batched contraction on RDNA, non-naive tensor descriptors for gridwise_gemm_wmma_cshuffle_v3, test setup for odd cases
* fix: failure to resolve template parameters when calling new function overload
* fix: passing reference type as parameter instead of underlying types
* fix: merge error caused duplicate definitions
* fix: make sure constness of template and parameters types match
* fix: don't compile batched contraction test on unsupported architectures
* feat: add example for new wmma implementation, and consolidate example code between platforms
* style: return inline instead of with branch
* chore: add extra assert on vector memory access sizes
* chore: clean up some unused variables
* fix: correct tail number calculation, added small cases and extra instances to the test
* fix: properly support wave transfer by generating correct grid descriptors dependent on the transfer method
* formatted
* formatted
* formatting
* formatting
* formatting
* [CK TILE GEMM] Refactor block_scale_gemm examples
- Split cpp file to reduce building time
- Support multiple GemmConfig
* [CK TILE GEMM] Refactor block_scale_gemm examples
- Update Readme
* enable prefill shapes
* [CK TILE GEMM] Refactor block_scale_gemm examples
- Add support for rowcol and tensor GEMM operations
* [CK TILE GEMM] Refactor block_scale_gemm examples
- Update README
* adding preshuffle quant as new parameter and its associated new files
* remove debugging statements
* adding test
* enable preshuffle quant with permuteN
* updating readme and correcponding gemmconfigs
* updating cmake file
* fixing CI failures for grouped quant gemm
* debugging permuteN
* debugging
* debugging PermuteN
* initial commit
* resolving merge conflicts
* adding test cases
* initial commit with prints
* debugging
* fine-grained working
* debugging medium grained
* fixing the tile window
* formatting
* enabling prefill shapes
* working prefill shapes
* formatted
* clean up
* code cleanup
* bug fix after merging with develop
* G128 working for both prefill and decode shapes for preshufflequant
* clean up after merging with develop
* fixing group 64 for decode shapes
* non preshufflequant working for group size 128
* enable preshuffleb and preshufflequant with variour group sizes
* reduce build time by splitting example into diff datatype files
* Adding tests for preshuffleQuant
* address review comment
* fix for gfx1201
* compile time fix for gfx1201
* clang formatted
---------
Co-authored-by: Cong Ma <congma13@amd.com>
Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
Co-authored-by: Agarwal <khuagarw@ctr2-alola-login-03.amd.com>
* CMakeLists.txt hack for Windows.
* Add Windows build instructions.
* Fix type issue with variadic min function.
* Use std::common_type to fix the variadic min/max functions.
* Enable CPU guard compilation on Windows.
* Suppress warnings related to std::getenv on Windows platform.
* Git ignore the output directory on Windows platform.
* Powershell script for running tests and generating reports.
* Improve test logging.
* Disable non-conv tests.
* Fix Debug build on Windows.
* More debug build changes.
* Update Windows build instructions.
* Enable all tests.
* Test fixes.
* Suppress not found linker options warning.
* Update unsigned long literals and format specifiers to work correctly in Windows
* Fix conv 3D bwd weight bilinear tests on Windows.
* Revert changes on .gitignore.
* Clean-up CMake project file for Windows builds.
* clang-format
* Fix definition of CMAKE_PREFIX_PATH on both Linux and Windows platforms.
* Fix building examples on Windows.
* Update Readme.
* Remove the suppression of the deprecated warnings.
* Remove Windows specific min/max implementations from CK Tile math core.
* Remove unnecessary no-op on Windows.
---------
Co-authored-by: User <user@example.com>
Co-authored-by: Ville Pietilä <none>
Co-authored-by: John Afaganis <john.afaganis@amd.com>
Co-authored-by: Ville Pietilä <>
* solve compiler issue
* solve the gfx950 mfma shuffle regression
* refactor jenkinsfile to handle arch name better
* [CK TILE] set divisor to count of thread along k dimension
* fix the compiler error
* solve degradation
* Finish the multiplies fix
* fix the scales
* solve compilation error
* solve the composes
* solve the error of tile sweeper
* fix the test and example
* fix for gfx950
---------
Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>
Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>
Co-authored-by: Cong Ma <congma13@amd.com>
* feat: grouped gemm tile loop support for RDNA4
* fix: removed extra parameter from grouped gemm example instance
* fix: FP8 check incorrectly enabling FP8 on RDNA3
- Enable page_size=1 support in batch prefill codegen (linear layout only).
- Implement per-token page lookup in `kv_offset_array_transform` for page_size=1 to handle 3D input tensors correctly.
- Relax `kPageBlockSize` alignment assertion for the page_size=1 case.
* nt on fp8 blockscale
* some improve and tests needs to be fixed
* update
* fix format
* revert useless change
* revert any change in amd_buffer_coherence
* WIP
* Add Unit tests for the Multi Reduction Kernel
* clang format
* Rename multiblock to threadwise
* Multiblock WIP
* Fix multi reduce multi block unit tests
* Multi Reduce Tile Engine: WIP
* refactoring + try addressing precision error
* Fix multiops examples
* Cleanup
* Clean up tile engine's reduce op
* Update changelog
* Fix remod/clang
* Fix dates
* Fix documentation & missing file
* Fix comments
* Use the update_tile api in the multi-block kernel
* Unify threadwise/multiblock into a single kernel + default multiblock output to float in tests
* Add TileParitioner
* Cleanup
* Add warning when no data to process, in the example
* Refactoring Reduce kernel Tile Partioner + cleanup
* Move the tile partioner to its own file
* Add missing includes
* Fix copyright header with update_amd_copyright_headers.py
* Fix change of interface in Reduce2dProblem
---------
Co-authored-by: Damien Lejeune <damien.lejeune@amd.com>
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
* update grouped_gemm blockwise kernel
* update config
* update kernel
* update examples
* remove test code for now
* sync test files with origin/develop
* update example
* fix code lint
* fix code-lint
* update test code
* run clang format
* run pre-commit
* update api
* add page_block_size parameter
* add is_sglang_layout to parameters
* add kv_offset_array_transform to batch async for page size 16
* add kv_last_page_lens to kernel
* change kv layout to [num_total_pages, page_block_size, hdim]
* format
* - enable codegen of batch_prefill kernels
- create new problem struct BlockFmhaBatchPrefillPipelineProblem for
batch prefill kernels
- generate different page sizes of batch prefill kernels (1, 16)
* 1. fix wrong calculation of page id in kv_offset_array_transform in gfx950
2. support page size 1024
* fix python format
* change kv cache layout to [num_blocks, num_kv_heads, head_size/x,
block_size, x] and [num_blocks, num_kv_heads, block_size/X, head_size, X]
* 1. Introduced `kVectorSize` in BlockFmhaBatchPrefillPipelineProblem instead of using hardcode values
2. Makes batch prefill kernel traits structures inherent from fmha fwd
traits
3. Add some static check for Page size, vector size, hdim, ..., etc.
* [Refactor] Replace is_sglang_layout with Enums for KV cache configuration
Refactored `fmha_batch_prefill` to use `BlockAttentionKVCacheMemoryLayoutEnum` (VECTORIZED/LINEAR) and `BlockAttentionKVCacheLookupTableEnum` (SGLANG_1D/VLLM_2D) instead of a single
boolean.
**Changes:**
* Added Enum definitions in `block_attention_kvcache_layout_enum.hpp`.
* Updated Kernel, Pipeline, and Traits to template on these Enums.
* Implemented `kv_offset_array_transform` logic based on `kKVMemoryLayout`.
* Refactored `PageBlockTableKargs` to adapt to `kKVLookupTable`.
* Updated CodeGen scripts to support new parameters.
This decouples memory layout from the paging mechanism, enabling flexible KV cache configurations.
* 1. remove batch prefill pipeline with sk_pad=false
2. correct some comments
3. add static assert to make sure v offsets is in same page within a tile.
* fix vgpr spill count
* remove unnecessary t2s functions
* add fp8 support for receipt 200 and 600 in fmha_bath_prefill.py
* support linear kv cache layout
* Remove block_table_ptr from fwd_batch_prefill_args. Instead, reuse
kv_page_indices as a pointer of the lookup table.
* 1. merge multiple transforms into single transform.
2. add static check to make sure vlayout is row-major.
* move FmhaFwdCommonKargs::seqlen_k_ptr to VllmPageTableKargs.
* update changelog
---------
Co-authored-by: ltqin <letaoqin@amd.com>
Co-authored-by: PoYen, Chen <PoYen.Chen@amd.com>
* initial poc
* factor out common parts in operator()
* cv4
* rest of the universal gemm pipelines
* fix test
* remove boilerplate from tile engine
* fix example
* fix example
* format
* fix tests build for gemm
* remove base pipeline codegen from gemm instance builder
* unify v3 logic with the rest of universal gemm pipelines
* fix build for multi abd test
* fix test gemm multi d
* fix build for weight preshuffle
* fix grouped gemm test
* fix grouped gemm multi d test
* fix grouped gemm preshuffle
* fix grouped gemm example except for quant
* fix gemm preshuffle
* fix splitk 2 stage example
* fix batched gemm example
* fix multid example
* fix multiabd example
* fix batched gemm test
* fixup
* fix examples build
* fix grouped gemm test build
* fix smoke builder
* hacky poc
* fix tile engine
* kill the lambda
* maybe fix test build
* more fixes
* clang-format
* save temp
* clang-format
* mostly fix examples
* clang-format
* remove dead code
* more cleanup
* fix fmha bwd build (default epilogue set/add appears to be broken)
* fix default epilogue tests but not correctness
* clang-format
* fix bquant
* clang-format
* cleanup dead code
* rearrange make windows for readability
* restore changes to IsSupportedArgument
* fix smoke-builder
* clang-format
* fixup rename class
* build fixes
* clang-format
* fix builder
* fixup
* remove set from builder tests
* fix test
* clang-format
* re-refactor the kernels
* clang-format
* fix header license
* remove memory operation from conv bwd test
* clang-format
* clang-format example,include
* clang-format test
* build fixes
* clang-format
* solve compilation error
* fix the CI
* solve compilation error
* clang format
* solve merge conflict
* solve merge conflict
* solve the gfx11 error
* solve test error
* moar build fixes
* remove AtomicAddRequiresKBatchGreaterThanOne test since the property is removed from the kernel scope
---------
Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
Previously, the code used unsigned long for literals and format specifiers to represent 64-bit unsigned values. While this worked on Linux, it caused compatibility issues on Windows.
The C++ standard does not guarantee that long is 64 bits. On LP64 systems (e.g., Linux), long maps to 64-bit values, but on LLP64 systems (e.g., Windows), long maps to 32-bit values. This discrepancy led to incorrect behavior when assuming unsigned long was always 64-bit.
This commit updates all relevant literals and format specifiers to explicitly use 64-bit unsigned types, ensuring consistent behavior across platforms.
* Added device level implementation for bwd_data_wmma_v3.
* Added first instance of bwd_data_wmma_v3(f16).
* Add support for bwd data in gridwise implementation
Some changes are general for convolution and some are specific for bwd
data. We need to generalize them once we have fwd, bwd data and bwd
weight
* Initial device implementation of bwd data
* Remove unused template parameters in device impl
* Add one instance for different layout
initial check of device implementation
* Add tests for splitk and for different layouts
* Appended more instances to wmma_v3_f16.
* Added conv_2d bf16 wmma_v3 instances.
* Added conv_3d_bf16 wmma_v3_instances.
* Added conv_3d_f16_wmma_v3_instances.
* Added SplitN test cases for wmma.
* Conv3d_bwd_data_scale_wmma_v3 instances.
* Conv3d_bwd_data_bilinear_wmma_v3_instances
* Renaming the device level instances file to common name , since it is defined for different DataTypes.
* Renaming the instances and fixing typo
* Added the test cases to regression test list
* NCHW support for wmma_v3
* Examples for bf16 and f16 bwd_data_wmma_v3
* Added transpose conditons for device impl
* fixing bugs
* Added the gemm_args array implmentation
* WIP debug conv bwd
* fix splitk
* Grouped gemm fix
* Update CmakeLists with EOF
* Added more instances for tests
* Fixed the run time error in examples and removed 3d conv examples.
* Fixed a typo.
* Updated CmakeLists to removed the 3d convultion deleted files
* Added print error statements for unsupoorted argument
* Added the merge conflict related changes
* Fixed compilation error
* Fixed the InstanceFactory duplication error.
* Removed the print statements and added logs to Arg function
* All the merge conflict related errors resolved
* Added d_tensor tests.
* Added the missing example types of wmm_v3
* Merge error fix
* Corrected the instance name
* Reverted the bias relu change
* Revereted the transpose load local change
* Updated the regression test list with bwd_data_scale
* Revert "Revereted the transpose load local change"
This reverts commit 0b7281edb2bf008e407006690a00621174d9d19b.
* Revert "Merge error fix"
This reverts commit f3c85daa474b1b83d10c8a3ce077354e71d91a2b.
* Reverting the local change
* Added merge error fix
* Build error fix due to merge conflicts
* Added bias_relu example for wmma_v3
* Modified the main method in dtensor tests
* Updated the dtensor tests to pick all the shapes
* Updated the dtensor test shapes.
* Updated the mem operations in tests.
* Added reference func
* Fixed typos in device impl
* Added new header file and modified the include file for 3d tests
* Renamed the test file and added reference func call.
* clang format fix
* Added ignore params
* Modified device impl and tests
* Removed debug print statements and updated dtensor test shapes
* Fixing merge conflicts
* Fixing more merge conflicts
* Fixed copyrights
* Updated the tuned instances to bilinear and scale.
* Adding tuned instances to vanilla wmma_v3
* Removed all unused instances and modified test layouts.
* Cleaned up all instances , reverted back fwd fp16 instances and updated tuned fp16 instances.
* Fix clang format
* Updated tuned f16/-genric instances
* Formatting the instances file
* Fixed copyrights and clang issues
* Nonsense commit to force git to force
* Removed the transpose instances
* Added verified genric instances
* Fixing namespace errors
* Added todo for failing shapes
* Formatting instance file
* Fix instance list formatting
* Removing unnecessary formats
* Renamed the common file
* Unification of xdl and wmma bwd_data tests
* Updated Cmake
* Added all layout types and deleted code.
* Updated Cmake to add the condition to all tests.
---------
Co-authored-by: Enrico Degregori <enrico@streamhpc.com>
Co-authored-by: Anton Gorenko <anton@streamhpc.com>
Co-authored-by: kiefer <kiefer.van.teutem@streamhpc.com>