[CK_TILE] MX GEMM non-preshuffled RCR layout
## Motivation
Implements a GEMM with MX scaling for fp4 and fp8 in non-preshuffled
layouts using async pipeline.
## Technical Details
<!-- Explain the changes along with any relevant GitHub links. -->
## Test Plan
<!-- Explain any relevant testing done to verify this PR. -->
## Test Result
<!-- Briefly summarize test outcomes. -->
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
* [Compiler] Addressing new compiler warnings
Clang enables new lifetime warnings in production and we see build
errors due to this with the staging compiler.
The attributes added in this PR are suggested by the compiler. However,
I'm not very familiar with the code base, so the changes may be
incorrect.
* Update some more instances
* Adds file-level ignores via clang diagnostic pragma
The number of instances was large, so I decided to use file-level scope
to disable the warning via pragma clang diagnostic ignored.
It also showed this warning coming from the gtest dependency. For that,
I did add the respective command line flag to the CMake variables. I
don't know if this is acceptable or not.
* This adds the remaining instances
For a build on gfx90a.
* fix clang format
* Adding couple more instances from gfx1200 build
* Fixed another few instances
---------
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>
* feat(check_err): add a variable to adjust number of incorrect values to print
* feat(host_tensor): add printing capability for fp8 bf8 int8 int4
* fix(gemm_utils): update acceptable data type
* fix(host_tensor): print both 4 bit ints in pk_int4_t
* refactor(HostTensor): define pk_int4_t_to_int8x2_t and fix typo in vector_type.hpp
* feat(host_tensor): add print first n elements functions
* updates to support int8 in 03_gemm example
* added comments, using aliases, helper functions
* test(gemm_universal): add test cases for int8 gemm pipeline
* fix(test_gemm): fix for failing test unit test for int8
* test(ck_tile): add int8 unit test for gemm universal
* refactor(gemm_universal): GPU reference verification for GEMM code improved
* style(gemm_universal): removed extra comments and did clang format
* merging recent changes to universal gemm to tile_engine
* ck tile engine integration work
* feat(tile_engine): add int8 support to tile engine ops/gemm
* feat(tile_engine): added 32 32 16 mfma instances to tile engine for int8
* style: Format code with clang-format-12
* refactor(tile_engine): address review comments
* style: removed unhelpful comments & unused variables.
* build: tile engine uses default config
* feat: add int8 support for CK_TILE GEMM
* style: added trailing commas to codegen_utils.py
* refactor: tile engine
* refactor: formatting and code review
* refactor: code formatting for python files
* fix: suppress build warning
* add support for gfx950
* refactor:KWarpTile size in gemms util
* Fix the branch and wrap up the k warp tile
* Add bf8 integration
* refactor: clang format and rebase
---------
Co-authored-by: zjli2013 <leezhengjiang@gmail.com>
Co-authored-by: AviralGoelAMD <aviral.goel@amd.com>
Co-authored-by: Khushbu Agarwal <khuagarw@amd.com>
* Avoid passing indices (std::vector) by value to host tensor's operator()
Each access requires 2 allocations and copies of the vector.
* Remove 1 unneeded vector copy from the slowest part of fmha_bwd's verification
* Compute ds_hp_host_ref in parallel
This sequntial ForEach is the slowest part of validation and it benefits
from parallel computation.
* Do not use ForEach for simple copy and conversion of large tensors
These tensors all have the same shape {nhead, real_seqlen_q, real_seqlen_k} and
can be copied/converted without complex computations of linear indices.
* not using structures under ck_tile/ops for ck_tile/host
* update as constexpr function
* Rename fn
* Update other examples.
---------
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
Co-authored-by: Adam Osewski <Adam.Osewski@amd.com>
* Use dictionary to config all the functions
* Add init codegen logic for fmha fwd appendkv
* Call HIP_CHECK_ERROR() macro to get real source info
* Setup meaningfull arguments
* Sync kernel name with the codegen
* Add knew/vnew tensors to the kernel argument
* Fix wrong K values after appending
* Fix vnew append errro
* Extract common logics
* Fix Vnew tile dstr for row major case
* Conditionally add fwd_splitkv API in fmha_fwd example
* Conditionally add call to fmha_fwd_splitkv()
* Remove "EXAMPLE_" prefix of cmake variables
* Regsiter API handlers automatically
* Early return if 0 < s_k_new is not supported
* Show message if we are ignoring option
* Unify CMakeLists.txt coding style
* Set num_splits=1 if split-kv is not supported
* Add length/stride getters for HostTensor
* Add RoPE example utilities
* Add reference_rotary_position_embedding() (not implemented)
* Finish reference_rotary_position_embedding() impl
* Fix typo of HostTensor<>::get_length()
* Fix compilation errors
* Fix wrong answer when interleaved=false
* Fix wrong answer when interleaved=true
* Append K/V in the host verification code
* Simplify K appending logics
* Simplify v_host_ref definition
* Reduce input/output dimensions
* Rename function: add "batched" prefix
* Apply RoPE on host side
* Rename RoPE utility function
* Fix wrong tensor size
* Avoid invoking deprecated method 'find_module'
* Pass RoPE kernel args
* Create Rotary Cos/Sin tile windows in kernel
* Add compute data type alias for RoPE
* Randomly generate seqlen_knew if needed
* Fix seqlen_knew enabling check logic
* Add minimum seqlen_k to generate compliance kvcache
* Fix compilation error in debug mode
* Fix wrong boundaries
* Fix wrong seqlen_k for kvcache
* Rename variables used in distributio encoding
* Fix rotary cos/sin tensor/tile size
* Add constraint to the rotary_dim option
* Remove unused inner namespace
* Add dram distribution for rotary_cos/rotary_sin (interleaved)
* Only apply interleaved RoPE on Knew for now
* Fix wrong thread starting offset
* Instantiate multiple kernels for RoPE approaches
* Clean-up pipeline
* Fix error in RoPE host reference
* Handle RoPE half-rotated logics
* Support 8x rotary_dim under half-rotated RoPE
* Add comment
* Apply elementwise function to the loaded tiles
* Unify parameter/variable naming style
* Remove constness from q_ptr
* Add code blocks for q_tile
* Apply RoPE to q_tile
* Remove debug print code in kernel
* Fix wrong knew/vnew appending positions
* Use better naming for tile indices
* Add make_tile_window() for adding distribution only
* Skip code if # of block is more than needed
* Move thread locating logics into policy
* Remove always true static_assert()
* Rename header
* Rename RotaryEmbeddingEnum
* Extract rotary embedding logic out
* Re-order parameters
* Align naming of some tile size constants
* Rename more tile size constants
* Fix wrong grid size
* Fix wrong shape of knew_host/vnew_host
* Fix wrong index into knew_host/vnew_host
* Fix wrong rotary_cos/rotary_sin memory size for Q
* Extract Q/Knew vector size to helper methods
* Use different rotary_cos/rotary_sin distr for Q/Knew
* Update host/device specifiers
* Fix wrong data type for Q rotary_cos/rotary_sin
* Remove RoPEComputeDataType type alias
* Shift rotary_cos/rotary_sin by cache_seqlen_k
* Add comment for why I just 't' for all padding flags
* Align commit message to the real comment
* Fix wrong pipeline
* Rename utility function
* Disable host verification if API not exist
* Fix wrong rope key for fp8 pipeline
* Allow only apply RoPE on Q (without append KV)
* Add append-kv smoke tests
* Remove debug statements
* Remove more debug statements
* Re-arrange the 'set +x' command
* Remove no-longer used method in pipeline
* Add missing init code
* Refine pipeline padding settings
* Enlarge rotary_dim limit (8 -> 16)
* Enlarge KPerThread for rotary_interleaved=false
* Update rotary_dim range in smoke_test_fwd.sh
* Add template argument 'kIsPagedKV' for splitkv kernels
* Launch splitkv kernel if given page_block_size
* Fix wrong kernel name
* Fix seqlen_k_min for pre-fill case (1 -> 0)
* Add copy_const<> type trait
* Add another make_tile_window()
* Introduce 'TileWindowNavigator' types
* Simplify TileWindowNavigator interfaces
* Fix tile window navigation bugs
* Disable calling fmha_fwd()
* Remove ununnecessary data members
* Simplify more make_tile_window() overloads
* Move V tile through TileWindowNavigator
* Fix uneven split checking logic
* Move code after decide seqlen_q/seqlen_k
* Make sure we always start reading complete tile
* Use 128 as minimus page_block_size
* Fix wrong origin for bias
* Add batch_stride_k/batch_stride_v in group mode
* Unify origin
* Add missing kernel arguments for group mode
* Add paged-kv codegen logic for appendkv kernels
* Add block_table kernel args for appendkv kernel
* Add tile navigators to the appendkv kernel
* Fix wrong tensor descriptor lengths
* Pass re-created tile window to pipeline
* Fix wrong strides for appendkv kernel
* Allow transit tile_window to another page-block
* Handle cross-page-block write
* Donot perform write again if already in last page-block
* Always add fmha_fwd() api
* Add missing group mode argument
* Remove debug macro usages
* Rename option s_k_new to s_knew
* Separate splitkv/non-splitkv args/traits
* Remove fmha_fwd_dispatch()
* Fix compilation errors
* Remove dropout code in splitkv kernel
* Allow problem types without define kHasDropout attr
* Use generic lambda to init traits objects
* Separate more non-splitkv & splitkv traits/args
* Display more info for specific kernels
* Show more detailed warning message
* Rename 'max_num_blocks' to 'max_num_page_blocks'
* Remove no-longer used pipeline files
* Wrap code by #if directives
* Move functors to the begining of validation code
* Use generic lambda to init all the api traits/args
* Fix wrong seqlen for kvcache
* Add missing comment
* Rename TileWindowNavigator to PageBlockNavigator
* Only expose necessary methods (not attributes)
* Re-order pipeline paremeters
* Refine smoke_test_fwd.sh
* Fix wrong arugment count
* Make tile window directly via PageBlockNavigator
* Remove unused template paremeter
* Remove group mode from appendkv kernel
* Fix skcheck logic
* Fix wrong syntax in skcheck expr
* Use meaningful options in smoke test
* Remove options
* Fix formatting
* Fix more format
* Re-organize bash functions
* Pass cache_batch_idx to kernels
* Support cache_batch_idx in example
* Fix compilation error
* Add more appendkv test
* Add more case for appendkv
* Fix unexisted attribute
* Remove 0 < seqlen_knew constraint
* Clarify the case in warning message
* Remove macro checking
* Force batch mode when invoking appendkv & splitkv apis
* Fix mode overriding logics
* Fix wrong parameter name
* Randomize seqlen_k if use kvcache
* Use randomized seqlen_k for kvcache
* Avoid using too small rotary_cos & rotary_sin
* Rename parameter
* Add seqlen_q & seqlen_k rules
* Add comment
* Add more comments
* Fix compilation errors
* Fix typo in comment
* Remove type argument
* Avoid seqlen_k=0 for kvcache
* Revert "Avoid seqlen_k=0 for kvcache"
This reverts commit 21c4df89e4.
* Fix wrong uneven split checking logics
* Only randomize kvcache seqlen_k if 1 < batch
* Return earlier if split is empty
* Revert "Only randomize kvcache seqlen_k if 1 < batch"
This reverts commit b9a4ab0d7e.
* Re-order seqlen_k_start adjustment logics
* Fix compilation errors
* Re-format script
* Find executable from folder automatically
* Fix kvcache seqlen_k generating logic
* Make comment more clear
* Fix wrong knew/vew appending logic on host
* Add s_barrier to sync threads
* Revert "Add s_barrier to sync threads"
This reverts commit d3f550f30c.
* Support only using 1 row of rotary_cos/rotary_sin
* Rotate Q in different way
* Unify tensor view creation logics
* Fix wrong argument
* Add mask to switch how we use the rotary_cos/sin
* Move attr from traits to problem
* Move has_mask to fmha_fwd_appendkv_args
* Support use uint32_t as SAD operand in Alibi<>
* Use sad_u32() in splitkv kernels
* Store tensor views in PageBlockNavigator
* Use stored tensor view to update tile windows
* Enlarge tensor view size
* Remove debug code
* Fix wrong tensor view size
* Wrap tensor view into PageBlockNavigator
* Add DataType member to PageBlockNavigator
* Remove unnecessary member functions
* Refind macro use
* Fix typo
* Add blank line between directives and actual code
* Re-format files
* Remove type in comment
---------
Co-authored-by: carlushuang <carlus.huang@amd.com>
Co-authored-by: rocking <ChunYu.Lai@amd.com>
* enable gfx940
* switch between intrinsic mfma routines on mi100/200 and mi300
* fix mfma_int8 on MI300
* disable 2 int8 examples on MI300
* Update cmake-ck-dev.sh
* restore gitignore file
* modify Jenkinsfile to the internal repo
* Bump rocm-docs-core from 0.24.0 to 0.29.0 in /docs/sphinx
Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core) from 0.24.0 to 0.29.0.
- [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases)
- [Changelog](https://github.com/RadeonOpenCompute/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](https://github.com/RadeonOpenCompute/rocm-docs-core/compare/v0.24.0...v0.29.0)
---
updated-dependencies:
- dependency-name: rocm-docs-core
dependency-type: direct:production
update-type: version-update:semver-minor
...
Signed-off-by: dependabot[bot] <support@github.com>
* initial enablement of gfx950
* fix clang format
* disable examples 31 and 41 int8 on gfx950
* add code
* fix build wip
* fix xx
* now can build
* naming
* minor fix
* wip fix
* fix macro for exp2; fix warpgemm a/b in transposedC
* unify as tuple_array
* Update the required Python version to 3.9
* Update executable name in test scripts
* re-structure tuple/array to avoid spill
* Merge function templates
* Fix format
* Add constraint to array<> ctor
* Re-use function
* Some minor changes
* remove wrong code in store_raw()
* fix compile issue in transpose
* Rename enum
Rename 'cood_transform_enum' to 'coord_transform_enum'
* let more integral_constant->constant, and formating
* make sure thread_buffer can be tuple/array
* temp fix buffer_store spill
* not using custom data type by default, now we can have ISA-level same code as opt_padding
* fix compile error, fp8 not ready now
* fix fp8 duplicated move/shift/and/or problem
* Default use CK_TILE_FLOAT_TO_FP8_STOCHASTIC rounding mode
* fix scratch in fp8 kernel
* update some readme
* fix merge from upstream
* sync with upstream
* sync upstream again
* sync 22
* remove unused
* fix clang-format
* update README of ck_tile example
* fix several issue
* let python version to be 3.8 as minimal
* remove ck_tile example from default cmake target like all/install/check
* remove mistake
* 1).support receipe in generate.py 2).use simplified mask type 3).change left/right to pass into karg
* fix some bug in group-mode masking and codegen. update README
* F8 quantization for FMHA forward (#1224)
* Add SAccElementFunction, PComputeElementFunction, OAccElementFunction in pipeline
* Add element function to fmha api
* Adjust P elementwise function
* Fix bug of elementwise op, our elementwise op is not inout
* Add some elementwise op, prepare to quantization
* Let generate.py can generate different elementwise function
* To prevent compiler issue, remove the elementwise function we have not used.
* Remove f8 pipeline, we should share the same pipeline even in f8
* Remove remove_cvref_t
* Avoid warning
* Fix wrong fp8 QK/KV block gemm setting
* Check fp8 rounding error in check_err()
* Set fp8 rounding error for check_err()
* Use CK_TILE_FLOAT_TO_FP8_STANDARD as default fp8 rounding mode
* 1. codgen the f8 api and kernel
2. f8 host code
* prevent warning in filter mode
* Remove not-in-use elementwise function kargs
* Remove more not-in-use elementwise function kargs
* Small refinements in C++ source files
* Use conditional_t<> to simplify code
* Support heterogeneous argument for binary function types
* Re-use already-existing scales<> functor template
* Fix wrong value produced by saturating
* Generalize the composes<> template
* Unify saturates<> implementation
* Fix type errors in composes<>
* Extend less_equal<>
* Reuse the existing template less_equal<> in check_err()
* Add equal<float> & equal<double>
* Rename check_err() parameter
* Rename check_err() parameter
* Add FIXME comment for adding new macro in future
* Remove unnecessary cast to void
* Eliminate duplicated code
* Avoid dividing api pool into more than 2 groups
* Use more clear variable names
* Use affirmative condition in if stmt
* Remove blank lines
* Donot perfect forwarding in composes<>
* To fix compile error, revert generate.py back to 4439cc107d
* Fix bug of p element function
* Add compute element op to host softmax
* Remove element function in api interface
* Extract user parameter
* Rename pscale and oscale variable
* rename f8 to fp8
* rename more f8 to fp8
* Add pipeline::operator() without element_functor
* 1. Remove deprecated pipeline enum
2. Refine host code parameter
* Use quantization range as input
* 1. Rename max_dtype to dtype_max.
2. Rename scale to scale_s
3.Add init description
* Refine description
* prevent early return
* unify _squant kernel name in cpp, update README
* Adjust the default range.
* Refine error message and bias range
* Add fp8 benchmark and smoke test
* fix fp8 swizzle_factor=4 case
---------
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
Co-authored-by: carlushuang <carlus.huang@amd.com>
---------
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: illsilin <Illia.Silin@amd.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
Co-authored-by: Jing Zhang <jizha@amd.com>
Co-authored-by: zjing14 <zhangjing14@gmail.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Po-Yen, Chen <PoYen.Chen@amd.com>
Co-authored-by: rocking <ChunYu.Lai@amd.com>