* Add shortcut to RMSNorm
* Modify test for adding shortcut for RMSNorm
* Add fused parameter into tests
* 1. Add YDataType. 2. rmsnorm2d_fwd_traits_ from rmsnorm2d_fwd.hpp to rmsnorm2d_fwd_api.cpp and rmsnorm2d_fwd_instance_common.hpp
* 1. Supports various stride and percisions.
* Add support of Epilogue
* Add fuse and epilogue support to rmsnorm ref
* Modify rmsnorm example
* Refactor tests/examples
* Bug fix for newly added tests/examples
* Bug fix for new tests 2
* Modify smoke test scripts
remove dbg code
* Supports non-smooth dyanmic quant
* Update Rmsnorm2dFwd::GetName()
* rename xscale and prec_sx to smoothscale and prec_sm
Bug fix after rename
Remove files
* change example_rmsnorm2d_fwd.cpp
* update performance calculator
* Fix issue in two-pass when fuse add is enabled
* Remove comment of beta
---------
Co-authored-by: rocking <ChunYu.Lai@amd.com>
* add prenorm/postnorm support, refactor using generate.py
* update README
* update README
* fix format
* update some description and fix format
* update format
* format
* use non-raw for loading
* format and update n4096
* dynamic-quant ready
* update readme
* support fused dynamic-quant
* update fused-quant, with smooth
* update README
* update args
* update some based on comment
* CK-Tile GEMM with memory bound pipeline.
* Memory bound gemm pipeline.
* Fix not closed namespace.
* Block gemm mem pipeline draft.
* Do not use ck_tile:: within ck_tile namespace.
* Refactoring & Move Layout info to pipeline problem.
* Get hot loop and TailNum information before lunching kernel.
* Fixes in pipeline.
* Add comment to load_tile_raw and change variable naming style.
* Few small changes & formatting.
* Do not use macro.
* Add gtests.
* Use AccDataType for Output of MFMA instruction.
* Formatting.
* Refactor gemm examples.
* Switch over to current block gemm.
* Use currently available pipeline policy.
* Refactoring and review comment.s
* Fixes after merge.
* Add missing include.
* Add load tile overload which accepts output tensor as parameter.
* This give 8% perf boost at the cost of using more registers.
* Rename example.
* Small changes.
* Fix compilation err and lower K.
* Support different layouts for A/B
* Fix vector size for different layouts.
* Rename Alignment into VectorSize
* Unblock tests.
* Add reduce2d new api
* Prevent user use cross warp reduction
* Fix bug of std caculation
* Add rmsnorm2d
* Add rmsnorm small example
* Remove static assert to prevent compile fail
* Add script to test performance and correctness
* Add missing cmake change
* refine naming
* refine example of rmsnorm
* Fix bug of rmsnorm
* Refine naming
* Fix cmake
* clang format
* Refine pipeline name
* Add add_rmsnorm2d_rdquant kernel
* Add reduce op
* host verification
* Fix bug of one pass pipeline
* Refine tile size
* Add two pass pipeline
* Rename two pass to three pass
* Fix bug of kSaveX == false
* Add instance library
* Add test script
* Fix bug of x verification
* Add save_x to trait
* Add README
* Move reduce2d into reduce folder
* Fix bug of welford when number of m warp > 1
* remove reduncant comment
* 1. move 06_rmsnorm2d to 10_rmsnorm2d
2. move 07_add_rmsnorm2d_rdquant to 11_add_rmsnorm2d_rdquant
* clang format and add missing header
* Add host validation of add + layernorm2d + rsquant
* Revert "Add host validation of add + layernorm2d + rsquant"
This reverts commit 936cb45797.
* Remove deprecated flag
* port layernorm
* change warp_welford.hpp
* Update warpshuffle
* 1. Add save mean and save std back
2. Move construction of tensor_view and tile_window to operator()
* refine welford max count calculation
* unify layernorm api
* Rename file
* Remove save mean and inv std
* Revert "refine welford max count calculation"
This reverts commit 022365802b.
* Fix order of parameter
* refine welford max count calculation again
* Remove fp32 instances
* Fix bug of padding
* refactor api
* Support bf16
* Extract common function
* Refine arg of operator()
* Add kMThreadPerBlock to template parameter
* clang format
* Refine variable name
* Refine file name
* remove redundant line
* refactor layernorm2d pipeline and add block-per-block utility
* fix name
* rename more
* add more block-per-tile instance
* remove duplicated define
* update instance for 2048, 1024 case
* support up to 2048 now
* opt loading
* add n1536
* Add two pass pipeline
* format
* Fix incorrect type
* parallel compilation
* Use smaller N
* fix 2p pass
* Support Repeat_M in distribution
* Refine nameing
* Add reduce example
---------
Co-authored-by: letaoqin <letaoqin@amd.com>
Co-authored-by: aska-0096 <haocwang@amd.com>
Co-authored-by: rocking <ChunYu.Lai@amd.com>
Co-authored-by: carlushuang <carlus.huang@amd.com>
* ake the cshuffle compilable
* modify Mhe reference on gpu and cpu. Correaccess of cshuffle
* fix the cpu reference code
* Complete the in tile shuffle logic
* restructure the kernel template input
* change the naming pattern of ck_tile gemm pipeline
* Re-format files using remod.py
* Solve the fmha conflict with gemm
* Comment Addressed from Carlus
---------
Co-authored-by: Po Yen, Chen <PoYen.Chen@amd.com>
* Finished the feature of gpu verification
* Add the ck_tile_gemm test in the CI CD
* add the include of tensor_layou in reference_gemm
* Comment Addressed
* split ck_tile fhma and gemm tests into separate stages
* restructure the reference gemm
* restructure a new reference_gemm api that could read the device mem
---------
Co-authored-by: carlushuang <carlus.huang@amd.com>
Co-authored-by: illsilin <Illia.Silin@amd.com>
* Checkpoint: Finished with the tile example & kernel verification, working on the different matrix layout
* Finished the Matrix Layout feature set up. Note: Need to modify the inner block to solve the shuffle problem in the future.
* Fix: Clang Format, API fixed from fmha
* fix with better naming convention
* revert back the pipeline code of fmha
* Fixed: Addressed the comments and merge the GEMM shape of GEMM Operator and FMHA Operator to one.
* clang format with the reference_gemm file
* convert the clang format with the remod.py
* Changed the format and variable name of the kernel gemm_shape and partitioner
---------
Co-authored-by: thomasning <thomasning@banff-cyxtera-s70-4.ctr.dcgpu>
* Use dictionary to config all the functions
* Add init codegen logic for fmha fwd appendkv
* Call HIP_CHECK_ERROR() macro to get real source info
* Setup meaningfull arguments
* Sync kernel name with the codegen
* Add knew/vnew tensors to the kernel argument
* Fix wrong K values after appending
* Fix vnew append errro
* Extract common logics
* Fix Vnew tile dstr for row major case
* Conditionally add fwd_splitkv API in fmha_fwd example
* Conditionally add call to fmha_fwd_splitkv()
* Remove "EXAMPLE_" prefix of cmake variables
* Regsiter API handlers automatically
* Early return if 0 < s_k_new is not supported
* Show message if we are ignoring option
* Unify CMakeLists.txt coding style
* Set num_splits=1 if split-kv is not supported
* Add length/stride getters for HostTensor
* Add RoPE example utilities
* Add reference_rotary_position_embedding() (not implemented)
* Finish reference_rotary_position_embedding() impl
* Fix typo of HostTensor<>::get_length()
* Fix compilation errors
* Fix wrong answer when interleaved=false
* Fix wrong answer when interleaved=true
* Append K/V in the host verification code
* Simplify K appending logics
* Simplify v_host_ref definition
* Reduce input/output dimensions
* Rename function: add "batched" prefix
* Apply RoPE on host side
* Rename RoPE utility function
* Fix wrong tensor size
* Avoid invoking deprecated method 'find_module'
* Pass RoPE kernel args
* Create Rotary Cos/Sin tile windows in kernel
* Add compute data type alias for RoPE
* Randomly generate seqlen_knew if needed
* Fix seqlen_knew enabling check logic
* Add minimum seqlen_k to generate compliance kvcache
* Fix compilation error in debug mode
* Fix wrong boundaries
* Fix wrong seqlen_k for kvcache
* Rename variables used in distributio encoding
* Fix rotary cos/sin tensor/tile size
* Add constraint to the rotary_dim option
* Remove unused inner namespace
* Add dram distribution for rotary_cos/rotary_sin (interleaved)
* Only apply interleaved RoPE on Knew for now
* Fix wrong thread starting offset
* Instantiate multiple kernels for RoPE approaches
* Clean-up pipeline
* Fix error in RoPE host reference
* Handle RoPE half-rotated logics
* Support 8x rotary_dim under half-rotated RoPE
* Add comment
* Apply elementwise function to the loaded tiles
* Unify parameter/variable naming style
* Remove constness from q_ptr
* Add code blocks for q_tile
* Apply RoPE to q_tile
* Remove debug print code in kernel
* Fix wrong knew/vnew appending positions
* Use better naming for tile indices
* Add make_tile_window() for adding distribution only
* Skip code if # of block is more than needed
* Move thread locating logics into policy
* Remove always true static_assert()
* Rename header
* Rename RotaryEmbeddingEnum
* Extract rotary embedding logic out
* Re-order parameters
* Align naming of some tile size constants
* Rename more tile size constants
* Fix wrong grid size
* Fix wrong shape of knew_host/vnew_host
* Fix wrong index into knew_host/vnew_host
* Fix wrong rotary_cos/rotary_sin memory size for Q
* Extract Q/Knew vector size to helper methods
* Use different rotary_cos/rotary_sin distr for Q/Knew
* Update host/device specifiers
* Fix wrong data type for Q rotary_cos/rotary_sin
* Remove RoPEComputeDataType type alias
* Shift rotary_cos/rotary_sin by cache_seqlen_k
* Add comment for why I just 't' for all padding flags
* Align commit message to the real comment
* Fix wrong pipeline
* Rename utility function
* Disable host verification if API not exist
* Fix wrong rope key for fp8 pipeline
* Allow only apply RoPE on Q (without append KV)
* Add append-kv smoke tests
* Remove debug statements
* Remove more debug statements
* Re-arrange the 'set +x' command
* Remove no-longer used method in pipeline
* Add missing init code
* Refine pipeline padding settings
* Enlarge rotary_dim limit (8 -> 16)
* Enlarge KPerThread for rotary_interleaved=false
* Update rotary_dim range in smoke_test_fwd.sh
* Add template argument 'kIsPagedKV' for splitkv kernels
* Launch splitkv kernel if given page_block_size
* Fix wrong kernel name
* Fix seqlen_k_min for pre-fill case (1 -> 0)
* Add copy_const<> type trait
* Add another make_tile_window()
* Introduce 'TileWindowNavigator' types
* Simplify TileWindowNavigator interfaces
* Fix tile window navigation bugs
* Disable calling fmha_fwd()
* Remove ununnecessary data members
* Simplify more make_tile_window() overloads
* Move V tile through TileWindowNavigator
* Fix uneven split checking logic
* Move code after decide seqlen_q/seqlen_k
* Make sure we always start reading complete tile
* Use 128 as minimus page_block_size
* Fix wrong origin for bias
* Add batch_stride_k/batch_stride_v in group mode
* Unify origin
* Add missing kernel arguments for group mode
* Add paged-kv codegen logic for appendkv kernels
* Add block_table kernel args for appendkv kernel
* Add tile navigators to the appendkv kernel
* Fix wrong tensor descriptor lengths
* Pass re-created tile window to pipeline
* Fix wrong strides for appendkv kernel
* Allow transit tile_window to another page-block
* Handle cross-page-block write
* Donot perform write again if already in last page-block
* Always add fmha_fwd() api
* Add missing group mode argument
* Remove debug macro usages
* Rename option s_k_new to s_knew
* Separate splitkv/non-splitkv args/traits
* Remove fmha_fwd_dispatch()
* Fix compilation errors
* Remove dropout code in splitkv kernel
* Allow problem types without define kHasDropout attr
* Use generic lambda to init traits objects
* Separate more non-splitkv & splitkv traits/args
* Display more info for specific kernels
* Show more detailed warning message
* Rename 'max_num_blocks' to 'max_num_page_blocks'
* Remove no-longer used pipeline files
* Wrap code by #if directives
* Move functors to the begining of validation code
* Use generic lambda to init all the api traits/args
* Fix wrong seqlen for kvcache
* Add missing comment
* Rename TileWindowNavigator to PageBlockNavigator
* Only expose necessary methods (not attributes)
* Re-order pipeline paremeters
* Refine smoke_test_fwd.sh
* Fix wrong arugment count
* Make tile window directly via PageBlockNavigator
* Remove unused template paremeter
* Remove group mode from appendkv kernel
* Fix skcheck logic
* Fix wrong syntax in skcheck expr
* Use meaningful options in smoke test
* Remove options
* Fix formatting
* Fix more format
* Re-organize bash functions
* Pass cache_batch_idx to kernels
* Support cache_batch_idx in example
* Fix compilation error
* Add more appendkv test
* Add more case for appendkv
* Fix unexisted attribute
* Remove 0 < seqlen_knew constraint
* Clarify the case in warning message
* Remove macro checking
* Force batch mode when invoking appendkv & splitkv apis
* Fix mode overriding logics
* Fix wrong parameter name
* Randomize seqlen_k if use kvcache
* Use randomized seqlen_k for kvcache
* Avoid using too small rotary_cos & rotary_sin
* Rename parameter
* Add seqlen_q & seqlen_k rules
* Add comment
* Add more comments
* Fix compilation errors
* Fix typo in comment
* Remove type argument
* Avoid seqlen_k=0 for kvcache
* Revert "Avoid seqlen_k=0 for kvcache"
This reverts commit 21c4df89e4.
* Fix wrong uneven split checking logics
* Only randomize kvcache seqlen_k if 1 < batch
* Return earlier if split is empty
* Revert "Only randomize kvcache seqlen_k if 1 < batch"
This reverts commit b9a4ab0d7e.
* Re-order seqlen_k_start adjustment logics
* Fix compilation errors
* Re-format script
* Find executable from folder automatically
* Fix kvcache seqlen_k generating logic
* Make comment more clear
* Fix wrong knew/vew appending logic on host
* Add s_barrier to sync threads
* Revert "Add s_barrier to sync threads"
This reverts commit d3f550f30c.
* Support only using 1 row of rotary_cos/rotary_sin
* Rotate Q in different way
* Unify tensor view creation logics
* Fix wrong argument
* Add mask to switch how we use the rotary_cos/sin
* Move attr from traits to problem
* Move has_mask to fmha_fwd_appendkv_args
* Support use uint32_t as SAD operand in Alibi<>
* Use sad_u32() in splitkv kernels
* Store tensor views in PageBlockNavigator
* Use stored tensor view to update tile windows
* Enlarge tensor view size
* Remove debug code
* Fix wrong tensor view size
* Wrap tensor view into PageBlockNavigator
* Add DataType member to PageBlockNavigator
* Remove unnecessary member functions
* Refind macro use
* Fix typo
* Add blank line between directives and actual code
* Re-format files
* Remove type in comment
---------
Co-authored-by: carlushuang <carlus.huang@amd.com>
Co-authored-by: rocking <ChunYu.Lai@amd.com>
* Add layernorm2d forward
* Refind file path
* clang format
* Exclude ck_tile op from all
* use add_executable instead
* refactor layernorm2d_fwd example
---------
Co-authored-by: carlushuang <carlus.huang@amd.com>
* enable gfx940
* switch between intrinsic mfma routines on mi100/200 and mi300
* fix mfma_int8 on MI300
* disable 2 int8 examples on MI300
* Update cmake-ck-dev.sh
* restore gitignore file
* modify Jenkinsfile to the internal repo
* Bump rocm-docs-core from 0.24.0 to 0.29.0 in /docs/sphinx
Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core) from 0.24.0 to 0.29.0.
- [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases)
- [Changelog](https://github.com/RadeonOpenCompute/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](https://github.com/RadeonOpenCompute/rocm-docs-core/compare/v0.24.0...v0.29.0)
---
updated-dependencies:
- dependency-name: rocm-docs-core
dependency-type: direct:production
update-type: version-update:semver-minor
...
Signed-off-by: dependabot[bot] <support@github.com>
* initial enablement of gfx950
* fix clang format
* disable examples 31 and 41 int8 on gfx950
* add code
* fix build wip
* fix xx
* now can build
* naming
* minor fix
* wip fix
* fix macro for exp2; fix warpgemm a/b in transposedC
* unify as tuple_array
* Update the required Python version to 3.9
* Update executable name in test scripts
* re-structure tuple/array to avoid spill
* Merge function templates
* Fix format
* Add constraint to array<> ctor
* Re-use function
* Some minor changes
* remove wrong code in store_raw()
* fix compile issue in transpose
* Rename enum
Rename 'cood_transform_enum' to 'coord_transform_enum'
* let more integral_constant->constant, and formating
* make sure thread_buffer can be tuple/array
* temp fix buffer_store spill
* not using custom data type by default, now we can have ISA-level same code as opt_padding
* fix compile error, fp8 not ready now
* fix fp8 duplicated move/shift/and/or problem
* Default use CK_TILE_FLOAT_TO_FP8_STOCHASTIC rounding mode
* fix scratch in fp8 kernel
* update some readme
* fix merge from upstream
* sync with upstream
* sync upstream again
* sync 22
* remove unused
* fix clang-format
* update README of ck_tile example
* fix several issue
* let python version to be 3.8 as minimal
* remove ck_tile example from default cmake target like all/install/check
* remove mistake
* 1).support receipe in generate.py 2).use simplified mask type 3).change left/right to pass into karg
* fix some bug in group-mode masking and codegen. update README
* F8 quantization for FMHA forward (#1224)
* Add SAccElementFunction, PComputeElementFunction, OAccElementFunction in pipeline
* Add element function to fmha api
* Adjust P elementwise function
* Fix bug of elementwise op, our elementwise op is not inout
* Add some elementwise op, prepare to quantization
* Let generate.py can generate different elementwise function
* To prevent compiler issue, remove the elementwise function we have not used.
* Remove f8 pipeline, we should share the same pipeline even in f8
* Remove remove_cvref_t
* Avoid warning
* Fix wrong fp8 QK/KV block gemm setting
* Check fp8 rounding error in check_err()
* Set fp8 rounding error for check_err()
* Use CK_TILE_FLOAT_TO_FP8_STANDARD as default fp8 rounding mode
* 1. codgen the f8 api and kernel
2. f8 host code
* prevent warning in filter mode
* Remove not-in-use elementwise function kargs
* Remove more not-in-use elementwise function kargs
* Small refinements in C++ source files
* Use conditional_t<> to simplify code
* Support heterogeneous argument for binary function types
* Re-use already-existing scales<> functor template
* Fix wrong value produced by saturating
* Generalize the composes<> template
* Unify saturates<> implementation
* Fix type errors in composes<>
* Extend less_equal<>
* Reuse the existing template less_equal<> in check_err()
* Add equal<float> & equal<double>
* Rename check_err() parameter
* Rename check_err() parameter
* Add FIXME comment for adding new macro in future
* Remove unnecessary cast to void
* Eliminate duplicated code
* Avoid dividing api pool into more than 2 groups
* Use more clear variable names
* Use affirmative condition in if stmt
* Remove blank lines
* Donot perfect forwarding in composes<>
* To fix compile error, revert generate.py back to 4439cc107d
* Fix bug of p element function
* Add compute element op to host softmax
* Remove element function in api interface
* Extract user parameter
* Rename pscale and oscale variable
* rename f8 to fp8
* rename more f8 to fp8
* Add pipeline::operator() without element_functor
* 1. Remove deprecated pipeline enum
2. Refine host code parameter
* Use quantization range as input
* 1. Rename max_dtype to dtype_max.
2. Rename scale to scale_s
3.Add init description
* Refine description
* prevent early return
* unify _squant kernel name in cpp, update README
* Adjust the default range.
* Refine error message and bias range
* Add fp8 benchmark and smoke test
* fix fp8 swizzle_factor=4 case
---------
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
Co-authored-by: carlushuang <carlus.huang@amd.com>
---------
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: illsilin <Illia.Silin@amd.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
Co-authored-by: Jing Zhang <jizha@amd.com>
Co-authored-by: zjing14 <zhangjing14@gmail.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Po-Yen, Chen <PoYen.Chen@amd.com>
Co-authored-by: rocking <ChunYu.Lai@amd.com>