* Add more printing to core cktile
* Revert other changes in static encoding pattern
* Refactor to using a free print() function
* Remove loops and print just the containers
* Print tuple with better formatting, fix sequence compilation
* Add some tests for print utility
* Add print utility header
* Print for static_encoding_pattern
* add buffer_view printing
* Align vector_traits
* Fix formatting
* Lower-case enum strings
Co-authored-by: Christopher Millette <63608002+cgmillette@users.noreply.github.com>
* Remove empty comment lines
* Fix test with lower-case too
* Reduce repeated code in print tests, move helper function closer to type definition, test X&Y
* Add test_print_common.hpp
* add print.hpp in core.hpp
---------
Co-authored-by: Aviral Goel <aviral.goel@amd.com>
Co-authored-by: Christopher Millette <63608002+cgmillette@users.noreply.github.com>
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
* Add tests for host convesion f32/f16 to f8
* Add tests for host convesion from f8 to f32/f16
* Fix UB and corner cases in f32/f16 to/from f8 conversion
* There are UBs when very small values are converted to f8: bitshifts
can be larger that type width. Using unsigned long long does not help
because exponent_diff >= 64 in such cases. This causes that values
like 2.117582368e-22 are converted to non-zero f8 in host validation
of FMHA tests, test_f8 crashes with segfault in completely irrelevant
code like GTest internals or produces non-deterministic results etc.
* Fix FNUZ conversion to return NaN for NaN inputs.
* Fix compilation error (due to uint8_t << 8) in OCP e5m2 to f16
conversion.
* Replace some magic numbers with values from numeric_traits
* Build tests only on devices supporting the type
* ck_tile kernel for gemm with groupwise quantized A or B tensor.
This change introduces new pipelines with Intrawave scheduler and block gemm primitives that loads the scale tensor to registers to perform dequantization post MFMA on C tensor in registers.
Scale tensor data, AQ/BQ is spliced across threads in registers and not stored in LDS.
Current support is for the following combinations, but it should be fairly straightforward to extend support to more formats.
1. fp8, fp8 -> f32
2. bf8, bf8 -> f32
3. i4, fp8 -> f32
4. i4, bf8 -> f32
Group size can go down to as low as K length of underlying WarpGemm primitive.
For Gemm problems with quantized B tensor, this change also introduces preliminary support for flatmm pipeline which loads B tensor directly into registers.
* [Block Scale Gemm] Only run gemm quant examples on __gfx94__
- Only run gemm quant examples on __gfx94__ for usage of
`v_cvt_pk_fp8_f32`
- Format the code
* [Block Scale Gemm] Remove Bquant Gemm BlockScale
This cleanup is in preparation for future development of bquant. By
isolating Aquant-related code, we can streamline the codebase and make
it easier to add and maintain bquant functionality in subsequent
updates.
* [Block Scale Gemm] Format code with clang-format-12
The latest clang-format (v19) in ROCm 7.0 generate different result than
clang-format-12 which is used in CK CI.
Format code with clang-format-12 for consistency.
* [Block Scale Gemm] Split the k direction loop
- Split the k direction loop in block_universal_gemm_as_quant_bs_cr.hpp
to make the logic clearer.
- Disable C transposition.
* [Block Scale Gemm] Move block scale gemm example to 38_block_scale_gemm
* [Block Scale Gemm] Update copyright
* test
* Add TailHandler
* Move TileDistributionEncodingPatternAQ
* Refactor
* refactor
* fix bug
* fix bug
* help solve the PR comment
* Format the code
* [Block Scale Gemm] Add unit tests
* [Block Scale Gemm] Add support to 16x16x32 MFMA
- Add support to 16x16x32 MFMA
- Fix a bug when exchange data crossing lanes
---------
Co-authored-by: Vijay Krishnamoorthy <vjkrish@meta.com>
Co-authored-by: Cong MA <congma13@ctr2-alola-ctrl-01.amd.com>
Co-authored-by: ThomasNing <thomas.ning@amd.com>
* [draft] Add pk_fp4 and test
* Add hw conversion for fp4
* Refine test code and pk_fp4 constructor.
* fix test indent
* modify according to comment.
* fix clang-format
* modify according comments.
---------
Co-authored-by: asleepzzz <hanwen.chang@amd.com>
* updates to support int8 in 03_gemm example
* added comments, using aliases, helper functions
* test(gemm_universal): add test cases for int8 gemm pipeline
* fix(test_gemm): fix for failing test unit test for int8
* test(ck_tile): add int8 unit test for gemm universal
* refactor(gemm_universal): GPU reference verification for GEMM code improved
* style(gemm_universal): removed extra comments and did clang format
* merging recent changes to universal gemm to tile_engine
* ck tile engine integration work
* feat(tile_engine): add int8 support to tile engine ops/gemm
* feat(tile_engine): added 32 32 16 mfma instances to tile engine for int8
* style: Format code with clang-format-12
* refactor(tile_engine): address review comments
* style: removed unhelpful comments & unused variables.
* build: tile engine uses default config
* feat: add int8 support for CK_TILE GEMM
* style: added trailing commas to codegen_utils.py
* refactor: tile engine
* refactor: formatting and code review
* refactor: code formatting for python files
* fix: suppress build warning
* add support for gfx950
* refactor:KWarpTile size in gemms util
* Fix the branch and wrap up the k warp tile
* Add bf8 integration
* refactor: clang format and rebase
---------
Co-authored-by: zjli2013 <leezhengjiang@gmail.com>
Co-authored-by: AviralGoelAMD <aviral.goel@amd.com>
Co-authored-by: Khushbu Agarwal <khuagarw@amd.com>
* add prenorm/postnorm support, refactor using generate.py
* update README
* update README
* fix format
* update some description and fix format
* update format
* format
* use non-raw for loading
* format and update n4096
* dynamic-quant ready
* update readme
* support fused dynamic-quant
* update fused-quant, with smooth
* update README
* update args
* update some based on comment
* Add ceil_to_qualified_tile_length()
* Rename kK0BlockLength to kQKHeaddim
* Add kSubQKHeaddim concept to support headdim96
* Fix in math.hpp to avoid using __half interfaces
* Add LdsBufferSequence instance for headdim96
* Update in fmha_fwd/fmha_fwd_splitkv codegen to support hd96 testing
* Disable hd96 instance generation in codegen fmha_fwd and fmha_fwd_splitkv to save compiling time
* Reformat one file
* Fix text alignment in fmha_fwd_splitkv.py
---------
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
* Use dictionary to config all the functions
* Add init codegen logic for fmha fwd appendkv
* Call HIP_CHECK_ERROR() macro to get real source info
* Setup meaningfull arguments
* Sync kernel name with the codegen
* Add knew/vnew tensors to the kernel argument
* Fix wrong K values after appending
* Fix vnew append errro
* Extract common logics
* Fix Vnew tile dstr for row major case
* Conditionally add fwd_splitkv API in fmha_fwd example
* Conditionally add call to fmha_fwd_splitkv()
* Remove "EXAMPLE_" prefix of cmake variables
* Regsiter API handlers automatically
* Early return if 0 < s_k_new is not supported
* Show message if we are ignoring option
* Unify CMakeLists.txt coding style
* Set num_splits=1 if split-kv is not supported
* Add length/stride getters for HostTensor
* Add RoPE example utilities
* Add reference_rotary_position_embedding() (not implemented)
* Finish reference_rotary_position_embedding() impl
* Fix typo of HostTensor<>::get_length()
* Fix compilation errors
* Fix wrong answer when interleaved=false
* Fix wrong answer when interleaved=true
* Append K/V in the host verification code
* Simplify K appending logics
* Simplify v_host_ref definition
* Reduce input/output dimensions
* Rename function: add "batched" prefix
* Apply RoPE on host side
* Rename RoPE utility function
* Fix wrong tensor size
* Avoid invoking deprecated method 'find_module'
* Pass RoPE kernel args
* Create Rotary Cos/Sin tile windows in kernel
* Add compute data type alias for RoPE
* Randomly generate seqlen_knew if needed
* Fix seqlen_knew enabling check logic
* Add minimum seqlen_k to generate compliance kvcache
* Fix compilation error in debug mode
* Fix wrong boundaries
* Fix wrong seqlen_k for kvcache
* Rename variables used in distributio encoding
* Fix rotary cos/sin tensor/tile size
* Add constraint to the rotary_dim option
* Remove unused inner namespace
* Add dram distribution for rotary_cos/rotary_sin (interleaved)
* Only apply interleaved RoPE on Knew for now
* Fix wrong thread starting offset
* Instantiate multiple kernels for RoPE approaches
* Clean-up pipeline
* Fix error in RoPE host reference
* Handle RoPE half-rotated logics
* Support 8x rotary_dim under half-rotated RoPE
* Add comment
* Apply elementwise function to the loaded tiles
* Unify parameter/variable naming style
* Remove constness from q_ptr
* Add code blocks for q_tile
* Apply RoPE to q_tile
* Remove debug print code in kernel
* Fix wrong knew/vnew appending positions
* Use better naming for tile indices
* Add make_tile_window() for adding distribution only
* Skip code if # of block is more than needed
* Move thread locating logics into policy
* Remove always true static_assert()
* Rename header
* Rename RotaryEmbeddingEnum
* Extract rotary embedding logic out
* Re-order parameters
* Align naming of some tile size constants
* Rename more tile size constants
* Fix wrong grid size
* Fix wrong shape of knew_host/vnew_host
* Fix wrong index into knew_host/vnew_host
* Fix wrong rotary_cos/rotary_sin memory size for Q
* Extract Q/Knew vector size to helper methods
* Use different rotary_cos/rotary_sin distr for Q/Knew
* Update host/device specifiers
* Fix wrong data type for Q rotary_cos/rotary_sin
* Remove RoPEComputeDataType type alias
* Shift rotary_cos/rotary_sin by cache_seqlen_k
* Add comment for why I just 't' for all padding flags
* Align commit message to the real comment
* Fix wrong pipeline
* Rename utility function
* Disable host verification if API not exist
* Fix wrong rope key for fp8 pipeline
* Allow only apply RoPE on Q (without append KV)
* Add append-kv smoke tests
* Remove debug statements
* Remove more debug statements
* Re-arrange the 'set +x' command
* Remove no-longer used method in pipeline
* Add missing init code
* Refine pipeline padding settings
* Enlarge rotary_dim limit (8 -> 16)
* Enlarge KPerThread for rotary_interleaved=false
* Update rotary_dim range in smoke_test_fwd.sh
* Add template argument 'kIsPagedKV' for splitkv kernels
* Launch splitkv kernel if given page_block_size
* Fix wrong kernel name
* Fix seqlen_k_min for pre-fill case (1 -> 0)
* Add copy_const<> type trait
* Add another make_tile_window()
* Introduce 'TileWindowNavigator' types
* Simplify TileWindowNavigator interfaces
* Fix tile window navigation bugs
* Disable calling fmha_fwd()
* Remove ununnecessary data members
* Simplify more make_tile_window() overloads
* Move V tile through TileWindowNavigator
* Fix uneven split checking logic
* Move code after decide seqlen_q/seqlen_k
* Make sure we always start reading complete tile
* Use 128 as minimus page_block_size
* Fix wrong origin for bias
* Add batch_stride_k/batch_stride_v in group mode
* Unify origin
* Add missing kernel arguments for group mode
* Add paged-kv codegen logic for appendkv kernels
* Add block_table kernel args for appendkv kernel
* Add tile navigators to the appendkv kernel
* Fix wrong tensor descriptor lengths
* Pass re-created tile window to pipeline
* Fix wrong strides for appendkv kernel
* Allow transit tile_window to another page-block
* Handle cross-page-block write
* Donot perform write again if already in last page-block
* Always add fmha_fwd() api
* Add missing group mode argument
* Remove debug macro usages
* Rename option s_k_new to s_knew
* Separate splitkv/non-splitkv args/traits
* Remove fmha_fwd_dispatch()
* Fix compilation errors
* Remove dropout code in splitkv kernel
* Allow problem types without define kHasDropout attr
* Use generic lambda to init traits objects
* Separate more non-splitkv & splitkv traits/args
* Display more info for specific kernels
* Show more detailed warning message
* Rename 'max_num_blocks' to 'max_num_page_blocks'
* Remove no-longer used pipeline files
* Wrap code by #if directives
* Move functors to the begining of validation code
* Use generic lambda to init all the api traits/args
* Fix wrong seqlen for kvcache
* Add missing comment
* Rename TileWindowNavigator to PageBlockNavigator
* Only expose necessary methods (not attributes)
* Re-order pipeline paremeters
* Refine smoke_test_fwd.sh
* Fix wrong arugment count
* Make tile window directly via PageBlockNavigator
* Remove unused template paremeter
* Remove group mode from appendkv kernel
* Fix skcheck logic
* Fix wrong syntax in skcheck expr
* Use meaningful options in smoke test
* Remove options
* Fix formatting
* Fix more format
* Re-organize bash functions
* Pass cache_batch_idx to kernels
* Support cache_batch_idx in example
* Fix compilation error
* Add more appendkv test
* Add more case for appendkv
* Fix unexisted attribute
* Remove 0 < seqlen_knew constraint
* Clarify the case in warning message
* Remove macro checking
* Force batch mode when invoking appendkv & splitkv apis
* Fix mode overriding logics
* Fix wrong parameter name
* Randomize seqlen_k if use kvcache
* Use randomized seqlen_k for kvcache
* Avoid using too small rotary_cos & rotary_sin
* Rename parameter
* Add seqlen_q & seqlen_k rules
* Add comment
* Add more comments
* Fix compilation errors
* Fix typo in comment
* Remove type argument
* Avoid seqlen_k=0 for kvcache
* Revert "Avoid seqlen_k=0 for kvcache"
This reverts commit 21c4df89e4.
* Fix wrong uneven split checking logics
* Only randomize kvcache seqlen_k if 1 < batch
* Return earlier if split is empty
* Revert "Only randomize kvcache seqlen_k if 1 < batch"
This reverts commit b9a4ab0d7e.
* Re-order seqlen_k_start adjustment logics
* Fix compilation errors
* Re-format script
* Find executable from folder automatically
* Fix kvcache seqlen_k generating logic
* Make comment more clear
* Fix wrong knew/vew appending logic on host
* Add s_barrier to sync threads
* Revert "Add s_barrier to sync threads"
This reverts commit d3f550f30c.
* Support only using 1 row of rotary_cos/rotary_sin
* Rotate Q in different way
* Unify tensor view creation logics
* Fix wrong argument
* Add mask to switch how we use the rotary_cos/sin
* Move attr from traits to problem
* Move has_mask to fmha_fwd_appendkv_args
* Support use uint32_t as SAD operand in Alibi<>
* Use sad_u32() in splitkv kernels
* Store tensor views in PageBlockNavigator
* Use stored tensor view to update tile windows
* Enlarge tensor view size
* Remove debug code
* Fix wrong tensor view size
* Wrap tensor view into PageBlockNavigator
* Add DataType member to PageBlockNavigator
* Remove unnecessary member functions
* Refind macro use
* Fix typo
* Add blank line between directives and actual code
* Re-format files
* Remove type in comment
---------
Co-authored-by: carlushuang <carlus.huang@amd.com>
Co-authored-by: rocking <ChunYu.Lai@amd.com>
* Add layernorm2d forward
* Refind file path
* clang format
* Exclude ck_tile op from all
* use add_executable instead
* refactor layernorm2d_fwd example
---------
Co-authored-by: carlushuang <carlus.huang@amd.com>
error: no viable conversion from returned value of type '__half' to function return type 'fp16_hip_t' (aka '_Float16')
Co-authored-by: carlushuang <carlus.huang@amd.com>
* add alibi support
* fix code
* update code based on comment
* Support more hdim
* fix fp8 bias
* support seqlen_k=0 case
* remove unused printf
* fix format
---------
Co-authored-by: rocking <ChunYu.Lai@amd.com>
* enable gfx940
* switch between intrinsic mfma routines on mi100/200 and mi300
* fix mfma_int8 on MI300
* disable 2 int8 examples on MI300
* Update cmake-ck-dev.sh
* restore gitignore file
* modify Jenkinsfile to the internal repo
* Bump rocm-docs-core from 0.24.0 to 0.29.0 in /docs/sphinx
Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core) from 0.24.0 to 0.29.0.
- [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases)
- [Changelog](https://github.com/RadeonOpenCompute/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](https://github.com/RadeonOpenCompute/rocm-docs-core/compare/v0.24.0...v0.29.0)
---
updated-dependencies:
- dependency-name: rocm-docs-core
dependency-type: direct:production
update-type: version-update:semver-minor
...
Signed-off-by: dependabot[bot] <support@github.com>
* initial enablement of gfx950
* fix clang format
* disable examples 31 and 41 int8 on gfx950
* add code
* fix build wip
* fix xx
* now can build
* naming
* minor fix
* wip fix
* fix macro for exp2; fix warpgemm a/b in transposedC
* unify as tuple_array
* Update the required Python version to 3.9
* Update executable name in test scripts
* re-structure tuple/array to avoid spill
* Merge function templates
* Fix format
* Add constraint to array<> ctor
* Re-use function
* Some minor changes
* remove wrong code in store_raw()
* fix compile issue in transpose
* Rename enum
Rename 'cood_transform_enum' to 'coord_transform_enum'
* let more integral_constant->constant, and formating
* make sure thread_buffer can be tuple/array
* temp fix buffer_store spill
* not using custom data type by default, now we can have ISA-level same code as opt_padding
* fix compile error, fp8 not ready now
* fix fp8 duplicated move/shift/and/or problem
* Default use CK_TILE_FLOAT_TO_FP8_STOCHASTIC rounding mode
* fix scratch in fp8 kernel
* update some readme
* fix merge from upstream
* sync with upstream
* sync upstream again
* sync 22
* remove unused
* fix clang-format
* update README of ck_tile example
* fix several issue
* let python version to be 3.8 as minimal
* remove ck_tile example from default cmake target like all/install/check
* remove mistake
* 1).support receipe in generate.py 2).use simplified mask type 3).change left/right to pass into karg
* fix some bug in group-mode masking and codegen. update README
* F8 quantization for FMHA forward (#1224)
* Add SAccElementFunction, PComputeElementFunction, OAccElementFunction in pipeline
* Add element function to fmha api
* Adjust P elementwise function
* Fix bug of elementwise op, our elementwise op is not inout
* Add some elementwise op, prepare to quantization
* Let generate.py can generate different elementwise function
* To prevent compiler issue, remove the elementwise function we have not used.
* Remove f8 pipeline, we should share the same pipeline even in f8
* Remove remove_cvref_t
* Avoid warning
* Fix wrong fp8 QK/KV block gemm setting
* Check fp8 rounding error in check_err()
* Set fp8 rounding error for check_err()
* Use CK_TILE_FLOAT_TO_FP8_STANDARD as default fp8 rounding mode
* 1. codgen the f8 api and kernel
2. f8 host code
* prevent warning in filter mode
* Remove not-in-use elementwise function kargs
* Remove more not-in-use elementwise function kargs
* Small refinements in C++ source files
* Use conditional_t<> to simplify code
* Support heterogeneous argument for binary function types
* Re-use already-existing scales<> functor template
* Fix wrong value produced by saturating
* Generalize the composes<> template
* Unify saturates<> implementation
* Fix type errors in composes<>
* Extend less_equal<>
* Reuse the existing template less_equal<> in check_err()
* Add equal<float> & equal<double>
* Rename check_err() parameter
* Rename check_err() parameter
* Add FIXME comment for adding new macro in future
* Remove unnecessary cast to void
* Eliminate duplicated code
* Avoid dividing api pool into more than 2 groups
* Use more clear variable names
* Use affirmative condition in if stmt
* Remove blank lines
* Donot perfect forwarding in composes<>
* To fix compile error, revert generate.py back to 4439cc107d
* Fix bug of p element function
* Add compute element op to host softmax
* Remove element function in api interface
* Extract user parameter
* Rename pscale and oscale variable
* rename f8 to fp8
* rename more f8 to fp8
* Add pipeline::operator() without element_functor
* 1. Remove deprecated pipeline enum
2. Refine host code parameter
* Use quantization range as input
* 1. Rename max_dtype to dtype_max.
2. Rename scale to scale_s
3.Add init description
* Refine description
* prevent early return
* unify _squant kernel name in cpp, update README
* Adjust the default range.
* Refine error message and bias range
* Add fp8 benchmark and smoke test
* fix fp8 swizzle_factor=4 case
---------
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
Co-authored-by: carlushuang <carlus.huang@amd.com>
---------
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: illsilin <Illia.Silin@amd.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
Co-authored-by: Jing Zhang <jizha@amd.com>
Co-authored-by: zjing14 <zhangjing14@gmail.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Po-Yen, Chen <PoYen.Chen@amd.com>
Co-authored-by: rocking <ChunYu.Lai@amd.com>