* ake the cshuffle compilable
* modify Mhe reference on gpu and cpu. Correaccess of cshuffle
* fix the cpu reference code
* Complete the in tile shuffle logic
* restructure the kernel template input
* change the naming pattern of ck_tile gemm pipeline
* Re-format files using remod.py
* Solve the fmha conflict with gemm
* Comment Addressed from Carlus
---------
Co-authored-by: Po Yen, Chen <PoYen.Chen@amd.com>
* Add a gpu gemm reference kernel
* Switch to gpu reference in gemm examples
* Remove redundant arguments
* Update all related examples
* Update more examples
* Try less threads per block
* Try even less threads per block
* Add support for all matrix layouts
* Increase block size
* Clean up
* Remove hardcoded strides
* Clean up
* Try a column-major case
* Revert back to row-major
* Run both CPU and GPU veriffication
---------
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
* Fix text alignment of ArgParser::print()
* Update example README files
* Clarify make-ck-dev.sh <arch> usage
* Only keep some of the argument from '-?' output
* Undo command line output changes in README
* Only keep existing argument on doc and update description
* Fix text alignment
* Make cmake-ck-*.sh compatible with 'sh' command
* Simplify the codes in splitkv_combine pipeline
* Always set kPadSeqLenK=true for fmha splitkv kernels
* Change in Oacc Alignment and TileDistribution to be more adaptable to tile sizes
---------
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
* Fix compile error
* Add one pass pipeline
* Extract creating tile_window to operator()
* clang format
* reduce duplicated code
* do not hardcode
* Support padding in layernorm
---------
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
* Adding seed and offset pointer support to the philox random number generator.
* Separating seed and offset pointer checks with different condition statements.
* Changes include, adding support for device seed and offset pointers, union is used to store seed/offset values and device pointers to minimize device SGPRs.
* Correcting a typo in the readme file
* Re-format files using remod.py
* Use STL type for API parameters
* Use simpler struct design for drop_seed & drop_offset
* Undo unnecessary changes
* Sync kargs style for fmha_fwd.hpp/.cpp
* Use templated union to reduce code
* Use structured binding to make code more readable
---------
Co-authored-by: Sudhir Kylasa <sukylasa@amd.com>
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
* Use same layout for o_acc and o tensor
* Use better param names in partitioner
* Remove redundant kargs 'max_seqlen_q'
* Use better param names in splitkv kernel
* Add comment for additional kernel arguments
* Sync empty loop early return logics between pipelines
* Pass more arguments to cmake in scripts
* Align backslashes
* Fix wrong o_acc tensor view strides
* Change o_acc layout if o_perm=0
* Handle whole row masked via attn_bias
* Use use vector width = 1 for o_acc
* Use more even split sizes
* Finished the feature of gpu verification
* Add the ck_tile_gemm test in the CI CD
* add the include of tensor_layou in reference_gemm
* Comment Addressed
* split ck_tile fhma and gemm tests into separate stages
* restructure the reference gemm
* restructure a new reference_gemm api that could read the device mem
---------
Co-authored-by: carlushuang <carlus.huang@amd.com>
Co-authored-by: illsilin <Illia.Silin@amd.com>
* Legacy support: customized filesystem
* Update cmakefile for python alternative path
* fix build issues
* CK has no boost dependency
* More fixes to issues found on legay systems
* fix clang format issue
* Check if blob is correctly generated in cmake
* fix the python issues
* add a compiler flag for codegen when using alternative python
* use target_link_options instead of target_compile_options
---------
Co-authored-by: illsilin <Illia.Silin@amd.com>
* Checkpoint: Finished with the tile example & kernel verification, working on the different matrix layout
* Finished the Matrix Layout feature set up. Note: Need to modify the inner block to solve the shuffle problem in the future.
* Fix: Clang Format, API fixed from fmha
* fix with better naming convention
* revert back the pipeline code of fmha
* Fixed: Addressed the comments and merge the GEMM shape of GEMM Operator and FMHA Operator to one.
* clang format with the reference_gemm file
* convert the clang format with the remod.py
* Changed the format and variable name of the kernel gemm_shape and partitioner
---------
Co-authored-by: thomasning <thomasning@banff-cyxtera-s70-4.ctr.dcgpu>
* revert ckprofiler change
* temp save
* Add test and test pass
* test pass
* Fix bug inside rotating buffer when tensor is not packed
* bug fix
* clang format
---------
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
* Use dictionary to config all the functions
* Add init codegen logic for fmha fwd appendkv
* Call HIP_CHECK_ERROR() macro to get real source info
* Setup meaningfull arguments
* Sync kernel name with the codegen
* Add knew/vnew tensors to the kernel argument
* Fix wrong K values after appending
* Fix vnew append errro
* Extract common logics
* Fix Vnew tile dstr for row major case
* Conditionally add fwd_splitkv API in fmha_fwd example
* Conditionally add call to fmha_fwd_splitkv()
* Remove "EXAMPLE_" prefix of cmake variables
* Regsiter API handlers automatically
* Early return if 0 < s_k_new is not supported
* Show message if we are ignoring option
* Unify CMakeLists.txt coding style
* Set num_splits=1 if split-kv is not supported
* Add length/stride getters for HostTensor
* Add RoPE example utilities
* Add reference_rotary_position_embedding() (not implemented)
* Finish reference_rotary_position_embedding() impl
* Fix typo of HostTensor<>::get_length()
* Fix compilation errors
* Fix wrong answer when interleaved=false
* Fix wrong answer when interleaved=true
* Append K/V in the host verification code
* Simplify K appending logics
* Simplify v_host_ref definition
* Reduce input/output dimensions
* Rename function: add "batched" prefix
* Apply RoPE on host side
* Rename RoPE utility function
* Fix wrong tensor size
* Avoid invoking deprecated method 'find_module'
* Pass RoPE kernel args
* Create Rotary Cos/Sin tile windows in kernel
* Add compute data type alias for RoPE
* Randomly generate seqlen_knew if needed
* Fix seqlen_knew enabling check logic
* Add minimum seqlen_k to generate compliance kvcache
* Fix compilation error in debug mode
* Fix wrong boundaries
* Fix wrong seqlen_k for kvcache
* Rename variables used in distributio encoding
* Fix rotary cos/sin tensor/tile size
* Add constraint to the rotary_dim option
* Remove unused inner namespace
* Add dram distribution for rotary_cos/rotary_sin (interleaved)
* Only apply interleaved RoPE on Knew for now
* Fix wrong thread starting offset
* Instantiate multiple kernels for RoPE approaches
* Clean-up pipeline
* Fix error in RoPE host reference
* Handle RoPE half-rotated logics
* Support 8x rotary_dim under half-rotated RoPE
* Add comment
* Apply elementwise function to the loaded tiles
* Unify parameter/variable naming style
* Remove constness from q_ptr
* Add code blocks for q_tile
* Apply RoPE to q_tile
* Remove debug print code in kernel
* Fix wrong knew/vnew appending positions
* Use better naming for tile indices
* Add make_tile_window() for adding distribution only
* Skip code if # of block is more than needed
* Move thread locating logics into policy
* Remove always true static_assert()
* Rename header
* Rename RotaryEmbeddingEnum
* Extract rotary embedding logic out
* Re-order parameters
* Align naming of some tile size constants
* Rename more tile size constants
* Fix wrong grid size
* Fix wrong shape of knew_host/vnew_host
* Fix wrong index into knew_host/vnew_host
* Fix wrong rotary_cos/rotary_sin memory size for Q
* Extract Q/Knew vector size to helper methods
* Use different rotary_cos/rotary_sin distr for Q/Knew
* Update host/device specifiers
* Fix wrong data type for Q rotary_cos/rotary_sin
* Remove RoPEComputeDataType type alias
* Shift rotary_cos/rotary_sin by cache_seqlen_k
* Add comment for why I just 't' for all padding flags
* Align commit message to the real comment
* Fix wrong pipeline
* Rename utility function
* Disable host verification if API not exist
* Fix wrong rope key for fp8 pipeline
* Allow only apply RoPE on Q (without append KV)
* Add append-kv smoke tests
* Remove debug statements
* Remove more debug statements
* Re-arrange the 'set +x' command
* Remove no-longer used method in pipeline
* Add missing init code
* Refine pipeline padding settings
* Enlarge rotary_dim limit (8 -> 16)
* Enlarge KPerThread for rotary_interleaved=false
* Update rotary_dim range in smoke_test_fwd.sh
* Add template argument 'kIsPagedKV' for splitkv kernels
* Launch splitkv kernel if given page_block_size
* Fix wrong kernel name
* Fix seqlen_k_min for pre-fill case (1 -> 0)
* Add copy_const<> type trait
* Add another make_tile_window()
* Introduce 'TileWindowNavigator' types
* Simplify TileWindowNavigator interfaces
* Fix tile window navigation bugs
* Disable calling fmha_fwd()
* Remove ununnecessary data members
* Simplify more make_tile_window() overloads
* Move V tile through TileWindowNavigator
* Fix uneven split checking logic
* Move code after decide seqlen_q/seqlen_k
* Make sure we always start reading complete tile
* Use 128 as minimus page_block_size
* Fix wrong origin for bias
* Add batch_stride_k/batch_stride_v in group mode
* Unify origin
* Add missing kernel arguments for group mode
* Add paged-kv codegen logic for appendkv kernels
* Add block_table kernel args for appendkv kernel
* Add tile navigators to the appendkv kernel
* Fix wrong tensor descriptor lengths
* Pass re-created tile window to pipeline
* Fix wrong strides for appendkv kernel
* Allow transit tile_window to another page-block
* Handle cross-page-block write
* Donot perform write again if already in last page-block
* Always add fmha_fwd() api
* Add missing group mode argument
* Remove debug macro usages
* Rename option s_k_new to s_knew
* Separate splitkv/non-splitkv args/traits
* Remove fmha_fwd_dispatch()
* Fix compilation errors
* Remove dropout code in splitkv kernel
* Allow problem types without define kHasDropout attr
* Use generic lambda to init traits objects
* Separate more non-splitkv & splitkv traits/args
* Display more info for specific kernels
* Show more detailed warning message
* Rename 'max_num_blocks' to 'max_num_page_blocks'
* Remove no-longer used pipeline files
* Wrap code by #if directives
* Move functors to the begining of validation code
* Use generic lambda to init all the api traits/args
* Fix wrong seqlen for kvcache
* Add missing comment
* Rename TileWindowNavigator to PageBlockNavigator
* Only expose necessary methods (not attributes)
* Re-order pipeline paremeters
* Refine smoke_test_fwd.sh
* Fix wrong arugment count
* Make tile window directly via PageBlockNavigator
* Remove unused template paremeter
* Remove group mode from appendkv kernel
* Fix skcheck logic
* Fix wrong syntax in skcheck expr
* Use meaningful options in smoke test
* Remove options
* Fix formatting
* Fix more format
* Re-organize bash functions
* Pass cache_batch_idx to kernels
* Support cache_batch_idx in example
* Fix compilation error
* Add more appendkv test
* Add more case for appendkv
* Fix unexisted attribute
* Remove 0 < seqlen_knew constraint
* Clarify the case in warning message
* Remove macro checking
* Force batch mode when invoking appendkv & splitkv apis
* Fix mode overriding logics
* Fix wrong parameter name
* Randomize seqlen_k if use kvcache
* Use randomized seqlen_k for kvcache
* Avoid using too small rotary_cos & rotary_sin
* Rename parameter
* Add seqlen_q & seqlen_k rules
* Add comment
* Add more comments
* Fix compilation errors
* Fix typo in comment
* Remove type argument
* Avoid seqlen_k=0 for kvcache
* Revert "Avoid seqlen_k=0 for kvcache"
This reverts commit 21c4df89e4.
* Fix wrong uneven split checking logics
* Only randomize kvcache seqlen_k if 1 < batch
* Return earlier if split is empty
* Revert "Only randomize kvcache seqlen_k if 1 < batch"
This reverts commit b9a4ab0d7e.
* Re-order seqlen_k_start adjustment logics
* Fix compilation errors
* Re-format script
* Find executable from folder automatically
* Fix kvcache seqlen_k generating logic
* Make comment more clear
* Fix wrong knew/vew appending logic on host
* Add s_barrier to sync threads
* Revert "Add s_barrier to sync threads"
This reverts commit d3f550f30c.
* Support only using 1 row of rotary_cos/rotary_sin
* Rotate Q in different way
* Unify tensor view creation logics
* Fix wrong argument
* Add mask to switch how we use the rotary_cos/sin
* Move attr from traits to problem
* Move has_mask to fmha_fwd_appendkv_args
* Support use uint32_t as SAD operand in Alibi<>
* Use sad_u32() in splitkv kernels
* Store tensor views in PageBlockNavigator
* Use stored tensor view to update tile windows
* Enlarge tensor view size
* Remove debug code
* Fix wrong tensor view size
* Wrap tensor view into PageBlockNavigator
* Add DataType member to PageBlockNavigator
* Remove unnecessary member functions
* Refind macro use
* Fix typo
* Add blank line between directives and actual code
* Re-format files
* Remove type in comment
---------
Co-authored-by: carlushuang <carlus.huang@amd.com>
Co-authored-by: rocking <ChunYu.Lai@amd.com>
* Enable CMakePresets build
* Verify Convolution, Scaling and ReLU algorithms.
* Add tensor element-wise scale and type cast operation.
* Reduction implemented but does not work.
* Exploration of Reduction functionality.
* Completed example for Convolution scaled with ReLu activation and AMAX reduction.
* WIP: Add required instances for convolution.
* WIP: Create client example. Implement convolution stage.
* Add elementwise instances.
* Add elementwise scale + convert example.
* Add reduction instances.
* WIP: Client example for AMAX reduction.
* WIP: Add instances for multistage reduction.
* WIP: Implementation of multistage reduction.
* Refactoring.
* Clean up.
* Add CMakePresets.json
* Guard off FP8 instances when the data type is not available.
* Add example for Scaled FP8 Convolution with AMAX reduction.
* Refactor CombConvScaleRelu instances.
* Add CombConvScale instances.
* Add client example for Scaled FP8 Convolution with AMAX reduction.
* Cleanup.
* Set RNE fp8 conversion as a default
* Update f8 tests
* Disable failing test on gfx11
* Update bf8 tests
* Add a flag
* Fix the flag
* Raise flag for gfx10 as well
* Temp commit for tolerance testing
* Update tolerances
* run ck_tile benchmarks after the smoke tests and store logs
* change the path of fmha benchmark logs
* change the way of stashig ck_tile fmha logs
* prevent the errors in stages where no logs are generated
* fix the ck_tile fmha log names and headers
* generate the fmha performance logs in the root folder
* change jenkins scrip arguments format
* use exact file names for stashing
* modify scripts to process FMHA performance results
* unstash FMHA logs before parsing them
* Support 64 bit indexing
* Add new grouped conv fwd kernel for large tensors
* Add instances large tensor
* Fixes for transform conv to gemm
* Fixes
* fixes
* Remove not needed instances
* examples fixes
* Remove not need ds arrays
* Fix tests
* Add 2GB check in gridwise dl
* Fixes
* add ab_scale init support
* enabled interwave
* add scale type; update isSupport
* adjust example
* clean
* enable f8 pure gemm rcr ckprofiler
* Add gemm_multiply_multiply instances
* clang format
* Optimize for ScaleBlockMNK=128
* enable abscale f8 gemm ck profiler
* Add pure f8 gemm test suite
* Reverting to the state of project at f60fd77
* update copyright
* clang format
* update copyright
---------
Co-authored-by: root <jizhan@amd.com>
* init for reduce_threadwise multi_d
* add reduce_threadwise_multi_d
* add reduce_multi_d
* clean
* start add an other splitk device op
* add reduce template parameter to SplitKBatchOffset
* add reduce c matrix
* clean up code
* change example data type to bf16
* add bf16Ai8B example
* remove reduce template parameter
* add splitk atomic status to v4
* example add multi d parameters
* device op add multi-d parameters
* add multi-d to reduce
* fix kbach=1 bug
* change B layout to col in bf16Ai8B example
* remove float adding struct
* change multi-d interface
* change file and class name
* remove multi-d of bf16Ai8B example
* change IsReduce function to IsReduceAdd
* change example layout to RRR from RCR
* according layout to set ds stride
* reset parameter layout
* add gemm universal reduce instance
* add reduce factory
* add profile_gemm_universal_reduce
* add reduce to profiler
* fix reduce instance
* fix profiler reduce compiling bug
* format
* format library instance code
* add mem instance for reduce library
* fix call instance names
* add workspace for reduce in ckProfiler
* format
* add mnpading to reduce library instance
* add fp16 instance to reduce of profiler
* change copyright time
* restore profiler cmake file
* add reduce text to instances
* add DsLayout and DsDataType to instances template parameter
* fixed gemm_reduce_multi_d
* add an example without multi_d
* Update common.hpp
* Update gtest.cmake
* Update gemm_xdl_splitk_reduce_bf16.cpp
* clean
* Update gtest.cmake
* format
* fixe api
* format
* default parameter change to RRR
* add vector_len for multi_d
* format
* Update gtest.cmake
* fix bf16A iBB elementwiseop
* add ReduceDataType
* move ReduceDataType to end position
* format
* remove googletest git method address
* fix copyright time
* update init data
---------
Co-authored-by: root <jizhan@amd.com>
Co-authored-by: letaoqin <letaoqin@amd.com>
Co-authored-by: Jing Zhang <jizhan@meta.com>
Co-authored-by: zjing14 <zhangjing14@gmail.com>
* Add CMakePresets configurations.
* Add ConvScale+ReLU Functor and an Example
* Account for ReLU FLOPs.
* Add instances of 3D convolutions with ConvscaleRelu operation.
* Implement Client Example
* Cleanup
* add ck_tile tests to CI
* build and run ck_tile tests on gfx90a and gfx942 in parallel
* fix groovy syntax
* turn ck_tile tests OFF by default
* skip creating the build folder
* build ck_tile examples with 64 threads
* build ck_tile examples with cmake-ck-dev.sh script
* add video group to docker on mi300
* do not retry to rebuild the early CI stages
* help prevent jenkins false failure
* restore cron trigger
* wa prec, remove sgpr offset for inline asm
* macro for set tile
* ignore unused param if no kernel instances in host API
* fix more prec issue
* cache buffer resource
* fix
* support pre-nop
* clear tile by vector type members
* add workaround to reduce scratch memory
* conditionally enable workaround code
* enable workaround start from certain build version
* fallback set_tile() implementation from certain build version
* undo template argument changes
* put dummy asm in load_raw()
* fix comments, refactor s_nop inside buffer_load
---------
Co-authored-by: PoYen, Chen <PoYen.Chen@amd.com>
* universal streamk with atomics with ckprofiler support. grid_size and streamk strategy are tunable. grid_size of -1 leads to #WGs = maximum occupancy X num_CUs. implementation supports many different streamk policies: 1-tile, 2-tile, 3-tile and 4-tile. streamk strategy of -1 leads to default streamk policy (4-tile).
* Update README.md
* fixing clang-format issues
* removed conflicts in struct members between streamk and universal streamk
* corrected arg parsing for streamk and universal streamk
* added stream-k policies for 3 tile and 4 tile
* fixed argument type issue with parsing cmd args
* changes suggested in PR review are made- removing comments and correcting copyright
* file permissions updated
* added default value support for grid_size and streamk-policy selection set to -1
* print messages for arguments
* print messages for arguments
* print messages for arguments1
* Add layernorm2d forward
* Refind file path
* clang format
* Exclude ck_tile op from all
* use add_executable instead
* refactor layernorm2d_fwd example
---------
Co-authored-by: carlushuang <carlus.huang@amd.com>
We are adding more instances of grouped convolution 3d forward with a ConvScale element-wise operation.
This commit handles bf8@bf8->fp8 data types combination.
* Included an example.
* Added instances.
* Added a client example.
---------
Co-authored-by: Rostyslav Geyyer <rosty.geyyer@amd.com>
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>