* Add cshuffle epilogue test
* add the poc implementation to the epilogue and tests
* refactor cshuffle epilogue
* WIP: adding tensor/tile usage to scale_tile
* fix usage of tile_elementwise_inout
* add gemm_quant_kernel for generalizing gemm quant kernel
* Add problem specific to different quants, add QuantType to Traits
* Add quant_type to quant_kernel template parameters
* Create aq/bq_block_windows and views depending on QuantType
* Use tile windows as inputs in cshuffle epilogue
* Fix some issues in epilogue
* initial new example code for new general gemm quant kernel test
* Fix issues in kernel
* Add verification check for rowcol Quantmode
* use AccDataType instead of AQ in pipeline
* fix aquant preshuffle
* fix formatting
* some cleanup
* remove gemm_aquant_basic.cpp
* remove gemm_aquant_kernel.hpp
* fix tests for the renamed quant kernel
* fix formatting
* clean example files
* fix some merge conflicts
* fix preshufflequant rename issue
* fix some templates after merging with develop
* fix test preshuffle parameter
* fix formatting
* Unify bquant kernel to the common quant kernel
* remove bquant kernel also from common header
* fix formatting
* clean up commented code
* fix formatting config hpp
* fix merge mistake
* Non-const for movable windows
* fix formatting
* Fix grammar in README
Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>
* Remove #include<bit> and clean up example
* fix strides
* Add some descriptions for move_windows
---------
Co-authored-by: Mohsen Saffari <mohsen.saffari@amd.com>
Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>
* Fix a typo
* Use std::variant to call run_gemm_example_with_layouts with the available layout variant combinations
* Use a unified run_gemm_example_prec_type for basic gemm and universal gemm
* Factor out run_gemm_example_prec_type
* Refactor argument parsing in gemm_splitk_two_stage_reduce.cpp
* Parse arguments outside of create_args
* Move the gemm operators to separate structs to facilitate their reuse
* Move the invokers to separate files to facilitate their reuse
* Rename the invoker files for consistency with the examples that use them
* Add fp32 support to the elementwise examples, and produce an error message for unsupported types
* Get rid of four unused variables
* Make two variables const
* Add support for different input-output type combinations in elementwise examples
* Test support for different input and output types in elementwise examples
* Add support for different operations in the elementwise unary tests
* Add support for UnaryConvert in the elementwise unary tests
* Add support for bf16 in elementwise examples, excluding unsupported type combinations
* Make some operator parameters const in ElementWiseKernel
* Remove some unnecessary include statements
* Implement a two-stage GEMM that does a type conversion in the second stage using the elementwise kernel
* Clear workspace instead of output when flushing the cache in SplitKTwoStageInvoker::gemm
* Fix formatting issues reported by clang
* Add back CK_TILE_USE_WMMA related changes
* Use the right prec type for bf16 in the universal GEMM and two stage split K examples
* Add some brackets
* Add some brackets
* Separate the clearing of the GEMM output memory from the cache flushing in the universal GEMM example
* Separate the clearing of the GEMM output memory from the cache flushing in the split K two stage example
* Fix formatting
* No need to call SetZero on ws_m_n_dev_buf here, as clear_gemm_output now does this as part of the kernel preprocessing
* Add fp16 data type to splitk two stage example
* Add preprocessing with optional cache flushing and clearing of output for k_batch > 1 to the basic GEMM example
* Adding RapidJson Library
* Adding Json Dumps in all CK_Tile Examples
Not verified yet
* Adding json to cktile Batched Transpose
* adding json dumps to layernorm2d_fwd
* Adding json dump to flatmm_basic
* Adding RapidJson Library
* Adding Json Dumps in all CK_Tile Examples
Not verified yet
* Adding json to cktile Batched Transpose
* adding json dumps to layernorm2d_fwd
* Adding json dump to flatmm_basic
* Adding json in 03_gemm
* Add json dump to 16_batched_gemm
* Add json dump to gemm_multi_d_fp16
* Add json dump to grouped_gemm
* fix fmha_bwd/fwd
* Fix clang-format errors
exclude include/rapidjson in jenkins as its a third-party library
* Saparating function and defination.
* Update Documentation of 03_gemm
* Refactoring as per code review
* Disable fp8 instances on unsupported targets (#2592)
* Restrict building of gemm_universal_preshuffle_f8 instances to specific targets in CMakeLists.txt
* Add condition to skip gemm_xdl_universal_preshuffle_f8 instances for unsupported targets in CMakeLists.txt
* Add conditions to skip unsupported targets for gemm_universal_preshuffle_f8 and gemm_xdl_universal_preshuffle_f8 instances in CMakeLists.txt
* Refine conditions to exclude gemm_universal_preshuffle_f8 instances for unsupported targets in CMakeLists.txt
---------
Co-authored-by: AviralGoelAMD <aviralgoel@amd.com>
* fix clang format
* remove duplicate lines of code from library/src/tensor_operation_instance/gpu/CMakeLists.txt
* Fixing Readme and unifying jsondumps
* adding moe_smoothquant
* adding fused_moe
* Fixing Readme for batched_gemm
* Fixing Readme for grouped_gemm
* adding flatmm
* adding gemm_multi_d_fp16
* adding elementwise
* adding File name when json is dumped
* Fixing Reduce after merge
* adding batched_transpose
* Adding Warptile in Gemm
* Fixing Clang Format
---------
Co-authored-by: Aviral Goel <aviral.goel@amd.com>
Co-authored-by: AviralGoelAMD <aviralgoel@amd.com>
Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>
* This change introduces new pipelines with Intrawave scheduler and block gemm primitives that loads the scale tensor to registers to perform dequantization post MFMA on C tensor in registers.
Scale tensor data, BQ is spliced across threads in registers and not stored in LDS.
Current support is for the following combinations, but it should be fairly straightforward to extend support to more formats.
fp8, fp8 -> f32
bf8, bf8 -> f32
fp8, i4 -> f32
bf8, i4 -> f32
Group size can go down to as low as K length of underlying WarpGemm primitive.
* Solve merge conflict
* [CK TILE] Update CHANGELOG.md
---------
Co-authored-by: Vijay Krishnamoorthy <vjkrish@fb.com>
Co-authored-by: ThomasNing <thomas.ning@amd.com>
Co-authored-by: Cong Ma <congma13@amd.com>
The performance of Aquant has increased after enabling transposed C.
Do not need to exchange AQ elements among lanes after enabling
transposed C as one thread only holds data from one row.
* feat(check_err): add a variable to adjust number of incorrect values to print
* feat(host_tensor): add printing capability for fp8 bf8 int8 int4
* fix(gemm_utils): update acceptable data type
* fix(host_tensor): print both 4 bit ints in pk_int4_t
* refactor(HostTensor): define pk_int4_t_to_int8x2_t and fix typo in vector_type.hpp
* feat(host_tensor): add print first n elements functions
* [CK TILE] Fix bugs in AQuant preshuffle
- Make Aquant works with block Mx64x256. `M` could be 16, 32, 64
- Make Aquant works with warp 16x16x32 and 32x32x16.
* [CK TILE] Rename Preshuffle to PreshuffleQuant
The new name, PreshuffleQuant, explicitly states the function's purpose:
to preshuffle the quantization matrix.
* [CK TILE Block Scale] Use GemmConfig to save tile properties
- Remove specialization of GemmQuantTypeConfig
- Pass GemmConfig around which contains tile properties. Stop using hard
coded tile properties in `gemm_calc_aquant()`
* [CK TILE Block Scale] Rename GemmConfig used in block scale
- Remove unused GemmConfig
- Rename GemmConfig used in block scale
---------
Co-authored-by: ThomasNing <thomas.ning@amd.com>
* base working version for single groupped conv bwd data
* Fix 2d descriptor
* fix groups
* Add 3d support
* fixes
* fixes
* fixes
---------
Co-authored-by: Jakub Piasecki <jakpia21@gmail.com>
* feat(gemm_wp): add two new configs for wp
* delete the unnecessary files
* fix the config error
* update the config
---------
Co-authored-by: AviralGoelAMD <aviral.goel@amd.com>
* Rename vector to ThreadTile
* more notes on tile encoding
* remove number<> from tuple of make_tile_window
* add script to stress test the copy example
* Remove some duplicate code in fmha_fwd_appendkv_kernel.hpp
* Simplify two templated operator calls by having the templated types deduced automatically
* Simplify two GemmPipeline calls
* Fix GemmPipelineAgBgCrCompV4::GetName
* Refactor use of ArgParser in CK tile GEMM examples
* Update args in README.md to match the implementation in create_args
* Remove some unnecessary include statements
* Rename two variables
* Factor out common code
* Factor out do_verify
* Add and use type aliases for memory operation integral constants
* In gemm_basic.cpp, use kPadM, kPadN, kPadK, and kBlockPerCu from GemmConfig
---------
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
* Preshuffle AQ matrix in block scale gemm
* turns the output to fp16. Increase the repetition time.
---------
Co-authored-by: ThomasNing <thomas.ning@amd.com>
* Readme for GEMM Multi D
* GEMM Multi D partial Progress
* GEMM Multi D partial Progress!
* CK Tile Engine GEMM Multi D : All Python files generated
* Partial Progress
* Partial Progress
* Partial Progress
* Partial Progress : Incorrect Result
* Partial Progress : Debugging
* Partial Progress : Correct Results
* Partial Progress - Incorrect Results
* Partial Progress - Commenting Passthrough bypass logic
* Changing Passthrough to MultiplyMultiply
* Correct Results!
* Fix and debug the pass through feature
* Sample commit
* Correct Results : MultiplyMultiply
* Code Cleanup
* Removing Failed Instances
* Working code before Unary element support
* Custom Elementwise Function support and working implementation for Mul and Add
* Updating README
* Working for Passthrough
* Review Comments : Minor Fixes
* Review Comments : Minor Fixes
* Readme Updated
* Partial Changes after Rebase
* Working Code : Changes after Rebase
* Updating Jenkins file
* Removing default value changed while testing
* Configuration changes in config files
* Tile Handler changes in GEMM Multi D Tile Engine
* Tile Handler changes in GEMM Multi D Example
* Change log for Gemm Multi D in CK Tile Engine
* Configuration changes in config files
---------
Co-authored-by: ThomasNing <thomasning@amd.com>
* General 2D Reduction Kernel
* Move the reduction kernel from the example
* Split the code and add the necessary policy, problem, shape files as
per ck_tile convention
* Add/modify the headers
* Modified the example to work with the 'new' kernel
* Added tests for the kernel
* N-D refernce reduce
* Added support for N-D input with transform to 2D
* Added padding to support various input sized tensors
* Bug fix in the thread buffer constructor
* Some comments to explain the reduce2d block kernel
* comments resolution
* clang-format
* comments resolution
* clang-format
* clang-format
* comments resolution
* clang-format
* enable the persistent kernel for CompV4
* polish the example and clang format
* fix the non-persistent kernel error
---------
Co-authored-by: ThomasNing <thomasning@amd.com>
* something khushbu can help with
* v1 v2 works with flatmm develop
* v0 v1 v2 numerical error gone
* Fixing numerical error, and interchange preshuffle configs to match with flatmm
* Refactor GEMM pipeline configurations and integrate preshuffle support
- Updated preshuffle pipeline definitions to include multiple versions (V1, V2, V3).
- Changed the pipeline constant from CK_TILE_PIPELINE_PRESHUFFLE to CK_TILE_PIPELINE_PRESHUFFLE_V3 in relevant configurations.
- Removed obsolete code and comments
* clang format
* fix vectorloadsize bug
* add the Preshuffle3
* update kwarp calculation in gemm utils
* update vector size A and B correctly in V2 pipeline; Added few more changes to align with dteng's branch
* fix: add CK_GFX950_SUPPORT macro for gfx950 detection
* default disable rotating buffer
* docs(CHANGELOG): update changelog for rocm 7.0
* Revert "docs(CHANGELOG): update changelog for rocm 7.0"
This reverts commit 2bc16fff84.
* Remove unused Preshuffle V3 pipeline and related code; update gemm function to use Preshuffle V2; clean up comments and formatting in various files.
* revert example/ck_tile/flatmm to its original state
* remove comment added by second author
* switch to xor ALDSDescriptor
* modify the MakeALdsDescriptor()
* temporary profiling script
* getting rid of line marker compiler error
* UniversalWeightPreshufflePipelineAgBgCrPolicy now derives from UniversalGemmBasePolicy
* add a minor fix for the config
* typo fix
* Fix formatting in lambda function for WeightPreshufflePipelineAGmemBGmemCRegV2
* revert change in include/ck_tile/ops/flatmm/pipeline/flatmm_pipeline_agmem_bgmem_creg_v1.hpp
* revert change in include/ck_tile/core/arch/amd_buffer_addressing.hpp
* reenable the GemmSpatiallyLocalTilePartitioner
* make GemmConfigPreshuffle_1 for v1 pipeline, GemmConfigPreshuffle_2 for v2 pipeline
* remove hardcoded true for preshuffle bool template argument
* rename script
* remove gemm_profilie.sh script
* merge conflict resolve
* clang formatted
* typo fix
* Remove duplicate include of block_gemm_areg_bsmem_creg_v2r1.hpp in gemm.hpp
* Remove commented-out code in UniversalWeightPreshufflePipelineAgBgCrPolicy
* Fix missing newline at end of file in run_gemm_example.inc
* Remove unused barrier call in BlockWeightPreshuffleASmemBSmemCRegV1
* addressing review comments
* removing debug code
* addressing review comments
* Revert "addressing review comments"
This reverts commit 29c45192ba.
* updating tile_engine code
* addressing review comments
---------
Co-authored-by: amd-khushbu <khuagarw@amd.com>
Co-authored-by: ThomasNing <thomas.ning@amd.com>