* Add cshuffle epilogue test
* add the poc implementation to the epilogue and tests
* refactor cshuffle epilogue
* WIP: adding tensor/tile usage to scale_tile
* fix usage of tile_elementwise_inout
* add gemm_quant_kernel for generalizing gemm quant kernel
* Add problem specific to different quants, add QuantType to Traits
* Add quant_type to quant_kernel template parameters
* Create aq/bq_block_windows and views depending on QuantType
* Use tile windows as inputs in cshuffle epilogue
* Fix some issues in epilogue
* initial new example code for new general gemm quant kernel test
* Fix issues in kernel
* Add verification check for rowcol Quantmode
* use AccDataType instead of AQ in pipeline
* fix aquant preshuffle
* fix formatting
* some cleanup
* remove gemm_aquant_basic.cpp
* remove gemm_aquant_kernel.hpp
* fix tests for the renamed quant kernel
* fix formatting
* clean example files
* fix some merge conflicts
* fix preshufflequant rename issue
* fix some templates after merging with develop
* fix test preshuffle parameter
* fix formatting
* Unify bquant kernel to the common quant kernel
* remove bquant kernel also from common header
* fix formatting
* clean up commented code
* fix formatting config hpp
* fix merge mistake
* Non-const for movable windows
* fix formatting
* Fix grammar in README
Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>
* Remove #include<bit> and clean up example
* fix strides
* Add some descriptions for move_windows
---------
Co-authored-by: Mohsen Saffari <mohsen.saffari@amd.com>
Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>
* Create new copies of existing device struct and gridwise struct for batched_gemm_softmax_gemm and disable the softmax part. Still based on old wmma pipelines. Also copy the example and remove the softmax part from the reference calculation. Works and results match reference except for tiny float errors in problem 2.
* Turn DeviceBatchedGemmGemm_Wmma_CShuffleV3 into a proper DeviceBatchedGemmGemm derived class, with the right argument and invoker functions. Update example to use new definitions.
* Remove unused cross-attention and self-attention kernels, arguments, and invokers. Also remove other unused Argument types.
* Remove masking related code, test unusual sizes in example.
* Remove remaining softmax related code from GridwiseBatchedGemmGemm_wmma_cshuffle_v3 and example.
* Remove code related to numDims, bias, and TensorSpec from Device struct and example.
* Add layout template parameters to device struct
* Move (NPerBlock, LTilePerBlock) device struct template arguments up by two places to match XDL template argument ordering.
* Merge accumulation data types into one type to match XDL device struct.
* Remove NPerWmma template parameter from device struct and just set it equal to LPerWmma. Now device struct template params exactly match those for XDL batched gemm gemm.
* Add support for RCCR layout and test this in example
* Add batched_gemm_gemm_wmma to instance library + profiler, and add gtest just like for xdl.
* Add RCCR instance and additional RCRR instance to library.
* Remove unused permute and alpha related code. Time all tests. Fix B1 strides in argument verification.
* Remove references to G0, G1 in favor of batch, reduce dimensionality of length and stride arrays.
* Managed to replace old wmma gridwise pipeline and blockwise struct with new wmma blockwise pipeline. Some cleanup required but all tests pass.
* Make TransposeC a proper template parameter that gets passed all the way from BlockGemmPipeline_Selector to WmmaGemm so we can use the correct settings for bacthed gemm gemm as well as regular gemm. Gemm universal tests now pass again.
* Replace old LoopSched and PipelineVer params with BlockwiseGemm pipeline equivalents, and use these in instance factory. The v3 pipeline does not work yet, but v1 works for intrawave and interwave.
* Adapt the A wave descriptor to deal with RDNA4 wmma. This fixes batched gemm gemm functionality on RDNA4.
* Fixed two aspects of the v3 pipeline that were incorrect: First of all the blockwise copy operator was invoked once too many in all cases (RunRead and move window), which broke batched gemm gemm when the blockwise pipeline was used multiple times. Furthermore we should be using the mainloop (hotloop) for num_k_loop >=2 instead of num_k_loop >=3. Now we can use support any K dimension.
* Remove num prefetch parameter from gridwise struct since we don't use it and it doesn't do anything,
* Remove unused non-lds paths.
* Test and update the IsSupportedArgument() and CheckValidity() functions for all layouts + padding modes and various problem sizes.
* Add a lot of instances to the profiler with various blocksizes and pipelines, all verified.
* Add support for BF16: instance library, tests, and examples.
* Add examples for int8 and fp8, had to add type_convert_sp template specializations for the latter.
* Template the library instance lists and add default padding instances.
* Move memory calculations from the kernel to the Argument contructor. Also actually parse and use the user-provided batch strides.
* Actually parse and use user-provided regular strides.
* More refactor: remove references to multiple dims per dims, and g0 / g1. Also move xdl specific test utils out of generic test util header.
* Small post-rebase-on-develop fix due to bscale-related pipeline changes. All tests rerun + tested bscale and regular gemm.
* Introduce the correct GetCThreadDescriptor function in the blockwise gemm pipelines for the TransposeC=true case. It turns out to be identical for our batched gemm gemm (gemm0) usecases, but could theoretically be different for wmma_gemm instances with smaller-than-4-byte output data size.
* Remove unused NumPrefetch template parameter, we don't need to match the XDL template params one-to-one.
* Implement proper TailNum and HasMainLoop template parameters for the v3 pipeline. Now the Run() function knows at compile time whether there are 1, 2, or more loops in total, and adds or removes sections accordingly. It still uses the blockwise copy operators the correct amount of times.
* Add print lambda with env check and file and func to device and gridwise level compatibility error messages. Also respect compatibility in example script.
* RDNA3 does not support fp8
* Fix a typo
* Use std::variant to call run_gemm_example_with_layouts with the available layout variant combinations
* Use a unified run_gemm_example_prec_type for basic gemm and universal gemm
* Factor out run_gemm_example_prec_type
* Refactor argument parsing in gemm_splitk_two_stage_reduce.cpp
* Parse arguments outside of create_args
* Move the gemm operators to separate structs to facilitate their reuse
* Move the invokers to separate files to facilitate their reuse
* Rename the invoker files for consistency with the examples that use them
* Add fp32 support to the elementwise examples, and produce an error message for unsupported types
* Get rid of four unused variables
* Make two variables const
* Add support for different input-output type combinations in elementwise examples
* Test support for different input and output types in elementwise examples
* Add support for different operations in the elementwise unary tests
* Add support for UnaryConvert in the elementwise unary tests
* Add support for bf16 in elementwise examples, excluding unsupported type combinations
* Make some operator parameters const in ElementWiseKernel
* Remove some unnecessary include statements
* Implement a two-stage GEMM that does a type conversion in the second stage using the elementwise kernel
* Clear workspace instead of output when flushing the cache in SplitKTwoStageInvoker::gemm
* Fix formatting issues reported by clang
* Add back CK_TILE_USE_WMMA related changes
* Use the right prec type for bf16 in the universal GEMM and two stage split K examples
* Add some brackets
* Add some brackets
* Separate the clearing of the GEMM output memory from the cache flushing in the universal GEMM example
* Separate the clearing of the GEMM output memory from the cache flushing in the split K two stage example
* Fix formatting
* No need to call SetZero on ws_m_n_dev_buf here, as clear_gemm_output now does this as part of the kernel preprocessing
* Add fp16 data type to splitk two stage example
* Add preprocessing with optional cache flushing and clearing of output for k_batch > 1 to the basic GEMM example
* Adding RapidJson Library
* Adding Json Dumps in all CK_Tile Examples
Not verified yet
* Adding json to cktile Batched Transpose
* adding json dumps to layernorm2d_fwd
* Adding json dump to flatmm_basic
* Adding RapidJson Library
* Adding Json Dumps in all CK_Tile Examples
Not verified yet
* Adding json to cktile Batched Transpose
* adding json dumps to layernorm2d_fwd
* Adding json dump to flatmm_basic
* Adding json in 03_gemm
* Add json dump to 16_batched_gemm
* Add json dump to gemm_multi_d_fp16
* Add json dump to grouped_gemm
* fix fmha_bwd/fwd
* Fix clang-format errors
exclude include/rapidjson in jenkins as its a third-party library
* Saparating function and defination.
* Update Documentation of 03_gemm
* Refactoring as per code review
* Disable fp8 instances on unsupported targets (#2592)
* Restrict building of gemm_universal_preshuffle_f8 instances to specific targets in CMakeLists.txt
* Add condition to skip gemm_xdl_universal_preshuffle_f8 instances for unsupported targets in CMakeLists.txt
* Add conditions to skip unsupported targets for gemm_universal_preshuffle_f8 and gemm_xdl_universal_preshuffle_f8 instances in CMakeLists.txt
* Refine conditions to exclude gemm_universal_preshuffle_f8 instances for unsupported targets in CMakeLists.txt
---------
Co-authored-by: AviralGoelAMD <aviralgoel@amd.com>
* fix clang format
* remove duplicate lines of code from library/src/tensor_operation_instance/gpu/CMakeLists.txt
* Fixing Readme and unifying jsondumps
* adding moe_smoothquant
* adding fused_moe
* Fixing Readme for batched_gemm
* Fixing Readme for grouped_gemm
* adding flatmm
* adding gemm_multi_d_fp16
* adding elementwise
* adding File name when json is dumped
* Fixing Reduce after merge
* adding batched_transpose
* Adding Warptile in Gemm
* Fixing Clang Format
---------
Co-authored-by: Aviral Goel <aviral.goel@amd.com>
Co-authored-by: AviralGoelAMD <aviralgoel@amd.com>
Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>
* This change introduces new pipelines with Intrawave scheduler and block gemm primitives that loads the scale tensor to registers to perform dequantization post MFMA on C tensor in registers.
Scale tensor data, BQ is spliced across threads in registers and not stored in LDS.
Current support is for the following combinations, but it should be fairly straightforward to extend support to more formats.
fp8, fp8 -> f32
bf8, bf8 -> f32
fp8, i4 -> f32
bf8, i4 -> f32
Group size can go down to as low as K length of underlying WarpGemm primitive.
* Solve merge conflict
* [CK TILE] Update CHANGELOG.md
---------
Co-authored-by: Vijay Krishnamoorthy <vjkrish@fb.com>
Co-authored-by: ThomasNing <thomas.ning@amd.com>
Co-authored-by: Cong Ma <congma13@amd.com>
The performance of Aquant has increased after enabling transposed C.
Do not need to exchange AQ elements among lanes after enabling
transposed C as one thread only holds data from one row.
* feat(check_err): add a variable to adjust number of incorrect values to print
* feat(host_tensor): add printing capability for fp8 bf8 int8 int4
* fix(gemm_utils): update acceptable data type
* fix(host_tensor): print both 4 bit ints in pk_int4_t
* refactor(HostTensor): define pk_int4_t_to_int8x2_t and fix typo in vector_type.hpp
* feat(host_tensor): add print first n elements functions
* [CK TILE] Fix bugs in AQuant preshuffle
- Make Aquant works with block Mx64x256. `M` could be 16, 32, 64
- Make Aquant works with warp 16x16x32 and 32x32x16.
* [CK TILE] Rename Preshuffle to PreshuffleQuant
The new name, PreshuffleQuant, explicitly states the function's purpose:
to preshuffle the quantization matrix.
* [CK TILE Block Scale] Use GemmConfig to save tile properties
- Remove specialization of GemmQuantTypeConfig
- Pass GemmConfig around which contains tile properties. Stop using hard
coded tile properties in `gemm_calc_aquant()`
* [CK TILE Block Scale] Rename GemmConfig used in block scale
- Remove unused GemmConfig
- Rename GemmConfig used in block scale
---------
Co-authored-by: ThomasNing <thomas.ning@amd.com>
* base working version for single groupped conv bwd data
* Fix 2d descriptor
* fix groups
* Add 3d support
* fixes
* fixes
* fixes
---------
Co-authored-by: Jakub Piasecki <jakpia21@gmail.com>
* feat(gemm_wp): add two new configs for wp
* delete the unnecessary files
* fix the config error
* update the config
---------
Co-authored-by: AviralGoelAMD <aviral.goel@amd.com>
* Rename vector to ThreadTile
* more notes on tile encoding
* remove number<> from tuple of make_tile_window
* add script to stress test the copy example
* Remove some duplicate code in fmha_fwd_appendkv_kernel.hpp
* Simplify two templated operator calls by having the templated types deduced automatically
* Simplify two GemmPipeline calls
* Fix GemmPipelineAgBgCrCompV4::GetName
* Refactor use of ArgParser in CK tile GEMM examples
* Update args in README.md to match the implementation in create_args
* Remove some unnecessary include statements
* Rename two variables
* Factor out common code
* Factor out do_verify
* Add and use type aliases for memory operation integral constants
* In gemm_basic.cpp, use kPadM, kPadN, kPadK, and kBlockPerCu from GemmConfig
---------
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
* Preshuffle AQ matrix in block scale gemm
* turns the output to fp16. Increase the repetition time.
---------
Co-authored-by: ThomasNing <thomas.ning@amd.com>
* Readme for GEMM Multi D
* GEMM Multi D partial Progress
* GEMM Multi D partial Progress!
* CK Tile Engine GEMM Multi D : All Python files generated
* Partial Progress
* Partial Progress
* Partial Progress
* Partial Progress : Incorrect Result
* Partial Progress : Debugging
* Partial Progress : Correct Results
* Partial Progress - Incorrect Results
* Partial Progress - Commenting Passthrough bypass logic
* Changing Passthrough to MultiplyMultiply
* Correct Results!
* Fix and debug the pass through feature
* Sample commit
* Correct Results : MultiplyMultiply
* Code Cleanup
* Removing Failed Instances
* Working code before Unary element support
* Custom Elementwise Function support and working implementation for Mul and Add
* Updating README
* Working for Passthrough
* Review Comments : Minor Fixes
* Review Comments : Minor Fixes
* Readme Updated
* Partial Changes after Rebase
* Working Code : Changes after Rebase
* Updating Jenkins file
* Removing default value changed while testing
* Configuration changes in config files
* Tile Handler changes in GEMM Multi D Tile Engine
* Tile Handler changes in GEMM Multi D Example
* Change log for Gemm Multi D in CK Tile Engine
* Configuration changes in config files
---------
Co-authored-by: ThomasNing <thomasning@amd.com>
* General 2D Reduction Kernel
* Move the reduction kernel from the example
* Split the code and add the necessary policy, problem, shape files as
per ck_tile convention
* Add/modify the headers
* Modified the example to work with the 'new' kernel
* Added tests for the kernel
* N-D refernce reduce
* Added support for N-D input with transform to 2D
* Added padding to support various input sized tensors
* Bug fix in the thread buffer constructor
* Some comments to explain the reduce2d block kernel
* comments resolution
* clang-format
* comments resolution
* clang-format
* clang-format
* comments resolution
* clang-format