* initial commit for skeleton code
* replaced skeleton code with old streamk b2c map functions from old CK, still need to clean up the code
* fixed up code to match CK Tile convention: data type changes, naming changes, etc.
* change for num_sk_blocks data type
* formatting fix
* minor fixes
* moved reduction argument to template
* resolved comments from PR review: standardizing naming, pruning unneeded code
* resolve errors from merge of device op PR: moved enum to common file
* switching to uint32_t due to implementation constraints: divmod only takes uint32_t and mixing signed and unsigned types causes problems
* unsigned type fix
* add const qualifier
* added documentation for template parameters
* documentation edit
* Adding RapidJson Library
* Adding Json Dumps in all CK_Tile Examples
Not verified yet
* Adding json to cktile Batched Transpose
* adding json dumps to layernorm2d_fwd
* Adding json dump to flatmm_basic
* Adding RapidJson Library
* Adding Json Dumps in all CK_Tile Examples
Not verified yet
* Adding json to cktile Batched Transpose
* adding json dumps to layernorm2d_fwd
* Adding json dump to flatmm_basic
* Adding json in 03_gemm
* Add json dump to 16_batched_gemm
* Add json dump to gemm_multi_d_fp16
* Add json dump to grouped_gemm
* fix fmha_bwd/fwd
* Fix clang-format errors
exclude include/rapidjson in jenkins as its a third-party library
* Saparating function and defination.
* Update Documentation of 03_gemm
* Refactoring as per code review
* Disable fp8 instances on unsupported targets (#2592)
* Restrict building of gemm_universal_preshuffle_f8 instances to specific targets in CMakeLists.txt
* Add condition to skip gemm_xdl_universal_preshuffle_f8 instances for unsupported targets in CMakeLists.txt
* Add conditions to skip unsupported targets for gemm_universal_preshuffle_f8 and gemm_xdl_universal_preshuffle_f8 instances in CMakeLists.txt
* Refine conditions to exclude gemm_universal_preshuffle_f8 instances for unsupported targets in CMakeLists.txt
---------
Co-authored-by: AviralGoelAMD <aviralgoel@amd.com>
* fix clang format
* remove duplicate lines of code from library/src/tensor_operation_instance/gpu/CMakeLists.txt
* Fixing Readme and unifying jsondumps
* adding moe_smoothquant
* adding fused_moe
* Fixing Readme for batched_gemm
* Fixing Readme for grouped_gemm
* adding flatmm
* adding gemm_multi_d_fp16
* adding elementwise
* adding File name when json is dumped
* Fixing Reduce after merge
* adding batched_transpose
* Adding Warptile in Gemm
* Fixing Clang Format
---------
Co-authored-by: Aviral Goel <aviral.goel@amd.com>
Co-authored-by: AviralGoelAMD <aviralgoel@amd.com>
Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>
* Adding fix for the gfx908 to the GEMM MFMA implementaitons of WarpGemmMfmaBf16Bf16F32M4N64K16 WarpGemmMfmaBf16Bf16F32M64N4K16
* Adding support for offload target gfx9-4-generic
* This duplication here isn't ideal
* This change introduces new pipelines with Intrawave scheduler and block gemm primitives that loads the scale tensor to registers to perform dequantization post MFMA on C tensor in registers.
Scale tensor data, BQ is spliced across threads in registers and not stored in LDS.
Current support is for the following combinations, but it should be fairly straightforward to extend support to more formats.
fp8, fp8 -> f32
bf8, bf8 -> f32
fp8, i4 -> f32
bf8, i4 -> f32
Group size can go down to as low as K length of underlying WarpGemm primitive.
* Solve merge conflict
* [CK TILE] Update CHANGELOG.md
---------
Co-authored-by: Vijay Krishnamoorthy <vjkrish@fb.com>
Co-authored-by: ThomasNing <thomas.ning@amd.com>
Co-authored-by: Cong Ma <congma13@amd.com>
The performance of Aquant has increased after enabling transposed C.
Do not need to exchange AQ elements among lanes after enabling
transposed C as one thread only holds data from one row.
* feat(check_err): add a variable to adjust number of incorrect values to print
* feat(host_tensor): add printing capability for fp8 bf8 int8 int4
* fix(gemm_utils): update acceptable data type
* fix(host_tensor): print both 4 bit ints in pk_int4_t
* refactor(HostTensor): define pk_int4_t_to_int8x2_t and fix typo in vector_type.hpp
* feat(host_tensor): add print first n elements functions
* [CK TILE] Fix bugs in AQuant preshuffle
- Make Aquant works with block Mx64x256. `M` could be 16, 32, 64
- Make Aquant works with warp 16x16x32 and 32x32x16.
* [CK TILE] Rename Preshuffle to PreshuffleQuant
The new name, PreshuffleQuant, explicitly states the function's purpose:
to preshuffle the quantization matrix.
* [CK TILE Block Scale] Use GemmConfig to save tile properties
- Remove specialization of GemmQuantTypeConfig
- Pass GemmConfig around which contains tile properties. Stop using hard
coded tile properties in `gemm_calc_aquant()`
* [CK TILE Block Scale] Rename GemmConfig used in block scale
- Remove unused GemmConfig
- Rename GemmConfig used in block scale
---------
Co-authored-by: ThomasNing <thomas.ning@amd.com>
Downstream libraries aren't migrated to c++20 yet, so replace a use of c++20 concept with equivalent SFINAE logic. The template checks for both the existence and the truthiness of the static member variable.
* base working version for single groupped conv bwd data
* Fix 2d descriptor
* fix groups
* Add 3d support
* fixes
* fixes
* fixes
---------
Co-authored-by: Jakub Piasecki <jakpia21@gmail.com>
* CK Tile Stream K Device Ops
Implementation of CK Tile StreamKHostArgs and StreamKKernel classes. The
StreamKKernel class injects Universal Gemm and includes functions to
facilitate kernel preparation for the GPU.
* Stream K Device Ops Fixes
- Update GetWorkSpaceSize to call TilePartitioner's GetWorkSpaceSize to
ensure we get size needed for accumulation buffers and semaphores.
- Pass in num_sk_blocks into TilePartitioner constructor
- Update documentation
* Add WarpTile dimensions to GetName function in StreamKKernel class
* Fix typos in StreamKHostArgs class description.
Co-authored-by: Christopher Millette <63608002+cgmillette@users.noreply.github.com>
* Apply clang format on updated comment for StreamKHostArgs
* Explicitly specify type for StreamKReductionStrategy enum
* Remove unecessary scopes
* Unify the commenting style to inline comments
* Add explicit casts for occupancy and num_cu in MakeKernelArgs function
Both the static functions Occupancy and NumCU in the StreamKKernel class
use functions from the HIP API that result in the returned occupancy and
num_cu types being type int. The TilePartitioner interface for stream K will
have occupancy and num_cu being type ck_tile::index_t which is int32_t.
Thus, to be safe, this change ensures that both occupancy and num_cu are
cast to int32_t.
* Fix use of kentry due to interface update
PR #2594 updated the interface for the kentry function in
include/ck_tile/host/kernel_launch.hpp. As a result, the static function
Occupancy was updated to work correctly with the new interface.
PR #2594 also changed UniversalGemmKernel's KernelBlockSize static
variable to kBlockSize, so the StreamKKernel class was updated to
reflect this change.
* Switch type of num_sk_blocks from uint32_t to int32_t
This change switches the type of num_sk_blocks to type ck_tile::index_t
which is int32_t. This was done because parallel work for the CK Tile
StreamK TilePartitioner's constructor will have num_sk_blocks as
ck_tile::index_t. Thus, this change will help unify the interfaces to
avoid any type conversion errors.
---------
Co-authored-by: Christopher Millette <63608002+cgmillette@users.noreply.github.com>
* Start adding other layouts for gemm_ab_scale
* Add some instances
* Create tensor descriptors for A/B scales depending on A/B layout
* Fix formatting
* Revert some comments
* Revert commented instances in CMakeLists.txt
* Add some more instances for col-row gemm
* enable more row,row instances
* Use occupancy=1 for col,row layout to avoid spills
* Rename vector to ThreadTile
* more notes on tile encoding
* remove number<> from tuple of make_tile_window
* add script to stress test the copy example
* Remove some duplicate code in fmha_fwd_appendkv_kernel.hpp
* Simplify two templated operator calls by having the templated types deduced automatically
* Simplify two GemmPipeline calls
* Fix GemmPipelineAgBgCrCompV4::GetName
* Refactor use of ArgParser in CK tile GEMM examples
* Update args in README.md to match the implementation in create_args
* Remove some unnecessary include statements
* Rename two variables
* Factor out common code
* Factor out do_verify
* Add and use type aliases for memory operation integral constants
* In gemm_basic.cpp, use kPadM, kPadN, kPadK, and kBlockPerCu from GemmConfig
---------
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
* Preshuffle AQ matrix in block scale gemm
* turns the output to fp16. Increase the repetition time.
---------
Co-authored-by: ThomasNing <thomas.ning@amd.com>
* Readme for GEMM Multi D
* GEMM Multi D partial Progress
* GEMM Multi D partial Progress!
* CK Tile Engine GEMM Multi D : All Python files generated
* Partial Progress
* Partial Progress
* Partial Progress
* Partial Progress : Incorrect Result
* Partial Progress : Debugging
* Partial Progress : Correct Results
* Partial Progress - Incorrect Results
* Partial Progress - Commenting Passthrough bypass logic
* Changing Passthrough to MultiplyMultiply
* Correct Results!
* Fix and debug the pass through feature
* Sample commit
* Correct Results : MultiplyMultiply
* Code Cleanup
* Removing Failed Instances
* Working code before Unary element support
* Custom Elementwise Function support and working implementation for Mul and Add
* Updating README
* Working for Passthrough
* Review Comments : Minor Fixes
* Review Comments : Minor Fixes
* Readme Updated
* Partial Changes after Rebase
* Working Code : Changes after Rebase
* Updating Jenkins file
* Removing default value changed while testing
* Configuration changes in config files
* Tile Handler changes in GEMM Multi D Tile Engine
* Tile Handler changes in GEMM Multi D Example
* Change log for Gemm Multi D in CK Tile Engine
* Configuration changes in config files
---------
Co-authored-by: ThomasNing <thomasning@amd.com>