* Adding fix for the gfx908 to the GEMM MFMA implementaitons of WarpGemmMfmaBf16Bf16F32M4N64K16 WarpGemmMfmaBf16Bf16F32M64N4K16
* Adding support for offload target gfx9-4-generic
* This duplication here isn't ideal
* This change introduces new pipelines with Intrawave scheduler and block gemm primitives that loads the scale tensor to registers to perform dequantization post MFMA on C tensor in registers.
Scale tensor data, BQ is spliced across threads in registers and not stored in LDS.
Current support is for the following combinations, but it should be fairly straightforward to extend support to more formats.
fp8, fp8 -> f32
bf8, bf8 -> f32
fp8, i4 -> f32
bf8, i4 -> f32
Group size can go down to as low as K length of underlying WarpGemm primitive.
* Solve merge conflict
* [CK TILE] Update CHANGELOG.md
---------
Co-authored-by: Vijay Krishnamoorthy <vjkrish@fb.com>
Co-authored-by: ThomasNing <thomas.ning@amd.com>
Co-authored-by: Cong Ma <congma13@amd.com>
The performance of Aquant has increased after enabling transposed C.
Do not need to exchange AQ elements among lanes after enabling
transposed C as one thread only holds data from one row.
* feat(check_err): add a variable to adjust number of incorrect values to print
* feat(host_tensor): add printing capability for fp8 bf8 int8 int4
* fix(gemm_utils): update acceptable data type
* fix(host_tensor): print both 4 bit ints in pk_int4_t
* refactor(HostTensor): define pk_int4_t_to_int8x2_t and fix typo in vector_type.hpp
* feat(host_tensor): add print first n elements functions
* [CK TILE] Fix bugs in AQuant preshuffle
- Make Aquant works with block Mx64x256. `M` could be 16, 32, 64
- Make Aquant works with warp 16x16x32 and 32x32x16.
* [CK TILE] Rename Preshuffle to PreshuffleQuant
The new name, PreshuffleQuant, explicitly states the function's purpose:
to preshuffle the quantization matrix.
* [CK TILE Block Scale] Use GemmConfig to save tile properties
- Remove specialization of GemmQuantTypeConfig
- Pass GemmConfig around which contains tile properties. Stop using hard
coded tile properties in `gemm_calc_aquant()`
* [CK TILE Block Scale] Rename GemmConfig used in block scale
- Remove unused GemmConfig
- Rename GemmConfig used in block scale
---------
Co-authored-by: ThomasNing <thomas.ning@amd.com>
Downstream libraries aren't migrated to c++20 yet, so replace a use of c++20 concept with equivalent SFINAE logic. The template checks for both the existence and the truthiness of the static member variable.
* base working version for single groupped conv bwd data
* Fix 2d descriptor
* fix groups
* Add 3d support
* fixes
* fixes
* fixes
---------
Co-authored-by: Jakub Piasecki <jakpia21@gmail.com>
* CK Tile Stream K Device Ops
Implementation of CK Tile StreamKHostArgs and StreamKKernel classes. The
StreamKKernel class injects Universal Gemm and includes functions to
facilitate kernel preparation for the GPU.
* Stream K Device Ops Fixes
- Update GetWorkSpaceSize to call TilePartitioner's GetWorkSpaceSize to
ensure we get size needed for accumulation buffers and semaphores.
- Pass in num_sk_blocks into TilePartitioner constructor
- Update documentation
* Add WarpTile dimensions to GetName function in StreamKKernel class
* Fix typos in StreamKHostArgs class description.
Co-authored-by: Christopher Millette <63608002+cgmillette@users.noreply.github.com>
* Apply clang format on updated comment for StreamKHostArgs
* Explicitly specify type for StreamKReductionStrategy enum
* Remove unecessary scopes
* Unify the commenting style to inline comments
* Add explicit casts for occupancy and num_cu in MakeKernelArgs function
Both the static functions Occupancy and NumCU in the StreamKKernel class
use functions from the HIP API that result in the returned occupancy and
num_cu types being type int. The TilePartitioner interface for stream K will
have occupancy and num_cu being type ck_tile::index_t which is int32_t.
Thus, to be safe, this change ensures that both occupancy and num_cu are
cast to int32_t.
* Fix use of kentry due to interface update
PR #2594 updated the interface for the kentry function in
include/ck_tile/host/kernel_launch.hpp. As a result, the static function
Occupancy was updated to work correctly with the new interface.
PR #2594 also changed UniversalGemmKernel's KernelBlockSize static
variable to kBlockSize, so the StreamKKernel class was updated to
reflect this change.
* Switch type of num_sk_blocks from uint32_t to int32_t
This change switches the type of num_sk_blocks to type ck_tile::index_t
which is int32_t. This was done because parallel work for the CK Tile
StreamK TilePartitioner's constructor will have num_sk_blocks as
ck_tile::index_t. Thus, this change will help unify the interfaces to
avoid any type conversion errors.
---------
Co-authored-by: Christopher Millette <63608002+cgmillette@users.noreply.github.com>
* Rename vector to ThreadTile
* more notes on tile encoding
* remove number<> from tuple of make_tile_window
* add script to stress test the copy example
* Remove some duplicate code in fmha_fwd_appendkv_kernel.hpp
* Simplify two templated operator calls by having the templated types deduced automatically
* Simplify two GemmPipeline calls
* Fix GemmPipelineAgBgCrCompV4::GetName
* Refactor use of ArgParser in CK tile GEMM examples
* Update args in README.md to match the implementation in create_args
* Remove some unnecessary include statements
* Rename two variables
* Factor out common code
* Factor out do_verify
* Add and use type aliases for memory operation integral constants
* In gemm_basic.cpp, use kPadM, kPadN, kPadK, and kBlockPerCu from GemmConfig
---------
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
* Preshuffle AQ matrix in block scale gemm
* turns the output to fp16. Increase the repetition time.
---------
Co-authored-by: ThomasNing <thomas.ning@amd.com>
* Readme for GEMM Multi D
* GEMM Multi D partial Progress
* GEMM Multi D partial Progress!
* CK Tile Engine GEMM Multi D : All Python files generated
* Partial Progress
* Partial Progress
* Partial Progress
* Partial Progress : Incorrect Result
* Partial Progress : Debugging
* Partial Progress : Correct Results
* Partial Progress - Incorrect Results
* Partial Progress - Commenting Passthrough bypass logic
* Changing Passthrough to MultiplyMultiply
* Correct Results!
* Fix and debug the pass through feature
* Sample commit
* Correct Results : MultiplyMultiply
* Code Cleanup
* Removing Failed Instances
* Working code before Unary element support
* Custom Elementwise Function support and working implementation for Mul and Add
* Updating README
* Working for Passthrough
* Review Comments : Minor Fixes
* Review Comments : Minor Fixes
* Readme Updated
* Partial Changes after Rebase
* Working Code : Changes after Rebase
* Updating Jenkins file
* Removing default value changed while testing
* Configuration changes in config files
* Tile Handler changes in GEMM Multi D Tile Engine
* Tile Handler changes in GEMM Multi D Example
* Change log for Gemm Multi D in CK Tile Engine
* Configuration changes in config files
---------
Co-authored-by: ThomasNing <thomasning@amd.com>
* add failing tests
* swap out and reference
* add constraint assert to transpose input distribution
* test both pipelines with rectangular block tile
* print mismatched indices
* add a smaller failing test for old pipeline
* print grid and block
* fill output before operating on it
* swap m/n tile sizes and make one test pass
* add device syncs
* add one more flipped test case
* flip block tile at host arg init
* fix tiles for lds pipeline
* clang-format
* rename tests
* roll back error check
* remove device syncs
* reduce large test case's size
* Add more printing to core cktile
* Revert other changes in static encoding pattern
* Refactor to using a free print() function
* Remove loops and print just the containers
* Print tuple with better formatting, fix sequence compilation
* Add some tests for print utility
* Add print utility header
* Print for static_encoding_pattern
* add buffer_view printing
* Align vector_traits
* Fix formatting
* Lower-case enum strings
Co-authored-by: Christopher Millette <63608002+cgmillette@users.noreply.github.com>
* Remove empty comment lines
* Fix test with lower-case too
* Reduce repeated code in print tests, move helper function closer to type definition, test X&Y
* Add test_print_common.hpp
* add print.hpp in core.hpp
---------
Co-authored-by: Aviral Goel <aviral.goel@amd.com>
Co-authored-by: Christopher Millette <63608002+cgmillette@users.noreply.github.com>
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
* General 2D Reduction Kernel
* Move the reduction kernel from the example
* Split the code and add the necessary policy, problem, shape files as
per ck_tile convention
* Add/modify the headers
* Modified the example to work with the 'new' kernel
* Added tests for the kernel
* N-D refernce reduce
* Added support for N-D input with transform to 2D
* Added padding to support various input sized tensors
* Bug fix in the thread buffer constructor
* Some comments to explain the reduce2d block kernel
* comments resolution
* clang-format
* comments resolution
* clang-format
* clang-format
* comments resolution
* clang-format
* enable the persistent kernel for CompV4
* polish the example and clang format
* fix the non-persistent kernel error
---------
Co-authored-by: ThomasNing <thomasning@amd.com>