* HasHotLoop is a constexpr
* Remove an unused function
* Remove some unused include statements
* Add implementation and tests for fp8 x bf8 weight preshuffle GEMM
* Add implementation and tests for fp8 x bf8 in CK Tile basic and universal GEMMs
* Remove two barrier calls that HotLoopScheduler already calls
* No need to suppress a variable that hasn't been declared
* Replace six arg_parser arguments with constexpr literals
* Simplify run_gemm_test_prec_type
* The strides don't need to be passed via arg_parser as we use their default values
* The layouts don't need to be passed as arguments twice
* Pass M N and K as regular arguments, not using the argument parser
* We can now remove the argument parser
* Add a common file for precision types to be used in testing
* Convert basic and universal GEMM tests to use gtest
* Make GemmConfig a test parameter, and form test cases as the cartesian product GemmConfigs x PrecTypes
* Add GemmConfigComputeV4 to the GEMM configs to run the universal tests on
* Added a changelog entry
* Add missing copyright statements
* ifndef-define-endif is not needed with pragma once
* Fix a comment
* Add F8 x BF8 tests for CompV4 in test_gemm_pipeline_kernel_types.hpp
* Disable the unreliable test MoeSortingCase4
---------
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
* Implement argument passing to element-wise functions for fwd convolution
* Add files for fwd + bias + clamp example
* Implement Bias
* Implement Clamp
* Elementwise function composition
* Composition unit test
* Implement fwd + bias + clamp example
* Simplify argument passing and composition
* elfunc -> bias_and_clamp
* Rename function to specify example
* Move element-wise function instantiation to kernel
* Make bias a runtime tensor
* No ugly namespace aliasing
* Initialize element-wise function on host
* Remove function initialization helper, simplify Compose initialization
* Remove unintended LSP compatibility patch
* Clean up includes and unused code
* Switch names in cshuffle epilogue
* Move CDElementwise to conv traits
* Re-add required include
* Initialize bias in same way as other tensors
* Better type specification for ds pointer
* Disable 1D convolution
* Add warning for non-group-constant bias
* Add name member to unary elementwise ops.
* Update elementwise_op_name to check for name attribute.
* Require that the layout is derived from BaseTensorLayout struct.
* Change the return type of run_gemm_combinations in the basic tests
* Change the return type of run_gemm_combinations in the universal tests
* Add universal GEMM tests for bf16 x pk_i4 and fp16 x pk_i4
* Add universal GEMM test for fp8 x pk_i4
* Add basic GEMM tests for bf16 x pk_i4, fp16 x pk_i4 and fp8 x pk_i4.
* Add missing GemmTypeConfig<ck_tile::fp8_t, ck_tile::pk_int4_t, ck_tile::half_t>
* Add missing GemmTypeConfig<ck_tile::bf16_t, ck_tile::pk_int4_t, ck_tile::bf16_t>
* No need for utility in test_ck_tile_elementwise_1d
* Fix conversion from pk_int4x4_t to bf16x8_t in PassThroughPack8
* Avoid union-based type punning in float_to_bf16_truc_raw to make it constexpr compliant
* For consistency also make float_to_bf16_truc_nan_raw constexpr compliant by removing the union
* Use a static_cast to bfloat16_t only when CK_TILE_USE_LLVM_BUILTIN_BF16 is enforced
* Convert from float to bf16 during compilation rather than using magic values
* Fix conversion from pk_int4x4_t to fp8x8_t in PassThroughPack8
* Comment out the basic test for fp16 x pk_i4 as it does not pass
* Add missing GemmTypeConfig<ck_tile::bf8_t, ck_tile::pk_int4_t, ck_tile::half_t>
* Fix conversion from pk_int4x4_t to bf8x8_t in PassThroughPack8
* Add basic and universal GEMM tests for bf8 x pk_i4
* Switch back to amd_assembly_i4_to_fp8x8 in PassThroughPack8 as it works now
* Switch back to amd_assembly_i4_to_bf8x8 in PassThroughPack8 as it works now
* Remove the inefficient fallbacks for fp8 and bf8 in elementwise/unary_element_wise_operation.hpp
* Use explicit macros for enabling and disabling the the constexpr lookup based converters
* Fix two failing tests
* Avoid union-based type punning in float_to_bf16_rtn_raw to make it constexpr compliant
* Use float_to_bf16_rtn_raw instead of float_to_bf16 to create the bf16 lookup table for use in conversions from pk_int4 to bf16
* On ROCm 7.0.1 we need an explicit cast to from uint16_t to bf16_t
1. Refine Reduce2dShape to support both wave32 and wave64
2. Fix example reduce, permute and elementwise on gfx11 and gfx12
---------
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
- Add support for tensor A/B in both fp16+pk_int4_t and fp8+pk_int4_t formats
- Implement A(bf8) B(i4) support in universal GEMM
- Use new implementation for i4 to fp8 conversion in Block Scale
* Fix a typo
* Use std::variant to call run_gemm_example_with_layouts with the available layout variant combinations
* Use a unified run_gemm_example_prec_type for basic gemm and universal gemm
* Factor out run_gemm_example_prec_type
* Refactor argument parsing in gemm_splitk_two_stage_reduce.cpp
* Parse arguments outside of create_args
* Move the gemm operators to separate structs to facilitate their reuse
* Move the invokers to separate files to facilitate their reuse
* Rename the invoker files for consistency with the examples that use them
* Add fp32 support to the elementwise examples, and produce an error message for unsupported types
* Get rid of four unused variables
* Make two variables const
* Add support for different input-output type combinations in elementwise examples
* Test support for different input and output types in elementwise examples
* Add support for different operations in the elementwise unary tests
* Add support for UnaryConvert in the elementwise unary tests
* Add support for bf16 in elementwise examples, excluding unsupported type combinations
* Make some operator parameters const in ElementWiseKernel
* Remove some unnecessary include statements
* Implement a two-stage GEMM that does a type conversion in the second stage using the elementwise kernel
* Clear workspace instead of output when flushing the cache in SplitKTwoStageInvoker::gemm
* Fix formatting issues reported by clang
* Add back CK_TILE_USE_WMMA related changes
* Use the right prec type for bf16 in the universal GEMM and two stage split K examples
* Add some brackets
* Add some brackets
* Separate the clearing of the GEMM output memory from the cache flushing in the universal GEMM example
* Separate the clearing of the GEMM output memory from the cache flushing in the split K two stage example
* Fix formatting
* No need to call SetZero on ws_m_n_dev_buf here, as clear_gemm_output now does this as part of the kernel preprocessing
* Add fp16 data type to splitk two stage example
* Add preprocessing with optional cache flushing and clearing of output for k_batch > 1 to the basic GEMM example
* Readme for GEMM Multi D
* GEMM Multi D partial Progress
* GEMM Multi D partial Progress!
* CK Tile Engine GEMM Multi D : All Python files generated
* Partial Progress
* Partial Progress
* Partial Progress
* Partial Progress : Incorrect Result
* Partial Progress : Debugging
* Partial Progress : Correct Results
* Partial Progress - Incorrect Results
* Partial Progress - Commenting Passthrough bypass logic
* Changing Passthrough to MultiplyMultiply
* Correct Results!
* Fix and debug the pass through feature
* Sample commit
* Correct Results : MultiplyMultiply
* Code Cleanup
* Removing Failed Instances
* Working code before Unary element support
* Custom Elementwise Function support and working implementation for Mul and Add
* Updating README
* Working for Passthrough
* Review Comments : Minor Fixes
* Review Comments : Minor Fixes
* Readme Updated
* Partial Changes after Rebase
* Working Code : Changes after Rebase
* Updating Jenkins file
* Removing default value changed while testing
* Configuration changes in config files
* Tile Handler changes in GEMM Multi D Tile Engine
* Tile Handler changes in GEMM Multi D Example
* Change log for Gemm Multi D in CK Tile Engine
* Configuration changes in config files
---------
Co-authored-by: ThomasNing <thomasning@amd.com>
* Elementwise kernel implementation
Co-authored-by: Sami Aario <samaario@amd.com>
Co-authored-by: Mohsen Saffari <mohsen.saffari@amd.com>
Co-authored-by: yashagar <yashagar@amd.com>
* Elementwise with generalized nDims
* Adding the n-ary input tensor feature
* Generalize dimensions on top of inputs
* Add TFLOPS + remove std usage for tuples
* 1D basecase optimization
* Cleanup code + refactoring to a common interface
* Generalize to unary and add an example
* Cleanup, refactoring and commenting
* Suggestions for LWPCK-3170: elementwise kernel improvements
* Clang-format: remod.py
* Replace InputTensorType with XDataType as the type of input_tensors
* Add Tuple::apply and use it in ElementWiseKernel::operator to call operation with the exact number of arguments in xs
* Move examples to folder 19_elementwise
* Add missing copyright headers and fix some existing ones
* Replace an assert with throw std::runtime_error in elementwise example
* Avoid reading the output by using make_static_distributed_tensor for y_tile
* Removed two unused includes
* No need to move windows to the next block when each workgroup processes a single tile
* Only copy input tensors to the device
* Use get_warp_size to obtain warp size, and use ceiling division for grid size also for the unary example
* Adding output strides to the kernel, transposition example and update the other examples
* Changes made by remod.py
* Use default template parameter values for memory operation and coherence in a call to make_naive_tensor_view
* Move binary operations to include/ck_tile/ops/elementwise/binary_elementwise_operation.hpp
* Reuse generic reference binary/unary operation in examples + refactoring the transpose reference
* Fix comments in elementwise_example.cpp
- Refer to AMD terminology except when suggesting NVIDIA alternatives in parentheses
- ElementWiseTraits was renamed to ElementWiseShape
- Adopt suggestions made by Copilot when prompted to check for factual or typographical errors
* Simplify CMakeLists.txt and remove the unused variables this uncovers
* Rename a file and fix some copyright statements
* Changes made by script/clang-format-overwrite.sh
* Add basic unit test for ElementWiseKernel
* Remove left-over uninformative comment in apply unit test
* Changes made by clang-format-overwrite.sh
* fixup! Use default template parameter values for memory operation and coherence in a call to make_naive_tensor_view
* Clean up test_tuple_apply.cpp and test_elementwise_1d.cpp
* Use make_uniform_array_with_factory to define h_xs and d_xs_mems_owner as type std::array
* Use a DeviceMem constructor that calls get_element_space_size_in_bytes internally
* Move examples to folder 20_elementwise
* Reduced register pressure on the CK tile elementwise kernel + add 4d input example to be able benchmark against old CK
* Fix CLang formating
* Bump up the elementwise example folder number
* Elementwise: add padding + minor cleanup
* Add Vector Size inference + fix issue with wrong vectorization due to missing GuaranteedLastDimensionVectorStride setting in make_naive_tensor_view
* Add isSupportedArg to Elementwise kernel + addapt example and unit tests
* Fix clang-format on the unit test file
---------
Co-authored-by: Damien Lejeune <damien.lejeune@amd.com>
Co-authored-by: Sami Aario <samaario@amd.com>
Co-authored-by: Mohsen Saffari <mohsen.saffari@amd.com>
Co-authored-by: Aviral Goel <aviral.goel@amd.com>
* ck_tile kernel for gemm with groupwise quantized A or B tensor.
This change introduces new pipelines with Intrawave scheduler and block gemm primitives that loads the scale tensor to registers to perform dequantization post MFMA on C tensor in registers.
Scale tensor data, AQ/BQ is spliced across threads in registers and not stored in LDS.
Current support is for the following combinations, but it should be fairly straightforward to extend support to more formats.
1. fp8, fp8 -> f32
2. bf8, bf8 -> f32
3. i4, fp8 -> f32
4. i4, bf8 -> f32
Group size can go down to as low as K length of underlying WarpGemm primitive.
For Gemm problems with quantized B tensor, this change also introduces preliminary support for flatmm pipeline which loads B tensor directly into registers.
* [Block Scale Gemm] Only run gemm quant examples on __gfx94__
- Only run gemm quant examples on __gfx94__ for usage of
`v_cvt_pk_fp8_f32`
- Format the code
* [Block Scale Gemm] Remove Bquant Gemm BlockScale
This cleanup is in preparation for future development of bquant. By
isolating Aquant-related code, we can streamline the codebase and make
it easier to add and maintain bquant functionality in subsequent
updates.
* [Block Scale Gemm] Format code with clang-format-12
The latest clang-format (v19) in ROCm 7.0 generate different result than
clang-format-12 which is used in CK CI.
Format code with clang-format-12 for consistency.
* [Block Scale Gemm] Split the k direction loop
- Split the k direction loop in block_universal_gemm_as_quant_bs_cr.hpp
to make the logic clearer.
- Disable C transposition.
* [Block Scale Gemm] Move block scale gemm example to 38_block_scale_gemm
* [Block Scale Gemm] Update copyright
* test
* Add TailHandler
* Move TileDistributionEncodingPatternAQ
* Refactor
* refactor
* fix bug
* fix bug
* help solve the PR comment
* Format the code
* [Block Scale Gemm] Add unit tests
* [Block Scale Gemm] Add support to 16x16x32 MFMA
- Add support to 16x16x32 MFMA
- Fix a bug when exchange data crossing lanes
---------
Co-authored-by: Vijay Krishnamoorthy <vjkrish@meta.com>
Co-authored-by: Cong MA <congma13@ctr2-alola-ctrl-01.amd.com>
Co-authored-by: ThomasNing <thomas.ning@amd.com>
* Multiple d, initial commit
* Check Ds Layout
* Readme and clang format
* Update branch & conflicts
* Multiple D - fix clang-formatter
* Rename elemetwise_op
* Fix CI
* Code review part1
* Remove printf
* Remove unnecessary comment
* Add new tests with Col layout
* Review part 2
* Added support for Multiple D GEMM
* Update comment
* Remove maybe_unused
* Clang-format
* Review part 3
* Add comment to function
* Add comment to function: another
* Take number of params for a refrence function
* Remove additional d param for 0 tensor
* Change name of function
* Fix CI fails