* feat: grouped gemm tile loop support for RDNA4
* fix: removed extra parameter from grouped gemm example instance
* fix: FP8 check incorrectly enabling FP8 on RDNA3
* Implement grouped gemm fastgelu for RDNA4
* chore: some cleanup and minor inconsistencies in grouped gemm profiler
* chore: clarified logic and reporting of supported instance warnings
* wip: grouped_gemm implementation based on wmma kernel + example for fp16
* chore: clean up grouped_gem_wmma_splitk_fp16 example
* chore: add cmake options to fully disable XDL or WMMA kernels
* feat: add tests for grouped gemma wmma instances for f16 and bf16 (all layouts)
* chore: add grouped gemm wmma bf16 example
* refactor: reuse more code between instance factory functions
* chore: turn test failure if not all batch sizes are supported into a warning
* chore: made failing of test on unsupported instances conditional to not break old tests
* chore: add log message to failure case where AK1/BK1/KBatch is too high for K value
* fix: issue with new overloads of GridwiseGemm_wmma_cshuffle_v3::Run()
* fix: stray comma after parameter list
* fix: compilation issues on RDNA3 and tests failing due to unsupported problems still being ran
* chore: update copyright in header comments
* nit: minor feebdack
* refactor: unified XDL / wma tests
* fix: properly disable FP8 instances when ONLY targeting gfx11
* refactor: add v3 suffix to grouped_gemm device struct name
* fix: small typos in example code
* fix: fully exclude xdl/wmma instances when using the corresponding cmake flags
* chore: remove unused destructor and added pipeline support checks to remove unnecessary paths
* fix: make sure to not add instance library to group if library was skipped
* fix: make sure xdl grouped gemm doesnt fail the new test
* fix: explicitly exclude test if no xdl/wmma support, as pattern matching fails in this case
* fix: examples not working since dependent types and functions were moved to ck namespace in develop
* fix: tests failing when compiling for just gfx11 due to trying to run unsupported instances
* chore: replace/add copyright headers with new format
* [CK] Add command option instance_index and param_mask to run partial ck test
Many CK test are instance test. it will loop all instance in the instance library. It causes test often out-of-time if we run test on simulator/emulator.
This PR add option instance_index and param_mask to reduce the workload of instance test
instance_index: only run test 1 available instance with specified index.
param_mask: filter the embedded parameter with specified mask
* fix CI error
* fix clang format
---------
Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>
* Few small fixes.
* New GroupedGemm instances (BF16)
* Unify and refactor GroupedGEMM device API.
* Adapt changes to new API.
* Adapt grouped gemm profiler.
* Accept multiple kbatches for grouped gemm profiler.
- delete obsolete two stage as it is now covered by grouped gemm
* Update unit test for grouped gemm.
* Fix thresholds for BF16 and F8. Unblock tests.
* Fix few instances.
* Multiple small fixes.
* Adapt to new API, check dynamic casting.
* Uncomment few data types in grouped gemm profiler.
* Fix call to SetDeviceArgs.
* Fix profile grouped gemm multiply tile loop.
* Fix grouped gemm tile loop kernel args in client examples.
* Review comments.
* parse examples inside the add_example_executable function
* fix the example 64 cmake file
* add xdl flag to the gemm_bias_softmax_gemm_permute example
* add filtering of tests based on architecture type
* enable test_grouped_gemm for gfx9 only
* enable test_transpose only for gfx9
* only linnk test_transpose if it gets built
* split the gemm instances by architectures
* split gemm_bilinear,grouped_conv_bwd_weight instances by targets
* split instances by architecture
* split grouped_conv instances by architecture
* fix clang format
* fix the if-else logic in group_conv headers
* small fix for grouped convolution instances
* fix the grouped conv bwd weight dl instances
* fix client examples
* only enable client examples 3 and 4 on gfx9
* set the gfx9 macro
* make sure the architecture macros are set by cmake
* use separate set of xdl/wmma flags for host code
* sinmplify the main cmake file
* add conv_fwd_bf8 instance declaration
* Introduce LocalBlockToCTileMap.
* Change the signature of CalculateBottomIndex() function which now does
not accept any argument. The B2C map which is already passed as an
argument to the kernel Run function is calculating block's local id
already outside at kernel entry point __global__ function.
The LocalB2C map stores as members local block ID.
* Use LocalBlockToCTile map in device ops.
* First draft of tile loop work distribution.
* Fix typo.
* Simplify kernel arguments.
Calculate descriptors & B2C maps on the device.
* Use looping kernel.
* Fix B2C constructor.
* Fix Navi21 errors.
* Calculate tile start/end in device kernel.
* Change Run API to accept user provided workspace buffer.
* Add new line at EOF.
* Move Gemm KernelArguments to device op interface.
* Remove unused code.
* Update API.
* Launch grid size which is min of occupancy vs tile count
* Get back to use constant memory for gemm descriptors.
* Remove unused code.
* Add default virtual method implementation.
* Update comments to conform with doxygen style.
* Fix doc style and unused parameters.
* Add thread cluster lengths to kernel name.
* Remove old splitk impl and replace it with tile looping one.
* Modify instances.
* set KPerBlock to 64
* maximize wherever possible vector load size.
* Fix instances cluster lengths.
* Change comment style.
* Use 128b store where possible in instances.
* Update test cases, since KPerBlock has doubled.
* Update output stream operator for Sequence.
* Add pipeline version to GroupedGEMM device op type string.
* Fix pipeline version type logging.
* Fix input tensors type after merge.
* Fix compiler error.
* Fix output stream operator for Pipeline version.
* Store using 128b.
* Set of instances with kpb 32/64
* Limit number of instances
* Remove commented out instances.
* Fix function name.
* Limit the number of instances.
Add pipline version to the regular instances
* Change thr cluster layout for reading B tensor.
* disabled failed instances
---------
Co-authored-by: Adam Osewski <aosewski@amd.com>
Co-authored-by: zjing14 <zhangjing14@gmail.com>
Co-authored-by: Jing Zhang <jizha@amd.com>
* refactor cmake files for the tests
* refactor cmake files for examples
* fix cmake for gemm example
* fix the cmake file for all examples
* add splitting by data types in gemm_splitk instance header
* rename test to reflect only dl instances are used
* clean up CI workspace, update cmake for instances
* change the jenkinsfile syntax
* build all instances except DL on gfx11
* move workspace cleanup after stages
* clean up workspace after every stage
* isolate data types in grouped_conv_fwd header
* isolate dl instances for grouped_conv2d_fwd
* fix syntax
* fix cmake and batchnorm instances
* fix typo
* fix reduction instances
* fix grouped_conv headers
* fix syntax
* replace parsing logic for instances, replace bfp16 with bf16
* fix the client examples build
* clean up DTYPES from instances cmake files
* update the parsing logic in cmake files
* make an exception for reduction kernels
* update few remaining cmake files to handle DTYPES
* fix syntax
* fix cmake conflicts
* replace f8 with fp8 test name
* resolve conflicts for dpp instances
* properly split conv_nd_bwd_data instances
* split conv2d_fwd instance data types
* split the gemm, conv2d_fwd and batched_gemm_softamx_gemm
* split the tests by data types where possible
* filter examples by DTYPES
* split few remaining examples by DTYPES
* filter most instances by DTYPES
* add new lines at end of headers, fix grouped_gemm profiler
* fix syntax
* split the ckprofiler instances by DTYPES
* split the conv2d and quantization DL and XDL instances
* fix the splitting of conv2d DL instances
* split softmax and pool_fwd tests for fp16 and fp32 types
* fix syntax
* fix the dl_int8 quantization instances isolation
* enable gfx941/942 targets
* fix clang format
* fix the cmake logic for multiple targets
* fix cmake syntax for looping over targets
* add gfx941/942 support for gemm_xdl instances
* Add license header.
* Reduce number of logged output. Add constant initialization.
* Add functional tests for grouped_gemm with different kbatch value.
* Add debug log informations + remove unused code.
* Don't pass kbatch to CalculateKPadded.
* Turn on logging in grouped gemm and gemm splitk profiler
* Debug: limit number of test cases to run;
* Log more information and initialize with constant value.
* Turn on DEBUG_LOG
* Add more debug log informations.
* Limit the number of instances to compile.
* Use GridwiseGemmPipeline
* Use KBatch to calculate K0
* Multiple DebugLog messages.
* Unit tests for multiple KBatch values.
* Refactoring
* Disable logging
* extract out of if statement KBatch update.
* Uncomment instances.
* Disable DebugLog.
* Use Kbatch when calculate KPadded.
* Fix CGridDesc padding.
* Use available helper functions.
* Uncomment code commented for debuggin.
* Remove unnecessary debug log messages.
* Uncomment previously commented code for debug purposes.
* Add KBatch info to profiler output summary log.
* Add gtests for gemm splitk using ckProfiler API.
* Add more test-cases for different data layout.
* Add more test cases for gemm splitk
* Remove old test.
* Unit tests for MKNK ggemm interface.
* Fix and add more unit-tests.
* Constepxr everything!
* Increase error threshold for fp16 and splitk.
Since we're using fp16 atomic add for splitk there's a
known precision loss.
---------
Co-authored-by: Adam Osewski <aosewski@amd.com>
Co-authored-by: zjing14 <zhangjing14@gmail.com>
* enable dl kernels on navi3
* do not build xdl tests and examples on Navi
* run tests before building everything on jenkins
* disable gemm_bilinear on gfx1030
* add gpu targets to installer on Navi
* put tests in the same order as before
* reduce the number of navi targets in CI
* build CI installed for gfx940 as well
* only build for MI300 during QA runs
* Re-structure ckProfiler source files
* Rename profiler.cpp to main.cpp
* Modularize ckProfiler operations
* Add description for profiler operations
* Use longer name to avoid name collision
* Use macro to delay expansion
* Use std::move() to avoid object copying
* Prohibit users from calling dtor
* Use macro to eliminate redundant code
* Make friend function hidden
* Add missing include directive <iostream>
* Fix wrong include directives
* Remove int8 from batchnorm-forward instances since it is not needed for forward training and could fail test
Co-authored-by: Qianfeng Zhang <Qianfeng.Zhang@amd.com>
* moved gemm_descs_args into const buff
* use CK_CONSTANT_ADDRESS_SPACE instead of global constant
* clean
* moved hipMemAlloc outside of deviceOp
* add SetWorkSpacePointer
* fix ignore
* add intrin_mfma_f64_16x16x4f64
* add example
* gemm reference add double data type
* chang init data
* fix M N PerXdlops
* fix ifdef
* add comparsion config
* add conv fwd example
* format log out
* change rc matrix egister layout
* reorganize example
* reorganize example 2
* format,because merge develop
* fix call impl adding acc data type
* lost ;
* add compiler warning
* change example tunning parameters
* add test for fp64
* add instance
* add test/gemm/gemm_fp64.cpp
* fix get name issue
* remove some tunning parameter
* fix conflict
* format
* use integer value for GEMM test
* add acc data type
* remove typeid because fp16
* fix streamconfig etc bug from merging develop
* format
* remove test_gemm_xdl_fp64
* add AccDataType
* AccDataType problem
Co-authored-by: qinletao <letaoqin@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>
* Turning compare warnings on
* Cleaning part I
* Cleaning part II
* Explicit static_cast to ck::type_convert
* Resolving large tensor size issue.
* format
* revert change to tensor descriptor; promote lementSpaceSize to 64bit
* use integer value for GEMM test
* Review remarks
* Review remarks + issues with (un)signed arithmetic
* Format fix
* Format
* Clang-format.
* fix 2gb limit issue
Co-authored-by: Chao Liu <chao.liu2@amd.com>
Co-authored-by: Adam Osewski <aosewski@amd.com>
* Convolution ND
* Code unification across dimensions for generating tensor descriptors.
* Example
* Instances
* Move convnd f32 instance file to comply with repo structure.
* Conv 1D tensor layouts.
* Formatting and use ReferenceConv
* Reference ConvFwd supporting 1D and 2D convolution.
* Debug printing TensorLayout name.
* Conv fwd 1D instance f32
* Refactor conv ND example.
Needed to support various conv dimensio.
Needed to support various conv dimensions
* Rename conv nd example director to prevent conflicts.
* Refactor some common utility to single file.
Plus some tests.
* Refactor GetHostTensorDescriptor + UT.
* Add 1D test case.
* Test reference convolution 1d/2d
* Remove some leftovers.
* Fix convolution example error for 1D
* Refactor test check errors utility function.
* Test Conv2D Fwd XDL
* More UT for 1D case.
* Parameterize input & weight initializers.
* Rename example to prevent conflicts.
* Split convnd instance into separate files for 1d/2d
* Address review comments.
* Fix data type for flops/gbytes calculations.
* Assign example number 11.
* 3D cases for convolution utility functions.
* 3D reference convolution.
* Add support for 3D convolution.
* Check for inputs bigger than 2GB.
* Formatting
* Support for bf16/f16/f32/i8 - conv instances + UT.
* Use check_err from test_util.hpp.
* Split convnd test into separate files for each dim.
* Fix data generation and use proper instances.
* Formatting
* Skip tensor initialization if not necessary.
* Fix CMakefiles.
* Remove redundant conv2d_fwd test.
* Lower problem size for conv3D UT.
* 3D case for convnd example.
* Remove leftovers after merge.
* Add Conv Specialization string to GetTypeString
* Skip instance causing numerical errors.
* Small fixes.
* Remove redundant includes.
* Fix namespace name error.
* Script for automatic testing and logging convolution fwd UTs
* Comment out numactl cmd.
* Refine weights initalization and relax rtol for fp16
* Move test_util.hpp to check_err.hpp
* Refine weights initalization and relax rtol for fp16
* Refactor common part of test conv utils.
* Move utility function to single common place.
* Add additional common functions to utility.
* Refactor convnd_fwd_xdl examples.
* Remove redundant files.
* Unify structure.
* Add constructor to ConvParams.
* And add input parameters validation.
* Modify conv examples to use single utility file.
* Remove check_error from host_tensor.hpp
* Get rid of check_indices function.
* Remove bf16_to_f32 function overload for scalars.
* Fix namespace.
* Add half_float::half for check_err.
* Fix conv params size in UT.
* Fix weights initialization for int8.
* Fix weights initialization for int8.
* Add type_convert when store output in ref conv 1D.
* Get back old conv2d_fwd_xdl operation.
* Silence conv debug print.
* format
* clean
* clean
* Fix merge.
* Fix namespace for check_err
* Formatting.
* Fix merge artifacts.
* Remove deleted header.
* Fix some includes and use ck::utils::check_err.
* Remove unused check_indices restored by previous merge.
* Fix namespaces after merge.
* Fix compilation error.
* Small fixes.
* Use common functions.
* Fix filename
* Fix namespaces.
* Fix merge artifact - retrieve removed by accident fun.
* Fix ConvForwardSpecialization.
* Adhere to coding style rules.
* Fix merge artifacts.
Co-authored-by: Adam Osewski <aosewski@amd.com>
Co-authored-by: Chao Liu <chao.liu2@amd.com>
* init of grouped_gemm
* 2 gemm test
* perf test
* clean
* wrap desc into a struct
* test cast static_arr to pointer
* add ptr to GemmDesc
* add grouped gemm profiler
* fixed mem issue with unique_ptr
* clean
* clean
* finished ckprofiler
* Update README.md
* readme
* fixed readme
* add example
* improve code
* fixed comments: reserve, seperate ptr and gemm_shapes
* merge group and non-group
* fixed comments: replace push_back with emplace_back to avoid copy constructor
* fixed comments: unified blk2ctile; add test
* ci fix
* fixed ci
* fixed ci
* fixed ci