* Add support for mixed precision in contraction scale and bilinear (#936)
* Extract common functionality to separate files
* Reference contraction: Remove incorrect consts from type_converts
* Reference contraction: Add missing type_convert for dst value
* Reference contraction: Fix incorrect order of B matrix dimensions
* Add support for mixed precision in contraction scale and bilinear
* Move using statements from instances to a common file
* Move using statements from examples to a common file
* Fix the order of B matrix dimensions across examples and profiler
* Fix the computation of error threshold
* Make ComputeDataType an optional argument
* Include possible DataType -> ComputeDataType casting error in the threshold
* Remove commented code
* Make the ComputeDataType an optional argument in instance
---------
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
* Fix instances dtype check
* Fix source dtypes seletor for examples and tests
* Sync with new cmakefile changes
* Remove not needed ifdefs
* Remove not needed ifdefs
* save mean and inverse std in normalization
* Save mean and inverse std in splitK
* Vector save mean and inv std
* Modify instance for save mean and std
* simplify the layernorm example
* Save mean and std in groupnorm example
* Save mean and inv std in ckProfiler and test
* Remove compute data type from base class
* Save mean and inv std in client example
* Add changelog
* clang format
* Fix compile error
* Refine naming
* Avoid error in bf16
* revert changelog
* Introduce LocalBlockToCTileMap.
* Change the signature of CalculateBottomIndex() function which now does
not accept any argument. The B2C map which is already passed as an
argument to the kernel Run function is calculating block's local id
already outside at kernel entry point __global__ function.
The LocalB2C map stores as members local block ID.
* Use LocalBlockToCTile map in device ops.
* First draft of tile loop work distribution.
* Fix typo.
* Simplify kernel arguments.
Calculate descriptors & B2C maps on the device.
* Use looping kernel.
* Fix B2C constructor.
* Fix Navi21 errors.
* Calculate tile start/end in device kernel.
* Change Run API to accept user provided workspace buffer.
* Add new line at EOF.
* Move Gemm KernelArguments to device op interface.
* Remove unused code.
* Update API.
* Launch grid size which is min of occupancy vs tile count
* Get back to use constant memory for gemm descriptors.
* Remove unused code.
* Add default virtual method implementation.
* Update comments to conform with doxygen style.
* Fix doc style and unused parameters.
* Add thread cluster lengths to kernel name.
* Remove old splitk impl and replace it with tile looping one.
* Modify instances.
* set KPerBlock to 64
* maximize wherever possible vector load size.
* Fix instances cluster lengths.
* Change comment style.
* Use 128b store where possible in instances.
* Update test cases, since KPerBlock has doubled.
* Update output stream operator for Sequence.
* Add pipeline version to GroupedGEMM device op type string.
* Fix pipeline version type logging.
* Fix input tensors type after merge.
* Fix compiler error.
* Fix output stream operator for Pipeline version.
* Store using 128b.
* Set of instances with kpb 32/64
* Limit number of instances
* Remove commented out instances.
* Fix function name.
* Limit the number of instances.
Add pipline version to the regular instances
* Change thr cluster layout for reading B tensor.
* disabled failed instances
---------
Co-authored-by: Adam Osewski <aosewski@amd.com>
Co-authored-by: zjing14 <zhangjing14@gmail.com>
Co-authored-by: Jing Zhang <jizha@amd.com>
* Extract common functionality to separate files
* Reference contraction: Remove incorrect consts from type_converts
* Reference contraction: Add missing type_convert for dst value
* Reference contraction: Fix incorrect order of B matrix dimensions
* Add support for mixed precision in contraction scale and bilinear
* Move using statements from instances to a common file
* Move using statements from examples to a common file
* Fix the order of B matrix dimensions across examples and profiler
* Fix the computation of error threshold
* Make ComputeDataType an optional argument
* Include possible DataType -> ComputeDataType casting error in the threshold
* Remove commented code
* Added error check after kernel launch (#919)
Co-authored-by: Xiaodong Wang <xdwang@meta.com>
Co-authored-by: Xiaodong Wang <xw285@cornell.edu>
* remove M=0 test cases for test_gemm_splitk
---------
Co-authored-by: Xiaodong Wang <xdwang@meta.com>
Co-authored-by: Xiaodong Wang <xw285@cornell.edu>
* Add column to image kernel
* Minor fixes for dtypes and client examples
* Disable tests for disabled dtypes
* Disable add instances functions for disabled data types
* Minor stylistic fixes
* Revert "Disable add instances functions for disabled data types"
This reverts commit 728b869563.
* Instances reduction
* Add comments in device_column_to_image_impl
* Update changelog and Copyrights
* Improve changelog
* split the types in gemm_bilinear instances, add condition to cmake policy
* fix syntax
* split the data types in batchnorm examples
* fix the batchnorm_bwd test
* fix types in the batchnorm_bwd test
* refactor cmake files for the tests
* refactor cmake files for examples
* fix cmake for gemm example
* fix the cmake file for all examples
* add splitting by data types in gemm_splitk instance header
* rename test to reflect only dl instances are used
* clean up CI workspace, update cmake for instances
* change the jenkinsfile syntax
* build all instances except DL on gfx11
* move workspace cleanup after stages
* clean up workspace after every stage
* isolate data types in grouped_conv_fwd header
* isolate dl instances for grouped_conv2d_fwd
* fix syntax
* fix cmake and batchnorm instances
* fix typo
* fix reduction instances
* fix grouped_conv headers
* fix syntax
* replace parsing logic for instances, replace bfp16 with bf16
* fix the client examples build
* clean up DTYPES from instances cmake files
* update the parsing logic in cmake files
* make an exception for reduction kernels
* update few remaining cmake files to handle DTYPES
* fix syntax
* fix cmake conflicts
* replace f8 with fp8 test name
* resolve conflicts for dpp instances
* Add image to column kernel
* Add instances, tests, profiler, example
* Add client example
* Several fixes of image to column
* Fix variable name in device_image_to_column_impl
* Several fixes of image to column profiler
* Fix num_btype calculation
* Make new mesaurements for correct bytes calculation
* Add maxpool instances
* Rename index pool to max pool.
* Add maxpool bwd bf16 instances
* Add avg pool bwd instances
* Rename avgpool and maxpool to avg_pool3d and max_pool
* Add bf16 pool fwd instances
* Add max pool bwd to ckProfiler
* Add avg pool3d bwd to ckProfiler
* Add avg pool bwd test
* Fix bug of reference pool fwd (dilation)
* Fix bug of max pool bwd (dilation and initZero)
* Support bf16 compute data type
* Force compute type be f32. Because atomicAdd only support f32
* Add max pool bwd test
* Rename folder
* Rename pool
* Add max pool bwd client example
* Add avg pool bwd client example
* Add missing workspace
* clang format
* Rename macro
* remove useless header
* remove useless layout
* experiment with config file
* experiment with version.h config
* add more info to version.h
* minor updates
* minor updates
* fix case where DTYPE is not used
* large amount of files but minor changes
* remove white space
* minor changes to add more MACROs
* fix cmakedefine01
* fix issue with CK internal conflict
* fix define and define value
* fix clang-format
* fix formatting issue
* experiment with cmake
* clang format v12 to be consistent with miopen
* avoid clang-format for config file
* Do not hardcode stride
* devicePool2DFwd Inherit devicePool3DFwd
* Move instance declaration out of common
* Add dilation
* use the pool3d rank, because pool2d inherit pooo3d
* calculate Do Ho Wo for the dilation
* Fix header name
* Modify ckProfiler
* Remove pool2d instance
* Remove pool2d in profiler
* Remove pool2d and add dilation
* In to client example, this commit revise following:
1. Add dilation.
2. Use pool3d to implement pool2d
* Refine naming and IsSupportedArgument()
* Add dilation to maxpool bwd example
* clang format
* 1. Remove useless header
2. Fix copyright
3. Refine naming
* Add layout parameter to pool fwd
* clang format
* Fix merge error
* Fix compile error
* Remove layout parameter in derived class
* Refine changlog
* Fix compile error
* Fix compiler error
* Add layout to external api and profiler
* Enable grouped conv with small K or C
* Add missing instances
* Refactor grouped conv fwd instances
* Fix fp16 instances since it supports src_per_vec %2 = 0
* Add generic instances
* properly split conv_nd_bwd_data instances
* split conv2d_fwd instance data types
* split the gemm, conv2d_fwd and batched_gemm_softamx_gemm
* split the tests by data types where possible
* filter examples by DTYPES
* split few remaining examples by DTYPES
* filter most instances by DTYPES
* add new lines at end of headers, fix grouped_gemm profiler
* fix syntax
* split the ckprofiler instances by DTYPES
* split the conv2d and quantization DL and XDL instances
* fix the splitting of conv2d DL instances
* split softmax and pool_fwd tests for fp16 and fp32 types
* fix syntax
* fix the dl_int8 quantization instances isolation
* Add wei_strides to grouped conv3d wei to keep consistency
* Fix strides in client examples
* Unify backward weight api with forward
* Fix for example
* Fixes for examples
---------
Co-authored-by: zjing14 <zhangjing14@gmail.com>
* Add s_nops after v_dot to avoid hazard
* Fix builtin for inner_produxt fp16
* Skip inline version to builtin
* Add comments regarding isa
* Fix comment regarding s_nop
* initial stream-k implementation with example
* fix unexpected change in err
* improve a little bit performance by reorganize pipeline.
* improve perf a little bit by swizzle block idx
* add profiler
* update example
* fix spelling
* shrink karg for streamk
* support dynamic buffer using memory coherence glc_slc bit from template
* control memory coherence while construct dynamic buffer
* update reduction for streamk(not ready yet)
* Add template parameter to make_dynamic_buffer to support amd_buffer coherence setting
* fix build issue
* fix several bug
* now result is correct, everything works (but has scratch)
* remove scratch by manually reset coordinate
* update device code
* fix a bug in final reduce
* fix something in example
* update async memset
* fix enum as camel case
* modify coherence enum name
* clean code and use atomic streamk by default
* remove unused var
* throw exception if have empty pointer
* fix format
* fix CI warning
* fix type in init
* modify CI error
* filter out on gfx10+
* restore changed example code
---------
Co-authored-by: Qianfeng Zhang <Qianfeng.Zhang@amd.com>
* allow building CK for specific data types
* add CI build and test stage on Naiv3x without some int8 instances
* add missing gemm fp16 instances
* add the changes to the missed cmake file
* add empty lines at end of source files
* Do not build quantization client example on navi3 in CI
* disable batched_gemm_multi_d_int8 instances with DTYPES
* disable device_conv2d_bwd_data_instance with DTYPES
* fix ckprofiler for conv_bwd_data for int8
* properly isolate the conv_bwd_data int8 instances
* remove empty line
* Add basic fp8 definitions and prn-generator
* Format
* Add fp8<->fp32 type_convert
* Format
* Split type_convert and cast_to/from_f8
* Format
* Minor fix
* Minor fix
* Move fp8 utils to a separate header
* Add elementwise ops
* Add fp8_convert_sr
* Format
* Add element op
* Eliminate magic numbers
* Split f8_convert_sr in host and device
* Format
* Add some constexpr
* Add a datatype test
* Format
* Another format
* Add fp8<->fp16 tests
* Update type_converts
* Format
* Add fp16 casting functions
* Format
* Use seed as a runtime arg
* Use element location for PRNG
* Format
* Add fp8<->fp16 to PassThrough element op
* Clean up
* Merge host and device implementations
* Add comments on rounding modes
* Remove leftover code
* Put type_converts into a separate header
* Put random number gen to a separate header
* Rearrange f8_utils' namespaces
* Refactor type_convert.hpp
* Move f8_t definition
* Add NumReduceDim template parameter to DeviceSoftmax and Softmax client API to simplify instances collecting
* Move the generic kernel instance to be the first of the instance list for elementwise op of normalization
* Add GetGenericInstance() interface for DeviceOperationInstanceFactory class of DeviceSoftmax
* Add testing of GetGenericInstance() in client_example of Softmax
* Revert "Add testing of GetGenericInstance() in client_example of Softmax"
This reverts commit f629cd9a93.
* Revert "Add GetGenericInstance() interface for DeviceOperationInstanceFactory class of DeviceSoftmax"
This reverts commit a9f0d000eb.
* Support generic kernel instance to be the first instance returned by GetInstances() for GroupNorm
* Move generic kernel instance to separate tuple for elementwise op of normalization
* Remove un-used files for softmax instance
* Store generic kernel instance to separate tuple for softmax
* Add IsSupported checking for generic instance to client example of softmax
* Replace the get_device_normalize_from_mean_meansquare_instances() by the DeviceOperationInstanceFactory class for elementwise-normalization
* clang-format fix
* Remove int8 from softmax instances
---------
Co-authored-by: zjing14 <zhangjing14@gmail.com>
* enable gfx941/942 targets
* fix clang format
* fix the cmake logic for multiple targets
* fix cmake syntax for looping over targets
* add gfx941/942 support for gemm_xdl instances
* Add license header.
* Reduce number of logged output. Add constant initialization.
* Add functional tests for grouped_gemm with different kbatch value.
* Add debug log informations + remove unused code.
* Don't pass kbatch to CalculateKPadded.
* Turn on logging in grouped gemm and gemm splitk profiler
* Debug: limit number of test cases to run;
* Log more information and initialize with constant value.
* Turn on DEBUG_LOG
* Add more debug log informations.
* Limit the number of instances to compile.
* Use GridwiseGemmPipeline
* Use KBatch to calculate K0
* Multiple DebugLog messages.
* Unit tests for multiple KBatch values.
* Refactoring
* Disable logging
* extract out of if statement KBatch update.
* Uncomment instances.
* Disable DebugLog.
* Use Kbatch when calculate KPadded.
* Fix CGridDesc padding.
* Use available helper functions.
* Uncomment code commented for debuggin.
* Remove unnecessary debug log messages.
* Uncomment previously commented code for debug purposes.
* Add KBatch info to profiler output summary log.
* Add gtests for gemm splitk using ckProfiler API.
* Add more test-cases for different data layout.
* Add more test cases for gemm splitk
* Remove old test.
* Unit tests for MKNK ggemm interface.
* Fix and add more unit-tests.
* Constepxr everything!
* Increase error threshold for fp16 and splitk.
Since we're using fp16 atomic add for splitk there's a
known precision loss.
---------
Co-authored-by: Adam Osewski <aosewski@amd.com>
Co-authored-by: zjing14 <zhangjing14@gmail.com>
* Expand the base class of pool2d, prepare to share base class with pool3d
* Add pool3d device op
* Add pool3d f16 example
* Refactor the base class. implement generic pooling in the future
* clang format
* get original index in max pooling
* Add outputindex to base class
* Fix dimension
* Add pooling instance
* Use indexType instead
* Remove useless header
* Extract IndexDataType to template
* Extract pooling reference code
* clang format
* clang format
* Fix typo
* Add tensor stride
* Add missing header
* Add index stride and output stride
* Refine naming
* Add type to base class
* Rename file
* Use proper size
* Fix typo
* Refine naming
* Modify the argument into vector.
* Add max pool profiler
* Refine naming
* Support f32 pool
* Fix typo
* Add avg pool2d fwd in profiler
* clang format
* Rename AccDatatype to ComputeDatatype
* Fix init
* test pool
* Extract variable
* Add client example
* Check the pooling dim
* clang format
* Connect argv and arg_parser
* Add found check
* Remove useless header
* Refine naming
* Adjust the order of device_pool_fwd
* enable dl kernels on navi3
* do not build xdl tests and examples on Navi
* run tests before building everything on jenkins
* disable gemm_bilinear on gfx1030
* add gpu targets to installer on Navi
* put tests in the same order as before
* reduce the number of navi targets in CI
* build CI installed for gfx940 as well
* only build for MI300 during QA runs
* Add contraction profiler and tests
* Build and style fixes
* Allow to use any elementwise operator for ref_contraction
* Introduce profile_contraction_scale and profile_contraction_bilinear
* Make ref_contraction generic and extend interface tests
* Stylistic minor fixes
* Extend test_contraction_interface