* add ab_scale init support
* enabled interwave
* add scale type; update isSupport
* adjust example
* clean
* enable f8 pure gemm rcr ckprofiler
* Add gemm_multiply_multiply instances
* clang format
* Optimize for ScaleBlockMNK=128
* enable abscale f8 gemm ck profiler
* Add pure f8 gemm test suite
* Reverting to the state of project at f60fd77
* update copyright
* clang format
* update copyright
---------
Co-authored-by: root <jizhan@amd.com>
* init for reduce_threadwise multi_d
* add reduce_threadwise_multi_d
* add reduce_multi_d
* clean
* start add an other splitk device op
* add reduce template parameter to SplitKBatchOffset
* add reduce c matrix
* clean up code
* change example data type to bf16
* add bf16Ai8B example
* remove reduce template parameter
* add splitk atomic status to v4
* example add multi d parameters
* device op add multi-d parameters
* add multi-d to reduce
* fix kbach=1 bug
* change B layout to col in bf16Ai8B example
* remove float adding struct
* change multi-d interface
* change file and class name
* remove multi-d of bf16Ai8B example
* change IsReduce function to IsReduceAdd
* change example layout to RRR from RCR
* according layout to set ds stride
* reset parameter layout
* add gemm universal reduce instance
* add reduce factory
* add profile_gemm_universal_reduce
* add reduce to profiler
* fix reduce instance
* fix profiler reduce compiling bug
* format
* format library instance code
* add mem instance for reduce library
* fix call instance names
* add workspace for reduce in ckProfiler
* format
* add mnpading to reduce library instance
* add fp16 instance to reduce of profiler
* change copyright time
* restore profiler cmake file
* add reduce text to instances
* add DsLayout and DsDataType to instances template parameter
* fixed gemm_reduce_multi_d
* add an example without multi_d
* Update common.hpp
* Update gtest.cmake
* Update gemm_xdl_splitk_reduce_bf16.cpp
* clean
* Update gtest.cmake
* format
* fixe api
* format
* default parameter change to RRR
* add vector_len for multi_d
* format
* Update gtest.cmake
* fix bf16A iBB elementwiseop
* add ReduceDataType
* move ReduceDataType to end position
* format
* remove googletest git method address
* fix copyright time
* update init data
---------
Co-authored-by: root <jizhan@amd.com>
Co-authored-by: letaoqin <letaoqin@amd.com>
Co-authored-by: Jing Zhang <jizhan@meta.com>
Co-authored-by: zjing14 <zhangjing14@gmail.com>
* Add CMakePresets configurations.
* Add ConvScale+ReLU Functor and an Example
* Account for ReLU FLOPs.
* Add instances of 3D convolutions with ConvscaleRelu operation.
* Implement Client Example
* Cleanup
* add ck_tile tests to CI
* build and run ck_tile tests on gfx90a and gfx942 in parallel
* fix groovy syntax
* turn ck_tile tests OFF by default
* skip creating the build folder
* build ck_tile examples with 64 threads
* build ck_tile examples with cmake-ck-dev.sh script
* add video group to docker on mi300
* do not retry to rebuild the early CI stages
* help prevent jenkins false failure
* restore cron trigger
* wa prec, remove sgpr offset for inline asm
* macro for set tile
* ignore unused param if no kernel instances in host API
* fix more prec issue
* cache buffer resource
* fix
* support pre-nop
* clear tile by vector type members
* add workaround to reduce scratch memory
* conditionally enable workaround code
* enable workaround start from certain build version
* fallback set_tile() implementation from certain build version
* undo template argument changes
* put dummy asm in load_raw()
* fix comments, refactor s_nop inside buffer_load
---------
Co-authored-by: PoYen, Chen <PoYen.Chen@amd.com>
* universal streamk with atomics with ckprofiler support. grid_size and streamk strategy are tunable. grid_size of -1 leads to #WGs = maximum occupancy X num_CUs. implementation supports many different streamk policies: 1-tile, 2-tile, 3-tile and 4-tile. streamk strategy of -1 leads to default streamk policy (4-tile).
* Update README.md
* fixing clang-format issues
* removed conflicts in struct members between streamk and universal streamk
* corrected arg parsing for streamk and universal streamk
* added stream-k policies for 3 tile and 4 tile
* fixed argument type issue with parsing cmd args
* changes suggested in PR review are made- removing comments and correcting copyright
* file permissions updated
* added default value support for grid_size and streamk-policy selection set to -1
* print messages for arguments
* print messages for arguments
* print messages for arguments1
* Add layernorm2d forward
* Refind file path
* clang format
* Exclude ck_tile op from all
* use add_executable instead
* refactor layernorm2d_fwd example
---------
Co-authored-by: carlushuang <carlus.huang@amd.com>
We are adding more instances of grouped convolution 3d forward with a ConvScale element-wise operation.
This commit handles bf8@bf8->fp8 data types combination.
* Included an example.
* Added instances.
* Added a client example.
---------
Co-authored-by: Rostyslav Geyyer <rosty.geyyer@amd.com>
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>
* Update the element op
* Add an example
* Add instances
* Add a client example
* make sure new instances only build on gfx9
* Update element op and its handling
* Format
* Update instances to take element op as an argument
* Update examples to use random scale values
* Format
* Update client example with random scales
* Format
---------
Co-authored-by: illsilin <Illia.Silin@amd.com>
* Add a scale op
* Update the element op
* Add instances
* Add an example
* Add a client example
* Add a flag check
* Revert flag check addition
* Fix flag check
* Update d strides in example
* Update d strides in client example
* Apply suggestions from code review
Update copyright header
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>
* Move the example
* Move the client example
* Update element op
* Update example with the new element op
* Add scalar layout
* Update example
* Update kernel for scalar Ds
* Revert kernel changes
* Update element op
* Update example to use scales' pointers
* Format
* Update instances
* Update client example
* Move element op to unary elements
* Update element op to work with values instead of pointers
* Update instances to take element op as an argument
* Update examples to use random scale values
---------
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>
* add f8 gemm with multiD for both row/col wise
* change compute_type to fp8
* changed tuning parameters in the example
* add rcr example
* post-merge fix
* fix
* reduce init range
* set individual gpu targets for instances, examples, tests
* fix path to hip compiler
* fix path to hip compiler once more
* aggregate device macros in ck_tile config header
* fix the cmake logic for instances
* fix clang format
* add gfx900 and gfx906 to default set of targets
* add alibi support
* fix code
* update code based on comment
* Support more hdim
* fix fp8 bias
* support seqlen_k=0 case
* remove unused printf
* fix format
---------
Co-authored-by: rocking <ChunYu.Lai@amd.com>
* Add bf16 instances
* Add bf16 gemm universal example
* tempsave
* Add guard to navi compilation
* workground on a specific mixed gemm instance ( bring back it when compiler fix upload)
* fix formatting condition statement issue
* solve conflict
---------
Co-authored-by: Jun Liu <Liu.Jun@amd.com>
* Overload output stream operator for LoopScheduler and PiplineVersion
* Add Run overload accepting grid descriptors MK.
* Add __device__ keyword for CalculateGridSize
* Create device op GroupedGemmMultipleD
* Add GroupedGemm MultipleD Tile Loop implementation.
* Add an example for GroupedGemm MultipleD tile loop.
* Device Op GroupedGEMMTileLoop.
* Bunch of small changes in exmaple.
* CkProfiler
* Remove unused tparam.
* Fix include statement.
* Fix output stream overloads.
* Do not make descriptors and check validity untill we find group.
* Fix gemm desc initialization.
* Revert device op
* Fix compilation for DTYPES=FP16
* Validate tensor transfers paramters.
* Validate on host only NK dims if M is not known.
* Fix bug.
* A convenient debug func for selecting threads.
* Fix has main k block loop bug.
* Make sure that b2c has up to date tile offset.
* Output stream operator for Sequence type.
* Cmake file formatting.
* Refactor elementwise kernels
* Instances fixes
* Fix cmake
* Fix max pool bwd test
* Update two stage gemm split k
* Restore elementwise scale for hiptensor backward compatiblity
* Fix Acc data type check in conv fwd multiple abd
* Disable conv fp64 fwd example
* Update grouped conv weight multi d
* enable gfx940
* switch between intrinsic mfma routines on mi100/200 and mi300
* fix mfma_int8 on MI300
* disable 2 int8 examples on MI300
* Update cmake-ck-dev.sh
* restore gitignore file
* modify Jenkinsfile to the internal repo
* Bump rocm-docs-core from 0.24.0 to 0.29.0 in /docs/sphinx
Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core) from 0.24.0 to 0.29.0.
- [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases)
- [Changelog](https://github.com/RadeonOpenCompute/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](https://github.com/RadeonOpenCompute/rocm-docs-core/compare/v0.24.0...v0.29.0)
---
updated-dependencies:
- dependency-name: rocm-docs-core
dependency-type: direct:production
update-type: version-update:semver-minor
...
Signed-off-by: dependabot[bot] <support@github.com>
* initial enablement of gfx950
* fix clang format
* disable examples 31 and 41 int8 on gfx950
* add code
* fix build wip
* fix xx
* now can build
* naming
* minor fix
* wip fix
* fix macro for exp2; fix warpgemm a/b in transposedC
* unify as tuple_array
* Update the required Python version to 3.9
* Update executable name in test scripts
* re-structure tuple/array to avoid spill
* Merge function templates
* Fix format
* Add constraint to array<> ctor
* Re-use function
* Some minor changes
* remove wrong code in store_raw()
* fix compile issue in transpose
* Rename enum
Rename 'cood_transform_enum' to 'coord_transform_enum'
* let more integral_constant->constant, and formating
* make sure thread_buffer can be tuple/array
* temp fix buffer_store spill
* not using custom data type by default, now we can have ISA-level same code as opt_padding
* fix compile error, fp8 not ready now
* fix fp8 duplicated move/shift/and/or problem
* Default use CK_TILE_FLOAT_TO_FP8_STOCHASTIC rounding mode
* fix scratch in fp8 kernel
* update some readme
* fix merge from upstream
* sync with upstream
* sync upstream again
* sync 22
* remove unused
* fix clang-format
* update README of ck_tile example
* fix several issue
* let python version to be 3.8 as minimal
* remove ck_tile example from default cmake target like all/install/check
* remove mistake
* 1).support receipe in generate.py 2).use simplified mask type 3).change left/right to pass into karg
* fix some bug in group-mode masking and codegen. update README
* F8 quantization for FMHA forward (#1224)
* Add SAccElementFunction, PComputeElementFunction, OAccElementFunction in pipeline
* Add element function to fmha api
* Adjust P elementwise function
* Fix bug of elementwise op, our elementwise op is not inout
* Add some elementwise op, prepare to quantization
* Let generate.py can generate different elementwise function
* To prevent compiler issue, remove the elementwise function we have not used.
* Remove f8 pipeline, we should share the same pipeline even in f8
* Remove remove_cvref_t
* Avoid warning
* Fix wrong fp8 QK/KV block gemm setting
* Check fp8 rounding error in check_err()
* Set fp8 rounding error for check_err()
* Use CK_TILE_FLOAT_TO_FP8_STANDARD as default fp8 rounding mode
* 1. codgen the f8 api and kernel
2. f8 host code
* prevent warning in filter mode
* Remove not-in-use elementwise function kargs
* Remove more not-in-use elementwise function kargs
* Small refinements in C++ source files
* Use conditional_t<> to simplify code
* Support heterogeneous argument for binary function types
* Re-use already-existing scales<> functor template
* Fix wrong value produced by saturating
* Generalize the composes<> template
* Unify saturates<> implementation
* Fix type errors in composes<>
* Extend less_equal<>
* Reuse the existing template less_equal<> in check_err()
* Add equal<float> & equal<double>
* Rename check_err() parameter
* Rename check_err() parameter
* Add FIXME comment for adding new macro in future
* Remove unnecessary cast to void
* Eliminate duplicated code
* Avoid dividing api pool into more than 2 groups
* Use more clear variable names
* Use affirmative condition in if stmt
* Remove blank lines
* Donot perfect forwarding in composes<>
* To fix compile error, revert generate.py back to 4439cc107d
* Fix bug of p element function
* Add compute element op to host softmax
* Remove element function in api interface
* Extract user parameter
* Rename pscale and oscale variable
* rename f8 to fp8
* rename more f8 to fp8
* Add pipeline::operator() without element_functor
* 1. Remove deprecated pipeline enum
2. Refine host code parameter
* Use quantization range as input
* 1. Rename max_dtype to dtype_max.
2. Rename scale to scale_s
3.Add init description
* Refine description
* prevent early return
* unify _squant kernel name in cpp, update README
* Adjust the default range.
* Refine error message and bias range
* Add fp8 benchmark and smoke test
* fix fp8 swizzle_factor=4 case
---------
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
Co-authored-by: carlushuang <carlus.huang@amd.com>
---------
Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: illsilin <Illia.Silin@amd.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
Co-authored-by: Jing Zhang <jizha@amd.com>
Co-authored-by: zjing14 <zhangjing14@gmail.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Po-Yen, Chen <PoYen.Chen@amd.com>
Co-authored-by: rocking <ChunYu.Lai@amd.com>
* Support A/B/C elementwise ops.
* First part of GGEMM multiD splitk two stage.
* WIP - changes for debuggin.
* tmp save
* working version
* added bf16@int8 version
* fixes
* add reviewers sugestions
* pre-commited missing files
* switched to ifs from elseifs
---------
Co-authored-by: Adam Osewski <Adam.Osewski@amd.com>
* Update device op api to support BComputeType
* Add example
* Add instances
* Add profiler mode
* Add client example
* Update copyright year
* Add BComputeType check
* Fix compute types
* parse examples inside the add_example_executable function
* fix the example 64 cmake file
* add xdl flag to the gemm_bias_softmax_gemm_permute example
* add filtering of tests based on architecture type
* enable test_grouped_gemm for gfx9 only
* enable test_transpose only for gfx9
* only linnk test_transpose if it gets built
* split the gemm instances by architectures
* split gemm_bilinear,grouped_conv_bwd_weight instances by targets
* split instances by architecture
* split grouped_conv instances by architecture
* fix clang format
* fix the if-else logic in group_conv headers
* small fix for grouped convolution instances
* fix the grouped conv bwd weight dl instances
* fix client examples
* only enable client examples 3 and 4 on gfx9
* set the gfx9 macro
* make sure the architecture macros are set by cmake
* use separate set of xdl/wmma flags for host code
* sinmplify the main cmake file
* add conv_fwd_bf8 instance declaration