Commit Graph

1269 Commits

Author SHA1 Message Date
Joseph Macaranas
548ddd0673 Enable external CI pipeline triggers (#1310)
[ROCm/composable_kernel commit: 02fa2c298b]
2024-05-23 18:21:34 -04:00
Illia Silin
9bdae6116c Split the gemm_multi_abd instances. (#1306)
* split the gemm_multi_abd instances

* update the dates

[ROCm/composable_kernel commit: ec2bae27ff]
2024-05-23 09:17:02 -07:00
dependabot[bot]
e96d09f6b3 Bump rocm-docs-core from 1.1.2 to 1.1.3 in /docs/sphinx (#1308)
Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core) from 1.1.2 to 1.1.3.
- [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases)
- [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](https://github.com/RadeonOpenCompute/rocm-docs-core/compare/v1.1.2...v1.1.3)

---
updated-dependencies:
- dependency-name: rocm-docs-core
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

[ROCm/composable_kernel commit: 06a9b72caf]
2024-05-23 07:45:53 -07:00
Max Podkorytov
564de0adc0 Make the library which generates CK instances for pytorch2 inductor's CK backend usage
Also bundle the CK library and include files with the pip package.

The package is pip-installable with
`pip install
git+https://github.com/tenpercent/composable_kernel@enable-pip`

(substitute the repo path and branch if necessary)

Testing:

`myenv/bin/python3 -m ck4inductor.universal_gemm.gen_instances`

(prints a list of instances)

`tree myenv/lib/python3.12/site-packages/ck4inductor`

(observe the list of sources along the installed package)


[ROCm/composable_kernel commit: 29e58d5b28]
2024-05-22 13:44:22 -07:00
Bartłomiej Kocot
b4b436d29a Optimize grouped conv bwd weight for small M and N (#1303)
* Optimize grouped conv bwd weight for small M and N

* Fixes

[ROCm/composable_kernel commit: fd72380aeb]
2024-05-22 21:01:01 +02:00
Illia Silin
beb7927f52 Select appropriate GPU targets for instances, tests, and examples. (#1304)
* set individual gpu targets for instances, examples, tests

* fix path to hip compiler

* fix path to hip compiler once more

* aggregate device macros in ck_tile config header

* fix the cmake logic for instances

* fix clang format

* add gfx900 and gfx906 to default set of targets

[ROCm/composable_kernel commit: 7b027d5643]
2024-05-22 11:45:27 -07:00
Rostyslav Geyyer
f6bd300ecb Move grouped conv fwd client examples (#1299)
* Move grouped conv fwd client examples

* Update existing examples

* Format

[ROCm/composable_kernel commit: 204da9c522]
2024-05-21 09:52:41 -05:00
Illia Silin
ca0015bf39 aggregate device macros in ck_tile config header (#1297)
[ROCm/composable_kernel commit: 06b891c5c2]
2024-05-20 08:34:45 -07:00
Illia Silin
0003dce849 replace the ENV macro with CK_ENV (#1296)
[ROCm/composable_kernel commit: 1274861a9d]
2024-05-17 10:42:51 -07:00
dependabot[bot]
3bd09036a3 Bump rocm-docs-core from 1.1.1 to 1.1.2 in /docs/sphinx (#1293)
Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core) from 1.1.1 to 1.1.2.
- [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases)
- [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](https://github.com/RadeonOpenCompute/rocm-docs-core/compare/v1.1.1...v1.1.2)

---
updated-dependencies:
- dependency-name: rocm-docs-core
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

[ROCm/composable_kernel commit: 6637a810d0]
2024-05-17 07:44:48 -07:00
rocking
54dc094a2c Fix compile error (#1292)
error: no viable conversion from returned value of type '__half' to function return type 'fp16_hip_t' (aka '_Float16')

Co-authored-by: carlushuang <carlus.huang@amd.com>

[ROCm/composable_kernel commit: aaa8dfdae9]
2024-05-17 17:19:17 +08:00
Illia Silin
ca31c8515e remove wrong use of nonexistent class members (#1290)
[ROCm/composable_kernel commit: c44137838e]
2024-05-15 08:08:17 -07:00
carlushuang
96b7e7336a remove operator-deref (#1291)
[ROCm/composable_kernel commit: dd0dd13d4e]
2024-05-15 08:06:50 -07:00
jakpiase
290ac20e62 Add unit tests for grouped gemm two stage (#1256)
* add unit tests for grouped gemm two stage

* add reviewers suggestions

---------

Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>

[ROCm/composable_kernel commit: 3e3471d5d2]
2024-05-15 10:03:39 +02:00
Illia Silin
2640cb1551 re-enable convnd_fwd_xdl_fp64 testing (#1289)
[ROCm/composable_kernel commit: 7843a8a7fb]
2024-05-10 22:48:28 -07:00
Illia Silin
254758813f Code clean-up (#1285)
* code clean-up

* remove the profiling output samples

[ROCm/composable_kernel commit: 566b6480a2]
2024-05-10 09:41:39 -07:00
carlushuang
9ca5aca74d [CK_TILE] fix some rand number init (#1287)
* add random norm

* normalized default to 0/3

* change squant->auto

[ROCm/composable_kernel commit: fcba889ef4]
2024-05-10 09:03:39 -07:00
Bartłomiej Kocot
70f51bb03f Change output gemm type to AccDataType in two stage conv bwd wei (#1283)
[ROCm/composable_kernel commit: 8346af9c68]
2024-05-10 10:57:42 +02:00
Adam Osewski
675a16e3b8 Fix MakeArgument (#1284)
[ROCm/composable_kernel commit: a0ae1c6133]
2024-05-09 09:42:41 -07:00
Adam Osewski
3ede1f58e6 Add vector instruction coherency bits for gfx94 targets. (#1268)
[ROCm/composable_kernel commit: 3c043cd10b]
2024-05-09 07:30:17 -07:00
Illia Silin
ffe52d2d30 fix the output formatting (#1282)
[ROCm/composable_kernel commit: fdbf8ccbd7]
2024-05-08 16:11:54 -07:00
Bartłomiej Kocot
b6a17bc3e2 Add two stage grouped conv bwd weight kernel (#1280)
[ROCm/composable_kernel commit: 0b6b5d1785]
2024-05-08 09:53:24 +02:00
Illia Silin
e88d576926 Enable logging in CK with environment variable. (#1278)
* enable logging using environment variable

* update ck.hpp header

* fix typo

* fix clang format

* Update include/ck/utility/env.hpp

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

---------

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

[ROCm/composable_kernel commit: bf42097646]
2024-05-07 16:26:43 -07:00
carlushuang
9be788a9aa [CK_TILE] support alibi (#1269)
* add alibi support

* fix code

* update code based on comment

* Support more hdim

* fix fp8 bias

* support seqlen_k=0 case

* remove unused printf

* fix format

---------

Co-authored-by: rocking <ChunYu.Lai@amd.com>

[ROCm/composable_kernel commit: 851c3ed157]
2024-05-07 22:32:54 +08:00
Sam Wu
a908d37952 Add ROCm Doc team as codeowners for RTD yaml (#1277)
Also add component owners as codeowners for header directory

[ROCm/composable_kernel commit: 6d073d31bb]
2024-05-06 10:07:39 -06:00
Illia Silin
ba9ffb86c7 add missing vector header (#1275)
[ROCm/composable_kernel commit: 08d51d9bc4]
2024-05-02 11:27:59 -07:00
Illia Silin
9e8d625b46 Downgrade minimum required python version to 3.6 (#1274)
[ROCm/composable_kernel commit: 7797f7c7a1]
2024-05-01 15:34:56 -07:00
Illia Silin
c02ecc0cd1 [CI] Focus CI stages on MI200 nodes for resource optimization (#1273)
[ROCm/composable_kernel commit: f0bf1e3125]
2024-05-01 10:07:14 -07:00
Rostyslav Geyyer
38d64386cd Add an ignore (#1270)
[ROCm/composable_kernel commit: a2d0bdd5a9]
2024-04-30 20:45:22 -07:00
Sam Wu
56f8a6e8da Update documentation requirements and configurations (#1272)
* Update documentation requirements

Set rocm-docs-core to v1.1.1

* Update RTD config

Set Python 3.10 for rocm-docs-core >= v1.0.0

[ROCm/composable_kernel commit: 43579900a9]
2024-04-30 20:44:59 -07:00
Illia Silin
8e9c95c520 [CI][Tests] Add a daily cron job to build CK instances for gfx9;gfx10;gfx11. (#1271)
* add a daily build for instances for gfx9;gfx10;gfx11

* fix jenkins logic for instances only build

* fix the path for instance_only build

* reduce the number of build threads to 32

[ROCm/composable_kernel commit: f6b3f4715d]
2024-04-30 14:44:30 -07:00
Adam Osewski
1adeb99f3f Fix example CMakeLists.txt (#1267)
Add proper dependency target.

[ROCm/composable_kernel commit: 0f7e8ec485]
2024-04-30 08:28:19 -07:00
Rostyslav Geyyer
e0669ecf6c Mark unneeded instances as "getting deprecated" (#1265)
* Add a flag

* Add flag check and messages

---------

Co-authored-by: root <root@aus-g7-rogeyyer.amd.com>

[ROCm/composable_kernel commit: 6ced3c12ff]
2024-04-29 12:00:55 -07:00
Haocong WANG
1a34c500a6 [GEMM] UniversalGemm update (#1262)
* Add bf16 instances

* Add bf16 gemm universal example

* tempsave

* Add guard to navi compilation

* workground on a specific mixed gemm instance ( bring back it when compiler fix upload)

* fix formatting condition statement issue

* solve conflict

---------

Co-authored-by: Jun Liu <Liu.Jun@amd.com>

[ROCm/composable_kernel commit: 764164b488]
2024-04-26 12:56:07 -05:00
Rostyslav Geyyer
2d642d2737 Add element op (#1259)
[ROCm/composable_kernel commit: f044ff71fb]
2024-04-26 12:55:45 -05:00
zjing14
58142fae83 ggemm tile_loop multD bf16 int8 (#1258)
* Overload output stream operator for LoopScheduler and PiplineVersion

* Add Run overload accepting grid descriptors MK.

* Add __device__ keyword for CalculateGridSize

* Create device op GroupedGemmMultipleD

* Add GroupedGemm MultipleD Tile Loop implementation.

* Add an example for GroupedGemm MultipleD tile loop.

* Device Op GroupedGEMMTileLoop.

* Bunch of small changes in exmaple.

* CkProfiler

* Remove unused tparam.

* changed the copy function to v7r2

* adding multi_abd

* in-progress

* add post-load oob check

* Fix include statement.

* Fix output stream overloads.

* Do not make descriptors and check validity untill we find group.

* Fix gemm desc initialization.

* debugging

* adjust instances

* add run_lds

* add elemntwise_op

* replace multi_abd_device with v3

* clean up

* clean

* clean

* Revert device op

* Fix compilation for DTYPES=FP16

* Validate tensor transfers paramters.

* Added LDSType

* profiling

* adjust oobcheck

* add missing file

* Validate on host only NK dims if M is not known.

* add

* clean

* refactor

* clean

* add examples

* add fuse

* add fusion and client example

* Fix bug.

* A convenient debug func for selecting threads.

* Fix has main k block loop bug.

* Make sure that b2c has up to date tile offset.

* Output stream operator for Sequence type.

* Cmake file formatting.

* clean

---------

Co-authored-by: Adam Osewski <Adam.Osewski@amd.com>

[ROCm/composable_kernel commit: 5ae893c0d3]
2024-04-26 10:37:49 -05:00
zjing14
ce67c185b4 bf16A_Int8B with fastgelu/bias (#1264)
* changed the copy function to v7r2

* adding multi_abd

* in-progress

* add post-load oob check

* debugging

* adjust instances

* add run_lds

* add elemntwise_op

* replace multi_abd_device with v3

* clean up

* clean

* clean

* Added LDSType

* profiling

* adjust oobcheck

* add missing file

* refactor

* clean

* add examples

[ROCm/composable_kernel commit: 0d0150db20]
2024-04-26 07:26:30 -05:00
Adam Osewski
2b452ad135 Grouped GEMM Multiple D tile loop. (#1247)
* Overload output stream operator for LoopScheduler and PiplineVersion

* Add Run overload accepting grid descriptors MK.

* Add __device__ keyword for CalculateGridSize

* Create device op GroupedGemmMultipleD

* Add GroupedGemm MultipleD Tile Loop implementation.

* Add an example for GroupedGemm MultipleD tile loop.

* Device Op GroupedGEMMTileLoop.

* Bunch of small changes in exmaple.

* CkProfiler

* Remove unused tparam.

* Fix include statement.

* Fix output stream overloads.

* Do not make descriptors and check validity untill we find group.

* Fix gemm desc initialization.

* Revert device op

* Fix compilation for DTYPES=FP16

* Validate tensor transfers paramters.

* Validate on host only NK dims if M is not known.

* Fix bug.

* A convenient debug func for selecting threads.

* Fix has main k block loop bug.

* Make sure that b2c has up to date tile offset.

* Output stream operator for Sequence type.

* Cmake file formatting.

[ROCm/composable_kernel commit: b4032629e5]
2024-04-25 15:12:53 -05:00
ltqin
b4f3b8e693 Universal gemm flush cache (#1251)
* add flush cache to device op

* add flush cache parameter to ckProfiler

* change calculate size a and b method

* chang evaluation time method foro AVERAGE to MEDIAN

* format code

* adjust some code

* fix core dumped

* remove loop call flush icache in kernel

* remove loop(outer) call flush icache

---------

Co-authored-by: letaoqin <letaoqin@amd.com>

[ROCm/composable_kernel commit: f448d179b7]
2024-04-25 15:07:14 -05:00
Bartłomiej Kocot
ac08f8a3a1 Fix contraction IsSupported checks (#1257)
[ROCm/composable_kernel commit: b1f8ae379b]
2024-04-23 22:59:39 +02:00
rocking
bb7bc263a3 Small refactor (#1246)
* Remove kIsFp8

* Extract alias

* Fix K, V and corresponding acc type

---------

Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>

[ROCm/composable_kernel commit: 43879b89e4]
2024-04-22 20:28:49 +08:00
Bartłomiej Kocot
6578635cb3 Refactor elementwise kernels (#1222)
* Refactor elementwise kernels

* Instances fixes

* Fix cmake

* Fix max pool bwd test

* Update two stage gemm split k

* Restore elementwise scale for hiptensor backward compatiblity

* Fix Acc data type check in conv fwd multiple abd

* Disable conv fp64 fwd example

* Update grouped conv weight multi d

[ROCm/composable_kernel commit: ad1597c499]
2024-04-19 13:31:17 +02:00
jakpiase
e7d121a6f0 Add bf16 and bf16@int8 mk_nk_mn instances for grouped gemm two stage (#1228)
* added bf16 and bf16@int8 mk_nk_mn instances

* fix preprocessor guards

[ROCm/composable_kernel commit: e0f3f918f1]
2024-04-19 13:16:10 +02:00
Bartłomiej Kocot
d001bea12f Add grouped conv bwd weight multi d kernel (#1237)
* Add grouped conv bwd weight multi d kernel

* Reference fix

* Fix cmake files

* bwd weight scale only xdl

* Fixes

* Fix client conv fwd example

[ROCm/composable_kernel commit: fd923b6d86]
2024-04-18 23:35:04 +02:00
Illia Silin
433f7857d7 Make daily cron jobs use the rocm6.1 compiler. (#1253)
* add rocm6.1 docker and make it default for CI

* fix typo

* move the rocm6.1 image into public dockerhub repo

* upgrade daily cron jobs to use rocm6.1

[ROCm/composable_kernel commit: 930f889c34]
2024-04-18 09:40:21 -07:00
Illia Silin
8986d38014 Upgrade to ROCm6.1 and turn on the -enable-post-misched=0 compiler flag. (#1250)
* add rocm6.1 docker and make it default for CI

* fix typo

* move the rocm6.1 image into public dockerhub repo

[ROCm/composable_kernel commit: caae537d8e]
2024-04-18 11:10:23 -05:00
peter
aedae92778 docs: fix broken contributing link (#1244)
[ROCm/composable_kernel commit: 501a6b68eb]
2024-04-16 11:32:14 -04:00
zjing14
4ddb546fe5 Added Multi_ABD support into Gemm and GroupedGemmFixedNK (#978)
* added an example grouped_gemm_multi_abd

* fixed ci

* add setElementwiseOp

* changed API

* clean code: add multiA into example

* fixed v7r2 copy

* add transpose

* clean

* fixed vector_load check

* Update example/15_grouped_gemm/grouped_gemm_multi_abd_xdl_fixed_nk_bias_fp16.cpp

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update example/15_grouped_gemm/grouped_gemm_multi_abd_xdl_fixed_nk_bias_fp16.cpp

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update example/15_grouped_gemm/grouped_gemm_multi_abd_xdl_fixed_nk_bias_fp16.cpp

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update include/ck/tensor_operation/gpu/grid/gridwise_gemm_multiple_abd_xdl_cshuffle.hpp

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update include/ck/tensor_operation/gpu/grid/gridwise_gemm_multiple_abd_xdl_cshuffle.hpp

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update include/ck/tensor_operation/gpu/device/device_grouped_gemm_multi_abd.hpp

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update include/ck/tensor_operation/gpu/device/device_grouped_gemm_multi_abd.hpp

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update include/ck/tensor_operation/gpu/device/device_grouped_gemm_multi_abd.hpp

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update include/ck/tensor_operation/gpu/device/device_grouped_gemm_multi_abd.hpp

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update include/ck/tensor_operation/gpu/device/device_grouped_gemm_multi_abd.hpp

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update include/ck/tensor_operation/gpu/device/device_grouped_gemm_multi_abd.hpp

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update include/ck/tensor_operation/gpu/device/device_grouped_gemm_multi_abd_fixed_nk.hpp

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update include/ck/tensor_operation/gpu/device/device_grouped_gemm_multi_abd_fixed_nk.hpp

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* add reduce

* testing

* add example_b16_i8

* refactor example

* clean

* add mpading

* disable reduce for kbatch = 1

* seperate reduce device op

* add reduce op

* add guard for workspace_size

* add instances

* format

* fixed

* add client example

* add a colmajor

* add instances

* Update cmake-ck-dev.sh

* Update profile_gemm_splitk.cpp

* Update gridwise_gemm_xdlops_v2r4r2.hpp

* format

* Update profile_gemm_splitk.cpp

* fixed

* fixed

* adjust test

* adjust precision loss

* adjust test

* fixed

* add bf16_i8 scale bias

* fixed scale

* fixed scale elementwise_op

* revert contraction deviceop changes

* fixed

* Add AddFastGelu

* Revert "Merge branch 'jizhan/gemm_splitk_reduce' into grouped_gemm_multi_abd_fixed_nk_example"

This reverts commit 3b5d001efd, reversing
changes made to 943199a991.

* add Scales into elementwise

* add gemm_multi_abd client example

* add client examples

* add rcr and crr

* add grouped gemm client example

* add grouped gemm client example

* add instance for rcr crr

* format

* fixed

* fixed cmake

* fixed

* fixed client_example

* format

* fixed contraction isSupport

* Update include/ck/tensor_operation/gpu/device/device_grouped_gemm_multi_abd_fixed_nk.hpp

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update device_reduce_threadwise.hpp

* clean

* Fixes

* Fix example

---------

Co-authored-by: Jing Zhang <jizha@amd.com>
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

[ROCm/composable_kernel commit: 12865fbf28]
2024-04-15 21:09:45 -05:00
carlushuang
bb5dbf2f06 introducing ck_tile! (#1216)
* enable gfx940

* switch between intrinsic mfma routines on mi100/200 and mi300

* fix mfma_int8 on MI300

* disable 2 int8 examples on MI300

* Update cmake-ck-dev.sh

* restore gitignore file

* modify Jenkinsfile to the internal repo

* Bump rocm-docs-core from 0.24.0 to 0.29.0 in /docs/sphinx

Bumps [rocm-docs-core](https://github.com/RadeonOpenCompute/rocm-docs-core) from 0.24.0 to 0.29.0.
- [Release notes](https://github.com/RadeonOpenCompute/rocm-docs-core/releases)
- [Changelog](https://github.com/RadeonOpenCompute/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](https://github.com/RadeonOpenCompute/rocm-docs-core/compare/v0.24.0...v0.29.0)

---
updated-dependencies:
- dependency-name: rocm-docs-core
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>

* initial enablement of gfx950

* fix clang format

* disable examples 31 and 41 int8 on gfx950

* add code

* fix build wip

* fix xx

* now can build

* naming

* minor fix

* wip fix

* fix macro for exp2; fix warpgemm a/b in transposedC

* unify as tuple_array

* Update the required Python version to 3.9

* Update executable name in test scripts

* re-structure tuple/array to avoid spill

* Merge function templates

* Fix format

* Add constraint to array<> ctor

* Re-use function

* Some minor changes

* remove wrong code in store_raw()

* fix compile issue in transpose

* Rename enum
Rename 'cood_transform_enum' to 'coord_transform_enum'

* let more integral_constant->constant, and formating

* make sure thread_buffer can be tuple/array

* temp fix buffer_store spill

* not using custom data type by default, now we can have ISA-level same code as opt_padding

* fix compile error, fp8 not ready now

* fix fp8 duplicated move/shift/and/or problem

* Default use CK_TILE_FLOAT_TO_FP8_STOCHASTIC rounding mode

* fix scratch in fp8 kernel

* update some readme

* fix merge from upstream

* sync with upstream

* sync upstream again

* sync 22

* remove unused

* fix clang-format

* update README of ck_tile example

* fix several issue

* let python version to be 3.8 as minimal

* remove ck_tile example from default cmake target like all/install/check

* remove mistake

* 1).support receipe in generate.py 2).use simplified mask type 3).change left/right to pass into karg

* fix some bug in group-mode masking and codegen. update README

* F8 quantization for FMHA forward (#1224)

* Add SAccElementFunction, PComputeElementFunction, OAccElementFunction in pipeline

* Add element function to fmha api

* Adjust P elementwise function

* Fix bug of elementwise op, our elementwise op is not inout

* Add some elementwise op, prepare to quantization

* Let generate.py can generate different elementwise function

* To prevent compiler issue, remove the elementwise function we have not used.

* Remove f8 pipeline, we should share the same pipeline even in f8

* Remove remove_cvref_t

* Avoid warning

* Fix wrong fp8 QK/KV block gemm setting

* Check fp8 rounding error in check_err()

* Set fp8 rounding error for check_err()

* Use CK_TILE_FLOAT_TO_FP8_STANDARD as default fp8 rounding mode

* 1. codgen the f8 api and kernel
2. f8 host code

* prevent warning in filter mode

* Remove not-in-use elementwise function kargs

* Remove more not-in-use elementwise function kargs

* Small refinements in C++ source files

* Use conditional_t<> to simplify code

* Support heterogeneous argument for binary function types

* Re-use already-existing scales<> functor template

* Fix wrong value produced by saturating

* Generalize the composes<> template

* Unify saturates<> implementation

* Fix type errors in composes<>

* Extend less_equal<>

* Reuse the existing template less_equal<> in check_err()

* Add equal<float> & equal<double>

* Rename check_err() parameter

* Rename check_err() parameter

* Add FIXME comment for adding new macro in future

* Remove unnecessary cast to void

* Eliminate duplicated code

* Avoid dividing api pool into more than 2 groups

* Use more clear variable names

* Use affirmative condition in if stmt

* Remove blank lines

* Donot perfect forwarding in composes<>

* To fix compile error, revert generate.py back to 4439cc107d

* Fix bug of p element function

* Add compute element op to host softmax

* Remove element function in api interface

* Extract user parameter

* Rename pscale and oscale variable

* rename f8 to fp8

* rename more f8 to fp8

* Add pipeline::operator() without element_functor

* 1. Remove deprecated pipeline enum
2. Refine host code parameter

* Use quantization range as input

* 1. Rename max_dtype to dtype_max.
2. Rename scale to scale_s
3.Add init description

* Refine description

* prevent early return

* unify _squant kernel name in cpp, update README

* Adjust the default range.

* Refine error message and bias range

* Add fp8 benchmark and smoke test

* fix fp8 swizzle_factor=4 case

---------

Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
Co-authored-by: carlushuang <carlus.huang@amd.com>

---------

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: illsilin <Illia.Silin@amd.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
Co-authored-by: Jing Zhang <jizha@amd.com>
Co-authored-by: zjing14 <zhangjing14@gmail.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
Co-authored-by: Po-Yen, Chen <PoYen.Chen@amd.com>
Co-authored-by: rocking <ChunYu.Lai@amd.com>

[ROCm/composable_kernel commit: db376dd8a4]
2024-04-15 19:27:12 -05:00
Illia Silin
d76663b1c2 add CK_USE_XDL/WMMA for client examples (#1238)
[ROCm/composable_kernel commit: dd34ab6e64]
2024-04-15 10:01:22 -05:00