Commit Graph

310 Commits

Author SHA1 Message Date
Adam Osewski
46b120168b Polished Grouped GEMM APIs and new BF16 instances (#1600)
* Few small fixes.

* New GroupedGemm instances (BF16)

* Unify and refactor GroupedGEMM device API.

* Adapt changes to new API.

* Adapt grouped gemm profiler.

* Accept multiple kbatches for grouped gemm profiler.

- delete obsolete two stage as it is now covered by grouped gemm

* Update unit test for grouped gemm.

* Fix thresholds for BF16 and F8. Unblock tests.

* Fix few instances.

* Multiple small fixes.

* Adapt to new API, check dynamic casting.

* Uncomment few data types in grouped gemm profiler.

* Fix call to SetDeviceArgs.

* Fix profile grouped gemm multiply tile loop.

* Fix grouped gemm tile loop kernel args in client examples.

* Review comments.

[ROCm/composable_kernel commit: 061ac0649c]
2024-11-27 13:02:44 +01:00
Harisankar Sadasivan
0d34db594d universal streamk fp8 changes (#1665)
* universal streamk fp8 changes & ckprofiler instances

* revert strides to -1 and verification options

* fp8 exclusion on pre-gfx94 for universal_streamk

* PR review based revisions: permissions reverted,  removed hip err checks


---------

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>

[ROCm/composable_kernel commit: d6d4c2788b]
2024-11-21 08:21:37 -08:00
Haocong WANG
15ce93a491 fix bug (#1680)
[ROCm/composable_kernel commit: 81ec5eff4a]
2024-11-20 07:03:56 -08:00
Illia Silin
87ed997f4b add more fp32 dl gemm instances (#1675)
* add more fp32 dl gemm instances

* update the dates

[ROCm/composable_kernel commit: da0c21f661]
2024-11-19 10:00:17 -08:00
Illia Silin
dc227604bc Add bf16 and int8 wmma gemms for Navi3x and Navi4x. (#1671)
* add bf16 gemms for gfx11/gfx12

* reduce the input values in test_gemm

* add int8 wmma gemm instances for gfx11/gfx12

* add example gemm_wmma_int8

* fix bug in gemm_wmma_int8 test

* increase bf16 gemm test tolerance

* update the dates and clean-up commented-out instances

[ROCm/composable_kernel commit: 8aba2724cc]
2024-11-18 14:07:04 -08:00
Bartłomiej Kocot
b89a44ea33 Batched GEMM Multiple D based on Universal GEMM (#1655)
* Batched GEMM Multiple D based on Universal GEMM

Co-authored-by: Jing Zhang <jizhan@fb.com>

* CI fixes

Co-authored-by: Jing Zhang <jizhan@fb.com>

---------

Co-authored-by: Jing Zhang <jizhan@fb.com>

[ROCm/composable_kernel commit: 754adc70e3]
2024-11-18 14:03:45 +01:00
Bartłomiej Kocot
421819d720 Add generic instances for two stage conv bwd wei (#1643)
* Add generic instances for two stage conv bwd wei

* Update layout prefix

[ROCm/composable_kernel commit: ea3640fdea]
2024-11-08 10:04:33 +01:00
Illia Silin
ddfcce82ab enable compilation for generic navi targets (#1645)
[ROCm/composable_kernel commit: 75c5bfa364]
2024-11-07 14:14:42 -08:00
Illia Silin
827e5ed06d Make sure cmake can handle the xnack+/xnack- targets. (#1633)
* make sure cmake can handle xnack targets

* dont build xdl instances for gfx906:xnack-

* dont build xdl tests for gfx906:xnack-

[ROCm/composable_kernel commit: b6e74be1aa]
2024-11-05 08:53:10 -08:00
Juan Manuel Martinez Caamaño
6e74da9b87 [generate.py] Override blob list if it already exists (#1635)
Before, generate.py appended the list at the end of the output file.
When running the cmake configuration steps multiple times on the
examples, the blob list (such as fwd_blob_list.txt) would grow at every
configuration.
`library/src/tensor_operation_instance/gpu/mha/CMakeLists.txt` worked around
this issue by removing the output file if it exists.

Now, generate.py overrides the content of the output file.
There is no need for the workaround in the CMakeLists.txt;
and the issue is solved for the example projects too.

[ROCm/composable_kernel commit: 464abd235e]
2024-11-05 10:09:52 +01:00
Lin Sun
40df6ce241 Linsun/convint8 fwd instances (#1626)
Add instances for int8 grouped conv2d fwd
---------

Co-authored-by: root <root@dell300x-pla-t28-03.pla.dcgpu>
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

[ROCm/composable_kernel commit: 0c9012fb70]
2024-11-04 16:33:20 -08:00
Bartłomiej Kocot
fadab8013c Temporary disable part of dynamic op conv instances (#1630)
* Temporary disable part of dynamic op conv instances

* fix

[ROCm/composable_kernel commit: 4f1fdbb6e3]
2024-11-04 13:34:17 -08:00
Illia Silin
f29c4cebf5 Reduce build time. (#1621)
* disable fp8 gemm_universal on gfx90a and gfx908 by default

* fix cmake syntax

* fix clang format

* add ifdefs in amd_xdlops

* disable fp8 gemm instances on gfx90a by default

* update readme

[ROCm/composable_kernel commit: 03c6448ba3]
2024-11-01 13:52:23 +08:00
Bartłomiej Kocot
930195c384 Add dynamic elementwise op (#1426)
* Add dynamic elementwise op

Co-authored-by: ThruptiRajLakshmanaGowda <thruptiraj.lakshmanagowda@amd.com>

* CI issues fix

* Custom parameter value for dynamic functions - Comments addressed

---------

Co-authored-by: ThruptiRajLakshmanaGowda <thruptiraj.lakshmanagowda@amd.com>
Co-authored-by: ThruptiRajLakshmanaGowda <tlakshma@amd.com>

[ROCm/composable_kernel commit: 31bf253aeb]
2024-10-26 15:22:37 +02:00
valarLip
85cf31cf40 add int8 gemm multiply multiply a8w8 (#1591)
* add int8 gemm multiply multiply a8w8

* uncomment

* clang-format-12

* Add example_gemm_multiply_multiply_xdl_int8

* Remove shell scripts

* update preprocess number for mi308; bring back printout in ckprofiler

* format

---------

Co-authored-by: chenjun <junchen2@amd.com>
Co-authored-by: Haocong WANG <haocwang@amd.com>
Co-authored-by: carlushuang <carlus.huang@amd.com>

[ROCm/composable_kernel commit: 37f7afed1e]
2024-10-26 16:39:34 +08:00
Bartłomiej Kocot
47c726d0ba [POST MERGE PR] Enable grouped conv bwd wei bf16 NGCHW (#1594)
[ROCm/composable_kernel commit: cedccd59c9]
2024-10-23 12:02:33 +02:00
Bartłomiej Kocot
5f8fefae24 Enable grouped conv bwd wei bf16 NGCHW (#1589)
* Enable grouped conv bwd wei bf16 NGCHW

* fixes

* fixes

* Fixes

* fixes

* fixes

* Fixes

[ROCm/composable_kernel commit: 82fc53835a]
2024-10-22 16:18:28 +02:00
Thomas Ning
a415fb6906 Ck profiler instance support (#1575)
* The draft on ckProfiler instance add

* support the ck profiler instance with same data types

* add a small feature on the M and N variable switch.

* Partially solve the incorrect result problem

* fix based on ci cd

[ROCm/composable_kernel commit: 560917b161]
2024-10-21 22:47:48 +08:00
Haocong WANG
3f37d10e43 disable bad instance detected on MI308CPX (#1584)
[ROCm/composable_kernel commit: a285d6f9b5]
2024-10-18 08:46:11 -07:00
Illia Silin
ee93500dad Fix build logic using GRU_ARCHS. (#1536)
* update build logic with GPU_ARCHS

* fix the GPU_ARCHS build for codegen

* unset GPU_TARGETS when GPU_ARCHS are set

[ROCm/composable_kernel commit: 7d8ea5f08b]
2024-10-07 08:18:23 -07:00
BrianHarrisonAMD
6d2f70f9aa Add generating mha static library for gfx90a (#1540)
* Add generating mha static library for gfx90a

* Update comment to reflect changes

[ROCm/composable_kernel commit: 294cb82314]
2024-10-02 09:26:11 -07:00
BrianHarrisonAMD
308d3698c0 Add additional instances to device_mha_instance (#1522)
* Add additional instances to device_mha_instance

* Add comment to describe what receipt 3 option filters

---------

Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>

[ROCm/composable_kernel commit: 3528a523ff]
2024-09-24 10:15:30 -06:00
Bartłomiej Kocot
9d56280a0a Add support for NGCHW in grouped conv fwd (#1499)
* Support NGCHW in grouped conv fwd

* Remove not needed variable

* Fixes

[ROCm/composable_kernel commit: 4ba52b35dc]
2024-09-20 10:45:46 +02:00
aledudek
f504e98b5d Extend pool3d fwd avg, max operations by f8_t, int8_t types (#1483)
* Extend pool3d fwd avg, max operations by f8_t, int8_t types

* Pack MaxPool3dFwd params together

* Fix MaxPool3dFwd AVG instances

* Decrease verification precision for bf16

* Adjust tests + review changes

* Adjust threshold for F8

* Adjusted compute types for MAX op instances

* Fix ComputeDataType mismatch in tests and profiler for AVG

* Fix naming from max_pool3d_fwd to pool3d_fwd

* Adjust CMakeLists

---------

Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>

[ROCm/composable_kernel commit: a793afc961]
2024-09-17 15:57:10 +02:00
Mateusz Ozga
d7326fb525 This commit contains implementation of max pool2d for f8 type (#1506)
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>

[ROCm/composable_kernel commit: 6834e5ee74]
2024-09-16 10:15:06 +02:00
bibek
50ec07c3e3 Fix duplicate CMake tidy-target issue (#1513)
[ROCm/composable_kernel commit: 49e012dee1]
2024-09-13 21:15:04 -07:00
jakpiase
4940f07a4b Add pool2d int8 and fp8 instances (#1508)
* add pool2d fp8 and int8

* minor fixes

* add formatting

* add reviewer suggestions

* add reviewer suggestions

[ROCm/composable_kernel commit: 8f8a2ce396]
2024-09-13 10:18:21 -07:00
Jun Liu
3739cf9f74 Customize filesystem in CK for legacy systems (#1509)
* Legacy support: customized filesystem

* Update cmakefile for python alternative path

* fix build issues

* CK has no boost dependency

* More fixes to issues found on legay systems

* fix clang format issue

* Check if blob is correctly generated in cmake

* fix the python issues

* add a compiler flag for codegen when using alternative python

* use target_link_options instead of target_compile_options

---------

Co-authored-by: illsilin <Illia.Silin@amd.com>

[ROCm/composable_kernel commit: 81bc1496b2]
2024-09-13 07:51:07 -07:00
Mateusz Ozga
92d1b386b2 Pool2d max/avg kernel in the BWD version (#1494)
* Add pool2d instance BWD AVG

* Add pool2d instance BWD MAX

* Fix: avg review

* Fix review: part2

* Fix - enable test when type is compiled

* Fix review part3

[ROCm/composable_kernel commit: 448c0f56d8]
2024-09-12 11:47:52 +02:00
jakpiase
cb4975cf70 Rewrite pool2d fwd (#1462)
* added pool2d fwd

* add tests

* add reviewers changes

* Revert "Merge remote-tracking branch 'origin/develop' into jakpiase/pool2d_fwd_new"

This reverts commit 6b2ba7ff89, reversing
changes made to 22c82bea0c.

* Revert "add reviewers changes"

This reverts commit 22c82bea0c.

* added reviewers comments

* revert some old files

* add reviewers requests

---------

Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>

[ROCm/composable_kernel commit: e8d2887cb2]
2024-09-11 15:21:00 +02:00
Haocong WANG
4e4514caa8 Add gemm universal bf16 instances (#1484)
* revert ckprofiler change

* temp save

* Add test and test pass

* test pass

* Fix bug inside rotating buffer when tensor is not packed

* bug fix

* clang format

---------

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>

[ROCm/composable_kernel commit: 5b10dae6a4]
2024-09-04 20:58:54 -07:00
Illia Silin
132c89b29d copy all fmha headers when building library (#1497)
* copy all fmha headers when building library

* fix the rocm_install call for mha headers

[ROCm/composable_kernel commit: 8b95d9ad52]
2024-09-04 07:36:41 -07:00
Illia Silin
234bc58d2d Add an option to select an alternative python version during build. (#1496)
* locate a newwer version of python when -DRHEL=ON flag is set

* allow setting python version on cmake command line

[ROCm/composable_kernel commit: 841009c5ee]
2024-09-04 07:36:27 -07:00
Bartłomiej Kocot
950165c6fb Add support for NGCHW in grouped conv bwd wei (#1491)
* Add support for NGCHW in grouped conv bwd wei

* Comments fixes

* navi fixes

* Update function names

[ROCm/composable_kernel commit: 73b67f290f]
2024-09-03 10:52:03 +02:00
Bartłomiej Kocot
9974926658 Revert "Revert "Revert Revert Support access per groups and filter2x3 in grouped conv fwd (#1382) (#1406) (#1415)" (#1455)" (#1490)
This reverts commit 725dd433cdc6435d481e806b5442a07b0097c94a.

[ROCm/composable_kernel commit: a9b170b541]
2024-09-02 10:39:49 +02:00
Andriy Roshchenko
f6c6819b47 Adding Instances and Examples for FP8-based Scaled Convolution and AMAX Reduction. (#1473)
* Enable CMakePresets build

* Verify Convolution, Scaling and ReLU algorithms.

* Add tensor element-wise scale and type cast operation.

* Reduction implemented but does not work.

* Exploration of Reduction functionality.

* Completed example for Convolution scaled with ReLu activation and AMAX reduction.

* WIP: Add required instances for convolution.

* WIP: Create client example. Implement convolution stage.

* Add elementwise instances.

* Add elementwise scale + convert example.

* Add reduction instances.

* WIP: Client example for AMAX reduction.

* WIP: Add instances for multistage reduction.

* WIP: Implementation of multistage reduction.

* Refactoring.

* Clean up.

* Add CMakePresets.json

* Guard off FP8 instances when the data type is not available.

* Add example for Scaled FP8 Convolution with AMAX reduction.

* Refactor CombConvScaleRelu instances.

* Add CombConvScale instances.

* Add client example for Scaled FP8 Convolution with AMAX reduction.

* Cleanup.

[ROCm/composable_kernel commit: c3515f277c]
2024-08-21 15:22:41 -07:00
Andriy Roshchenko
10edb0c70e Adding Instances and Examples for FP8-based Scaled Convolution with ReLU Activation and AMAX Reduction. (#1469)
* Enable CMakePresets build

* Verify Convolution, Scaling and ReLU algorithms.

* Add tensor element-wise scale and type cast operation.

* Reduction implemented but does not work.

* Exploration of Reduction functionality.

* Completed example for Convolution scaled with ReLu activation and AMAX reduction.

* WIP: Add required instances for convolution.

* WIP: Create client example. Implement convolution stage.

* Add elementwise instances.

* Add elementwise scale + convert example.

* Add reduction instances.

* WIP: Client example for AMAX reduction.

* WIP: Add instances for multistage reduction.

* WIP: Implementation of multistage reduction.

* Refactoring.

* Clean up.

* Guard off FP8 instances when the data type is not available.

* Improve output readability.

* Addressing reviewer's comments.

[ROCm/composable_kernel commit: a94113a941]
2024-08-20 10:30:56 -05:00
Illia Silin
ad65d8d5b0 Re-enable fp8 types for all architectures. (#1470)
* re-enable fp8 and bf8 for all targets

* restore the fp8 gemm instances

* re-enable conv_3d fp8 on all architectures

* diasble several fp8 gemm instances on all architectures except gfx94

* clang format fix

[ROCm/composable_kernel commit: c8b6b64240]
2024-08-16 16:07:52 -06:00
Haocong WANG
65d6442b4c [GEMM] gemm_universal related optimization (#1453)
* replace buffer_atomic with global_atomic

* fixed global_atomic_add

* added bf16 atomic_add

* format

* clang-format-12

* clean

* clean

* add guards

* Update gtest.cmake

* enabled splitk_gemm_multi_d

* format

* add ckProfiler

* format

* fixed naming

* format

* clean

* clean

* add guards

* fix clang format

* format

* add kbatch printout

* clean

* Add rocm6.2 related gemm optimization

* Limit bf16 atomic usage

* remove redundant RCR gemm_universal instance

* Add RRR fp8 gemm universal instance

* Bug fix

* Add GPU_TARGET guard to FP8/BF8 target

* bug fix

* update cmake

* remove all fp8/bf8 example if arch not support

* Enable fp8 RRR support in ckProfiler

* limit greedy-reverse flag to gemm_universal in ckProfiler

---------

Co-authored-by: Jing Zhang <jizhan@fb.com>
Co-authored-by: Jing Zhang <jizhan@meta.com>
Co-authored-by: zjing14 <zhangjing14@gmail.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
Co-authored-by: illsilin <Illia.Silin@amd.com>

[ROCm/composable_kernel commit: 3049b5467c]
2024-08-14 10:42:30 +08:00
Illia Silin
92df7893df Disable inapplicable xdl and mha instances for gfx12 (#1464)
[ROCm/composable_kernel commit: cbb6f2ab8c]
2024-08-12 15:11:58 -07:00
Jun Liu
254a7dadb6 Revert "Revert Revert Support access per groups and filter2x3 in grouped conv fwd (#1382) (#1406) (#1415)" (#1455)
This reverts commit 0c367d5912486f4fcbae1dbb861a1fb8176ca308.

[ROCm/composable_kernel commit: 5ff8eeebf9]
2024-08-08 19:09:33 -07:00
bibek
c8c3293b0b adding mha as static lib (#1366)
* adding mha as static lib

* add fmha fwd compile options

* typo

* fix python version

* python version to 3

* increase path length

* add max path flag in mha cmake

* fix long path issue

* mha currently only runs in gfx94x

* only buld mha in mi300

* populate gpu_list

* add mha compile flags

* avoid building mha in gpu other then gfx94x

* some comments and  include ck_tile in rocm

* use rocm_install

* place ck_tile in include

* correct ck_tile path

---------

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>

[ROCm/composable_kernel commit: 840c5397bb]
2024-08-06 11:17:10 -05:00
Bartłomiej Kocot
69a6b563f9 Add Grouped Conv Fwd Large Tensor kernel (#1432)
* Support 64 bit indexing

* Add new grouped conv fwd kernel for large tensors

* Add instances large tensor

* Fixes for transform conv to gemm

* Fixes

* fixes

* Remove not needed instances

* examples fixes

* Remove not need ds arrays

* Fix tests

* Add 2GB check in gridwise dl

* Fixes

[ROCm/composable_kernel commit: 4ec5c52a0c]
2024-08-06 10:06:10 +02:00
Illia Silin
8f71de4707 add --offload-compress compiler flag (#1433)
* add --offload-compress compiler flag

* only apply the --offload-compress flag to the ckProfiler

* move the --offload-compress flag back to main cmake file

* add offload-compress to target compile option of ckProfiler

---------

Co-authored-by: carlushuang <carlus.huang@amd.com>

[ROCm/composable_kernel commit: 7f57b2e02c]
2024-08-05 23:26:01 +08:00
Bartłomiej Kocot
1567614d80 Revert Revert Support access per groups and filter2x3 in grouped conv fwd (#1382) (#1406) (#1415)
[ROCm/composable_kernel commit: 33b399cc15]
2024-07-30 18:36:04 +02:00
Andriy Roshchenko
e3b469a493 Adding more instances of grouped convolution 3d forward for FP8 with ConvScale+Bias element-wise operation. (#1412)
* Add CMakePresets configurations.

* Add binary elementwise ConvScaleAdd and an example.

* Numerical verification of results.

Observed significant irregularities in F8 to F32 type conversions:
```log
ConvScaleAdd: float=145.000000   f8_t=160.000000    e=144.000000
ConvScaleAdd: float=97.000000   f8_t=96.000000    e=104.000000
ConvScaleAdd: float=65.000000   f8_t=64.000000    e=72.000000
```

* Implemented ConvScaleAdd + Example.

* Add ConvScale+Bias Instances

* Add Client Example for ConvScale+Bias

* Fix number of bytes in an example..

* Cleanup.

[ROCm/composable_kernel commit: 4a8a1befd5]
2024-07-24 15:49:55 -05:00
Haocong WANG
c69df380b9 disable bad instance (#1410)
[ROCm/composable_kernel commit: d22713a719]
2024-07-23 09:05:03 -07:00
Bartłomiej Kocot
b23a3fcf77 Revert Support access per groups and filter2x3 in grouped conv fwd (#1382) (#1406)
[ROCm/composable_kernel commit: 5d8c3d8190]
2024-07-22 14:21:24 +02:00
Haocong WANG
a0e0f3cdcc [GEMM] F8 GEMM, performance optimized. (#1384)
* add ab_scale init support

* enabled interwave

* add scale type; update isSupport

* adjust example

* clean

* enable f8 pure gemm rcr ckprofiler

* Add gemm_multiply_multiply instances

* clang format

* Optimize for ScaleBlockMNK=128

* enable abscale f8 gemm ck profiler

* Add pure f8 gemm test suite

* Reverting to the state of project at f60fd77

* update copyright

* clang format

* update copyright

---------

Co-authored-by: root <jizhan@amd.com>

[ROCm/composable_kernel commit: 8c90f25be3]
2024-07-19 22:06:52 +08:00
ltqin
50c6703b31 Universal gemm splitk using reduce (with multi-d) (#1341)
* init for reduce_threadwise multi_d

* add reduce_threadwise_multi_d

* add reduce_multi_d

* clean

* start add an other splitk device op

* add reduce template parameter to SplitKBatchOffset

* add reduce c matrix

* clean up code

* change example data type to bf16

* add bf16Ai8B example

* remove reduce template parameter

* add splitk atomic status to v4

* example add multi d parameters

* device op add multi-d parameters

* add multi-d to reduce

* fix kbach=1 bug

* change B layout to col in  bf16Ai8B example

* remove float adding struct

* change  multi-d interface

* change file and class name

* remove multi-d of bf16Ai8B example

* change IsReduce function to IsReduceAdd

* change example layout to RRR from RCR

* according layout to set ds stride

* reset parameter layout

* add gemm universal reduce instance

* add reduce factory

* add profile_gemm_universal_reduce

* add reduce to profiler

* fix reduce instance

* fix profiler reduce compiling bug

* format

* format library instance code

* add mem instance for reduce library

* fix call instance names

* add workspace for reduce in ckProfiler

* format

* add mnpading to reduce library instance

* add fp16 instance to reduce of profiler

* change copyright time

* restore profiler cmake file

* add reduce text to instances

* add DsLayout and DsDataType to instances template parameter

* fixed gemm_reduce_multi_d

* add an example without multi_d

* Update common.hpp

* Update gtest.cmake

* Update gemm_xdl_splitk_reduce_bf16.cpp

* clean

* Update gtest.cmake

* format

* fixe api

* format

* default parameter change to RRR

* add vector_len for multi_d

* format

* Update gtest.cmake

* fix bf16A iBB elementwiseop

* add ReduceDataType

* move ReduceDataType to end position

* format

* remove googletest git method  address

* fix copyright time

* update init data

---------

Co-authored-by: root <jizhan@amd.com>
Co-authored-by: letaoqin <letaoqin@amd.com>
Co-authored-by: Jing Zhang <jizhan@meta.com>
Co-authored-by: zjing14 <zhangjing14@gmail.com>

[ROCm/composable_kernel commit: c544eb4da0]
2024-07-19 22:01:22 +08:00