122 Commits

Author SHA1 Message Date
ApoorvaKalyani
70d71b1514 Test fix for gemm_b_scale_xdl_v3. (#3674) 2026-01-30 10:34:54 -07:00
Kiefer van Teutem
2377a62837 Adding remaining conv, dynamic_op, and scaleadd_scaleadd_relu flavors for grouped conv fwd (#3529)
* Adding remaining flavors for grouped conv fwd

As titled. Following variants are added:
- grouped_conv2d_fwd_dynamic_op
- grouped_conv3d_fwd_dynamic_op
- grouped_conv3d_fwd_bilinear
- grouped_conv3d_fwd_convscale
- grouped_conv3d_fwd_convinvscale
- grouped_conv3d_fwd_convscale_add
- grouped_conv3d_fwd_convscale_relu
- grouped_conv3d_fwd_scale
- grouped_conv3d_fwd_combconvscale
- grouped_conv3d_fwd_scaleadd_scaleadd_relu

* Fix incomplete parsing of types from source names in add_instance_library() cmakelists function so we don't build f8 on RDNA3.

* Do not build f8 / bf8 only flavor tests on RDNA3

* Make sure we have proper generic instances for all instance lists related to the post-ces extra flavors, with scalarPerVector = 1. Then disable all but one generic instance per instance list to reduce compile time.

* Post rebase fix: Template parameters for Grouped Conv Fwd Device Impl got tweaked upstream.

* adding int8 and fp16 overloads to the elementwise operations

* fixed copilot nits

* Addressing review comments:

- removed unnecessary examples for dynamic op
- removed unnecessary conv specalizations for all the flavors
- removed spurious bilinear and scale source files

* clang-format

* reduced no of tests

---------

Co-authored-by: Wojciech Laskowski <wojciech.laskowski@streamhpc.com>
2026-01-30 17:02:14 +01:00
Johannes Graner
c190d8d61f [CK tests] Extend conv GPU reference (#3539)
* test_convnd_fwd

* test_convnd_bwd_data

* test_conv_bwd_data_scale

* test_grouped_convnd_fwd_clamp

* test_grouped_convnd_fwd_scale

* multiple A/B tensors and D tensor for fwd GPU ref

* test_grouped_convnd_fwd_scaleadd_ab

* test_grouped_convnd_fwd_bias_clamp

* test_grouped_convnd_fwd_bilinear

* test_grouped_convnd_fwd_gk_bias_clamp

* Extend GPU reference to enable batchnorm epilogue

* test_grouped_convnd_fwd{,_gk}_bias_bnorm_clamp

* test_grouped_conv_bwd_data_bilinear

* test_grouped_convnd_bwd_weight_bilinear

* Add missing template instantiation

* Perform operations in float in reference

* Slightly increase tolerance for batchnorm profiler

* Revert "Slightly increase tolerance for batchnorm profiler"

This reverts commit a3b2475229.

* Revert "test_grouped_convnd_fwd{,_gk}_bias_bnorm_clamp"

This reverts commit 6da4576060.

* Revert "Extend GPU reference to enable batchnorm epilogue"

This reverts commit e2f75fa10e.

* Clarify variable names

* Refactor elementwise ops into helper functions

* Make helpers C++17-compatible
2026-01-27 09:49:42 +01:00
Erwin Terpstra
eb041079a3 Implement grouped gemm tile loop for RDNA4 (#3304)
* feat: grouped gemm tile loop support for RDNA4

* fix: removed extra parameter from grouped gemm example instance

* fix: FP8 check incorrectly enabling FP8 on RDNA3
2026-01-13 07:14:23 +01:00
Kiefer van Teutem
88ae445580 Replace grouped conv bwd wei wmmaV3 bilin/scale bf16f32bf16 support with bf16bf16bf16 (#3470)
* Replace grouped convolution bwd weight wmma v3 bilinear and scale bf16f32bf16 support with bf16bf16bf16 support. Update tests.

* Tentative fix for bwd weight bilinear bf16bf16bf16, seems like the bilinear elementwise overload for this case (bf16, f32 accu, bf16) was wrong.
2025-12-29 12:58:29 +01:00
Aviral Goel
de6466481f chore(copyright): update copyright header for include directory (#3293) 2025-11-26 11:00:05 -07:00
Illia Silin
930423ab3b Initialize new variable to prevent c++17 compiler error (#3156)
* initialize new variable to prevent c++17 compiler error

* build for gfx90a using -std=c++17 flag
2025-11-04 18:54:14 -08:00
Enrico Degregori
4ebc48a3cd WMMA gemm_add_relu_add_layernorm (#2989)
* Summary:

 - Refactor epilogue (with CShuffle) to support fused operations:
    - EpilogueCShuffleBase holds common parts
    - EpilogueCShuffle: runs CShuffle and write out
    - EpilogueWelfordCShuffle: holds Welford specific arguments, runs CShuffle, write out, Welford first part and Welford write out

 - Extend thread transfer v7r3:
    - Support for intermediate data type different from src and dst type
    - New functionality to write to dst buffer and keep data (to be able to use them for additional operations)

* Adress review comments
2025-10-31 11:19:26 -07:00
John Shumway
5ed2046bee Add the last two forward instance traits. (#3134)
* Add InstanceTraits for DeviceGroupedConvFwdMultipleD_Wmma_CShuffle

* Add InstanceTraits for kernel_grouped_conv_fwd_dl_multiple_d

* A few small changes to fix broken instance traits.
2025-10-31 07:52:42 -07:00
Ville Pietilä
1c17bae816 Add name member to CK elementwise operations. (#3102) 2025-10-27 22:19:29 -07:00
John Shumway
54746e9329 [CK_BUILDER] Test and fix instance traits utils. (#3096)
* Refactor instance_traits_util and add unit tests tests

* Address reviewer comments.

Just adds some TODOs to indicate deprecated layouts in our reflection. Our strategy is to leave the reflection code broad (covering deprecated features), but keep the builder concepts narrow. Once we've removed deprecated features from all instances, we can remove them from reflection.

Also add a comment to the cmake to explain the unit test target test_conv_builder.

* Addressed more reviewer comments.

* Remove duplicate PassThrough::name

Accidentally added this field to the end of the struct, too. The `name` field should be a the start of the struct for consistency.
2025-10-27 22:14:08 -07:00
Ville Pietilä
6c2ca1211a [CK_BUILDER] First fwd convolution builder implementation (#3070)
* Add experimental builder infrastructure for composable_kernel

- Add experimental/builder directory with README documentation.
- Create initial test infrastructure with CMakeLists.txt and placeholder test.
- Update root CMakeLists.txt to support CK_EXPERIMENTAL_BUILDER option.
- Update .gitignore to not treat `experimental/builder` as a CMake build directory.

This establishes the directory structure  for a high-level builder pattern that will provide a semantically-clear interface for constructing CK operations, with initial focus on convolution kernels for MIOpen integration.

* Fix clang formatting.

* Fix CMake build infrastructure for experimental builder

- Add experimental/builder CMakeLists.txt with proper subdirectory structure
- Add placeholder include/ck_tile/builder CMakeLists.txt for header installation
- Fix gtest.cmake to use include_guard to prevent multiple inclusions
- Update root CMakeLists.txt to include full builder directory instead of just tests

* Scope C++20 settingto the test code

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Remove redundant GTest::gtest linkage

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Introduce basic types, and convolution algorithm concepts and limits.

* Add convolution signature concepts.

* Add convolution factory.

* Finalize conv factory implementation for fwd convolutions.

* Add type definitions for testing.

* Add placeholder test.

* Add convolution builder definition.

* Fully functional fwd conv builder.

* Test improvements.

* Clean-up include headers.

* Enable the limit checks for the convolution algorithm parameters.

* Remove dead code.

* clang formatting.

* Add more tests and missing conv specialization argument.

* clang formatting.

* Add explicit handling of the tensor layouts.

* Add complete 2D/3D layout support to CK Builder

  - Add missing 2D layouts: GNHWC_GKYXC_GNHWK, NGCHW_GKCYX_NGKHW
  - Add missing 3D layout: GNDHWC_GKZYXC_GNDHWK
  - Add 1D layouts (NWGC, NGCW, GNWC, NGCW_GKCX) for future support
  - Add 3 tests for new 2D/3D layouts
  - All tests pass (5/5)

* Add tests for remaining 2D/3D layouts

  - Add test for 2D NGCHW_GKYXC_NGKHW (channels-first) with Filter1x1Stride1Pad0
  - Add test for 3D NDHWGC_GKZYXC_NDHWGK (channels-last)
  - All 7 tests pass (complete coverage for all 2D/3D forward layouts)

* Change enum converters to consteval.

* 7 tests with pipeline and specialization| Test # | Dim | Type | Layout               | Pipeline | Specialization          |
  |--------|-----|------|----------------------|----------|-------------------------|
  | 1      | 2D  | BF16 | NHWGC_GKYXC_NHWGK    | V1       | DEFAULT                 |
  | 2      | 2D  | FP16 | GNHWC_GKYXC_GNHWK    | V3       | FILTER_1X1_PAD0         |
  | 3      | 2D  | FP32 | NGCHW_GKCYX_NGKHW    | V4       | FILTER_1X1_STRIDE1_PAD0 |
  | 4      | 2D  | BF16 | NHWGC_GKYXC_NHWGK    | V5       | FILTER_3x3              |
  | 5      | 3D  | FP32 | NGCDHW_GKCZYX_NGKDHW | V1       | FILTER_1X1_PAD0         |
  | 6      | 3D  | BF16 | GNDHWC_GKZYXC_GNDHWK | V3       | DEFAULT                 |
  | 7      | 3D  | FP16 | NDHWGC_GKZYXC_NDHWGK | V4       | FILTER_1X1_PAD0         |

* Add missing convolution layouts and provide better compile-time error in instance traits.

* Fix clang formatting.

* Changed I8 -> S8.

* Fix signature.

* Rename concepts and corresponding members.

* Rename LDS related parameters.

* Remove ODD_C specialization. Add V2 pipeline.

* Add missing types.

* Add elementwise operation to the conv signature.

* Improve compile-time error message for unsupported elementwise ops.

* Separate different fwd conv builder tests into separate compilation units.

* Fix layout to string and add name to old CK PassThrough elementwise op.

* Enable both CK and CK Tile tensor layouts in instance traits.

* Fix clang-format.

---------

Co-authored-by: John Shumway <jshumway@amd.com>
Co-authored-by: John Shumway <john.shumwayjr@gmail.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: JH-Leon-KIM-AMD <jeonghyun.kim@amd.com>
2025-10-27 20:09:24 +02:00
Enrico Degregori
b740380906 Wmma support for multiple Ds based GEMMs (#2613)
* Fixed cmake errors related to  gemm_bilinear. Previously, if the above flags are set, cmake build fails: GPU_TARGETS="gfx1100;gfx1201" -D DTYPES="fp16;bf16;fp8"

* Fixed cmake build errors related to test_fp8

* Updates to support mixed precision


(cherry picked from commit e65d71180393e7b66169c56565a6bac740427de6)

Co-authored-by: Anca Hamuraru <anca@streamhpc.com>

* Adding support for RRR, F8xF16xF16 gemm_universal_wmma - wip


(cherry picked from commit f8c06322df0abcbd5945a56cdf5bffe56480f9f0)

Co-authored-by: Anca Hamuraru <anca@streamhpc.com>

* Added support for F8xF16xF16 to gemm_wmma_universal


(cherry picked from commit 15c851de6daa513a12c2e3af299bab0176175fb5)

Co-authored-by: Anca Hamuraru <anca@streamhpc.com>

* Added support for F16xF8xF16 to gemm_wmma_universal

* Added support for BF16xI4xBF16 to gemm_wmma_universal


(cherry picked from commit c6a4a69d2d43d59bae8bdabfae80d648646f217e)

Co-authored-by: Anca Hamuraru <anca@streamhpc.com>

* Added support for F16xI4xF16 to gemm_wmma_universal

* Fixed IsSupportedArgument to check ComputeTypeA, ComputeTypeB instead of ADataType, BDataType

* Added missing test class for FP16_KM_NK

* Pre-commit hooks fixes

* Added padding instances for f16xf16xf16

* Fixed cmake errors related to  gemm_bilinear. Previously, if the above flags are set, cmake build fails: GPU_TARGETS="gfx1100;gfx1201" -D DTYPES="fp16;bf16;fp8"


(cherry picked from commit 5bdc993dbf)

Co-authored-by: Anca Hamuraru <anca@streamhpc.com>

* Fixed cmake build errors related to test_fp8


(cherry picked from commit 12176616b6)

Co-authored-by: Anca Hamuraru <anca@streamhpc.com>

* Ammending changes for adding support for padding instances for f16xf16xf16

* Fixes for padding instances for f16xf16xf16

* Added padding instances for bf16xbf16, f8xf8

* Added packed instances for bf16xi4xbf16

* Added padding instances for f8xf16xf16

* Added padding instances for f16xf8xf16, f16xi4xf16

* Fixed typos for bf16xbf16xbf16 padding instances

* Fixed typos for padded instances

* Added tests for fp16, KM_KN and KM_NK

* Padding not supported for when BDataType is pk_i4_t. Added fix for correct check and removed padding instances.

* Fixed typos

* Updated the set of tests for FP16

* Updated the set of tests for FP16

* Fix typo

* Moved f16xi4 test under the correct data layout group

* example for gemm_universal_bf16

* Adding examples for gemm_wmma instances

* Added the  missing parameters

* Fixed review comments and added executable to cmakeLists

* Fixing clang format

* Fixing build erros

* Fixed compilation failure.

* Modified some code as per gemm_universal_examples

* Fixed the gemm specialization error

* Fixed the build errors.

* Fix strides of a/b_thread_desc

The descriptors are larger than needed (even though the compiler don't alloc registers for unused values).

* Load in M/NRepeat dims with thread copy's slice instead of a loop

* Clone BlockwiseGemmXdlops_pipeline_v1 for WMMA implementation

* Implement Intrawave and Interwave variants of pipeline v1

* Add instances for Interwave and Intrawave v1

* Add instances with ABlockLdsExtraM and BBlockLdsExtraN = 0

* Remove instances that are too slow (mostly because of register spilling)

* Add a workaround for fp8/bf8->f32 packed conversion issue

* Add instances for Interwave and Intrawave v1

* Enable profiling of mixed precision with f8 and int4 on WMMA

* Fix segfault in profiler when B is pk_i4_t

b_device_buf's size in bytes is larger than b_k_n_permute so b_device_buf.ToDevice reads out-of-bounds.

* Remove instances that are too slow (mostly because of register spilling)

* Add missing add_device_gemm_wmma_universal_f8_f8_bf16 declarations

* Add test case for bf16_i4

* Add missing Regular tests

* Add test_gemm_universal_xdl/wmma_fp16 to REGRESSION_TESTS

They take more than 30 seconds

* Fix a bug that fp16_i4 validation passes only with PermuteB

A permutation required by conversion from pk_i4_t to half_t does not
depend on PermuteB, they can be used independently.

* Use PermuteB with f16_i4 in most instances (as xdl)

Some instances use PermuteB = false for checking correctness.
See also the previous commit.

* Fix cache flushing for pk_i4

* Add mixed precision examples

* Disable all tests and instances with f8 on gfx11

Even though f8_f16 and f16_f8 don't require f8 WMMA instructions,
gfx11 still lacks hardware instructions for fast f8->f32 conversion.

* Add FP16 KM_NK and KM_KN test suites for XDL

These tests were added to common .inc for better testing of WMMA instances

* Support multiple D in GridwiseGemm_wmma_cshuffle_v3

DeviceGemm_Wmma_CShuffleV3 is changed for new template parameters.

* Use ThreadGroupTensorSliceTransfer_v7r3

* Clone for device_gemm_wmma_cshuffle_v3.hpp for future Multiple D support

* Clone example/65_gemm_multiply_multiply/gemm_add_add_xdl_fp16.cpp for wmma

* Implement DeviceGemmMultipleD_Wmma_CShuffleV3

* Make gemm_add_add_wmma to work with DeviceGemmMultipleD_Wmma_CShuffleV3

* Prepare gemma_add tests for adding wmma

* Add gemm_add_fastgelu instances and test

* Add a special wrapper to use DeviceGemmMultipleD_Wmma_CShuffleV3 with old API

ckProfiler uses DeviceGemmMultipleD (tests also call its functions), the wrapper allows to use
DeviceGemmMultipleDSplitK instances there.

* removed unnecessary ck parts from compilation

* initial gemm_add_multiply instance implementations

* fixed profiler help message for gemm_add_multiply

* improved multiply_add profiler layout help

* fixed template arguments for test instances

* added test for gemm_add_multiply

* Support multiple D in GridwiseGemm_wmma_cshuffle_v3

DeviceGemm_Wmma_CShuffleV3 is changed for new template parameters.

* Use ThreadGroupTensorSliceTransfer_v7r3

* Clone for device_gemm_wmma_cshuffle_v3.hpp for future Multiple D support

* Clone example/65_gemm_multiply_multiply/gemm_add_add_xdl_fp16.cpp for wmma

* Implement DeviceGemmMultipleD_Wmma_CShuffleV3

* Make gemm_add_add_wmma to work with DeviceGemmMultipleD_Wmma_CShuffleV3

* Prepare gemma_add tests for adding wmma

* Add gemm_add_fastgelu instances and test

* Add a special wrapper to use DeviceGemmMultipleD_Wmma_CShuffleV3 with old API

ckProfiler uses DeviceGemmMultipleD (tests also call its functions), the wrapper allows to use
DeviceGemmMultipleDSplitK instances there.

* switched to splitK interface

* log print added to splitk benchmarks

* revert main cmake comments

* newline change reverted

* added add_fastgelu instances

* revert unintended change in xdl add_fastgelu

* created gemm_add_add_fastgelu instances

* created fastegelu instances

* added tests for all splitk fastgelus

* Added tests.

* multiply_add instances created

* updates to add_multiply splitk instances

* splitk xdl test fixes

* added wmma multiply_multiply instances

* fixed ONLY_XDL_AND_WMMA_KERNELS tag

* Added gemm_add examples for wmma v1 and v3

* fixed / workarounded i8 instances

* Modified the v3 code to added one fp16 bxdl instance.

* added bf16 xdl instance.

* adding gemm_add wmma_cshuffle and other support


(cherry picked from commit ec447e7f564095ea969eddc39ec77b843aa52976)

Co-authored-by: Cenxuan <cenxuan@streamhpc.com>

* add instances into camkelists


(cherry picked from commit 23bf2d2771c939ea3ca7f493433c55255bffd08e)

Co-authored-by: Cenxuan <cenxuan@streamhpc.com>

* This is work in progress, edited the template parameters in order to build

(cherry picked from commit b4fde8a3314cb44659c4bbda35f1a0133c63dc41)

Co-authored-by: Cenxuan <cenxuan@streamhpc.com>

* temp work saved, changed the BDataType to f16 or bf16 since wmma currently not support non-equal A and B datatype


(cherry picked from commit 22fbd68f1db458ab50780a394ee2544c7a1484d1)

Co-authored-by: Cenxuan <cenxuan@streamhpc.com>

* added datatype and use clang-format-12


(cherry picked from commit ae4e853682ef1bb27784b2f965b4a66b3751ceec)

Co-authored-by: Cenxuan <cenxuan@streamhpc.com>

* Fixing build errors

* Added instances for v3

* Adding instances and executables

* Code update of template parameters modified.

* Renamed file.

* Added tests.

* resolved error tests.

* Fixing build errors

* Updated comments

* removed the changes as per the MR review comment.

* Updated tests.

* fp8 instances - not tested

* Restored the Cmake file that was reverted by mistake during rebase.

* fixed wmma_op test

* Updated comments.

* Updated the template parameter description

* fixed rdna4 instances

* fixed back compatibility on gfx11

* cleanups

* fix ckProfiler

* one more cmake fix

* added fp8 instances

* Updated tests to ad BF16 instances as per review comment

* Added include file and cleaned up(as per review comment)

* Updated and optimized the example code for all types.

* Fixed clang format

* Resolve "Implement `device_gemm_bilinear` for RDNA4"

* test generalization to handle FP16 shuffle better

* added missing changes

* Added bf16 wmma instance for add_relu

* Added f16 wmma instance and corrected bf16 instance errors.

* Added instances to Cmake

* Modified the template parameters to make the instances work.

* Fixed typo in profiler

* Added v3 instances for gemm_add_relu

* addressed core review comments

* Added test for gemm_add_relu wmma instance

* Cleaned up the code.

* Added examples for gemm_add_relu

* Fixing typo to resolve build errors.

* Fixes applied to fix  the precision loss.

* fix billinear test after merge

* Removed the old wmma instances.

* Added wrapper and renamed the wmma_v3 instances

* Updated copyrights and added wrappers.

* Fixes applied according to review comments

* Apply 1 suggestion(s) to 1 file(s)

Co-authored-by: Robin Voetter <robin@streamhpc.com>

* Removed the old wmma instances.

* Updated wrapper for the v3 instances

* removed the old wmma examples

* Renamed the v3 instances

* Deleted the  gtest file added by mistake.

* Updated thge profiler with wrapper

* Fixed test errors.

* Fixed the review comments

* Fixed the if condition MACROS.

* REVERTED THE PROFILER CHANGES

* Revert "REVERTED THE PROFILER CHANGES"

This reverts commit 21cb98546c.

* Revert "Fixed test errors."

This reverts commit 13efcc6fe1.

* Revert "Updated thge profiler with wrapper"

This reverts commit 536f86661d.

* Added missing wrapper instances

* Updated copyrights.

* Fixed typo.

* Fixed copyrights.

* Updated copyrights.

* updated copyrights.

* comments on the atomics workaround

* fixed cmake comment

* Fix bug from merge

* clang-format-18

* Fix compilation error

* Fix linking error

* Fix bug in add and add_relu examples

* Fix error including file (typo)

* Quick fix to compile examples for different targets

* Fix for multi target

* implemented f16 and bf16 instances for gemm_silu

* addressed review comments

* addressed review comments

* Fix clang format

* Fix clang format

---------

Co-authored-by: Anca Hamuraru <anca@streamhpc.com>
Co-authored-by: apoorva <apoorva@streamhpc.com>
Co-authored-by: Anton Gorenko <anton@streamhpc.com>
Co-authored-by: Zoltan Lakatos <zoltan.lakatos@streamhpc.com>
Co-authored-by: Cenxuan <cenxuan@streamhpc.com>
Co-authored-by: Robin Voetter <robin@streamhpc.com>
Co-authored-by: Kiefer van Teutem <kiefer.van.teutem@streamhpc.com>
Co-authored-by: Kevin Abraham <kevin.abraham@streamhpc.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
2025-09-05 16:31:08 +02:00
linqunAMD
e2d28a92af Extend XDL kernel to Support RDNA3/4 - Part 2 (#2722)
Update Blockwise and Gridwise files to support both wave32 & wave64.

1. Calculate WaveSize from template parameter, instead of hard code it to 64, some "64" is also replace with WaveSize
2. Move BN0Shuffled and BK0Shuffled to device side. we can't get correct mfma inst info in host side.
3. Update b_thread_offset_n and b_thread_offset_k in gridwise_gemm_xdl_cshuffle_v3_b_scale.hpp for gfx11. in gfx11, input data is duplicated for each 16 threads, it is different with all of others.
4. Modify a1_threadwise_copy in gridwise_batched_*gemm*gemm for gfx11.  for gfx11, we need duplicate input and swizzle A if transposeC isn't enabled.
2025-09-04 08:33:40 +08:00
Bartłomiej Kocot
5328b232b2 Grouped Convolution Forward Infer Bias Bnorm Activ (#2621)
* Grouped Convolution Forward Infer Bias Bnorm Activ

* 3d
2025-08-07 08:36:47 +02:00
Illia Silin
504b101da3 upgrade from clang-format-12 to clang-format-18 (#2568)
* upgrade to clang-format-18

* update to clang-format-18 in pre-commit-config
2025-07-28 11:34:07 -07:00
Adam Osewski
c8eb2f995c Add v3 support for Groupd fwd conv+bias+clamp & ckProfiler (#2463)
* Add logging to IsSupported.

* Less casting in AddClamp

* Conv+bias+clamp instances & profiler BF16

* Fix 3D instances & run just 1x for verification.

* :Run just once for verification conv fwd.

* ckProfiler conv fwd clampwq

* Remove exec bit & formatting

* Add support for MultiD for grouped conv fwd v3.

* Enable 2Lds.

* clean

* align instances

* align instances

* profiler fixes

* Fixes

* fix

* fix

---------

Co-authored-by: Adam Osewski <root@quanta-ccs-aus-f01-19.cs-aus.dcgpu>
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>
2025-07-25 10:34:31 +02:00
Bartłomiej Kocot
f6c2ff9dce Grouped convolution forward with clamp (#2334)
* Grouped convolution forward with clamp

* Optimize clamp

* unary fixes

* test gk bias

* Revert "test gk bias"

This reverts commit 8e42e29d7b.

* Revert "Revert "test gk bias""

This reverts commit e73c0550ce.

* workaround comment
2025-06-16 15:36:53 +02:00
Andriy Roshchenko
00247e3c29 Optimized GEMMs for MX FP4/8 (#2294)
Adds V3 GEMM pipeline for MX FP4 and MX FP8 
Adds V3 GEMM pipeline for MX FP4 with preshuffling
Adds MXFP4 GEMM tests (#2275)
Adds MXFP4 GEMM examples
Adds MXFP4 GEMMs to ckProfiler




Co-authored-by: Andriy Roshchenko <107577548+andriy-ca@users.noreply.github.com>
Co-authored-by: Andriy Roshchenko <andriy.roshchenko@amd.com>
Co-authored-by: aska-0096 <haocwang@amd.com>
Co-authored-by: lalala-sh <Jiaxing.Wen@amd.com>
Co-authored-by: OscarXu <huaiguxu@amd.com>
Co-authored-by: mtgu0705 <mtgu@amd.com>
Co-authored-by: Ding, Yi <yi.ding@amd.com>
Co-authored-by: feifei14119 <feiw@amd.com>
Co-authored-by: Lin, Qun <qlin@amd.com>
Co-authored-by: joye <joye@amd.com>
Co-authored-by: Rostyslav Geyyer <46627076+geyyer@users.noreply.github.com>
2025-06-05 13:54:15 -06:00
Bartłomiej Kocot
6e5acee0f9 Add Clamp/Relu bf16/fp16 cast fixes (#2279)
* Add Clamp/Relu bf16/fp16 fixes

* fix
2025-06-03 18:31:46 +02:00
Bartłomiej Kocot
e7906dd644 Change relu to clamp for grouped conv fwd instances (#2249) 2025-05-29 00:51:25 +02:00
Bartłomiej Kocot
6fddb5708c Add grouped conv fwd bias relu instances (#2179)
* Add grouped conv fwd bias relu instances

* fixes

* fix
2025-05-09 22:52:34 +02:00
valarLip
2c563fecf7 add passthrough for int32->float32 (#2062) 2025-04-08 15:16:30 -07:00
Mingtao Gu
0db7c8f0b2 Ck int4 moe develop (#1949)
* Add Gemm fp8xint4 example and kernel, function pass.

* Init Gemm_fp8xint4 Bpreshuffle

* Added gemm_fp8xint4_Bpreshuffle files, function not checked yet

* General fix.

* fp8xint4 bpreshuffle function pass

* fix.

* init b preshuffle dequant in VGPR.

* fix bug, function pass.

* move b thread dequant copy to blockwise.

* fix bug, function now passes.

* modified the tile size to 256, 128x128x128.

* fixed a bug.

* Initial int4 moe, compile pass, function not check.

* fix bug in moe_gemm1.cpp, now function pass.

* test expert = 8 and function pass.

* Added moe_pk_i4_gemm2, function pass.

* Added b preshuffle pipeline v3 support.

* fixed merge issue. fp8xint4 and fp8xint4_bpreshuffle function pass.

* Split the blockwise pipeline for fp8xint4.

* commit missing files

* opt gemm2 to 2x2 wave

* fix swizzle = false

* update int4 moe with latest input changes.

* update tile size.

* enable pipeline v3.

* fix nswizzle = true

* commit a version for compiler debug.

* Updated transfer_v3r1_gather to support pk_i4_t type.

* for int4 moe2 for type_convert support.

* remove some values between mfma instructions.

* fix int4 moe

* Updated transfer_v3r1_gather to support pk_i4_t type.

* i4 support lds multiple shuffle

* fixed int4 moe tflops calculation.

* Modified CshuffleCShuffleMXdlPerWavePerShuffle to 1 to suit C multiple shuffle

* updated gemm2.

* change int4 moe example names

* fix and format code.

* format.

* format codes.

* update fp8xint4 example tile size.

* add <unordered_map> header

* fixed.

* format.

* Added conditional compilation for int4 -> fp8 conversion kernels

---------

Co-authored-by: mtgu0705 <mtgu@amd.com>
Co-authored-by: coderfeli <coderfeli@163.com>
2025-03-10 11:16:44 +08:00
Illia Silin
68a08c872e Rebase the PR #1520 to ROCm repo. (#1574)
* Implement hiprtc for codegen tests

* Introduce gemm_softmax_gemm to codegen.

* Fix codegen build issues.

* Address PR comments.

* Separate ck_host lib and gemm_softmax_gemm into different PR.

* Fix cmake.

* Replace ENV variable with CMake option for toggling hipRTC in codegen
tests.

* Address PR comments.

* fix clang format

* Add missing header in magic_division.hpp

* - Workaround for hipRTC content wrapper
- Move descriptor for gemm_softmax_gemm to different branch

* Fix formatting.

* Revert "Fix formatting."

This reverts commit b5209eaef4.

* formatting fix

* fixed header guard issues

* updated header guards

* updated data_type for new types

* fixed redefinition error

* Add codegen test for batched_gemm_softmax_gemm.

Signed-off-by: Mirza Halilcevic <mirza.halilcevic@amd.com>

* formatting fix

---------

Signed-off-by: Mirza Halilcevic <mirza.halilcevic@amd.com>
Co-authored-by: Dino Musić <dino.music@htecgroup.com>
Co-authored-by: Mirza Halilcevic <mirza.halilcevic@htecgroup.com>
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
Co-authored-by: arai713 <67439843+arai713@users.noreply.github.com>
Co-authored-by: Astha Rai <astha.rai713@gmail.com>
Co-authored-by: Mirza Halilcevic <mirza.halilcevic@amd.com>
2025-02-20 18:58:14 -08:00
Haocong WANG
76425a673f [A8W8 GEMM] Optimized weight-preshuffled implementation & add quantization datatype for CK TILE rms_norm (#1862)
* tempsave

* temp save

* tempsave

* tempsave, epilogue optimization for universal gemm done. TODO: mulitpleD epilogue optimization

* temp save

* tempsave

* temp save

* update bf16 instance list

* clang format

* bug fix

* temp save

* tempsave

* revert exp changes.

* add blank line

* add int8 gemm multiply multiply a8w8

* uncomment

* clang-format-12

* Add example_gemm_multiply_multiply_xdl_int8

* Remove shell scripts

* update preprocess number for mi308; bring back printout in ckprofiler

* tempsave

* update ck_a8w8 library, update flush cache timing api

* remove the change in ckprofiler src

* clean the flush_cache api

* reduce prefetch stage in blockwisepipev4

* update tile size for fp8 rowwise

* fix bug in enable f8 gemm inside ckProfiler

* update instance and lds layout strategy

* delete use less files

* fix cmake bug

* update instances

* add configs to fix tunning cases

* port tiles from a8w8

* rm debug used files

* add instances

* remove all non gemm in cmake

* fix build

* sanity bug fix

* add bypass logic and build

* can run

* add double buffer scratch

* remove agpr usage when vgpr usage <256

* add configs to fix tunning cases

* fix build

* fix performance regression on blockgemm v3 pipe

* using develop branch timer

* impl fp16 in ckprofiler

* add cpu shuffle

* fix tail

* use empty hipstream in ckprofiler

* fix missed files and fix clang format

* fix fp16 build

* fix cmake rm compile options

* fix brepeat, kloop and lds two buffer; works ok now

* use new pipeline for b preshuffle, run ok; revert olds to fix ckprofiler

* auto calculate hard code params

* fix warnings and revert cmake and fix clang format

* tempsave

* sanity pass, most tile size enabled. TODO: NWave!=4

* disable N, K  Padding, splitk enabled

* add fp16 instances

* use bpreshuffle as independent example

* refine weight preshuffle format.

* tempsave

* optimize software pipeline

* refine blockgemm pipeline version as base struct.

* fp8 add_rmsnorm_dynamic_dequant

* add save_x=true instance

* tempsave

* Add compute-friendly pipeline for bpreshuffle case; remove enable-post-misched=0 flag.

* fix Odd Mrepeat number pipelinev3; Add v3 instances to ckProfiler

* clean the code

* Merge from internal (#1857)

* enable batched_gemm_softmax_gemm_perm_wmma for gfx12

* disable instances with blocksize=256 in attention examples

* debuggging

* debug

* fixed lds_enabled

* debugging

* Fix and add limit to skiplds feature

* Enable skipLds feature and fix compilation bugs

* add ck_tile definitions for gfx12

* fix clang format and test/wmma_op

* updage instances cmake for gfx12

* disable the test_wmma_op on gfx12

* fix the builds for gfx950

* add gfx12 and gfx950 to default target list

* clean-up cmake file

* Initial introduction of OFP8 data types.

* Renamed FP8 and BF8 tests into FP8_FNUZ and BF8_FNUZ.

* Implementation of ConvertFP32Nearest in test_fp8_ocp.

* Remove dependence on possibly undeclared alias.

* Implement FP8OCP test for stochastic rounding mode.

* Implement FP8OCP tests for half_t type conversions.

* enable bf16 atomic add on gfx950

* Implement ConvertFP32Nearest test.

* Implement ConvertFP32Stochastic test.

* Implement ConvertFP16Nearest and ConvertFP16Stochastic tests.

* Refactoring. Move FP8 definitions into a separate header file.

* Enable easy switching between architectures.

* Fix compilation error for gfx942 architecture.

* Add fp4 type with constants

* only builf gfx950 branch for gfx950 target by default

* Enable OCP build of example_gemm_xdl_fp8.

* Fix formatting.

* fix the build logic for gfx950

* Improve GEMM example verbosity.

* Add constexpr where applicable.

* fix the logic of enabling XDL and WMMA instances

* Improve GEMM example verbosity.

* Enable build of example_gemm_xdl_fp8_bf8 test.

* Fix tests for gfx1101 architecture.

* Build DPP examples only on gfx103 and gfx11 architectures.

* Optionaly run either CPU or GPU verifications with GEMM examples.

* Extend GeneratorTensor_Sequential to produce values of prescribed data types.

* Add missing constructor.

* Add scale type and mxfp conversions

* Update conversions

* Add conversion tests

* Fix typo

* Improve infrastructure for OFP8 data type support.

* BUGFIX. Should not use FP8 as Compute/Accum data type.

* Add custom target for grouped_convnd_bwd_weight tests.

* Can build `tests` target on gfx950.

* Bugfixes on gfx1101 architecture.

* Fix dependencies.

* Add stochastic rounding tests

* Provide single point of truth for FP8 INF and NAN checks

* Prevent instantiation of operators that are not supported by FP8 data types

* Add FP8 type selection into client_axample CMakeLists.txt

* Prevent sccache server from shutting down during build

* Fix test success reporting logic

* Change default verification method to CPU.

GPU verification takes too much time to complete on the emulator.

* Add scale <-> float conversions

* Add scaled conversions with tests

* Add device conversions

* Make sure all tests and examples are built for gfx950

* Facilitate testing of FP8 data types on the emulator

* Introduce two new tensor generators

* Enable instances built for gfx94 to be built on gfx950

* Verify 35_splitk_gemm on floating point numbers.

splitk gemm appears to be losing precision VS reference implementation when FP numbers are involved.

* Format

* Verify 04_gemm_add_add_fastgelu on floating point numbers

* Verify 20_grouped_conv_bwd_weight on floating point numbers

* Verify 38_grouped_conv_bwd_data_multiple_d on floating point numbers

* Verify more tests on floating point data

* Fix data types and improve testing verbocity.

* Add fp4 vectors

* Add debug tests

* Upgrade to NPI 573 build docker.

* Skip on gemm_universal tests.

The tests take too long to complete on the emulator.
Need to see if it is possible to reduce the scope of the testing to just FP8 data types.

* Add new mfma instructions and examples

* Add preprocessor directives for gfx950 specific code

* Fix gfx1101 build

* Document test availability

* Re-enable fp8 gemms for gfx94/95

* Cherry-pick GEMM Universal tests for FP8 data types

* Cleanup

* Add vector types and tests

* Add check_err function

* Add tensor generators

* CK_USE_GFX94 has already been set on this branch

* Fix

* Address formatting issues and leftovers

* Make fail/pass logic consistent within 01_gemm folder

Removed multiple negations in fail/pass logic to propagate `true` as the success indicator.

* Fix GPU verification reporting logic.

* Update year in copyright notice.

* Cleanup

* Use `enum class` instead of `enum`

* Remove set_property for FP8 tests

* Add vector conversions

* Fix

* Fix linker errror

* Clean up

* Fix gfx950 conversions

* Clean up

* Fix more gfx950 conversions

* Fix even more gfx950 conversions

* Narrowing the scope of PR to OCP FP8 enablement only

* Add tests for OCP FP8 vector_type storage

* Fix client examples build

* Fix typo

* Update e8m0 casting

* Rename E8M0 type

* Update unpack method

* Cleanup merge artifacts

* Enable gemm kernel on all gfx9 architectures (#227)

* clean-up

* Implement `non_native_vector_base` with `ext_vector_type` array. (#232)

* Enable support of 1, 2, 4, and 8-byte custom types in CK.

* Fix pool tests for OCP FP8 data type

* Fix build

* Add ckProfiler gemm instances for new mfma instructions and fix ckProfiler build on MI350

* fix clang format

* Add new mfma instructions and examples

* Add preprocessor directives for gfx950 specific code

* Add ckProfiler gemm instances for new mfma instructions and fix ckProfiler build on MI350

* fix clang format

* Fix clang format for the newly merged files

* Use the existing example instances for fp16 bf16 and int8

* Remove comment on new mfma instructions in MfmaInstr

* Update include/ck/tensor_operation/gpu/grid/gridwise_batched_gemm_gemm_xdl_cshuffle_v1.hpp

Co-authored-by: Andriy Roshchenko <107577548+andriy-ca@users.noreply.github.com>

* merge from public repo

* Fix ck build

* Fix ck build

* Use double for max_abs_in_val

* Move scaled_type_convert functions to a separate header (#251)

* re-enable building mha lib and gemm_universal_f8 instances for gfx950

* Update library/src/tensor_operation_instance/gpu/CMakeLists.txt

Co-authored-by: Andriy Roshchenko <107577548+andriy-ca@users.noreply.github.com>

* fix typo for CK_USE_OCP_FP8

* fix typo for CK_USE_OCP_FP8

* Add FP6 and BF6 types (#261)

* Add a rounding flag

* Add FP6 and BF6

* Add tests

Co-authored-by: Andriy Roshchenko <107577548+andriy-ca@users.noreply.github.com>

* Clean up

---------

Co-authored-by: Andriy Roshchenko <107577548+andriy-ca@users.noreply.github.com>

* fix one more typo

* Refactor E8M0 scale implementation (#262)

* Refactor E8M0 scale implementation

* Add MXFP6 and MXBF6 conversion methods (#270)

* Add conversions

* Add tests

* Add docstrings

* Add scaled conversions

* Add fp6/bf6 tests

* Remove misleading fp4 test case

* Add docstrings

* Clean up

* Address comments

* Set stricter tolerances for RNE tests

* Add missing tests

* Add native conversions to float

* Revert "Add native conversions to float"

This reverts commit 09467111f73b753c8cc3d597533b187940353dab.

* Update copyright years

* replace the fp6 with bf6 convert calls in test_bf6

* fix test_bf6

* enable smfmac test

* [MX FP8] Add Scaled Type Convert Functions for OCP FP8/BF8 data types (#271)

* Move scaled_type_convert functions to a separate header

* Introduce MX data tests

* Build MX tests only on relevant architectures

* Refactor E8M0 scale implementation

* Fix `config.h` typo

* Cleanup deprecated symbols

* Refactor `amd_ck_fp8.hpp`

* `scaled_type_convert` for `f8_ocp_t`

* Implement test for MX FP8 scaled type convert

* Implement test for MX BF8 scaled type convert

* Scaled type convert for vectors of 2 FP8 elements

* Scaled type convert for vectors of 16 FP8 elements

* Implementation of scaled conversion from F32 to F8

* Add tests for scaled conversions from FP32 to FP8

* Add documentation to the test functions

* Implementation of scaled conversion from F32x2 to F8x2

* Implementation of scaled conversion from F32x16 to F8x16

* Implementation of scaled conversion from F32x32 to F8x32

* Implementation of scaled conversion from F8x32 to F32x32

* Verified on the emulator

* MX FP GEMM - Example Template (#277)

Temporarily uses `DeviceGemmMultiD_ABScale_Xdl_CShuffle_V3` kernel and 128x128 scaling matrices.
Must be modified to use MX-native GEMM kernell with 16 or 32 component vectors per scale.

Verified on the emulator.

* Add vector support

* Add tests

* Add missing type aliases

* Fix test naming

* only build mx example for gfx950

* disable CK_USE_AMD_MFMA_GFX950 by default

* fic build for multiple archs

* fix typo

* fix typo

* Update unpack signature

* Fix merge

* Add size checks in pack function

* Add a flag

* Add conversions

* Fix build logic

* Update pack/unpack methods

* Remove unneeded AsType accessors

* Add docstrings

* Add a flag to config file

* Test the functionality of V_MFMA_F32_16X16X128_F8F6F4 and  V_MFMA_F32_32X32X64_F8F6F4 instructions. (#293)

* Introduced MFMA tests

* Verified f8f6f4 MFMA Instructions

* Move flag logic to scaled_type_convert header

* Use pointers instead of array indices

* Fix a typo

* Update tests and pack functions

* Fix gemm gemm on gfx950

* Fix clang format

* restore the default gput target lists

* fix the jenkinsfile

* add missing ifdef

---------

Co-authored-by: Jing Zhang <jizhan@amd.com>
Co-authored-by: aska-0096 <haocwang@amd.com>
Co-authored-by: Jun Liu <Liu.Jun@amd.com>
Co-authored-by: Andriy Roshchenko <andriy.roshchenko@amd.com>
Co-authored-by: Rostyslav Geyyer <rosty.geyyer@amd.com>
Co-authored-by: Rostyslav Geyyer <46627076+geyyer@users.noreply.github.com>
Co-authored-by: root <root@banff-cyxtera-s83-2.ctr.dcgpu>
Co-authored-by: Andriy Roshchenko <107577548+andriy-ca@users.noreply.github.com>
Co-authored-by: jefyang1 <146495389+jefyang1@users.noreply.github.com>
Co-authored-by: jefyang1 <Jeffreyj.Yang@amd.com>

* clang format

* fix errors

* fix errors

* remove compile flags in example

* fix error

* restore cron trigger (#1863)

* recover enable-post-misched=0 for sanity issue

* add vectorloads on non-k dim for memory pipelines (#1856)

* Support for dtypes (fp8, bf8, bf16 and fp16) for the ck_tile/03_gemm example. (#1845)

* Support bf16/fb8/bf8 datatypes for ck_tile/gemm

* remove commented out code.

* Addressing code review comments and enabling universal_gemm for all the supported data types.

* Merge conflict resolution.

* Solve the memory pipeline compilation error. Merge with the new change of CShuffle

* finish the feature, pass the tests

* Fix the pipeline and add the benchmark script for other data types

---------

Co-authored-by: ThomasNing <thomas.ning@amd.com>

* revert blockwisegemm modification

* revert blkgemm pipe v2 changes.

* CK Tile - small fix to hotloop scheduler & KPack value. (#1867)

* Use SmemPack in HotLoop scheduler

* Additional debug print information

* Change KPack value.

Hardcode for now, as without AK1/BK1 there's no good way to determine
its value.

* Fix HotLoopScheduler MFMA instr parameters.

* Add a host mx gemm reference kernel (#1864)

* Add mx gemm reference kernel

* Update copyright year

* Update mx gemm example

* Use element-wise ops in the reference gemm

* External CI: enable amd-develop branch trigger (#1859)

* Apply suggestions from code review

Co-authored-by: John Afaganis <john.afaganis@amd.com>

* hotfix for ckprofiler operator

* add the 16x16 mfma instances

---------

Co-authored-by: chenjun <junchen2@amd.com>
Co-authored-by: coderfeli <coderfeli@163.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
Co-authored-by: Jing Zhang <jizhan@amd.com>
Co-authored-by: Jun Liu <Liu.Jun@amd.com>
Co-authored-by: Andriy Roshchenko <andriy.roshchenko@amd.com>
Co-authored-by: Rostyslav Geyyer <rosty.geyyer@amd.com>
Co-authored-by: Rostyslav Geyyer <46627076+geyyer@users.noreply.github.com>
Co-authored-by: root <root@banff-cyxtera-s83-2.ctr.dcgpu>
Co-authored-by: Andriy Roshchenko <107577548+andriy-ca@users.noreply.github.com>
Co-authored-by: jefyang1 <146495389+jefyang1@users.noreply.github.com>
Co-authored-by: jefyang1 <Jeffreyj.Yang@amd.com>
Co-authored-by: jakpiase <jakub.piasecki@amd.com>
Co-authored-by: kylasa <sudhir.kylasa@amd.com>
Co-authored-by: ThomasNing <thomas.ning@amd.com>
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
Co-authored-by: Daniel Su <danielsu@amd.com>
Co-authored-by: John Afaganis <john.afaganis@amd.com>
2025-02-20 14:00:27 -08:00
Bartłomiej Kocot
9ee69dd297 Fix pk_int4 cast and add pk_int4 dtype in ck tile (#1854)
* Fix pk_int4 cast and add pk_int4 dtype in ck tile

* fixes

* Improvements

* fix typo
2025-02-04 10:32:07 +01:00
arai713
2e3183af4f Codegen hipRTC compilation (#1579)
* updating codegen build for MIOpen access: adding .cmake for codegen component

* updating CMake

* adding in header guards for some headers due to issues with hiprtc compilation in MIOpen

* some more header guards

* putting env file in header guard

* cleaning up some includes

* updated types file for hiprtc purposes

* fixed types file: bit-wise/memcpy issue

* updating multiple utility files to deal with standard header inclusion for hiprtc

* added some more header guards in the utility files, replacing some standard header functionality

* added some more header guards

* fixing some conflicts in utility files, another round of header guards

* fixing errors in data type file

* resolved conflict errors in a few utility files

* added header guards/replicated functionality in device files

* resolved issues with standard headers in device files: device_base and device_grouped_conv_fwd_multiple_abd

* resolved issues with standard headers in device files: device_base.hpp, device_grouped_conv_fwd_multiple_abd.hpp, device_grouped_conv_fwd_multiple_abd_xdl_cshuffle.hpp

* added header guards for gridwise gemm files: gridwise_gemm_multiple_abd_xdl_cshuffle.hpp and gridwise_gemm_multiple_d_xdl_cshuffle.hpp

* fixed issue with numerics header, removed from transform_conv_fwd_to_gemm and added to device_column_to_image_impl, device_grouped_conv_fwd_multiple_abd_xdl_cshuffle, device_grouped_conv_fwd_multiple_abd_xdl_cshuffle_v3, device_image_to_column_impl

* replaced standard header usage and added header guards in block to ctile map and gridwise_gemm_pipeline_selector

* resolved errors in device_gemm_xdl_splitk_c_shuffle files in regards to replacement of standard headers in previous commit

* added replicated functionality for standard header methods in utility files

* replaced standard header functionality in threadwise tensor slice transfer files and added header guards in element_wise_operation.hpp

* temp fix for namespace error in MIOpen

* remove standard header usage in codegen device op

* removed standard header usage in elementwise files, resolved namespace errors

* formatting fix

* changed codegen argument to ON for testing

* temporarily removing codegen compiler flag for testing purposes

* added codegen flag again, set default to ON

* set codegen flag default back to OFF

* replaced enable_if_t standard header usage in data_type.hpp

* added some debug prints to pinpoint issues in MIOpen

* added print outs to debug in MIOpen

* removed debug print outs from device op

* resolved stdexcept include error

* formatting fix

* adding includes to new fp8 file to resolve ck::enable_if_t errors

* made changes to amd_wave_read_first_lane

* updated functionality in type utility file

* fixed end of file issue

* resovled errors in type utility file, added functionality to array utility file

* fixed standard header usage replication in data_type file, resolves error with failing examples on navi3x

* formatting fix

* replaced standard header usage in amd_ck_fp8 file

* added include to random_gen file

* removed and replicated standard header usage from data_type and type_convert files for fp8 changes

* replicated standard unsigned integer types in random_gen

* resolved comments from review: put calls to reinterpret_cast for size_t in header guards

* updated/added copyright headers

* removed duplicate header

* fixed typo in header guard

* updated copyright headers

---------

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
2025-01-31 09:48:39 -08:00
Bartłomiej Kocot
1519ce91a3 Fix and optimize dynamic unary elementwise (#1818)
* Fix and optimize dynamic unary elementwise

* fix
2025-01-16 13:48:39 -08:00
Mingtao Gu
4f62f6e9b7 Implement the fp16xint4 scale weight only kernel for Ali (#1786)
* enable int4 scale (weight only) kernel

* format some files

* Add unit test for int4 weight only

* fixed and formatted code

* fixed

* formated

* formated

* fixed

* fixed a bug in the ckProfiler, and formatted the code

---------

Co-authored-by: mtgu0705 <mtgu@amd.com>
2025-01-03 18:35:21 +08:00
Adam Osewski
1d8e4ec2ce Jing's contribution: prototype of mixed precision gemm FP16/BF16xint4 GEMM (#1762)
* add a prototype of int4

* clean

* debug

* clean

* clean

* move packed into dynamic_buffer

* fixed coord reset

* add fast pki4 to half conversion

* fix

* fixed reference and host_tensor

* fixed tensor init

* format

* debug i4_to_f16_convert

* format

* fixed splitk

* weight permute

* add b tile permute

* clean

* weight permute with splitki

* format

* improve weight layout

* add and_or_b32

* fixed splitk crush

* add permute switch as a template

* recover v3r1

* clean

* failure with intrawave v2

* fixed

* fixed

* add ckProfiler

* add bfp16 support

* add bf16 example

* fixed int4 to bhalf_t conversion

* format

* fixed int4 to bf16 conversion

* clean

* add instances for mem

* clean

* fixed host tensor size

* fixed

* debug

* fixed

* add pk_i4_t as a struct

* fix

* Update example/01_gemm/gemm_xdl_bf16_pk_i4_v3.cpp

Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>

* Update example/01_gemm/gemm_xdl_bf16_pk_i4_v3.cpp

Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>

* Update example/01_gemm/gemm_xdl_bf16_pk_i4_v3.cpp

Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>

* revert

* Update example/01_gemm/gemm_xdl_bf16_pk_i4_v3.cpp

Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>

* Update example/01_gemm/gemm_xdl_fp16_pk_i4_v3.cpp

Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>

* Update example/01_gemm/gemm_xdl_fp16_pk_i4_v3.cpp

Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>

* Update example/01_gemm/gemm_xdl_fp16_pk_i4_v3.cpp

Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>

* Update example/01_gemm/gemm_xdl_fp16_pk_i4_v3.cpp

Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>

* fixed comments

* revert

* clean

* revert

* revert

* fixed

* Update CMakeLists.txt

* Update script/cmake-ck-dev.sh

Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>

* Update include/ck/tensor_operation/gpu/element/unary_element_wise_operation.hpp

Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>

* Update CMakeLists.txt

Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>

* fixed

* fixed

* fixed

* revert

* revert

* add comments

* format

* fixed assert

* fixed

* Fix I4 define in ckProfiler

* Fixed example_gemm_xdl_bf16_pk_i4_v3 test failed issue

---------

Co-authored-by: Jing Zhang <jizhan@fb.com>
Co-authored-by: zjing14 <zhangjing14@gmail.com>
Co-authored-by: mtgu0705 <mtgu@amd.com>
2025-01-02 11:48:06 +08:00
Bartłomiej Kocot
9a8a52130d Remove virtual destructors from unary ops (#1610)
* Remove virtual destructors from unary ops

* Fixes

* Fixes

* clang format fixes
2024-10-30 17:42:50 +01:00
Bartłomiej Kocot
31bf253aeb Add dynamic elementwise op (#1426)
* Add dynamic elementwise op

Co-authored-by: ThruptiRajLakshmanaGowda <thruptiraj.lakshmanagowda@amd.com>

* CI issues fix

* Custom parameter value for dynamic functions - Comments addressed

---------

Co-authored-by: ThruptiRajLakshmanaGowda <thruptiraj.lakshmanagowda@amd.com>
Co-authored-by: ThruptiRajLakshmanaGowda <tlakshma@amd.com>
2024-10-26 15:22:37 +02:00
valarLip
37f7afed1e add int8 gemm multiply multiply a8w8 (#1591)
* add int8 gemm multiply multiply a8w8

* uncomment

* clang-format-12

* Add example_gemm_multiply_multiply_xdl_int8

* Remove shell scripts

* update preprocess number for mi308; bring back printout in ckprofiler

* format

---------

Co-authored-by: chenjun <junchen2@amd.com>
Co-authored-by: Haocong WANG <haocwang@amd.com>
Co-authored-by: carlushuang <carlus.huang@amd.com>
2024-10-26 16:39:34 +08:00
Bartłomiej Kocot
f21cda2536 Add transpose scale amax example (#1547)
* Add transpose scale amax example

* fixes

* Tune reduce instance
2024-10-14 17:39:38 +02:00
Mateusz Ozga
448c0f56d8 Pool2d max/avg kernel in the BWD version (#1494)
* Add pool2d instance BWD AVG

* Add pool2d instance BWD MAX

* Fix: avg review

* Fix review: part2

* Fix - enable test when type is compiled

* Fix review part3
2024-09-12 11:47:52 +02:00
Andriy Roshchenko
c3515f277c Adding Instances and Examples for FP8-based Scaled Convolution and AMAX Reduction. (#1473)
* Enable CMakePresets build

* Verify Convolution, Scaling and ReLU algorithms.

* Add tensor element-wise scale and type cast operation.

* Reduction implemented but does not work.

* Exploration of Reduction functionality.

* Completed example for Convolution scaled with ReLu activation and AMAX reduction.

* WIP: Add required instances for convolution.

* WIP: Create client example. Implement convolution stage.

* Add elementwise instances.

* Add elementwise scale + convert example.

* Add reduction instances.

* WIP: Client example for AMAX reduction.

* WIP: Add instances for multistage reduction.

* WIP: Implementation of multistage reduction.

* Refactoring.

* Clean up.

* Add CMakePresets.json

* Guard off FP8 instances when the data type is not available.

* Add example for Scaled FP8 Convolution with AMAX reduction.

* Refactor CombConvScaleRelu instances.

* Add CombConvScale instances.

* Add client example for Scaled FP8 Convolution with AMAX reduction.

* Cleanup.
2024-08-21 15:22:41 -07:00
Andriy Roshchenko
4a8a1befd5 Adding more instances of grouped convolution 3d forward for FP8 with ConvScale+Bias element-wise operation. (#1412)
* Add CMakePresets configurations.

* Add binary elementwise ConvScaleAdd and an example.

* Numerical verification of results.

Observed significant irregularities in F8 to F32 type conversions:
```log
ConvScaleAdd: float=145.000000   f8_t=160.000000    e=144.000000
ConvScaleAdd: float=97.000000   f8_t=96.000000    e=104.000000
ConvScaleAdd: float=65.000000   f8_t=64.000000    e=72.000000
```

* Implemented ConvScaleAdd + Example.

* Add ConvScale+Bias Instances

* Add Client Example for ConvScale+Bias

* Fix number of bytes in an example..

* Cleanup.
2024-07-24 15:49:55 -05:00
Haocong WANG
8c90f25be3 [GEMM] F8 GEMM, performance optimized. (#1384)
* add ab_scale init support

* enabled interwave

* add scale type; update isSupport

* adjust example

* clean

* enable f8 pure gemm rcr ckprofiler

* Add gemm_multiply_multiply instances

* clang format

* Optimize for ScaleBlockMNK=128

* enable abscale f8 gemm ck profiler

* Add pure f8 gemm test suite

* Reverting to the state of project at f60fd77

* update copyright

* clang format

* update copyright

---------

Co-authored-by: root <jizhan@amd.com>
2024-07-19 22:06:52 +08:00
Qianfeng
ee768148f0 Replace the using of __expf by __ocml_exp_f32 to work-around the test_softmax_rank4 failure (#1394) 2024-07-17 09:15:05 -07:00
Andriy Roshchenko
802a8a1df1 Adding more instances of grouped convolution 3d forward for FP8 with ConvScale element-wise operation and ReLU activation. (#1386)
* Add CMakePresets configurations.

* Add ConvScale+ReLU Functor and an Example

* Account for ReLU FLOPs.

* Add instances of 3D convolutions with ConvscaleRelu operation.

* Implement Client Example

* Cleanup
2024-07-16 08:51:49 -07:00
ThruptiRajLakshmanaGowda
0162a5f6ba Adding Missed Activation Functions for Grouped 2D/3D Convolutions (#1348)
* Initial Push

* First Push

* Fixed Clang format

* Resolve merge conflict

* Addressed review comments

* Addressed review comments

* Addressed review comments
2024-06-20 09:24:54 -05:00
Rostyslav Geyyer
ce66277a76 Add a convinvscale op, related instances and examples (#1307)
* Update the element op

* Add an example

* Add instances

* Add a client example

* make sure new instances only build on gfx9

* Update element op and its handling

* Format

* Update instances to take element op as an argument

* Update examples to use random scale values

* Format

* Update client example with random scales

* Format

---------

Co-authored-by: illsilin <Illia.Silin@amd.com>
2024-06-10 14:48:49 -05:00
Rostyslav Geyyer
cb0645bedc Add a scale op, related instances and examples (#1242)
* Add a scale op

* Update the element op

* Add instances

* Add an example

* Add a client example

* Add a flag check

* Revert flag check addition

* Fix flag check

* Update d strides in example

* Update d strides in client example

* Apply suggestions from code review

Update copyright header

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Move the example

* Move the client example

* Update element op

* Update example with the new element op

* Add scalar layout

* Update example

* Update kernel for scalar Ds

* Revert kernel changes

* Update element op

* Update example to use scales' pointers

* Format

* Update instances

* Update client example

* Move element op to unary elements

* Update element op to work with values instead of pointers

* Update instances to take element op as an argument

* Update examples to use random scale values

---------

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>
2024-06-04 19:28:15 -05:00
Rostyslav Geyyer
f044ff71fb Add element op (#1259) 2024-04-26 12:55:45 -05:00
zjing14
0d0150db20 bf16A_Int8B with fastgelu/bias (#1264)
* changed the copy function to v7r2

* adding multi_abd

* in-progress

* add post-load oob check

* debugging

* adjust instances

* add run_lds

* add elemntwise_op

* replace multi_abd_device with v3

* clean up

* clean

* clean

* Added LDSType

* profiling

* adjust oobcheck

* add missing file

* refactor

* clean

* add examples
2024-04-26 07:26:30 -05:00
zjing14
12865fbf28 Added Multi_ABD support into Gemm and GroupedGemmFixedNK (#978)
* added an example grouped_gemm_multi_abd

* fixed ci

* add setElementwiseOp

* changed API

* clean code: add multiA into example

* fixed v7r2 copy

* add transpose

* clean

* fixed vector_load check

* Update example/15_grouped_gemm/grouped_gemm_multi_abd_xdl_fixed_nk_bias_fp16.cpp

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update example/15_grouped_gemm/grouped_gemm_multi_abd_xdl_fixed_nk_bias_fp16.cpp

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update example/15_grouped_gemm/grouped_gemm_multi_abd_xdl_fixed_nk_bias_fp16.cpp

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update include/ck/tensor_operation/gpu/grid/gridwise_gemm_multiple_abd_xdl_cshuffle.hpp

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update include/ck/tensor_operation/gpu/grid/gridwise_gemm_multiple_abd_xdl_cshuffle.hpp

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update include/ck/tensor_operation/gpu/device/device_grouped_gemm_multi_abd.hpp

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update include/ck/tensor_operation/gpu/device/device_grouped_gemm_multi_abd.hpp

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update include/ck/tensor_operation/gpu/device/device_grouped_gemm_multi_abd.hpp

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update include/ck/tensor_operation/gpu/device/device_grouped_gemm_multi_abd.hpp

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update include/ck/tensor_operation/gpu/device/device_grouped_gemm_multi_abd.hpp

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update include/ck/tensor_operation/gpu/device/device_grouped_gemm_multi_abd.hpp

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update include/ck/tensor_operation/gpu/device/device_grouped_gemm_multi_abd_fixed_nk.hpp

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update include/ck/tensor_operation/gpu/device/device_grouped_gemm_multi_abd_fixed_nk.hpp

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* add reduce

* testing

* add example_b16_i8

* refactor example

* clean

* add mpading

* disable reduce for kbatch = 1

* seperate reduce device op

* add reduce op

* add guard for workspace_size

* add instances

* format

* fixed

* add client example

* add a colmajor

* add instances

* Update cmake-ck-dev.sh

* Update profile_gemm_splitk.cpp

* Update gridwise_gemm_xdlops_v2r4r2.hpp

* format

* Update profile_gemm_splitk.cpp

* fixed

* fixed

* adjust test

* adjust precision loss

* adjust test

* fixed

* add bf16_i8 scale bias

* fixed scale

* fixed scale elementwise_op

* revert contraction deviceop changes

* fixed

* Add AddFastGelu

* Revert "Merge branch 'jizhan/gemm_splitk_reduce' into grouped_gemm_multi_abd_fixed_nk_example"

This reverts commit 3b5d001efd, reversing
changes made to 943199a991.

* add Scales into elementwise

* add gemm_multi_abd client example

* add client examples

* add rcr and crr

* add grouped gemm client example

* add grouped gemm client example

* add instance for rcr crr

* format

* fixed

* fixed cmake

* fixed

* fixed client_example

* format

* fixed contraction isSupport

* Update include/ck/tensor_operation/gpu/device/device_grouped_gemm_multi_abd_fixed_nk.hpp

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

* Update device_reduce_threadwise.hpp

* clean

* Fixes

* Fix example

---------

Co-authored-by: Jing Zhang <jizha@amd.com>
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>
2024-04-15 21:09:45 -05:00
Haocong WANG
f83e9701e9 [GEMM] Gemm universal device operation (#1154)
* Optimize GEMM on MI200/300:
1. Add new blockwise gemm pipeline
2. Add irregular splitk intances

* clang format + typo fix

* Fix a bug

* initial commit

* Add more instances to irregular splitk

* blkgemm pipeline v1~4 prototype

* Sanity Checked. Known issue:
1. Poor performance of splitk
2. Register spill on blkgemmpipeline v3

* Sanity and Performance fix:
1. fix a bug related to sanity in grouped b2c mapping
2. fix a bug related to sanity and performance in splitk offset

* Sanity and API update:
1. Remove prefetch stage
2. Fix valid check bug
3, Add first gemm_universal instance into ckProfiler

* Add NN instances for gemm universal

* 1. Add NT instances for gemm_universal
2. Fix a bug about Kpadding in gemm_universal

* Fix a bug regarding padding Odd K number

* remove kernel print

* Fix KPadding bug...

* Update safety check

* another try to fix kpadding..

* Sanity checked

* new instances..

* clang format+typo fix

* remove clang format script's change

* Add non-hotloop compile option

* 1. Add fp16xfp8 example
2. pull packed convert f8 from pr1150

* Some miscs.. opt and fix

* Add pipeline description docs

* Split universal gemm instance library to cut profiler compiling time

* uncomment cmakefile

* Fix a bug caused by blockwise_gemm_pipe_v2

* reduce default splitk to 1

* Add 224x256x64 tile size

* update, including:
1. Experiment pipeline 5~7
2. Optimization for pipeline 4
3. Organized instance library

* temp save

* temp save

* Permuted lds layout, sanity and function checked

* clang format

* Move OOB check from RunRead to RunWrite, for better software pipeline.
TODO: agpr spill when NN layout

* clangformat

* A/B splitpipe scheduler for v3

* Fix two bugs

* bug fix

* fix a bug in oob check

* Example for mixed fp16_fp8 gemm

* Clean experimental code blocks

* Add mixed precision gemm into profiler

* tempsave

* optimize m/n major lds layout

* Add RRR GEMM  mixed precision instances

* Optimize f8 matrix transpose

* Add test_gemm_universal

* A/B spilt schedule for blkpip v5

* Take ds_read2 into iglp scheduling scheme

* format

* fixed cmake

* Add llvm-option into CI cmake flag

---------

Co-authored-by: Jing Zhang <jizhan@amd.com>
2024-04-13 21:03:18 -05:00
Bartłomiej Kocot
9a194837af Introduce combined elementwise ops (#1217)
* Introduce combined elementwise ops

* Introduce refrence elementwise
2024-04-02 17:23:49 -05:00
Bartłomiej Kocot
285251768e Add conv fwd/bwd data scale instances, extend bilinear instances (#1178)
* Add conv fwd/bwd data scale instances

* Fix cmake client example file

---------

Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
2024-03-13 23:09:08 +01:00