Commit Graph

1754 Commits

Author SHA1 Message Date
Bartłomiej Kocot
fdaff5603e Add grouped conv bwd wei merged grouped instance for larger filter (#1984)
* Add grouped conv bwd wei merged grouped instance for larger filter

* Update readme
2025-03-18 16:16:24 +01:00
Illia Silin
1342ecf7fb Add a daily CI build on gfx908. (#1987)
* add one daily ci build on gfx908

* add redis invocation tag for gfx908

* make ci build for gfx908 conditional

* fix groovy logic

* add option to run perf tests for gfx908

* disable a few tests on mi100
2025-03-17 18:08:53 -07:00
Illia Silin
07f25186b2 disable ck_tile basic gemm (#1986) 2025-03-17 15:26:43 -07:00
aledudek
5095906975 Async grouped gemm v3 (#1940)
* Fully async grouped gemm

* Remove commented code

* Remvoe maybe_unused

* host kernel args

* Checkpoint segfault debugging...

* Working part1

* Working part2

* Remvoe comments...

* Use void ptr for gemm kernel host args

* Fix device_grouped_gemm_multiple_d_dl build issue

* Fix device_grouped_gemm_xdl build issue
2025-03-17 16:42:43 +01:00
Bartłomiej Kocot
c2e4898b4b Grouped conv bwd data NGCHW (#1967)
* Grouped conv bwd data NGCHW

* fixes

* fix

* Improvements

* Fix

* Fix

* add client example
2025-03-17 13:32:00 +01:00
valarLip
52b1cd7780 hotfix fmoe build issue (#1976) 2025-03-13 15:11:59 +08:00
dependabot[bot]
de7a745ca6 Bump rocm-docs-core from 1.17.1 to 1.18.1 in /docs/sphinx (#1977)
Bumps [rocm-docs-core](https://github.com/ROCm/rocm-docs-core) from 1.17.1 to 1.18.1.
- [Release notes](https://github.com/ROCm/rocm-docs-core/releases)
- [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.17.1...v1.18.1)

---
updated-dependencies:
- dependency-name: rocm-docs-core
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-03-12 23:36:36 -07:00
carlushuang
3e81279d26 Reapply "[CK_TILE] support hdim=192/128 pair for deepseekv3 (#1961)" … (#1971)
* Reapply "[CK_TILE] support hdim=192/128 pair for deepseekv3 (#1961)" (#1969)

This reverts commit 8cbcd3e0d0.

* fix codegen problem

* Update config.hpp

---------

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
2025-03-13 11:41:39 +08:00
Illia Silin
d4a6d69643 disable tests that take too long to build for gfx90a (#1975) 2025-03-12 17:54:03 -07:00
feli
251afab3b7 ck_moe: fix useless code and remove usless oob (#1972)
* fix useless code and remove usless oob

* clang format

---------

Co-authored-by: coderfeli <coderfeli@163.com>
2025-03-12 09:22:42 -07:00
Illia Silin
4c97cc511e use old instrinsics with staging compiler (#1970) 2025-03-12 07:29:09 -07:00
Illia Silin
8cbcd3e0d0 Revert "[CK_TILE] support hdim=192/128 pair for deepseekv3 (#1961)" (#1969)
This reverts commit 7a93b16ff6.
2025-03-11 10:40:18 -07:00
Haocong WANG
cbd74c2d12 [Block Scale GEMM] Optimized block scale gemm (#1950)
* Added two kernel for M=32 problem

* Comment the first one

* Enable multiply_multiply for Scale_Block_M = 1 for deepseek

* Modify the a_thread offset since the A data load is different from B.

* edit fp8 ab scale for Scale_Block_M=1

* edit GemmSpec to MNKPadding

* enable blockwise pipelie v1 and v2. v1 is work for small K.

* add instance for gemm_ab_scale

* fix cmakelist of ckProfiler

* optimize blockscale gemm. todo: reduce vgpr usage

* fix a correctness bug

* sanity checked

* revert ckprofiler cmake changes

* clang format

* revert unnecessary changes.

* remove commented codes.

* split weight preshuffle library targets

* bring back enable-post-misched=0

* fix build issues for gemm_multiply_multiply_fp8 instances

* fix clang format

* add verbose build flag when building for all targets

* reduce path names for new instances

* fix paths in cmake

* refactor gemm_multiply_multiply library target

* fix a bug in example

* fix example 65 cmake

* reduce the number of threads when building libs for all targets to 50

* use ninja to build for all targets

* reduce teh number of threads when building for all targets

* reduce the number of threads to 32 when building libs for all targets to 50

---------

Co-authored-by: mtgu0705 <mtgu@amd.com>
Co-authored-by: chenjun <junchen2@amd.com>
Co-authored-by: illsilin <Illia.Silin@amd.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
2025-03-11 10:11:21 -07:00
Haocong WANG
ba209b9dab reduce test size to avoid timeout on specific silicon (#1966) 2025-03-11 09:15:26 -07:00
Illia Silin
aa42c3db06 disable example_moe_gemm2_xdl_pk_i4 on gfx950 (#1968) 2025-03-11 08:34:47 -07:00
carlushuang
7a93b16ff6 [CK_TILE] support hdim=192/128 pair for deepseekv3 (#1961)
* support hdim=192/128 pair

* remove useless print

* update
2025-03-11 21:07:40 +08:00
Mingtao Gu
0db7c8f0b2 Ck int4 moe develop (#1949)
* Add Gemm fp8xint4 example and kernel, function pass.

* Init Gemm_fp8xint4 Bpreshuffle

* Added gemm_fp8xint4_Bpreshuffle files, function not checked yet

* General fix.

* fp8xint4 bpreshuffle function pass

* fix.

* init b preshuffle dequant in VGPR.

* fix bug, function pass.

* move b thread dequant copy to blockwise.

* fix bug, function now passes.

* modified the tile size to 256, 128x128x128.

* fixed a bug.

* Initial int4 moe, compile pass, function not check.

* fix bug in moe_gemm1.cpp, now function pass.

* test expert = 8 and function pass.

* Added moe_pk_i4_gemm2, function pass.

* Added b preshuffle pipeline v3 support.

* fixed merge issue. fp8xint4 and fp8xint4_bpreshuffle function pass.

* Split the blockwise pipeline for fp8xint4.

* commit missing files

* opt gemm2 to 2x2 wave

* fix swizzle = false

* update int4 moe with latest input changes.

* update tile size.

* enable pipeline v3.

* fix nswizzle = true

* commit a version for compiler debug.

* Updated transfer_v3r1_gather to support pk_i4_t type.

* for int4 moe2 for type_convert support.

* remove some values between mfma instructions.

* fix int4 moe

* Updated transfer_v3r1_gather to support pk_i4_t type.

* i4 support lds multiple shuffle

* fixed int4 moe tflops calculation.

* Modified CshuffleCShuffleMXdlPerWavePerShuffle to 1 to suit C multiple shuffle

* updated gemm2.

* change int4 moe example names

* fix and format code.

* format.

* format codes.

* update fp8xint4 example tile size.

* add <unordered_map> header

* fixed.

* format.

* Added conditional compilation for int4 -> fp8 conversion kernels

---------

Co-authored-by: mtgu0705 <mtgu@amd.com>
Co-authored-by: coderfeli <coderfeli@163.com>
2025-03-10 11:16:44 +08:00
Thomas Ning
c954bd0cfa Add the instance of MBlock=144 for GemmMultiplyMultiply (#1955)
* tempsave, not selected

* finish the feature and merge with develop

---------

Co-authored-by: aska-0096 <haocwang@amd.com>
2025-03-07 13:44:06 -08:00
Thomas Ning
9d51d17dd0 Fix on the error (#1956) 2025-03-07 13:43:52 -08:00
Illia Silin
0e8e711ec8 add missing headers (#1959) 2025-03-07 11:11:30 -08:00
Max Podkorytov
9e132eb77c refactor ck-tile kernel launch (#1925) 2025-03-07 08:29:40 -08:00
Qianfeng
4f54fa3058 Ck tile/complete k prefetch (#1941)
* Re-implement qr_ks_vs_async pipeline by using kLoadOnce

* Remove last block_sync_lds() in the loop

* Tiny adjustment in qr_ks_vs_async pipeline for better performance

* Rename MakeQDramTileDistribution to MakeQRegTileDistribution for QLoadOnce pipeline

* Use LDS as intermediary stop when loading Q from global memory for qr_ks_vs_async pipeline

* Use un-rolled gemm for Gemm-0

* Use k0_loops small tile load/store to replace the big tile load/store for K

* Remove the commented lines in qx_ks_vs_custom_policy.hpp

* Tune the prefetching of V in qr_ks_vs_async pipeline

* Move the codes for storing the first v_lds tile some later

* Let BlockDropout reuse LDS with V

* Switch to separate code blocks according to iteration index

* Interleave code blocks for better performance

* Move clear_tile(s_acc) for better interleaving

* Move code interleaving

* Use MakeQDramTileDistribution for q_dram_window

* Roll-back to load Q directly from global memory instead of using LDS as intermediary stop

* Let V reuse the LDS of K

* Use array of tiles to represent Q in vgprs

* Use QLoadOnce == false for qr_ks_vs_async pipeline

* Special treatment for hdim-96 to save vgprs in qr_ks_vs_async pipeline

* Define statically indexed array k_lds_windows[] to reduce the using of get_slice_tile()

* Move the definition of v_tiles out from the loop

* Define statically indexed array v_lds_windows[] to reduce using of get_slice_tile()

* Remove using KLoadOnce in qx_ks_vs_custom_policy

* Remove un-used get_slice_tile() call

* Move the code line of clear_tile(s_acc)

* Tune the lines of codes to make them more tidy

* Re-arrange the codes before the main-loop

* Add comments

* Unify the alignment to be 8 for Q/K/V Lds decriptors

* Tuning to K pre-loading

* Tune K Lds and V Lds reuse for kPreloadWholeNextIterationK == false

* Adjust the pipeline codes

* Use NumPrefetchV to separate from NumVLdsBuffers

* Tune the location of a scheduler barrier code line

* Prefetch first v_tile at earlier time for both kPreloadNextWholeIterationK true/false paths

* Adjust the using of kPadSeqLenQ and kPadSeqLenK in the kernel

* Use __builtin_amdgcn_sched_barrier(0x7f) in the pipeline

* Move the location for store_tile() of first v_tile

* Rename the qr_ks_vs_async pipeline to qr_ks_vs_whole_k_prefetch pipeline

* Re-add NumPrefetchK as template for BlockFmhaPipelineQXKSVSCustomPolicy<>

* Try to fix old bugs in qx_ks_vs_custom_policy

* Remove K_LDS_LOAD_USE_OFFSET_TRANSFORM code-path to make qr_ks_vs_async and qx_ks_vs_custom_policy simpler

* Fix in MakeKDramTileDistribution() in qx_ks_vs_custom_policy

* Update to LdsBufferSequence and introduce NumKVLdsBuffers for max(NumPrefetchK, NumPrefetchV)

* Tiny Fix (#1888)

* Ck tile/paged attention workaround (#1894)

* Correction in GetRangeAlongX()

* Work-around to solve the failures in test_paged_attention_ck in xformers

* Tiny code adjustment in the qr_ks_vs_whole_k_prefetch pipeline

* Remove one call of move_tile_window for q_dram_window

* Refine the codes in GetNumPrefetchV()/GetNumKLdsBuffers()

* Tiny fix in qr_ks_vs_whole_k_prefetch pipeline

* Adjust the location of codes for storing the first V tile to LDS

* Tiny fix and add comments

* Change GetSmemKPackK size to improve performance

* Move the codes related to K-Lds to the pipeline default policy due to some override on the generic custom_policy

* Update MakeKDramTileDistribution() and MakeKLdsDescriptor() to completely remove bank conflicts for K-Lds access

* Adjustment in intermediate iteration codes for tiny performance improvement

* Reduce the number of VLds buffers to 2 for whole_k_prefetch situtation

* Use IsFirstKLdsBufferOverlapLastVLdsBuffer() to avoid potential Lds issue

* Adjust the code location for calling IsFirstKLdsBufferOverlapLastVLdsBuffer()

* Remove useless AsyncopyV

* Rename MakeQDramTileDistribution to MakeQRegTileDistribution when LDS is not used

* Keep qx_ks_vs_custom_policy work for other pipelines and move whole_k_prefetch specific codes to whole_k_prefetch default policy

* Recover the qr_ks_vs_async pipeline

* Recover qr_ks_vs_async in fmha.hpp and tiny fix in qr_ks_vs pipeline

* Revert "Try to fix old bugs in qx_ks_vs_custom_policy"

This reverts commit 39b82ca194.

* Tiny fix with regard to whole_k_prefetch pipeline compiling

* Update kPadSeqLenK setting in fmha_fwd_kernel

* Use q_element_func and k_element_func

* Use single q_tile rather than multiple sliced q_tiles

* Codes refine according to the comments

* Re-format one file

* Mark qr_ks_vs_whole_k_prefetch as QLoadOnec == true
2025-03-07 14:19:51 +08:00
Illia Silin
43c90b5234 RE-enable DL and DPP instances by default. (#1954)
* enable DL and DPP instances by default

* fix cmake logic
2025-03-06 21:45:31 -08:00
Max Podkorytov
7a4a5d6c08 Update CODEOWNERS (#1953)
Add @tenpercent
2025-03-06 17:38:29 -08:00
Juan Manuel Martinez Caamaño
0f62dd9928 Fix typo: v_offset used in initialization of v_offset (#1951)
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
2025-03-06 17:37:29 -08:00
kylasa
66c5f5b0b6 Addressing (Post Merge) code review comments for PR 1845 (#1883)
* Addressing code review comments.

* Addressing code review comments.

* Reorganized code for better readability.

* add ck_tile gemms for new types in CI

* fix jenkins syntax

* fix script syntax

* Add the test cases back

* Address the review comments

* Address review comments

* clang format

* Solve the merging issues

* Addressed the comments

* clang format

---------

Co-authored-by: illsilin <Illia.Silin@amd.com>
Co-authored-by: ThomasNing <thomas.ning@amd.com>
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
2025-03-06 11:40:30 -08:00
carlushuang
c12fb0a624 [CK_TILE][HOTFIX] WA for address space by disable it completely (#1947)
* port all moe changes from ck_moe_gemm branch

* refine codes in the pr

* fix tail odd

* fix clang format

* fix clang format2

* make hot loop scheduler compatible with 16x16 and 32x32

* clang format

* fix per token quant

* rename moe example

* clang format

* WA for address space by disable it completely

* hot fix moe gemm2

---------

Co-authored-by: coderfeli <coderfeli@163.com>
Co-authored-by: feli <felix.li@amd.com>
2025-03-06 12:01:25 +08:00
Illia Silin
a88bf76ecc Replace buffer load/store intrinsics with builtins (#1876)
* replace buffer load/store intrinsics with builtins

* fix clang format

* replace buffer load/store intrinsics with built-ins in ck_tile

* fix clang format

* add switch between buffer intrinsics and built-ins

* change the builtins threshold to clang20

* fix clang format

* fix some compilation errors

* revert changes in ck_tile

* revert changes in ck_tile

* delete all root files and folders when CI completes

* try changing the username in CI

* fix groovy syntax

* add user and group id info to ci dockers

* change ownership of all files in CI to jenkins at the end

* update changelog
2025-03-05 14:33:28 -08:00
Adam Osewski
4814db3905 [CK TILE] Fix KIterPerInnerLoop for block gemm. (#1934)
* Fix KIterPerInnerLoop

* Fix Kpack and KPerInnerLoop for block universal gemm.

* Fix overlooked spelling bugs.
2025-03-05 14:17:44 -08:00
Illia Silin
9b51c08bf7 remove support for gfx940 and gfx941 targets (#1944)
* remove support for gfx940 and gfx941 targets

* update changelog
2025-03-05 11:07:33 -08:00
asleepzzz
d378233924 Update CODEOWNERS (#1945)
* Update CODEOWNERS

Add new code owners

* Update CODEOWNERS

---------

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
2025-03-05 10:29:51 -08:00
feli
3786e16375 ck moe gemm implement (#1936)
* port all moe changes from ck_moe_gemm branch

* refine codes in the pr

* fix tail odd

* fix clang format

* fix clang format2

* make hot loop scheduler compatible with 16x16 and 32x32

* clang format

* fix per token quant

* rename moe example

* clang format

---------

Co-authored-by: coderfeli <coderfeli@163.com>
2025-03-05 15:56:55 +08:00
jefyang1
c95bda93ba Remove CK_USE_AMD_MFMA_GFX950 (#1935)
* Add runtime check in example_gemm_xdl_streamk for gfx950

* Add runtime check in grouped conv fwd examples for gfx950

* Disable CK_USE_AMD_MFMA_GFX950

* Add new instances for gfx950

* Fix test_gemm_universal on gfx950
2025-03-04 10:32:25 -08:00
dependabot[bot]
540a6da40b Bump rocm-docs-core from 1.17.0 to 1.17.1 in /docs/sphinx (#1937)
Bumps [rocm-docs-core](https://github.com/ROCm/rocm-docs-core) from 1.17.0 to 1.17.1.
- [Release notes](https://github.com/ROCm/rocm-docs-core/releases)
- [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.17.0...v1.17.1)

---
updated-dependencies:
- dependency-name: rocm-docs-core
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-03-03 22:37:30 -08:00
Juan Manuel Martinez Caamaño
57bb0e96a4 Missing _ in __HIPCC__ (#1930) 2025-03-03 08:19:47 -08:00
arai713
fd06ed926c MIGraphX hipRTC fix (#1923)
* fixed hiprtc compilation issues from new additions, removed clashing mixed precision functionality from codegen(ignore the whole file)

* fixed device op error: misplaced header guard

* restrict virtual function use in device_gemm_multiple_d file for codegen hiprtc compilation

* add CK_CODE_GEN_RTC flag for compilation, since this flag has wider coverage for hiprtc compilation

* fixed conditional error in amd_ck_fp8.hpp

* Add MaskOutUpperTriangle as a problem parameter to
BatchedGemmSoftmaxGemm and disable tests with
MaskOutUpperTriangle==True.

Signed-off-by: Mirza Halilcevic <mirza.halilcevic@amd.com>

---------

Signed-off-by: Mirza Halilcevic <mirza.halilcevic@amd.com>
Co-authored-by: Mirza Halilcevic <mirza.halilcevic@amd.com>
2025-03-03 07:55:05 -08:00
asleepzzz
ef16010273 Revert "[BlockScale GEMM] FP8 Blockscale GEMM optimization and ckProfiler (#1913)" (#1933)
This reverts commit 020148d0f7.
2025-03-03 07:17:39 -08:00
Bartłomiej Kocot
1bf29478cd [CK TILE] Fix double lds in ck tile gemm (#1924) 2025-02-28 08:07:53 -08:00
rocking
faa2235dad explicit show no feature in kernel name (#1920) 2025-02-28 14:23:30 +08:00
slippedJim
a9bcd3c98d make fmha bwd api template for v2 & v3 (#1918)
* use template fmha_bwd function

* update

---------

Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
2025-02-27 19:26:19 +08:00
Bartłomiej Kocot
0356ee069e [CK TILE] Gemm pk_int4_t permute B (#1907)
* [CK TILE] Gemm pk_int4_t permute B

* Fixes
2025-02-27 11:01:14 +01:00
Bartłomiej Kocot
bf1e17007e [CK TILE] Block universal gemm lds<->vgpr optimizations (#1906)
* [CK TILE] Block universal gemm lds<->vgpr optimizations

* Rebase

* Fixes
2025-02-27 10:36:28 +01:00
rocking
e9ee568681 Apply filter to every kernel in the codgen of FMHA (#1911)
* add receipt for fwd

* Add receipt for bwd

* Use kernel name to avoid more receipt

* apply filter to every kernel
2025-02-26 20:20:29 +08:00
aledudek
c9bcfd755e [CK_TILE] Add EnvLogging and missing gemm args checks (#1896)
* [CK_TILE] Add EnvLogging - refactor IsSupported error messages

* [CK_TILE] Add EnvLogging - wrap gemm kernel error messages

* [CK_TILE] Add EnvLogging - Add missing k_batch args check

* [CK_TILE] Add EnvLogging - remove debug log

* Add one check

* [CK_TILE] EnvLogging - add CK_TILE_ERROR logs

* [CK_TILE] EnvLogging quotes fix

* [CK_TILE] EngLogging use function instead of macro for err logs

* [CK_TILE] EnvLogging - refactor checking env var
2025-02-25 11:48:38 +01:00
carlushuang
353a612b44 [CK_TILE] add moe-sorting MP kernel (#1910)
* moe sorting ex

* fix bug for race condition

* fix bug and optimze large expert

* fix

* optimize with sub_token_oneshot

* support skip empty tokens for expert sorting

* update moe_sorting

* tidy code

* support mp kernel

* hint mp

* remove use less code

* porting to example 15

---------

Co-authored-by: valarLip <340077269@qq.com>
2025-02-25 17:56:55 +08:00
Haocong WANG
020148d0f7 [BlockScale GEMM] FP8 Blockscale GEMM optimization and ckProfiler (#1913)
* Added two kernel for M=32 problem

* Comment the first one

* Enable multiply_multiply for Scale_Block_M = 1 for deepseek

* Modify the a_thread offset since the A data load is different from B.

* edit fp8 ab scale for Scale_Block_M=1

* edit GemmSpec to MNKPadding

* enable blockwise pipelie v1 and v2. v1 is work for small K.

* add instance for gemm_ab_scale

* fix cmakelist of ckProfiler

* optimize blockscale gemm. todo: reduce vgpr usage

* fix a correctness bug

* sanity checked

* revert ckprofiler cmake changes

* clang format

* revert unnecessary changes.

* remove commented codes.

---------

Co-authored-by: mtgu0705 <mtgu@amd.com>
Co-authored-by: chenjun <junchen2@amd.com>
2025-02-25 15:42:20 +08:00
coconutruben
fcd4a6f3d1 device_prop.hpp - replace map with compile time hash and switch (#1898)
* device_prop.hpp - replace map with compile time hash and switch

Summary:

We replace a static const map with a compile time hash function
and a switch statement to achieve the same goal: translate names
to architectures. Most of these are very old, however the function
needs to continue to work.

Why? because the static map can cause issues when compiling into
libraries that get dynamically loaded/unloaded, leading to memory
corruption

Test Plan:

Running pytorch `torch.compile()` with CK enabled, and seeing it not
segfault on the 2nd kernel (1st reload of the library)

Reviewers:

Subscribers:

Tasks:

Tags:

* clang-format
2025-02-24 09:57:55 -08:00
Andriy Roshchenko
ffa13455a2 MX FP GEMM - Test MX FP8 MFMA Instructions (#1902)
* Refactored `load_A_row_major` to follow scale mapping

* Refactored `load_A_col_major` to follow scale mapping

* Refactored `load_B_col_major` to follow scale mapping

* Verified non-scaled test

* Verified scaled tests

* Used ReferenceMXGemm for verification

* Updated license headers
2025-02-21 13:35:54 -07:00
Mirza Halilčević
3ace125c30 Remove PRIVATE from rocm_install_targets. (#1909)
Signed-off-by: Mirza Halilcevic <mirza.halilcevic@amd.com>
2025-02-21 10:55:20 -08:00
Illia Silin
68a08c872e Rebase the PR #1520 to ROCm repo. (#1574)
* Implement hiprtc for codegen tests

* Introduce gemm_softmax_gemm to codegen.

* Fix codegen build issues.

* Address PR comments.

* Separate ck_host lib and gemm_softmax_gemm into different PR.

* Fix cmake.

* Replace ENV variable with CMake option for toggling hipRTC in codegen
tests.

* Address PR comments.

* fix clang format

* Add missing header in magic_division.hpp

* - Workaround for hipRTC content wrapper
- Move descriptor for gemm_softmax_gemm to different branch

* Fix formatting.

* Revert "Fix formatting."

This reverts commit b5209eaef4.

* formatting fix

* fixed header guard issues

* updated header guards

* updated data_type for new types

* fixed redefinition error

* Add codegen test for batched_gemm_softmax_gemm.

Signed-off-by: Mirza Halilcevic <mirza.halilcevic@amd.com>

* formatting fix

---------

Signed-off-by: Mirza Halilcevic <mirza.halilcevic@amd.com>
Co-authored-by: Dino Musić <dino.music@htecgroup.com>
Co-authored-by: Mirza Halilcevic <mirza.halilcevic@htecgroup.com>
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
Co-authored-by: arai713 <67439843+arai713@users.noreply.github.com>
Co-authored-by: Astha Rai <astha.rai713@gmail.com>
Co-authored-by: Mirza Halilcevic <mirza.halilcevic@amd.com>
2025-02-20 18:58:14 -08:00