Commit Graph

2196 Commits

Author SHA1 Message Date
Enrico Degregori
aa7950b72d Add padding to 1x1Stride1Pad0 conv specialization (grouped conv bwd weight) (#2675)
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

[ROCm/composable_kernel commit: a6f4029276]
2025-08-14 00:21:09 +02:00
JH-Leon-KIM-AMD
4bc6c568bd CSV-driven convolution test pipeline (#2581)
* Add CSV-driven convolution test pipeline

- Add test_grouped_convnd_fwd_dataset_xdl.cpp with CSV reader functionality
- Add complete dataset generation toolchain in test_data/
- Add Jenkins integration with RUN_CONV_COMPREHENSIVE_DATASET parameter
- Ready for comprehensive convolution testing with scalable datasets

* Update convolution test dataset generation pipeline

* add 2d, 3d dataset csv files

* Remove CSV test dataset files from repository

* Update generate_test_dataset.sh

* Fix channel division for MIOpen to CK conversion

* Remove unnecessary test files

* Fix clang-format-18 formatting issues

---------

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

[ROCm/composable_kernel commit: b963478759]
2025-08-13 16:24:34 +02:00
Haocong WANG
1d13107873 fix for aiter consume (#2677)
[ROCm/composable_kernel commit: 3142562c22]
2025-08-13 19:06:22 +08:00
SamiAario-AMD
a2e850acbc Cleanups (#2631)
* Remove some duplicate code in fmha_fwd_appendkv_kernel.hpp

* Simplify two templated operator calls by having the templated types deduced automatically

* Simplify two GemmPipeline calls

* Fix GemmPipelineAgBgCrCompV4::GetName

* Refactor use of ArgParser in CK tile GEMM examples

* Update args in README.md to match the implementation in create_args

* Remove some unnecessary include statements

* Rename two variables

* Factor out common code

* Factor out do_verify

* Add and use type aliases for memory operation integral constants

* In gemm_basic.cpp, use kPadM, kPadN, kPadK, and kBlockPerCu from GemmConfig

---------

Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>

[ROCm/composable_kernel commit: 28a97865f5]
2025-08-13 10:12:08 +02:00
Haocong WANG
cfd8e1b303 Re-enable optimization for gfx950 fmha fwd (#2671)
* Fix for fwd/bwd kernel build filter

* fix bwd code

* save an example for __bf16 type

* temp save, waiting for debug

* tempsave, fmha_decode

* temp save, change all instance to 1wave

* fix async copytest bug

* Add block_sync_lds_direct_load utility

* fix the s_waitcnt_imm calculation

* Improve s_waitcnt_imm calculation

* fix vmcnt shift

* add input validation and bug fix

* remove unnecessary output

* move test_copy into test

* temp save

* tempsave

* compile pass

* tempsave, trload+asyncload done

* tempsave. asynccopy+trload sanity checked

* remove unnecessary features

* fix the lds alignment caused performance regression

* enable prefill overload operator().

* remove all lds bankconflict with xor layouts

* enable larger tile size; upgrade xor pattern

* upgrade prefill pipeline; simple iglp; consistent data produce and consume order

* small refactor

* Load Q through lds, implement xor;

* add vmcnt guard before load ktile

* Add v_permlaneb32 for block_reduce. Disable it as it will cause un-coexecutable packed math in FA

* Add XOR fold strategy for hdim<128, but perf dropped; disable it by default; wait further perf debug

* add __restrict__ to tr load

* merge fa_decode pipeline into fmha_fwd api

* remove unnecessary files; rename some files

* Remove unnecessary changes

* bug fix, clang format;

* remove non-necessary change

* fix clangformat with 18.1.3

* fix bugs

* fix bug

* fix bug on non-gfx950

* fix bugs in gemm

* fix bug in pki4

* tempsave, update the blocksync functions

* change the warp setting for hdim32 fmha fwd

* clang format

* fix conflict. disable all v-col instance for fmha fwd

* Fix the bug

* clang format

* refactor blockgemm change, isolate to v2;

---------

Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>
Co-authored-by: asleepzzz <hanwen.chang@amd.com>

[ROCm/composable_kernel commit: 05a6e92705]
2025-08-13 14:57:43 +08:00
Cong Ma
7a49dd7d7c Preshuffle AQ matrix in block scale gemm (#2624)
* Preshuffle AQ matrix in block scale gemm

* turns the output to fp16. Increase the repetition time.

---------

Co-authored-by: ThomasNing <thomas.ning@amd.com>

[ROCm/composable_kernel commit: 452791a3ba]
2025-08-12 21:32:51 -07:00
Thomas Ning
651512cbc2 Finish the grouped gemm restructure with fp8 data type (#2655)
* Finish the grouped gemm restructure with data type

* restore gemm_utils.hpp

* Update example/ck_tile/17_grouped_gemm/run_grouped_gemm_example.inc

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Comment Addressed

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

[ROCm/composable_kernel commit: 0f42a92fc1]
2025-08-12 18:23:34 -07:00
Thrupti Raj Lakshmana Gowda
1d1d6717d2 GEMM Multi D for CK Tile Engine (#2660)
* Readme for GEMM Multi D

* GEMM Multi D partial Progress

* GEMM Multi D partial Progress!

* CK Tile Engine GEMM Multi D : All Python files generated

* Partial Progress

* Partial Progress

* Partial Progress

* Partial Progress : Incorrect Result

* Partial Progress : Debugging

* Partial Progress : Correct Results

* Partial Progress - Incorrect Results

* Partial Progress - Commenting Passthrough bypass logic

* Changing Passthrough to MultiplyMultiply

* Correct Results!

* Fix and debug the pass through feature

* Sample commit

* Correct Results : MultiplyMultiply

* Code Cleanup

* Removing Failed Instances

* Working code before Unary element support

* Custom Elementwise Function support and working implementation for Mul and Add

* Updating README

* Working for Passthrough

* Review Comments : Minor Fixes

* Review Comments : Minor Fixes

* Readme Updated

* Partial Changes after Rebase

* Working Code : Changes after Rebase

* Updating Jenkins file

* Removing default value changed while testing

* Configuration changes in config files

* Tile Handler changes in GEMM Multi D Tile Engine

* Tile Handler changes in GEMM Multi D Example

* Change log for Gemm Multi D in CK Tile Engine

* Configuration changes in config files

---------

Co-authored-by: ThomasNing <thomasning@amd.com>

[ROCm/composable_kernel commit: 3f57ec3d2d]
2025-08-12 16:05:05 -07:00
Geo Min
0d5324f88d [TheRock CI] Adding TheRock CI gate check (#2648)
* Adding initial TheRock CI

* Adding composable kernel link

* Adding correct repo for rocm-libraries

* Adding entire rocm-libraries checkout

* Adding correct flag

* Adding correct flag for fetch sources

* Fixing git health

* Removing patch

* Removing patching

* Removing manual check

* PR comments

* testing without dist

* Removing test branch

* PR comments

* PR comments

* PR comment

* Adding test_runs_on

[ROCm/composable_kernel commit: 30dafe8281]
2025-08-12 14:13:01 -07:00
joyeamd
5b318ecee5 [CK_TILE]fix ck_tile's moe_sorting example in gfx11 (#2667)
* fix ck_tile's moe_sorting example in gfx11

* fix clang format

---------

Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>

[ROCm/composable_kernel commit: 0856b3f4a2]
2025-08-12 12:33:56 -07:00
Illia Silin
5b5ae5f81d fix builds with mainline/staging compilers (#2674)
[ROCm/composable_kernel commit: bbf41b27f2]
2025-08-12 10:23:08 -07:00
slippedJim
8a12cdd385 remove bad pipeline codegen (#2673)
[ROCm/composable_kernel commit: 20288caa2f]
2025-08-13 00:23:40 +08:00
asleepzzz
9161cb590d Revert "Optimize fmha fwd decode & prefill for gfx950 (#2641)" (#2670)
This reverts commit 327bf408dd05b4e4bfb7b72f63f8710f35efa9a4.

[ROCm/composable_kernel commit: 5b39de4bb6]
2025-08-12 20:27:10 +08:00
Haocong WANG
8a20b06f54 Optimize fmha fwd decode & prefill for gfx950 (#2641)
* Fix for fwd/bwd kernel build filter

* fix bwd code

* save an example for __bf16 type

* temp save, waiting for debug

* tempsave, fmha_decode

* temp save, change all instance to 1wave

* fix async copytest bug

* Add block_sync_lds_direct_load utility

* fix the s_waitcnt_imm calculation

* Improve s_waitcnt_imm calculation

* fix vmcnt shift

* add input validation and bug fix

* remove unnecessary output

* move test_copy into test

* temp save

* tempsave

* compile pass

* tempsave, trload+asyncload done

* tempsave. asynccopy+trload sanity checked

* remove unnecessary features

* fix the lds alignment caused performance regression

* enable prefill overload operator().

* remove all lds bankconflict with xor layouts

* enable larger tile size; upgrade xor pattern

* upgrade prefill pipeline; simple iglp; consistent data produce and consume order

* small refactor

* Load Q through lds, implement xor;

* add vmcnt guard before load ktile

* Add v_permlaneb32 for block_reduce. Disable it as it will cause un-coexecutable packed math in FA

* Add XOR fold strategy for hdim<128, but perf dropped; disable it by default; wait further perf debug

* add __restrict__ to tr load

* merge fa_decode pipeline into fmha_fwd api

* remove unnecessary files; rename some files

* Remove unnecessary changes

* bug fix, clang format;

* remove non-necessary change

* fix clangformat with 18.1.3

* fix bugs

* fix bug

* fix bug on non-gfx950

* fix bugs in gemm

* fix bug in pki4

* tempsave, update the blocksync functions

* change the warp setting for hdim32 fmha fwd

* clang format

* fix conflict. disable all v-col instance for fmha fwd

* Fix the bug

* clang format

---------

Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>

[ROCm/composable_kernel commit: b7322a521a]
2025-08-12 19:43:14 +08:00
Mateusz Ozga
eab0dc96f6 fix (#2668)
[ROCm/composable_kernel commit: c0c2ded566]
2025-08-12 13:02:10 +02:00
Yi DING
0afd7af89c [CK_TILE] FMHA BWD Decode Pipeline (#2643)
* Fix distr

* Duplicate block_fmha_bwd_dq_dk_dv_pipeline_trload_kr_ktr_vr

* decode 16x16 o2

[ROCm/composable_kernel commit: 8e1eb0c1ee]
2025-08-12 17:02:52 +08:00
Cameron Shinn
b48fe39bf5 Fix num_byte calculations to use nhead_k for K & V size (#2653)
Simple fix just to calculate the number of bytes correctly for what's reported in the output. I was getting 6200 GB/s which is past the SoL of MI300.

Before:
```
./bin/tile_example_fmha_fwd -prec=bf16 -b=2 -s=1 -s_k=32768 -h=32 -h_k=8 -d=128 -page_block_size=128 -num_splits=8 -iperm=0 -operm=0 -v=0 -kname=1
[bf16|batch|bshd] b:2, h:32/8, s:1/32768, d:128/128, scale_s:0.0883883, bias:n, p_drop:0, lse:0, squant:0, mask:n, v:r, num_splits:8, page_block_size:128, fmha_fwd_splitkv_d128_bf16_batch_b16x64x64x128x64x128_r1x4x1_r1x4x1_w16x16x16_w16x16x16_qr_nwarp_sshuffle_vr_ps_nlogits_nbias_nmask_lse_nsquant_pagedkv, fmha_fwd_splitkv_combine_d128_bf16_batch_b32_unused_ps_nlse_nsquant, 0.173 ms, 6.20 TFlops, 6202.95 GB/s
```

After:
```
./bin/tile_example_fmha_fwd -prec=bf16 -b=2 -s=1 -s_k=32768 -h=32 -h_k=8 -d=128 -page_block_size=128 -num_splits=8 -iperm=0 -operm=0 -v=0 -kname=1
[bf16|batch|bshd] b:2, h:32/8, s:1/32768, d:128/128, scale_s:0.0883883, bias:n, p_drop:0, lse:0, squant:0, mask:n, v:r, num_splits:8, page_block_size:128, fmha_fwd_splitkv_d128_bf16_batch_b16x64x64x128x64x128_r1x4x1_r1x4x1_w16x16x16_w16x16x16_qr_nwarp_sshuffle_vr_ps_nlogits_nbias_nmask_lse_nsquant_pagedkv, fmha_fwd_splitkv_combine_d128_bf16_batch_b32_unused_ps_nlse_nsquant, 0.163 ms, 6.58 TFlops, 1644.53 GB/s
```

[ROCm/composable_kernel commit: 352f87e684]
2025-08-12 13:44:01 +08:00
Yi DING
ef32abc4cc [CK_TILE] FMHA BWD Optimization For GFX950 (#2628)
* simplify fmha_bwd_kernel MakeKargs & dq_dram_window

* simply duplicate

* trload pipeline

* Try two-stage

* add prefetch

* optimize & iglp

[ROCm/composable_kernel commit: 4fde1646e5]
2025-08-12 11:11:55 +08:00
Aviral Goel
4d263f468e feat(copy_kernel): add basic copy kernel example with beginner friendly documentation (#2582)
* feat(copy_kernel): add basic copy kernel example with documentation

* docs(CHANGELOG): Updated changelog

* chore: performed clang format

* Update example/ck_tile/39_copy/copy_basic.cpp

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update example/ck_tile/39_copy/README.md

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update example/ck_tile/39_copy/README.md

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* Update example/ck_tile/39_copy/README.md

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

* Update example/ck_tile/39_copy/README.md

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

* Update example/ck_tile/39_copy/README.md

Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

* fix(terminology): follow amd terms

* extract elementwise copy to a new kernel

* fix(copy_kernel): bug in verification

* add comments about vgpr usage

* lint and nits

* add notes and comments

* print hostTensor via stream

* print hostTensor via stream

---------

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: spolifroni-amd <Sandra.Polifroni@amd.com>

[ROCm/composable_kernel commit: a7badc6ec5]
2025-08-11 10:54:37 -07:00
Illia Silin
3c0626e2c1 enable aiter test_mha in daily CI (#2659)
[ROCm/composable_kernel commit: 6bfef63414]
2025-08-11 09:50:33 -07:00
Yashvardhan Agarwal
78662c8ad5 Fixes to "General 2D Reduction Kernel" (#2535) (#2656)
* fix reduce2d

- revret the combine_partial_results() chnages
- remove auto from function def

* clang-format

[ROCm/composable_kernel commit: 191c62967b]
2025-08-11 15:01:33 +02:00
geozhai
ffaa6ba3b3 update CK build instruction step 4 (#2563)
Co-authored-by: Aviral Goel <aviral.goel@amd.com>

[ROCm/composable_kernel commit: 1e1ee758fa]
2025-08-11 00:26:13 -04:00
Illia Silin
f5b69dbdc2 remove ck_tile transpose and gemm stages from CI (#2646)
[ROCm/composable_kernel commit: 8613aa1e40]
2025-08-08 10:48:44 -07:00
Illia Silin
45758022ee Add daily AITER tests on gfx942. (#2639)
* add option to select aiter branch, add tests on gfx942

[ROCm/composable_kernel commit: 7ac850ac72]
2025-08-08 09:30:46 -07:00
Max Podkorytov
c8ff442803 [CK-tile] add more tests for batched transpose testing the rectangular block tile sizes (#2634)
* add failing tests

* swap out and reference

* add constraint assert to transpose input distribution

* test both pipelines with rectangular block tile

* print mismatched indices

* add a smaller failing test for old pipeline

* print grid and block

* fill output before operating on it

* swap m/n tile sizes and make one test pass

* add device syncs

* add one more flipped test case

* flip block tile at host arg init

* fix tiles for lds pipeline

* clang-format

* rename tests

* roll back error check

* remove device syncs

* reduce large test case's size

[ROCm/composable_kernel commit: ab26026835]
2025-08-07 16:51:53 -07:00
Sami Remes
a180ebf3e7 [CK_TILE] Enable persistent kernel and tail handler in tile_engine (#2300)
* Enable persistent kernel in tile_engine and use tail handler

* Fix formatting

* Add persistent to default_config.json

* Remove extra newlines and add persistent also to user config

* Reduce instances from default_config.json

* add persistent to benchmark.json and custom_ci_config.json

* changed the config file to have few instances

---------

Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
Co-authored-by: ThomasNing <thomasning@amd.com>

[ROCm/composable_kernel commit: 3c9400471d]
2025-08-07 16:03:49 -07:00
Gino Lu
f21e432ea1 Add e8m0 scaled convert into CK_TILE (#2617)
* first commit

* remove redundent code

* modify according to comments.

* fix type_convert error with scaled_type_convert

[ROCm/composable_kernel commit: 5d6d236b25]
2025-08-07 21:37:28 +08:00
Yi DING
84a17e2f94 [CK_TILE] FMHA BWD Remove Unnecessary Padding (#2550)
* Remove unnecessary pssk

* Add BlockFmhaBwdDQDKDVPipeline wrapper

* Resolve copilot comments & Remove kpad & fix

* Remove spad

[ROCm/composable_kernel commit: b0a97498b0]
2025-08-07 21:24:43 +08:00
Sami Remes
6474014ee7 [CK_TILE] Enable printing more structures in CK-Tile (#2443)
* Add more printing to core cktile

* Revert other changes in static encoding pattern

* Refactor to using a free print() function

* Remove loops and print just the containers

* Print tuple with better formatting, fix sequence compilation

* Add some tests for print utility

* Add print utility header

* Print for static_encoding_pattern

* add buffer_view printing

* Align vector_traits

* Fix formatting

* Lower-case enum strings

Co-authored-by: Christopher Millette <63608002+cgmillette@users.noreply.github.com>

* Remove empty comment lines

* Fix test with lower-case too

* Reduce repeated code in print tests, move helper function closer to type definition, test X&Y

* Add test_print_common.hpp

* add print.hpp in core.hpp

---------

Co-authored-by: Aviral Goel <aviral.goel@amd.com>
Co-authored-by: Christopher Millette <63608002+cgmillette@users.noreply.github.com>
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>

[ROCm/composable_kernel commit: ffdee5e774]
2025-08-07 15:45:27 +03:00
Enrico Degregori
9e5f065191 Revert "Add padding to 1x1Stride1Pad0 conv specialization (grouped conv bwd weight) (#2610)" (#2637)
This reverts commit 5c58e5593464484b0573ec1b77aa06d36be04d54.

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

[ROCm/composable_kernel commit: 21e9983913]
2025-08-07 12:30:08 +02:00
Bartłomiej Kocot
7c838b7fbe Fix clang format after conv changes (#2636)
[ROCm/composable_kernel commit: 54c7e08a2f]
2025-08-07 10:00:09 +02:00
Bartłomiej Kocot
b12eb7db4f Grouped Convolution Forward Infer Bias Bnorm Activ (#2621)
* Grouped Convolution Forward Infer Bias Bnorm Activ

* 3d

[ROCm/composable_kernel commit: 5328b232b2]
2025-08-07 08:36:47 +02:00
Max Podkorytov
40a8c5c855 modernize scripts for running cmake and clang-format (#2503)
Co-authored-by: Aviral Goel <aviral.goel@amd.com>

[ROCm/composable_kernel commit: 1824d65758]
2025-08-06 10:15:44 -07:00
Yashvardhan Agarwal
b6f0e98da6 General 2D Reduction Kernel (#2535)
* General 2D Reduction Kernel

* Move the reduction kernel from the example
* Split the code and add the necessary policy, problem, shape files as
per ck_tile convention
* Add/modify the headers
* Modified the example to work with the 'new' kernel
* Added tests for the kernel
* N-D refernce reduce
* Added support for N-D input with transform to 2D
* Added padding to support various input sized tensors
* Bug fix in the thread buffer constructor
* Some comments to explain the reduce2d block kernel

* comments resolution

* clang-format

* comments resolution

* clang-format

* clang-format

* comments resolution

* clang-format

[ROCm/composable_kernel commit: 4750b293fe]
2025-08-06 15:36:59 +02:00
Adam Osewski
b7659e284a Remove unused lds direct load instruction. (#2573)
This functionality is replaced by amd_async_buffer_load

Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>
Co-authored-by: Aviral Goel <aviral.goel@amd.com>

[ROCm/composable_kernel commit: 2622ff06cb]
2025-08-06 15:16:12 +02:00
Yi DING
65e18fa783 [CK_TILE] Fix FMHA qr_async causing errors in FA (#2627)
[ROCm/composable_kernel commit: 15e8b6ccf7]
2025-08-06 20:04:23 +08:00
Thomas Ning
f466e04554 delete all slp compilation flag in CK Tile (#2625)
[ROCm/composable_kernel commit: 07469142cb]
2025-08-06 00:34:39 -07:00
Illia Silin
69cfe33716 Revert "Reduce build time tile engine (#2579)" (#2623)
This reverts commit 19caeff665a8d9c499e28ff4e1703d1c87602162.

[ROCm/composable_kernel commit: 833ae1d051]
2025-08-05 09:27:55 -07:00
Enrico Degregori
41f3e48f9c Add padding to 1x1Stride1Pad0 conv specialization (grouped conv bwd weight) (#2610)
* Add padding 1x1Stride1Pad0 conv specialization

* Add gridwise checks for conv cshufflev3

* Merge padding with previous transforms

* Apply transform changes for padding to default specialization as well

---------

Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>

[ROCm/composable_kernel commit: 2203b0ddfe]
2025-08-05 15:23:19 +02:00
Thomas Ning
7c9d0d0435 Persistent grouped gemm CompV4 Enablement & Polish (#2605)
* enable the persistent kernel for CompV4

* polish the example and clang format

* fix the non-persistent kernel error

---------

Co-authored-by: ThomasNing <thomasning@amd.com>

[ROCm/composable_kernel commit: cbfecf8d7a]
2025-08-04 23:43:01 -07:00
Max Podkorytov
beff75a8ff fix build for test_ck_tile_fp8 on rhel8 (#2615)
[ROCm/composable_kernel commit: 2a78da4708]
2025-08-04 17:43:15 -07:00
Illia Silin
7e796b7861 fix test_mx_mfma errors (#2614)
[ROCm/composable_kernel commit: fb96b49666]
2025-08-04 11:43:47 -07:00
rahjain-amd
0bc4627b73 Fix Debug Build for ckProfiler (#2609)
Problem
=======
relocation R_X86_64_32 out of range: 5405348154 is not in [0, 4294967295]

Solution
========
The problem was caused due the limitation comes from the 32 bit offsets
used in original DWARF standard.
We have the option to switch to 64bit offset for your libs which free
us from 4G size boundary.

add -gdwarf64 and -Og to avoid this limit.

[ROCm/composable_kernel commit: 59245df46d]
2025-08-04 11:28:09 -07:00
Jinchao Xu
d2525548c6 Add -gsplit-dwarf flag to reduce debug section size and fix ckProfiler link errors (#2611)
Resolves R_X86_64_32 relocation out of range errors in grouped conv2d instances
by splitting debug information into separate .dwo files.

Add explicit cast to avoid signed/unsigned comparison warning.

[ROCm/composable_kernel commit: 15eb493152]
2025-08-04 11:26:08 -07:00
Bartłomiej Kocot
c903850b65 Mark non-grouped convolutions instances as deprecated (#2595)
* Mark non-grouped convolutions instances as deprecated

* Update CHANGELOG.md

Co-authored-by: John Afaganis <john.afaganis@amd.com>

* Update library/src/tensor_operation_instance/gpu/conv1d_bwd_data/device_conv1d_bwd_data_xdl_nwc_kxc_nwk_bf16_instance.cpp

Co-authored-by: John Afaganis <john.afaganis@amd.com>

---------

Co-authored-by: John Afaganis <john.afaganis@amd.com>

[ROCm/composable_kernel commit: 8655ba989c]
2025-08-04 16:49:55 +02:00
Max Podkorytov
b13f01345d remove std::format (#2604)
[ROCm/composable_kernel commit: 0d9439760f]
2025-08-01 19:22:07 -07:00
Illia Silin
8703165b46 remove std=c++17 compiler flag (#2603)
[ROCm/composable_kernel commit: b786d12e56]
2025-08-01 16:18:16 -07:00
Max Podkorytov
dd4a259904 [CK-tile] remove old ck-tile transpose test (#2591)
* remove old ck-tile transpose test

* rename test exe for consistency

* replace batched transpose regression test

[ROCm/composable_kernel commit: f36cb5b2aa]
2025-08-01 14:50:09 -07:00
Thomas Ning
aa2f9b4c73 Reduce build time tile engine (#2579)
* Modify CMakeLists to allow for splitting.

* Modify CMakeLists for data and layout logic.

* Run tests and get build artifact.

* Test new Cmakelists for speedup.

* Further improvements for speedup.

* turn off the FMHA

* turn off the automatic tile engine gemm

* minor fix

* disable the transpose test first

* Address the comment

* Jenkinsfile

* change the make thread to 64

* change the compile thread to 32

* Try to use with less OS memory space

* Have the Unity build batch size to 2

* reduce the chunk size

---------

Co-authored-by: Vidyasagar Ananthan <vidyasagar.ananthan@amd.com>

[ROCm/composable_kernel commit: e5b79b26fa]
2025-08-01 14:42:33 -07:00
Illia Silin
6fc0a709dc update the switch condition for buffer built-ins (#2602)
[ROCm/composable_kernel commit: 788e8a878e]
2025-08-01 14:30:07 -07:00