Commit Graph

395 Commits

Author SHA1 Message Date
Ville Pietilä
99fe3df99a Fix tensor descriptors. 2025-10-03 14:23:04 +00:00
Ville Pietilä
9510171377 WIP: Put back the generic tensor descriptors for convolutions. 2025-10-02 15:06:30 +00:00
Ville Pietilä
c3f0c1a866 Add additional check for non-supported c > 1 case. 2025-09-30 07:46:24 +00:00
Ville Pietilä
db835e065c Make MPerGroup and NPerGroup template parameters. 2025-09-30 07:14:28 +00:00
Ville Pietilä
1a6f602c65 Remove debug code. 2025-09-30 05:53:28 +00:00
Ville Pietilä
193907fd85 Fix case k > 1 and c=1. 2025-09-29 16:02:00 +00:00
Ville Pietilä
558054eadb WIP: Simplify conv to gemm transformations and handle K > 1 and C > 1 cases. 2025-09-26 13:38:24 +00:00
Ville Pietilä
8babf7195a Fix strides in 1D conv to gemm transformation. 2025-09-26 09:38:11 +00:00
Ville Pietilä
354dd5039c Add compile check for assumed row-mjor layout. 2025-09-26 08:39:39 +00:00
Ville Pietilä
1764c77fb2 Enable running multiple GEMM batches of merged conv groups. 2025-09-26 07:51:29 +00:00
Ville Pietilä
b864c077ed Code clean-up for bwd tensor transformations. 2025-09-25 15:09:08 +00:00
Ville Pietilä
0ea3268d5d Remove debug and other dead code. 2025-09-25 09:41:33 +00:00
Ville Pietilä
cc7433efc6 Add more comments, disable debug code. 2025-09-25 09:37:15 +00:00
Ville Pietilä
97f842f2c6 Fully functional LDS to global mem transfer using tensor descriptor and tile distribution encoding. 2025-09-25 09:30:50 +00:00
Ville Pietilä
625a78b17b WIP: LDS to global mem transfer using CK tile tensor descriptor and tile distribution encoding. 2025-09-24 15:08:01 +00:00
Ville Pietilä
7280df1bc3 Add one more unit test for tensor view. 2025-09-24 12:10:26 +00:00
Ville Pietilä
8048d6ff73 Fix build. 2025-09-23 11:17:08 +00:00
Ville Pietilä
e6f6c4a6a3 Working baseline for depthwise covolution with merged conv groups. 2025-09-23 11:14:10 +00:00
Ville Pietilä
29e3112b9b Epilogue fixes. 2025-09-22 15:38:02 +00:00
Ville Pietilä
d7da3d5089 Offset fixes. 2025-09-22 15:37:46 +00:00
Ville Pietilä
7dfbac5d0b WIP: Separate epilogue for merged conv groups. 2025-09-19 13:52:33 +00:00
Ville Pietilä
af6838e5dc Integration test for CShuffle epilogue. 2025-09-19 12:09:08 +00:00
Ville Pietilä
7f52f84167 Fix tile window size for c block. 2025-09-19 08:08:19 +00:00
Ville Pietilä
6bcdb0947e LDS to global memory copy. 2025-09-18 14:59:32 +00:00
Ville Pietilä
0e09504057 WIP: merged conv groups GEMM epilogue changes. 2025-09-17 14:25:02 +00:00
Ville Pietilä
27a2ceb4f7 Increase the max number of reported errors. 2025-09-17 12:29:12 +00:00
Ville Pietilä
4ec81cb95c Add more logging. 2025-09-17 12:27:51 +00:00
Ville Pietilä
6d318ab481 Enable running multiple conv groups per batch. 2025-09-12 14:03:04 +00:00
Ville Pietilä
0d5c1b9638 WIP: Merged conv groups epilogue. 2025-09-11 15:24:36 +00:00
Ville Pietilä
970b40aa6c WIP: Merged conv groups offset calculation. 2025-09-09 11:33:31 +00:00
Ville Pietilä
d9f0a9cdd0 Fully working conv group merging for TransformConvBwdWeightToGemm. 2025-09-09 09:58:43 +00:00
Ville Pietilä
8845b23254 WIP: Tensor transformations. 2025-09-08 15:41:54 +00:00
Ville Pietilä
61b3c96273 Add number of groups to merge to ck tile grouped gemm example. 2025-09-04 14:24:23 +00:00
Ville Pietilä
2b1908a375 Fix compilation of the grouped conv examples. 2025-09-04 12:01:49 +00:00
arai713
0282d98412 [CK TILE] Stream-K tile partitioner (#2708)
* initial commit for skeleton code

* replaced skeleton code with old streamk b2c map functions from old CK, still need to clean up the code

* fixed up code to match CK Tile convention: data type changes, naming changes, etc.

* change for num_sk_blocks data type

* formatting fix

* minor fixes

* moved reduction argument to template

* resolved comments from PR review: standardizing naming, pruning unneeded code

* resolve errors from merge of device op PR: moved enum to common file

* switching to uint32_t due to implementation constraints: divmod only takes uint32_t and mixing signed and unsigned types causes problems

* unsigned type fix

* add const qualifier

* added documentation for template parameters

* documentation edit
2025-09-03 13:38:17 -07:00
msaffari-amd
47d020a993 refactor: use snake_case naming in ck_tile/core components (#2766) 2025-09-03 09:34:11 +02:00
Cong Ma
e1ab460d2d [CK TILE GEMM] Fix building issues (#2772)
- Add `WarpGemmMfma_f32_16x16x128_[fp8|bf8]_[fp8|bf8]_CTransposed`
- Replace `__gfx950__` with `CK_GFX950_SUPPORT`
2025-09-02 22:40:18 -07:00
Po Yen Chen
9f35cde374 [CK_TILE] Fix fmha_fwd_v3() Default2DEpilogue usage (#2765)
* Fix Default2DEpilogue usage

* Fix Default2DEpilogue usage for batch_prefill
2025-09-02 09:51:56 -07:00
Sami Remes
4419fc34a2 Fix formatting problem (#2768) 2025-09-02 14:14:10 +03:00
Michael Mcminn
022f369deb Adding fix for the gfx908 to the GEMM MFMA implementaitons of WarpGem… (#2751)
* Adding fix for the gfx908 to the GEMM MFMA implementaitons of WarpGemmMfmaBf16Bf16F32M4N64K16 WarpGemmMfmaBf16Bf16F32M64N4K16

* Adding support for offload target gfx9-4-generic

* This duplication here isn't ideal
2025-09-02 10:35:07 +02:00
Haocong WANG
33418b201f Fix naming issue (#2762) 2025-09-02 11:18:53 +08:00
Po Yen Chen
d876e87fe4 [CK_TILE] Add FAv3 fwd pipeline (#2731)
* Add FAv3 fwd pipeline

* Unpack v_pk_mul to hide v_mov

* Avoid compiler moving l compute across phase

* Sync sched_group_barrier() setting for masking cases
2025-09-01 09:16:45 +08:00
Aviral Goel
fcff0043ae chore(gemm): clang format to pass CI (#2758) 2025-08-29 00:38:46 -07:00
Vijay Krish
4208e28988 ck_tile kernel for gemm with groupwise quantized B tensor. (#2663)
* This change introduces new pipelines with Intrawave scheduler and block gemm primitives that loads the scale tensor to registers to perform dequantization post MFMA on C tensor in registers.

Scale tensor data, BQ is spliced across threads in registers and not stored in LDS.

Current support is for the following combinations, but it should be fairly straightforward to extend support to more formats.

fp8, fp8 -> f32
bf8, bf8 -> f32
fp8, i4 -> f32
bf8, i4 -> f32
Group size can go down to as low as K length of underlying WarpGemm primitive.

* Solve merge conflict

* [CK TILE] Update CHANGELOG.md

---------

Co-authored-by: Vijay Krishnamoorthy <vjkrish@fb.com>
Co-authored-by: ThomasNing <thomas.ning@amd.com>
Co-authored-by: Cong Ma <congma13@amd.com>
2025-08-28 23:43:02 -07:00
Cong Ma
428090f749 Support transposed C tile in Aquant (#2679)
The performance of Aquant has increased after enabling transposed C.

Do not need to exchange AQ elements among lanes after enabling
transposed C as one thread only holds data from one row.
2025-08-28 13:28:09 -07:00
Mateusz Ozga
0758883fa4 [CK-TILE] Default2DEpilogue, example and adding nullptr_t type for D (#2752)
* Init commit

* Quick fix, CI fails

* Remove CDElementWise

* Add CDEELementWise

---------

Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
2025-08-28 12:45:50 -07:00
asleepzzz
038ea82315 Revert "[CK_TILE] FMHA BWD Enable Tile 16x192 (#2741)" (#2757)
This reverts commit ead4447b20.
2025-08-28 22:50:42 +08:00
linqunAMD
4a49dac7c6 [Regression] Fix CK_TILE build error in grouped_convolution, copy_basic and fused_moegemm_kernel (#2728)
* fix copy basic build error

* fix other ck tile test build error
2025-08-28 20:30:30 +08:00
Yi DING
ead4447b20 [CK_TILE] FMHA BWD Enable Tile 16x192 (#2741)
* 16x192

* Use buffer_load_lds for lse/d

* Dispatch & cleanup

* Avoid zeroing dq & fix

* fix
2025-08-28 18:54:18 +08:00
Linjun-AMD
bf7b458e6e use iglp to improve dim256 fmha fwd in qr_ks_vs pipeline (#2711)
* add k_lds padding and iglp to improve dim256 fmha fwd

* Update include/ck_tile/ops/fmha/pipeline/block_fmha_pipeline_qr_ks_vs.hpp

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

* update  block_fmha_pipeline_qr_ks_vs.hpp

Signed-off-by: JL-underdog <Jun.Lin@amd.com>

* Update block_fmha_pipeline_qx_ks_vs_custom_policy.hpp

* clang format

Signed-off-by: JL-underdog <Jun.Lin@amd.com>

* use same naming style

---------

Signed-off-by: JL-underdog <Jun.Lin@amd.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
2025-08-28 11:39:39 +08:00