Commit Graph

2134 Commits

Author SHA1 Message Date
aska-0096
414cad667b Add XOR fold strategy for hdim<128, but perf dropped; disable it by default; wait further perf debug 2025-08-05 07:23:51 +00:00
aska-0096
0d12fc944f Add v_permlaneb32 for block_reduce. Disable it as it will cause un-coexecutable packed math in FA 2025-08-04 10:27:42 +00:00
aska-0096
4f31847de1 add vmcnt guard before load ktile 2025-08-04 10:02:17 +00:00
aska-0096
746f4ccb99 Load Q through lds, implement xor; 2025-08-04 06:49:01 +00:00
aska-0096
2d4e73d2b4 small refactor 2025-08-01 10:44:54 +00:00
aska-0096
a28b6e67fe upgrade prefill pipeline; simple iglp; consistent data produce and consume order 2025-07-31 10:25:37 +00:00
aska-0096
75cba48682 enable larger tile size; upgrade xor pattern 2025-07-31 05:13:27 +00:00
aska-0096
69890afc98 remove all lds bankconflict with xor layouts 2025-07-30 12:25:33 +00:00
aska-0096
8dacc35c4c enable prefill overload operator(). 2025-07-30 03:51:06 +00:00
aska-0096
13bcc913de fix the lds alignment caused performance regression 2025-07-25 07:10:01 +00:00
aska-0096
af28123cec remove unnecessary features 2025-07-23 09:05:57 +00:00
aska-0096
14e0ab70c6 tempsave. asynccopy+trload sanity checked 2025-07-22 08:04:05 +00:00
aska-0096
1b468bac0b tempsave, trload+asyncload done 2025-07-21 05:55:55 +00:00
aska-0096
afd96d8180 compile pass 2025-07-18 10:04:34 +00:00
aska-0096
5616551115 Merge branch 'develop' of https://github.com/ROCm/composable_kernel into wip-async-tr-fa 2025-07-18 05:17:27 +00:00
aska-0096
ae39c84f55 tempsave 2025-07-18 05:16:39 +00:00
Linjun-AMD
095393276a h_dim256 fmha use async_qr pipeline (#2510) 2025-07-18 09:59:38 +08:00
Thrupti Raj Lakshmana Gowda
0f3083ab5c [CKTILE] Layout Support for CK Tile engine (#2482)
* Updating runtime log message for CK TILE ENGINE

* CKTile layout from config

* CKTile custom config for CI

* Documentation for Layout Changes

* CKTile Layout changes  to Jenkins

* Fixing Clang Format

* Changes to Jenkins file to fix error

* fix(cmake-ck-dev): no longer sets invalid values as gpu arch

* style(py files): ruff formatting

* fix(cmake-ck-release): no longer sets invalid values as gpu arch

* chore(cmake-tile_engine): add reminder to uncomment user config json

* Changes to jenkin file to address more cases

* Changes to Jenkins to fix Error

* Changes to Jenkins file for fixing an error

* Update Jenkinsfile (#2517)

* Update Jenkinsfile

---------

Co-authored-by: ThruptiRajLakshmanaGowda <tlakshma@amd.com>
Co-authored-by: AviralGoelAMD <aviral.goel@amd.com>
Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
2025-07-17 12:19:41 -07:00
Emily Martins
c08986b026 Tests for CK Tile Batched Transpose and Smoothquant (#2453)
* Create tests for ck tile batched transpose using example

* Create ck tile tests for smoothquant using examples

* fix precision input strings and convert batched transpose to regression tests

* Code cleanup and fix asserts

* add missing licenses

* update copyright and licensing in files

* Update smoothquant tests to use example's smoothquant.cpp

* Add custom target for batched transpose tests

* Add missing new lines at end of files for CMakelists

* fix typo in batched transpose CMakeList target_compile_options

---------

Co-authored-by: root <root@ctr-ubbsmc16.amd.com>
2025-07-17 09:53:34 -06:00
Mateusz Ozga
7fc000d7b3 Fix CI clang-format (#2521) 2025-07-17 14:41:29 +02:00
aska-0096
94b6430489 temp save 2025-07-17 10:06:09 +00:00
aska-0096
7e330553dc Merge branch 'test_copy_fix' of https://github.com/ROCm/composable_kernel into fa_decode_pipeline 2025-07-17 07:24:32 +00:00
slippedJim
05b65d0c7c update (#2519) 2025-07-17 15:24:19 +08:00
Haocong WANG
28072adc3a fix mfma32x32 dispatch (#2490) 2025-07-17 15:24:12 +08:00
Yi DING
f1d8ad2818 [CK_TILE] Use read_tr in universal gemm (#2436)
* Use read_tr in universal gemm

* Enable all instances back

* Revert example37 changes

* Resolve comments

* resolve comments 2

* Fix assertion msg

* fix the gemm basic

* change index_t to bool for preshuffle variable

* Solve the comment

---------

Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>
Co-authored-by: AviralGoelAMD <aviral.goel@amd.com>
2025-07-16 23:56:22 -07:00
Khushbu Agarwal
579bd73435 Fixing numerical error, and interchange preshuffle configs to match with flatmm (#2515) 2025-07-16 22:33:03 -07:00
aska-0096
804f77dce5 move test_copy into test 2025-07-17 03:10:46 +00:00
aska-0096
21627d7ca7 remove unnecessary output 2025-07-17 02:41:31 +00:00
aska-0096
287792c44a Merge branch 'test_copy_fix' of https://github.com/ROCm/composable_kernel into test_copy_fix 2025-07-17 02:26:13 +00:00
aska-0096
a4221db304 add input validation and bug fix 2025-07-17 02:26:10 +00:00
Po Yen Chen
722c22fb15 Revert "Eliminate warning caused by failed to meet occupancy requirement (#2389)" (#2514)
This reverts commit b2dea90116.
2025-07-17 10:09:01 +08:00
linqunAMD
fbd9f32abe [CK][CONV] Support NCHW in class DeviceGroupedConvBwdDataMultipleD_Xdl_CShuffle_v1 (#2459)
1. Port NCHW support from ConvFwd (#2375) to conv bwd data
2. Add new instance device_grouped_conv_bwd_data_xdl_f16_nchw_instances for nchw

Co-authored-by: azhuang <anzhong.huang@amd.com>
2025-07-17 08:19:57 +08:00
Max Podkorytov
21fd7e9538 Merge branch 'develop' into test_copy_fix 2025-07-16 11:23:57 -07:00
linqunAMD
6e76b82059 Fix build errors on windows (#2456)
* Fix build errors on windows

* correct clang format

---------

Co-authored-by: Lin, Qun <Quentin.Lin+amdeng@amd.com>
2025-07-16 07:58:23 -07:00
Illia Silin
a4bf78ac0e replace obsolete warpSize system variable with the new one (#2496) 2025-07-16 07:39:15 -07:00
Illia Silin
f5d1e3fa48 Use a clang20 compiler for gfx950 builds. (#2504)
* update docker tag for gfx950 ci build

* update compiler path for gfx950 ci build

* suppress compiler path override for gfx950

* clean up
2025-07-16 07:37:53 -07:00
aska-0096
d6df7bf851 fix vmcnt shift 2025-07-16 08:55:50 +00:00
aska-0096
40e039e4e4 Improve s_waitcnt_imm calculation 2025-07-16 08:37:07 +00:00
huaiguxu
c1badfd30c Handle moe_fp8 no-mainloop cases. Supprese no-mainloop check (#2438)
Co-authored-by: felix <felix.li@amd.com>
2025-07-16 15:44:34 +08:00
MHYangAMD
3499fe67ff [CK_TILE] Enhance RMSNorm Accuracy: New Pipeline Pass for Selectable Implementation (#2409)
* Add Rmsnorm2dFwdPipelineModelSensitiveT5Pass

* Update rmsnorm2d_fwd_pipeline_model_sensitive_pass

1.  Add BlockReduce2dTreeCrossWarpSync

* Add Rmsnorm2dFusedModelSensitiveEnum

* Update patch

1. Reverse generate.py
2. Remove comment in generate.py
3. Update tree cross warp reduce

* Refactor RMSNorm model enum and introduce T5-like option

* Update the n stage for cross warp reduce

* Add new cmdline option in RMSNorm for new pipeline testing

---------

Co-authored-by: Clement Lin <clement.lin@amd.com>
Co-authored-by: ClementLinCF <162283536+ClementLinCF@users.noreply.github.com>
2025-07-16 14:05:26 +08:00
aska-0096
c30f8b709b fix the s_waitcnt_imm calculation 2025-07-16 05:39:50 +00:00
aska-0096
ec0a45b29f Merge branch 'develop' of https://github.com/ROCm/composable_kernel into test_copy_fix 2025-07-16 03:57:57 +00:00
aska-0096
e5cc4af808 Add block_sync_lds_direct_load utility 2025-07-16 03:54:33 +00:00
rahjain-amd
6b09f0823e add missing condition for bf16 (#2502)
Without this DataType = unknown -
``` sh
Run Flatmm kernel with DataType = unknown M =1280 N =16384 K =1024 StrideA =1024 StrideB =1024 StrideC =16384 : 0.228837 ms, 187.687 TFlops, 341.374 GB/s,
```

after this change
```sh
Run Flatmm kernel with DataType = bf16 M =1280 N =16384 K =1024 StrideA =1024 StrideB =1024 StrideC =16384 : 0.227029 ms, 189.181 TFlops, 344.092 GB/s,
```
2025-07-15 21:25:56 +05:30
aska-0096
eea58629cf fix async copytest bug 2025-07-15 09:39:03 +00:00
carlushuang
cfe211cc60 [CK_TILE] moe sorting optimize local_token (#2469)
* fix bug in loops that need use local tokens to compute

* support extra chain local_token

* update

* update

* refine some main

* update

* support dispatch_policy

* fix 15 example
2025-07-15 09:42:18 +08:00
Gino Lu
141bf2d54d [CK_TILE] Add pk_fp4 data type (#2422)
* [draft] Add pk_fp4 and test

* Add hw conversion for fp4

* Refine test code and pk_fp4 constructor.

* fix test indent

* modify according to comment.

* fix clang-format

* modify according comments.

---------

Co-authored-by: asleepzzz <hanwen.chang@amd.com>
2025-07-14 20:35:06 +08:00
Andriy Roshchenko
25b359d630 MX GEMM - Add FP6 GEMM Test (#2488)
* Add F6 GEMM MX Test

* Add BF6 GEMM MX Test
2025-07-11 15:32:12 -06:00
Andriy Roshchenko
518dc21ae8 MX GEMM - FP6 Support in GEMM MX v3 Pipeline (#2481)
* Add GEMM MX BF6 example

* Fix BF6 type_convert

* Add type_convert for bf16x6

* Add compare operator to f4x2_pk_t

* Update README for 67_gemm_microscaling

* Fix host tensor initialization with integer values for FP8
2025-07-11 13:07:05 -06:00
Khushbu Agarwal
d239b91fd5 Merge flatmm Operator with universal gemm (#2434)
* Initial commit

* Adding new tile partitioner to flatmm

* intermediate changes

* debugging kernels

* Updating flatmm example to universal gemm example

* updated flatmm kernel to run via gemmKernel

* update universal gemm to incorporate flatmm

* debug

* Fix flatmm call

* Fixing other kernels and tests for API changes

* clang formatted

* fixing gemm tests

* added test for flatmm and simplify kernel arguments

* adding flatmm test

* fix test for flatmm

* simplify gemm kernel with flatmm

* remove flatmm related files

* addressing review comments and code clean up

* resolving empty file

* resolving empty file

* clang formatted

* addressing review comments

* enable persistent kernel for flatmm

* reverted the removed files for flatmm

* reverted the removed files for flatmm

* changed flatmm to weightPReshuffle; removed the _1 added in teh faltmm example

* some more renames

* clang formatted
2025-07-11 08:27:55 -07:00