242 Commits

Author SHA1 Message Date
lalala-sh
10a288c3a2 opt moe sorting (#2822)
* opt moe storing for 2k

* rm duplicated clear

---------

Co-authored-by: root <root@hjbog-srdc-39.amd.com>
2025-09-18 08:57:20 +08:00
lalala-sh
72d13f3f55 merge felix/sorting 2025-09-10 09:31:50 +00:00
coderfeli
41bc59f5b1 merge nkpad 2025-09-09 10:59:14 +00:00
Feng Shijie
5c484a5672 Add bias for f16xf4 moe_flatmm 2025-08-28 08:02:50 +00:00
Feng Shijie
dd6539f366 update case construction 2025-08-27 13:39:47 +00:00
Feng Shijie
65b702454c support swiglu activaion and use rcpf to accelerate silu 2025-08-26 12:32:29 +00:00
root
65989e940c enable hotloop 2025-08-21 09:46:52 -05:00
Feng Shijie
85976b0b87 use int64_t as expert stride to avoid overflow 2025-08-21 06:58:55 +00:00
Feng Shijie
be55c0f9cb add fp16xf4 moe 2025-08-18 17:28:11 +00:00
Feng Shijie
599e1f5b32 rename example 2025-08-17 17:51:18 +00:00
Feng Shijie
53e8c0c533 Merge remote-tracking branch 'origin/moe_flatmm' into feat-mixed_input_flatmm 2025-08-13 16:51:49 +00:00
Feng Shijie
5de6208952 update f16xMXF4 2025-08-13 16:16:48 +00:00
Feng Shijie
732ebdee8b update scale-preshuffle for MXF4 2025-08-13 10:48:53 +00:00
Feng Shijie
edb58d0680 update 2025-08-11 11:24:34 +00:00
Feng Shijie
200a11afc8 update scale for mxfp4 2025-08-11 07:59:47 +00:00
Feng Shijie
87aed564dc update case construction 2025-08-11 07:56:14 +00:00
Feng Shijie
8b85fa6cf2 update granularity control 2025-08-11 06:03:06 +00:00
Feng Shijie
1b8c7097b8 fix TileConfig 2025-08-11 03:42:46 +00:00
Feng Shijie
f788d3d629 add mixed_prec fp16xfp4 2025-08-08 20:19:16 +00:00
Feng Shijie
3dea10a277 debug mixed_prec flatmm 2025-08-07 09:22:04 +00:00
Feng Shijie
6d3cbc7c0e add moe_flatmm 2025-08-06 08:33:33 +00:00
Feng Shijie
ac5908c0bb fix wrong config for fp8 scaling 2025-08-01 07:28:38 +00:00
Feng Shijie
3f43b841d4 prune debug message 2025-07-30 06:37:26 +00:00
Feng Shijie
c117a1986a Add persistent option on flatmm for tuning 2025-07-29 15:42:58 +00:00
AMD-dteng
a587701117 update pipeline v1: add atomic IGLP schedule 2025-07-29 14:59:32 +00:00
lalala-sh
f9e48148d2 fix error log throwing 2025-07-29 14:59:18 +00:00
Feng Shijie
1b6d7cf407 crz idea 2025-07-28 08:24:51 +00:00
Feng Shijie
5473f06461 Add permuteN optimzization when NRepeat % 2 == 0 on flatmm 2025-07-27 11:57:38 +00:00
sjfeng
bfb9f4002f try to remove c_shuffle_lds 2025-07-27 17:24:08 +08:00
Mateusz Ozga
b507d889c1 [CK_TILE] Introduces a new GEMM API that splits the existing basic GEMM class into multiple specialized classes. (#2520)
* Init commit new API

* apply clang-format

* PreShuffle preapring

* Apply Preshuffle condition to universal_gemm

* Fix: convert size_t to index_t

* Review changes

* Mode 100755 -> 100644

---------

Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
2025-07-24 20:39:56 +02:00
Yashvardhan Agarwal
606b0cc947 [CK_TILE] Support for elementwise kernel (#2246)
* Elementwise kernel implementation

Co-authored-by: Sami Aario <samaario@amd.com>
Co-authored-by: Mohsen Saffari <mohsen.saffari@amd.com>
Co-authored-by: yashagar <yashagar@amd.com>

* Elementwise with generalized nDims

* Adding the n-ary input tensor feature

* Generalize dimensions on top of inputs

* Add TFLOPS + remove std usage for tuples

* 1D basecase optimization

* Cleanup code + refactoring to a common interface

* Generalize to unary and add an example

* Cleanup, refactoring and commenting

* Suggestions for LWPCK-3170: elementwise kernel improvements

* Clang-format: remod.py

* Replace InputTensorType with XDataType as the type of input_tensors

* Add Tuple::apply and use it in ElementWiseKernel::operator to call operation with the exact number of arguments in xs

* Move examples to folder 19_elementwise

* Add missing copyright headers and fix some existing ones

* Replace an assert with throw std::runtime_error in elementwise example

* Avoid reading the output by using make_static_distributed_tensor for y_tile

* Removed two unused includes

* No need to move windows to the next block when each workgroup processes a single tile

* Only copy input tensors to the device

* Use get_warp_size to obtain warp size, and use ceiling division for grid size also for the unary example

* Adding output strides to the kernel, transposition example and update the other examples

* Changes made by remod.py

* Use default template parameter values for memory operation and coherence in a call to make_naive_tensor_view

* Move binary operations to include/ck_tile/ops/elementwise/binary_elementwise_operation.hpp

* Reuse generic reference binary/unary operation in examples + refactoring the transpose reference

* Fix comments in elementwise_example.cpp

- Refer to AMD terminology except when suggesting NVIDIA alternatives in parentheses
- ElementWiseTraits was renamed to ElementWiseShape
- Adopt suggestions made by Copilot when prompted to check for factual or typographical errors

* Simplify CMakeLists.txt and remove the unused variables this uncovers

* Rename a file and fix some copyright statements

* Changes made by script/clang-format-overwrite.sh

* Add basic unit test for ElementWiseKernel

* Remove left-over uninformative comment in apply unit test

* Changes made by clang-format-overwrite.sh

* fixup! Use default template parameter values for memory operation and coherence in a call to make_naive_tensor_view

* Clean up test_tuple_apply.cpp and test_elementwise_1d.cpp

* Use make_uniform_array_with_factory to define h_xs and d_xs_mems_owner as type std::array

* Use a DeviceMem constructor that calls get_element_space_size_in_bytes internally

* Move examples to folder 20_elementwise

* Reduced register pressure on the CK tile elementwise kernel + add 4d input example to be able benchmark against old CK

* Fix CLang formating

* Bump up the elementwise example folder number

* Elementwise: add padding + minor cleanup

* Add Vector Size inference + fix issue with wrong vectorization due to missing GuaranteedLastDimensionVectorStride setting in make_naive_tensor_view

* Add isSupportedArg to Elementwise kernel + addapt example and unit tests

* Fix clang-format on the unit test file

---------

Co-authored-by: Damien Lejeune <damien.lejeune@amd.com>
Co-authored-by: Sami Aario <samaario@amd.com>
Co-authored-by: Mohsen Saffari <mohsen.saffari@amd.com>
Co-authored-by: Aviral Goel <aviral.goel@amd.com>
2025-07-24 11:21:45 +02:00
lalala-sh
1239d8a546 merge flatmm -scale 2025-07-24 08:46:51 +00:00
jakpiase
6681593864 [CK_TILE] Grouped Convolution Backward Weight Kernel (#2357)
* [CK TILE] Grouped Convolution Forward Kernel

* custom vector size

* fixes

* refactor

* resolved conflicts

* rebase fixes

* fixes

* tmp

* add working support for splitk

* minor fix

* fixes

* fixes

* minor fix

* small fix

* Split K and preprocessing fixes

---------

Co-authored-by: Bartlomiej Kocot <barkocot@amd.com>
2025-07-24 10:41:35 +02:00
lalala-sh
4066454483 revert delete of inc file 2025-07-24 16:19:58 +08:00
solin
68390988c9 reorg flatmm code 2025-07-24 04:38:16 +00:00
Feng Shijie
5a1183ebbd support flatmm scaling 2025-07-23 19:04:22 +00:00
Illia Silin
1b6f024836 refactor fmha_bwd.py (#2546) 2025-07-23 09:09:56 -07:00
valarLip
89fa639207 merge flatmm pipe v0 from dteng_flatmm_opt 2025-07-23 09:50:33 +00:00
lalala-sh
3f7d848dd3 build pass 2025-07-23 15:38:12 +08:00
Haocong WANG
a5fdc663c8 fix async copytest bug (#2509)
* fix async copytest bug

* Add block_sync_lds_direct_load utility

* fix the s_waitcnt_imm calculation

* Improve s_waitcnt_imm calculation

* fix vmcnt shift

* add input validation and bug fix

* remove unnecessary output

* move test_copy into test

* change bit width check

* refactor macros into constexpr functions

which still get inlined

* wrap s_waitcnt api

* parameterize test

* cleanup

* cleanup fp8 stub

* add fp8 test cases; todo which input parameters are valid?

* replace n for fp8 in test cases

* add large shapes; fp8 fails again

* change input init

* test sync/async

* time the test

* clang-format test

* use float instead of bfloat to cover a 4-byte type

* fix logic - arg sections should be 'or'd

* make block_sync_lds_direct_load interface similar to old ck

* fix a few comment typos

* name common shapes

* revert the example to original logic of not waiting lds

* clang-format

---------

Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>
Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
2025-07-23 00:14:02 -07:00
Cong Ma
e62710e461 ck_tile kernel for gemm with groupwise quantized A tensor (#2473)
* ck_tile kernel for gemm with groupwise quantized A or B tensor.

This change introduces new pipelines with Intrawave scheduler and block gemm primitives that loads the scale tensor to registers to perform dequantization post MFMA on C tensor in registers.

Scale tensor data, AQ/BQ is spliced across threads in registers and not stored in LDS.

Current support is for the following combinations, but it should be fairly straightforward to extend support to more formats.

1. fp8, fp8 -> f32
2. bf8, bf8 -> f32
3. i4, fp8 -> f32
4. i4, bf8 -> f32

Group size can go down to as low as K length of underlying WarpGemm primitive.

For Gemm problems with quantized B tensor, this change also introduces preliminary support for flatmm pipeline which loads B tensor directly into registers.

* [Block Scale Gemm] Only run gemm quant examples on __gfx94__

- Only run gemm quant examples on __gfx94__ for usage of
  `v_cvt_pk_fp8_f32`
- Format the code

* [Block Scale Gemm] Remove Bquant Gemm BlockScale

This cleanup is in preparation for future development of bquant. By
isolating Aquant-related code, we can streamline the codebase and make
it easier to add and maintain bquant functionality in subsequent
updates.

* [Block Scale Gemm] Format code with clang-format-12

The latest clang-format (v19) in ROCm 7.0 generate different result than
clang-format-12 which is used in CK CI.

Format code with clang-format-12 for consistency.

* [Block Scale Gemm] Split the k direction loop

- Split the k direction loop in block_universal_gemm_as_quant_bs_cr.hpp
   to make the logic clearer.
- Disable C transposition.

* [Block Scale Gemm] Move block scale gemm example to 38_block_scale_gemm

* [Block Scale Gemm] Update copyright

* test

* Add TailHandler

* Move TileDistributionEncodingPatternAQ

* Refactor

* refactor

* fix bug

* fix bug

* help solve the PR comment

* Format the code

* [Block Scale Gemm] Add unit tests

* [Block Scale Gemm] Add support to 16x16x32 MFMA

- Add support to 16x16x32 MFMA
- Fix a bug when exchange data crossing lanes

---------

Co-authored-by: Vijay Krishnamoorthy <vjkrish@meta.com>
Co-authored-by: Cong MA <congma13@ctr2-alola-ctrl-01.amd.com>
Co-authored-by: ThomasNing <thomas.ning@amd.com>
2025-07-23 00:10:16 -07:00
lalala-sh
7e1bd4b839 sync 2025-07-23 15:01:53 +08:00
Linjun-AMD
095393276a h_dim256 fmha use async_qr pipeline (#2510) 2025-07-18 09:59:38 +08:00
lalala-sh
9aa3396a79 fix tail handler bug 2025-07-17 08:40:35 +00:00
slippedJim
05b65d0c7c update (#2519) 2025-07-17 15:24:19 +08:00
Yi DING
f1d8ad2818 [CK_TILE] Use read_tr in universal gemm (#2436)
* Use read_tr in universal gemm

* Enable all instances back

* Revert example37 changes

* Resolve comments

* resolve comments 2

* Fix assertion msg

* fix the gemm basic

* change index_t to bool for preshuffle variable

* Solve the comment

---------

Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>
Co-authored-by: AviralGoelAMD <aviral.goel@amd.com>
2025-07-16 23:56:22 -07:00
Khushbu Agarwal
579bd73435 Fixing numerical error, and interchange preshuffle configs to match with flatmm (#2515) 2025-07-16 22:33:03 -07:00
lalala-sh
fb76450e63 merge from dteng_flatmm_opt 2025-07-16 10:12:19 +00:00
MHYangAMD
3499fe67ff [CK_TILE] Enhance RMSNorm Accuracy: New Pipeline Pass for Selectable Implementation (#2409)
* Add Rmsnorm2dFwdPipelineModelSensitiveT5Pass

* Update rmsnorm2d_fwd_pipeline_model_sensitive_pass

1.  Add BlockReduce2dTreeCrossWarpSync

* Add Rmsnorm2dFusedModelSensitiveEnum

* Update patch

1. Reverse generate.py
2. Remove comment in generate.py
3. Update tree cross warp reduce

* Refactor RMSNorm model enum and introduce T5-like option

* Update the n stage for cross warp reduce

* Add new cmdline option in RMSNorm for new pipeline testing

---------

Co-authored-by: Clement Lin <clement.lin@amd.com>
Co-authored-by: ClementLinCF <162283536+ClementLinCF@users.noreply.github.com>
2025-07-16 14:05:26 +08:00
rahjain-amd
6b09f0823e add missing condition for bf16 (#2502)
Without this DataType = unknown -
``` sh
Run Flatmm kernel with DataType = unknown M =1280 N =16384 K =1024 StrideA =1024 StrideB =1024 StrideC =16384 : 0.228837 ms, 187.687 TFlops, 341.374 GB/s,
```

after this change
```sh
Run Flatmm kernel with DataType = bf16 M =1280 N =16384 K =1024 StrideA =1024 StrideB =1024 StrideC =16384 : 0.227029 ms, 189.181 TFlops, 344.092 GB/s,
```
2025-07-15 21:25:56 +05:30