Feng Shijie
f7f0306eea
fix gate-up when GU_NRepeat > 1
2025-08-18 18:43:37 +00:00
Feng Shijie
be55c0f9cb
add fp16xf4 moe
2025-08-18 17:28:11 +00:00
Feng Shijie
599e1f5b32
rename example
2025-08-17 17:51:18 +00:00
Feng Shijie
7899fb4a8d
remove additional check when e8m0->float
2025-08-15 06:20:46 +00:00
Feng Shijie
714b341797
eliminate repeat dequant
2025-08-14 09:34:12 +00:00
Feng Shijie
53e8c0c533
Merge remote-tracking branch 'origin/moe_flatmm' into feat-mixed_input_flatmm
2025-08-13 16:51:49 +00:00
Feng Shijie
5de6208952
update f16xMXF4
2025-08-13 16:16:48 +00:00
Feng Shijie
732ebdee8b
update scale-preshuffle for MXF4
2025-08-13 10:48:53 +00:00
Feng Shijie
edb58d0680
update
2025-08-11 11:24:34 +00:00
Feng Shijie
cc9c7b9e58
optimize gemm2 atomic_add pattern
2025-08-11 08:38:23 +00:00
Feng Shijie
200a11afc8
update scale for mxfp4
2025-08-11 07:59:47 +00:00
Gino Lu
8ba1c708dc
Add e8m0 scaled convert into CK_TILE ( #2617 )
...
* first commit
* remove redundent code
* modify according to comments.
* fix type_convert error with scaled_type_convert
2025-08-10 07:42:42 +00:00
Feng Shijie
f788d3d629
add mixed_prec fp16xfp4
2025-08-08 20:19:16 +00:00
Feng Shijie
3dea10a277
debug mixed_prec flatmm
2025-08-07 09:22:04 +00:00
Feng Shijie
6d3cbc7c0e
add moe_flatmm
2025-08-06 08:33:33 +00:00
coderfeli
c0cb4d036d
fix split k
2025-08-06 02:45:31 +00:00
Feng Shijie
90e910f3a7
fix flatmm with scaling when WarpTileM == 32
2025-08-04 07:16:36 +00:00
Feng Shijie
aa5e008fa5
optimize scaling epilogue
2025-08-01 11:01:23 +00:00
Feng Shijie
3f43b841d4
prune debug message
2025-07-30 06:37:26 +00:00
Feng Shijie
2e5d4c74cd
fix compile error
2025-07-30 04:52:08 +00:00
Feng Shijie
c117a1986a
Add persistent option on flatmm for tuning
2025-07-29 15:42:58 +00:00
AMD-dteng
a587701117
update pipeline v1: add atomic IGLP schedule
2025-07-29 14:59:32 +00:00
Feng Shijie
1b6d7cf407
crz idea
2025-07-28 08:24:51 +00:00
Feng Shijie
5473f06461
Add permuteN optimzization when NRepeat % 2 == 0 on flatmm
2025-07-27 11:57:38 +00:00
sjfeng
bfb9f4002f
try to remove c_shuffle_lds
2025-07-27 17:24:08 +08:00
Feng Shijie
1264f4d2ab
fix loop-dim mismatch and improve c_shuffle alu parallelism
2025-07-25 07:41:48 +00:00
lalala-sh
1239d8a546
merge flatmm -scale
2025-07-24 08:46:51 +00:00
Feng Shijie
b908f5e803
fix flatmm syntax error on gfx950
2025-07-23 19:12:31 +00:00
Feng Shijie
5a1183ebbd
support flatmm scaling
2025-07-23 19:04:22 +00:00
valarLip
89fa639207
merge flatmm pipe v0 from dteng_flatmm_opt
2025-07-23 09:50:33 +00:00
lalala-sh
3f7d848dd3
build pass
2025-07-23 15:38:12 +08:00
lalala-sh
6dacf833da
fix bug
2025-07-23 07:20:26 +00:00
lalala-sh
7e1bd4b839
sync
2025-07-23 15:01:53 +08:00
valarLip
46a538e39e
adaptive scheduler instead of Macro definition
2025-07-23 13:40:43 +08:00
lalala-sh
9aa3396a79
fix tail handler bug
2025-07-17 08:40:35 +00:00
lalala-sh
fb76450e63
merge from dteng_flatmm_opt
2025-07-16 10:12:19 +00:00
MHYangAMD
3499fe67ff
[CK_TILE] Enhance RMSNorm Accuracy: New Pipeline Pass for Selectable Implementation ( #2409 )
...
* Add Rmsnorm2dFwdPipelineModelSensitiveT5Pass
* Update rmsnorm2d_fwd_pipeline_model_sensitive_pass
1. Add BlockReduce2dTreeCrossWarpSync
* Add Rmsnorm2dFusedModelSensitiveEnum
* Update patch
1. Reverse generate.py
2. Remove comment in generate.py
3. Update tree cross warp reduce
* Refactor RMSNorm model enum and introduce T5-like option
* Update the n stage for cross warp reduce
* Add new cmdline option in RMSNorm for new pipeline testing
---------
Co-authored-by: Clement Lin <clement.lin@amd.com >
Co-authored-by: ClementLinCF <162283536+ClementLinCF@users.noreply.github.com >
2025-07-16 14:05:26 +08:00
carlushuang
cfe211cc60
[CK_TILE] moe sorting optimize local_token ( #2469 )
...
* fix bug in loops that need use local tokens to compute
* support extra chain local_token
* update
* update
* refine some main
* update
* support dispatch_policy
* fix 15 example
2025-07-15 09:42:18 +08:00
Gino Lu
141bf2d54d
[CK_TILE] Add pk_fp4 data type ( #2422 )
...
* [draft] Add pk_fp4 and test
* Add hw conversion for fp4
* Refine test code and pk_fp4 constructor.
* fix test indent
* modify according to comment.
* fix clang-format
* modify according comments.
---------
Co-authored-by: asleepzzz <hanwen.chang@amd.com >
2025-07-14 20:35:06 +08:00
Andriy Roshchenko
518dc21ae8
MX GEMM - FP6 Support in GEMM MX v3 Pipeline ( #2481 )
...
* Add GEMM MX BF6 example
* Fix BF6 type_convert
* Add type_convert for bf16x6
* Add compare operator to f4x2_pk_t
* Update README for 67_gemm_microscaling
* Fix host tensor initialization with integer values for FP8
2025-07-11 13:07:05 -06:00
Khushbu Agarwal
d239b91fd5
Merge flatmm Operator with universal gemm ( #2434 )
...
* Initial commit
* Adding new tile partitioner to flatmm
* intermediate changes
* debugging kernels
* Updating flatmm example to universal gemm example
* updated flatmm kernel to run via gemmKernel
* update universal gemm to incorporate flatmm
* debug
* Fix flatmm call
* Fixing other kernels and tests for API changes
* clang formatted
* fixing gemm tests
* added test for flatmm and simplify kernel arguments
* adding flatmm test
* fix test for flatmm
* simplify gemm kernel with flatmm
* remove flatmm related files
* addressing review comments and code clean up
* resolving empty file
* resolving empty file
* clang formatted
* addressing review comments
* enable persistent kernel for flatmm
* reverted the removed files for flatmm
* reverted the removed files for flatmm
* changed flatmm to weightPReshuffle; removed the _1 added in teh faltmm example
* some more renames
* clang formatted
2025-07-11 08:27:55 -07:00
Qianfeng
45904b8fd7
Add separate mask checking for scope [aligned_physical_seqlen_k_start, physical_seqlen_k_end) ( #2487 )
...
* Add separate mask checking for scope [aligned_physical_seqlen_k_start, physical_seqlen_k_end) in pagedkv pipeline
* i_nhead_ conversion type to prevent overflow
---------
Co-authored-by: ltqin <letaoqin@amd.com >
2025-07-11 18:14:47 +08:00
Illia Silin
1b66f3f4a3
Add declarations for atomic add for fp16 and unsigned short. ( #2483 )
...
* add template for fp16 atomic add
* add template for unsigned short atomic add
* use atomicCAS in atomic add for fp16 and unsigned short
* revrt back to atomic add using casting
2025-07-10 07:18:56 -07:00
shay-li77
d814fefe18
support y-direction step length greater than 1 for SimplifiedGenericAttentionMask ( #2338 )
...
* mask support ratio for y axis
* format code
* add notes for param y_ratio
* fix comments error
* support template and mdiv for ratio mask
* refactor y-ratio mask constructor
* optimize coordinate calculation
* add SimplifiedRatioAttentionMask
2025-07-09 23:18:55 +08:00
Yi DING
032ca60015
[CK_TILE] Avoid compile kernel in host pass ( #2475 )
2025-07-09 22:27:54 +08:00
Illia Silin
93420ecf89
Revert "Add templates for fp16 and unsigned short atomic add to fix FBGEMM bu…" ( #2474 )
...
This reverts commit 112b47e885 .
2025-07-08 19:01:26 -07:00
Illia Silin
112b47e885
Add templates for fp16 and unsigned short atomic add to fix FBGEMM builds. ( #2471 )
...
* add template for fp16 atomic add
* add template for unsigned short atomic add
* use atomicCAS in atomic add for fp16 and unsigned short
2025-07-08 18:09:30 -04:00
Haocong WANG
5557eadce6
[CK TILE] Fix FA build filter ( #2369 )
...
* Fix for fwd/bwd kernel build filter
* fix bwd code
* cmake depends & bwd filter order fix
* revert unexpected reformat
* Avoid change fmha bwd filter order for downstream compatibility
* Revert unexpected changes
---------
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com >
Co-authored-by: Ding, Yi <yi.ding@amd.com >
2025-07-08 10:42:07 +08:00
Illia Silin
e033a1b4bf
fix compilation errors with clang20 ( #2464 )
2025-07-07 19:40:30 -07:00
Po Yen Chen
b2dea90116
Eliminate warning caused by failed to meet occupancy requirement ( #2389 )
...
Co-authored-by: felix <felix.li@amd.com >
2025-07-08 09:17:25 +08:00