Commit Graph

1516 Commits

Author SHA1 Message Date
rocking
9fa8b4c170 Fix bug of welford when number of m warp > 1 2024-10-29 10:51:33 +00:00
rocking
8beda9d98d Move reduce2d into reduce folder 2024-10-29 10:02:09 +00:00
rocking
1654e6cd97 Merge branch 'develop' into ck_tile/rmsnorm 2024-10-29 14:42:48 +08:00
Illia Silin
922e42a039 fix compilation errors for gfx12 with clang20 (#1606) 2024-10-28 19:02:48 -07:00
rocking
356b045fd7 Add README 2024-10-28 19:55:57 +00:00
rocking
6a54faae25 Add save_x to trait 2024-10-28 19:55:44 +00:00
rocking
b683de6b32 Fix bug of x verification 2024-10-28 19:49:08 +00:00
rocking
88d3079065 Add test script 2024-10-28 19:39:37 +00:00
rocking
b83f8d242a Add instance library 2024-10-28 19:34:51 +00:00
rocking
9a22805e92 Fix bug of kSaveX == false 2024-10-27 11:42:58 +00:00
rocking
0f9969a894 Rename two pass to three pass 2024-10-26 20:29:55 +00:00
rocking
697558d856 Add two pass pipeline 2024-10-26 20:21:18 +00:00
carlushuang
b098b71b05 topk_softmax (#1592)
* topk_softmax

* remove some file

* fix atomix linear_offset

* address various comment, and change sfc get_index api to static(tuple)
2024-10-26 23:52:49 +08:00
Bartłomiej Kocot
31bf253aeb Add dynamic elementwise op (#1426)
* Add dynamic elementwise op

Co-authored-by: ThruptiRajLakshmanaGowda <thruptiraj.lakshmanagowda@amd.com>

* CI issues fix

* Custom parameter value for dynamic functions - Comments addressed

---------

Co-authored-by: ThruptiRajLakshmanaGowda <thruptiraj.lakshmanagowda@amd.com>
Co-authored-by: ThruptiRajLakshmanaGowda <tlakshma@amd.com>
2024-10-26 15:22:37 +02:00
Po Yen Chen
54f0e6f4bb [CK_TILE] More fmha splitkv optimizations (#1588)
* Use pre-defined constants for readability

* Use vector write for o_acc tensor

* Remove no-longer used policy method

* Deprecate no-longer used policy/pipeline

* Specify gemm0/gemm1 block warps separately in codegen

* Fix wrong ps_idx creation logic

* Add single-warp block gemm

* Supoprt single-warp gemm0

* Make MakeCBlockTile() as static method

* Use MakeCBlockTile() to get underlying tile distribution

* Use kNumGemm1Warps to compute # threads for gemm1

* Put normal case in the if clause

* Refine fmha splitkv block mapping

* Refine & fix the lse_acc/o_acc layout

* Fix wrong LDS size for K tile

* Use kK0=64 for hdim=128,256 fmha splitkv kernels

* Use kK1=64 for hdim=32,64,128 fmha splitkv kernels

* Undo kK0/kK1 changes

* Use more reasonable GetAlignmentV() computation

* Using store_tile() in fmha splitkv kernel epilogue
2024-10-26 18:35:45 +08:00
rocking
2d4480a123 Refine tile size 2024-10-26 10:23:20 +00:00
rocking
1c1f1e35b5 Fix bug of one pass pipeline 2024-10-26 10:22:50 +00:00
rocking
27d96b4031 host verification 2024-10-26 10:22:09 +00:00
valarLip
37f7afed1e add int8 gemm multiply multiply a8w8 (#1591)
* add int8 gemm multiply multiply a8w8

* uncomment

* clang-format-12

* Add example_gemm_multiply_multiply_xdl_int8

* Remove shell scripts

* update preprocess number for mi308; bring back printout in ckprofiler

* format

---------

Co-authored-by: chenjun <junchen2@amd.com>
Co-authored-by: Haocong WANG <haocwang@amd.com>
Co-authored-by: carlushuang <carlus.huang@amd.com>
2024-10-26 16:39:34 +08:00
rocking
826ee18a11 Add reduce op 2024-10-25 22:51:15 +00:00
rocking
1e0c9fde51 Add add_rmsnorm2d_rdquant kernel 2024-10-25 20:50:48 +00:00
Max Podkorytov
eda5938386 add parsing grouped conv fwd instances 2024-10-25 08:25:53 -07:00
Rostyslav Geyyer
7d576f1748 Update GPU verification (#1596)
* Update inits

* Update static_cast to type_convert

* Add verification option selection
2024-10-25 08:13:46 -07:00
aledudek
9385caa306 Generic threshold calculation (#1546)
* Calculate generic relative threshold pool3dfwd

* Calculate absolute error threshold pool3d fwd

* Generic threshold calculation take max input for relative error pool3dfwd

* Remove max possible value for error calculation at runtime

* Remove debug print in pool3dfwd

* Pool3d fwd adjusted types in generic threshold calculation

* Generic threshold calculation take into account number of accumulations and accdatatype

* Generic threshold fix final error formula

* Generic threshold calculation - num of accs fix

* Generic threshold calculation - adjust absolute error

* Generic threshold calculation - OutDataType in absolute error
2024-10-25 12:46:24 +02:00
dummycoderfe
9183ce69ca hot_fix epsilon pos (#1597)
Co-authored-by: dummycoderfe <noplydummmycoder@163.com>
2024-10-25 11:17:45 +08:00
rocking
871af334d1 Refine pipeline name 2024-10-24 20:42:40 +00:00
rocking
c89d8ca95f clang format 2024-10-24 17:05:36 +00:00
rocking
1684d71a3f Fix cmake 2024-10-24 11:44:55 +00:00
rocking
1e6814a6bd Refine naming 2024-10-24 11:44:40 +00:00
rocking
d79715ba53 Fix bug of rmsnorm 2024-10-24 11:43:45 +00:00
rocking
e4a169dd47 refine example of rmsnorm 2024-10-24 11:43:15 +00:00
rocking
a50ec83d03 refine naming 2024-10-24 08:48:34 +00:00
rocking
df976ff6a1 Add missing cmake change 2024-10-24 06:13:03 +00:00
rocking
3d2e3be652 Add script to test performance and correctness 2024-10-24 06:12:42 +00:00
rocking
5b3108a62f Remove static assert to prevent compile fail 2024-10-24 06:09:23 +00:00
Illia Silin
8e22e1ae31 fix the logic of enabling XDL and WMMA instances (#1595) 2024-10-23 15:55:39 -07:00
rocking
a5986c70dc Add rmsnorm small example 2024-10-23 19:31:05 +00:00
rocking
382a2af212 Add rmsnorm2d 2024-10-23 19:23:51 +00:00
Bartłomiej Kocot
cedccd59c9 [POST MERGE PR] Enable grouped conv bwd wei bf16 NGCHW (#1594) 2024-10-23 12:02:33 +02:00
rocking
dfb4bf9488 Fix bug of std caculation 2024-10-22 20:36:25 +00:00
rocking
26f16dd20b Prevent user use cross warp reduction 2024-10-22 19:29:46 +00:00
Jatin Chaudhary
4d5248e2d1 Explicit cast values to half (#1593)
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
2024-10-22 11:17:32 -07:00
rocking
9e7fcc0b37 Add reduce2d new api 2024-10-22 14:52:10 +00:00
Bartłomiej Kocot
82fc53835a Enable grouped conv bwd wei bf16 NGCHW (#1589)
* Enable grouped conv bwd wei bf16 NGCHW

* fixes

* fixes

* Fixes

* fixes

* fixes

* Fixes
2024-10-22 16:18:28 +02:00
ltqin
0394f8a713 update layernorm (#1570)
* port layernorm

* change warp_welford.hpp

* Update warpshuffle

* 1. Add save mean and save std back
2. Move construction of tensor_view and tile_window to operator()

* refine welford max count calculation

* unify layernorm api

* Rename file

* Remove save mean and inv std

* Revert "refine welford max count calculation"

This reverts commit 022365802b.

* Fix order of parameter

* refine welford max count calculation again

* Remove fp32 instances

* Fix bug of padding

* refactor api

* Support bf16

* Extract common function

* Refine arg of operator()

* Add kMThreadPerBlock to template parameter

* clang format

* Refine variable name

* Refine file name

* remove redundant line

* refactor layernorm2d pipeline and add block-per-block utility

* fix name

* rename more

* add more block-per-tile instance

* remove duplicated define

* update instance for 2048, 1024 case

* support up to 2048 now

* opt loading

* add n1536

* Add two pass pipeline

* format

* Fix incorrect type

* parallel compilation

* Use smaller N

* fix 2p pass

* Support Repeat_M in distribution

* Refine nameing

* Add reduce example

---------

Co-authored-by: letaoqin <letaoqin@amd.com>
Co-authored-by: aska-0096 <haocwang@amd.com>
Co-authored-by: rocking <ChunYu.Lai@amd.com>
Co-authored-by: carlushuang <carlus.huang@amd.com>
2024-10-22 09:26:18 +08:00
Rostyslav Geyyer
3f710930f6 Update default stride (#1576)
* Update default stride value to -1

* Fix format

* Revert "Fix format"

This reverts commit ae0c3649ec.

---------

Co-authored-by: Harisankar Sadasivan <135730918+hsadasiv@users.noreply.github.com>
2024-10-21 08:45:22 -07:00
spolifroni-amd
794f2d64a8 added link to documentation (#1578) 2024-10-21 08:35:57 -07:00
dependabot[bot]
d0565e33d6 Bump rocm-docs-core from 1.8.2 to 1.8.3 in /docs/sphinx (#1587)
Bumps [rocm-docs-core](https://github.com/ROCm/rocm-docs-core) from 1.8.2 to 1.8.3.
- [Release notes](https://github.com/ROCm/rocm-docs-core/releases)
- [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.8.2...v1.8.3)

---
updated-dependencies:
- dependency-name: rocm-docs-core
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-10-21 08:34:53 -07:00
Thomas Ning
560917b161 Ck profiler instance support (#1575)
* The draft on ckProfiler instance add

* support the ck profiler instance with same data types

* add a small feature on the M and N variable switch.

* Partially solve the incorrect result problem

* fix based on ci cd
2024-10-21 22:47:48 +08:00
Po Yen Chen
95e722a3b3 [CK_TILE] Optimize fmha splitkv & splitkv combine kernels (#1577)
* Use smaller width for lse_accum dist tensor

* Update pipeline comment

* Fix wrong distribution for lse_accum

* Remove duplicate dim in lse_accum dist encoding

* Decide fmha splitkv combine kernel kBlockSize by kM0

* Remove assumption of MPerThread=1

* Add log<4> & log<8> specialization

* Enlarge occupancy array

* Fix vector size for small tile

* Add support for kMaxSplits=8

* Re-format gemm.hpp

* Use 16x16x16 warp gemm for fwd_splitkv

* Centralize policy code changes

* Leave fp8/bf8 tile settings unchanged
2024-10-21 10:52:11 +08:00