Commit Graph

1494 Commits

Author SHA1 Message Date
rocking
0f9969a894 Rename two pass to three pass 2024-10-26 20:29:55 +00:00
rocking
697558d856 Add two pass pipeline 2024-10-26 20:21:18 +00:00
rocking
2d4480a123 Refine tile size 2024-10-26 10:23:20 +00:00
rocking
1c1f1e35b5 Fix bug of one pass pipeline 2024-10-26 10:22:50 +00:00
rocking
27d96b4031 host verification 2024-10-26 10:22:09 +00:00
rocking
826ee18a11 Add reduce op 2024-10-25 22:51:15 +00:00
rocking
1e0c9fde51 Add add_rmsnorm2d_rdquant kernel 2024-10-25 20:50:48 +00:00
rocking
871af334d1 Refine pipeline name 2024-10-24 20:42:40 +00:00
rocking
c89d8ca95f clang format 2024-10-24 17:05:36 +00:00
rocking
1684d71a3f Fix cmake 2024-10-24 11:44:55 +00:00
rocking
1e6814a6bd Refine naming 2024-10-24 11:44:40 +00:00
rocking
d79715ba53 Fix bug of rmsnorm 2024-10-24 11:43:45 +00:00
rocking
e4a169dd47 refine example of rmsnorm 2024-10-24 11:43:15 +00:00
rocking
a50ec83d03 refine naming 2024-10-24 08:48:34 +00:00
rocking
df976ff6a1 Add missing cmake change 2024-10-24 06:13:03 +00:00
rocking
3d2e3be652 Add script to test performance and correctness 2024-10-24 06:12:42 +00:00
rocking
5b3108a62f Remove static assert to prevent compile fail 2024-10-24 06:09:23 +00:00
rocking
a5986c70dc Add rmsnorm small example 2024-10-23 19:31:05 +00:00
rocking
382a2af212 Add rmsnorm2d 2024-10-23 19:23:51 +00:00
rocking
dfb4bf9488 Fix bug of std caculation 2024-10-22 20:36:25 +00:00
rocking
26f16dd20b Prevent user use cross warp reduction 2024-10-22 19:29:46 +00:00
rocking
9e7fcc0b37 Add reduce2d new api 2024-10-22 14:52:10 +00:00
ltqin
0394f8a713 update layernorm (#1570)
* port layernorm

* change warp_welford.hpp

* Update warpshuffle

* 1. Add save mean and save std back
2. Move construction of tensor_view and tile_window to operator()

* refine welford max count calculation

* unify layernorm api

* Rename file

* Remove save mean and inv std

* Revert "refine welford max count calculation"

This reverts commit 022365802b.

* Fix order of parameter

* refine welford max count calculation again

* Remove fp32 instances

* Fix bug of padding

* refactor api

* Support bf16

* Extract common function

* Refine arg of operator()

* Add kMThreadPerBlock to template parameter

* clang format

* Refine variable name

* Refine file name

* remove redundant line

* refactor layernorm2d pipeline and add block-per-block utility

* fix name

* rename more

* add more block-per-tile instance

* remove duplicated define

* update instance for 2048, 1024 case

* support up to 2048 now

* opt loading

* add n1536

* Add two pass pipeline

* format

* Fix incorrect type

* parallel compilation

* Use smaller N

* fix 2p pass

* Support Repeat_M in distribution

* Refine nameing

* Add reduce example

---------

Co-authored-by: letaoqin <letaoqin@amd.com>
Co-authored-by: aska-0096 <haocwang@amd.com>
Co-authored-by: rocking <ChunYu.Lai@amd.com>
Co-authored-by: carlushuang <carlus.huang@amd.com>
2024-10-22 09:26:18 +08:00
Rostyslav Geyyer
3f710930f6 Update default stride (#1576)
* Update default stride value to -1

* Fix format

* Revert "Fix format"

This reverts commit ae0c3649ec.

---------

Co-authored-by: Harisankar Sadasivan <135730918+hsadasiv@users.noreply.github.com>
2024-10-21 08:45:22 -07:00
spolifroni-amd
794f2d64a8 added link to documentation (#1578) 2024-10-21 08:35:57 -07:00
dependabot[bot]
d0565e33d6 Bump rocm-docs-core from 1.8.2 to 1.8.3 in /docs/sphinx (#1587)
Bumps [rocm-docs-core](https://github.com/ROCm/rocm-docs-core) from 1.8.2 to 1.8.3.
- [Release notes](https://github.com/ROCm/rocm-docs-core/releases)
- [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.8.2...v1.8.3)

---
updated-dependencies:
- dependency-name: rocm-docs-core
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-10-21 08:34:53 -07:00
Thomas Ning
560917b161 Ck profiler instance support (#1575)
* The draft on ckProfiler instance add

* support the ck profiler instance with same data types

* add a small feature on the M and N variable switch.

* Partially solve the incorrect result problem

* fix based on ci cd
2024-10-21 22:47:48 +08:00
Po Yen Chen
95e722a3b3 [CK_TILE] Optimize fmha splitkv & splitkv combine kernels (#1577)
* Use smaller width for lse_accum dist tensor

* Update pipeline comment

* Fix wrong distribution for lse_accum

* Remove duplicate dim in lse_accum dist encoding

* Decide fmha splitkv combine kernel kBlockSize by kM0

* Remove assumption of MPerThread=1

* Add log<4> & log<8> specialization

* Enlarge occupancy array

* Fix vector size for small tile

* Add support for kMaxSplits=8

* Re-format gemm.hpp

* Use 16x16x16 warp gemm for fwd_splitkv

* Centralize policy code changes

* Leave fp8/bf8 tile settings unchanged
2024-10-21 10:52:11 +08:00
Haocong WANG
a285d6f9b5 disable bad instance detected on MI308CPX (#1584) 2024-10-18 08:46:11 -07:00
Illia Silin
88e6fa7fdb add the lsr-drop-solution=1 compiler flag (#1582) 2024-10-18 08:25:54 -07:00
Qianfeng
14c3cfb1c6 [CK_TILE] Improve headdim96 performance for fmha-bwd (#1573)
* Add kQKHeaddimForGemmN and kVHeaddimForGemmN in order to support headdim 96

* Remove the using of MakeKRegBlockDescriptor and MakeVRegBlockDescriptor

* Fix in bwd_piple_default_policy

* Remove kQKHeaddim and rename kQKHeaddimForGemmN to kQKHeaddim in the bwd kernel and pipelines

* Replace kVHeaddimForGemmN by kVHeaddim and kDoDvHeaddim

* Update to hd96 tile settings

* Add smoke test scripts for fmha-bwd hd96

* Revert "Add smoke test scripts for fmha-bwd hd96"

This reverts commit 7ca7e1a93d.

* Remove hd96 tile settings in fmha_bwd codegen to save compiling

* Fix lost code line in bwd_pipeline_default_policy

* Merge kDoDvHeaddim/kPadHeadDimDoDv to kVHeaddim/kPadHeadDimV and remove TileFmhaBwdTraits

* Rename KRegSliceBlockDescriptor/VRegSliceBlockDescriptor to KRegBlockDescriptor/VRegBlockDescriptor

* tiny adjustments

---------

Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
Co-authored-by: danyao12 <Dan.Yao@amd.com>
2024-10-16 18:14:32 +08:00
Paul Fultz II
10158b0ffd Build codegen as standalone (#1556)
* Build codegen as standalone

* Add exception for device tests

* Use local filesystem header

* add a codegen test CI stage and daily build

---------

Co-authored-by: illsilin <Illia.Silin@amd.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
2024-10-15 13:20:42 -07:00
Bartłomiej Kocot
d02a92cc0d [CK_TILE] Add block universal gemm pipeline policy (#1557)
* [CK_TILE] Add block universal gemm pipeline policy

* Fixes

* fixes2

* Fixes3

* fixeS
2024-10-15 13:53:41 +02:00
Po Yen Chen
9868fd0245 Apply ROCm 6.2 WA to ROCm 6.3 and later (#1563) 2024-10-15 18:02:41 +08:00
Rostyslav Geyyer
4cf70b36c1 Add custom type vector support (#1333)
* Add non_native_vector_type

* Add a test

* Add non-native vector type

* Fix CTOR

* Fix non-native vector type of 1

* Fix CTORs

* Use vector_type to cover non-native implementation as well

* Update the test

* Format

* Format

* Fix copyright years

* Remove BoolVecT so far

* Add AsType test cases

* Update assert error message

* Remove redundant type

* Update naming

* Add complex half type with tests

* Add tests for vector reshaping

* Add missing alignas

* Update test/data_type/test_custom_type.cpp

Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>

* Compare custom types to built-in types

* Add default constructor test

* Add an alignment test

---------

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
2024-10-14 11:56:45 -05:00
Bartłomiej Kocot
f21cda2536 Add transpose scale amax example (#1547)
* Add transpose scale amax example

* fixes

* Tune reduce instance
2024-10-14 17:39:38 +02:00
Thomas Ning
35c1777d59 decouple the calling from gemm_pipeline (#1571)
* decouple the calling from gemm_pipeline

* clang format
2024-10-14 13:59:26 +08:00
Adam Osewski
29d384d0b2 Implement GetWorkSpaceSize from BaseOperator. (#1564) 2024-10-12 14:05:11 +08:00
Illia Silin
11444e4cf2 [CI] remove the --rm docker container flags (#1568) 2024-10-11 14:29:46 -07:00
Illia Silin
f46a9eee9d only build tests and examples if user sets GPU_TARGETS (#1565) 2024-10-10 15:31:56 -07:00
spolifroni-amd
14c52befda removed API usage header (#1566) 2024-10-10 13:57:23 -07:00
Rostyslav Geyyer
d18fc0797f Fix default stride value (#1559) 2024-10-10 07:37:09 -07:00
Thomas Ning
6f27bc9872 Ck tile gemm cshuffle & CK Tile GEMM restructure (#1535)
* ake the cshuffle compilable

* modify Mhe reference on gpu and cpu. Correaccess of cshuffle

* fix the cpu reference code

* Complete the in tile shuffle logic

* restructure the kernel template input

* change the naming pattern of ck_tile gemm pipeline

* Re-format files using remod.py

* Solve the fmha conflict with gemm

* Comment Addressed from Carlus

---------

Co-authored-by: Po Yen, Chen <PoYen.Chen@amd.com>
2024-10-10 18:02:22 +08:00
Illia Silin
2e1165c1a7 fix the target selection logic (#1561) 2024-10-09 15:21:57 -07:00
Illia Silin
cfac9497e2 remove gfx12 targets from daily builds with rocm6.2 (#1560) 2024-10-09 10:18:05 -07:00
Christopher Millette
ceaed8e097 Fixes small memory leak from missing hipEventDestroy (#1554) 2024-10-09 09:41:35 +02:00
Rostyslav Geyyer
aa932445ea Add a gpu gemm reference kernel (#1528)
* Add a gpu gemm reference kernel

* Switch to gpu reference in gemm examples

* Remove redundant arguments

* Update all related examples

* Update more examples

* Try less threads per block

* Try even less threads per block

* Add support for all matrix layouts

* Increase block size

* Clean up

* Remove hardcoded strides

* Clean up

* Try a column-major case

* Revert back to row-major

* Run both CPU and GPU veriffication

---------

Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
2024-10-08 11:05:28 -05:00
Po Yen Chen
0c094daa7e [CK_TILE] Update example README files & fix script compatibility issue (#1548)
* Fix text alignment of ArgParser::print()

* Update example README files

* Clarify make-ck-dev.sh <arch> usage

* Only keep some of the argument from '-?' output

* Undo command line output changes in README

* Only keep existing argument on doc and update description

* Fix text alignment

* Make cmake-ck-*.sh compatible with 'sh' command
2024-10-08 10:45:12 +08:00
Qianfeng
74d68e3b99 [CK_TILE] Simplify the codes in splitkv_combine pipeline (#1549)
* Simplify the codes in splitkv_combine pipeline

* Always set kPadSeqLenK=true for fmha splitkv kernels

* Change in Oacc Alignment and TileDistribution to be more adaptable to tile sizes

---------

Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
2024-10-08 10:44:34 +08:00
Illia Silin
7733ae167b add a CK_USE_CODEGEN build argument to enable codegen (#1552)
* add a CK_USE_CODEGEN build argument to enable codegen

* fix cmake codegen logic
2024-10-07 15:45:19 -07:00