Commit Graph

593 Commits

Author SHA1 Message Date
so
8af425fc0a fix v0 and wave id calc 2025-01-11 09:56:48 +00:00
shengnxu
339a674b59 current status: single WG, memory out of bound 2025-01-11 11:10:17 +08:00
shengnxu
5d00b37e6b fix loop cnt and half d buffer size 2025-01-07 07:42:05 +00:00
shengnxu
2a66e08059 fix some issue, next step, res recalc, 2025-01-06 14:54:07 +00:00
shengnxu
7cc808f28d fix some codes 2025-01-05 16:08:50 +00:00
shengnxu
6f7d127267 changed all the scale outside except for uq 2025-01-05 10:31:17 +00:00
shengnxu
9a46c0e78e move a scale out inline 2025-01-04 10:20:01 +00:00
shengnxu
26d84960c4 change some inline parameter style 2025-01-03 14:14:02 +00:00
shengnxu
d0c80b12da fix more issues, current status, inline asm using more register than available 2025-01-03 09:17:08 +00:00
shengnxu
a759277dc1 fix some error 2025-01-02 19:14:01 +08:00
shengnxu
f549173bc0 simple gemm2 for gemm1 debuggging 2025-01-01 22:41:19 +08:00
shengnxu
811b75d3a0 staging code for backup 2024-12-31 18:34:02 +08:00
shengnxu
1c9a5ff42b formater 2024-12-12 10:52:47 +00:00
shengnxu
7703ceec24 remove redundant code 2024-12-12 09:58:52 +00:00
root
e2be4b9e91 added moe interleaving pipeline 2024-12-03 03:46:22 +00:00
Bartłomiej Kocot
f49b595dc0 [CK TILE] Add gemm compute pipeline v3 (#1661)
* [CK TILE] Add gemm compute pipeline v3

* Enable universal gemm compute pipeline.

* Rename example and add compute pipeline.

* Introduce ag bg cr pipeline impl base.

* Refactor to reuse code.

* Cleaning

* Formatting.

---------

Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
Co-authored-by: Adam Osewski <Adam.Osewski@amd.com>
2024-11-28 17:51:49 +01:00
jakpiase
e7b6286441 Add interwave scheduler for gemm mem pipeline (#1647)
* add interwave scheduler for gemm mem pipeline

* Fix merge artifacts.

* Refactor unit tests.

* Switch to interwave scheduler for mem example

---------

Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
Co-authored-by: Adam Osewski <Adam.Osewski@amd.com>
2024-11-27 18:25:07 +01:00
Illia Silin
fe6b185b97 move utility headers from library/include to include path (#1697) 2024-11-27 06:12:56 -08:00
Adam Osewski
061ac0649c Polished Grouped GEMM APIs and new BF16 instances (#1600)
* Few small fixes.

* New GroupedGemm instances (BF16)

* Unify and refactor GroupedGEMM device API.

* Adapt changes to new API.

* Adapt grouped gemm profiler.

* Accept multiple kbatches for grouped gemm profiler.

- delete obsolete two stage as it is now covered by grouped gemm

* Update unit test for grouped gemm.

* Fix thresholds for BF16 and F8. Unblock tests.

* Fix few instances.

* Multiple small fixes.

* Adapt to new API, check dynamic casting.

* Uncomment few data types in grouped gemm profiler.

* Fix call to SetDeviceArgs.

* Fix profile grouped gemm multiply tile loop.

* Fix grouped gemm tile loop kernel args in client examples.

* Review comments.
2024-11-27 13:02:44 +01:00
rocking
abae2afc72 support max3 in smoothquant and add+ rmsnorm + rdquant (#1654)
* Fix cmake example build

* Support max3 in smoothquant one pass

* support max3 in two pass

* support max3 in add_rmsnorm_rdquant
2024-11-27 05:01:15 +08:00
Adam Osewski
bfe983a151 Change block gemm pipeline local prefill loop order. (#1692)
* Fix loop order.

* Fix loop order in pipeline v4
2024-11-26 17:36:53 +01:00
jakpiase
b70f367f80 Add check for bf16 splitk support for grouped gemm splitk (#1673)
* add check for bf16 splitk support for grouped gemm splitk

* Update if condition

---------

Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
2024-11-26 13:56:32 +01:00
Po Yen Chen
cf2d635ea2 [CK_TILE] Fix incorrect computation of group mode PagedAttention (#1688)
* Allow getting batch size from splitkv tile partitioner

* Fix wrong paged-kvcache impl for group mode

* Fix wrong example code for page-kvcache

* Undo changes in fmha_fwd.cpp

* Always use 2D block table

* Add is_gappy kernel argument for paged-kvcache

The is_gappy argument is used for differentiating seqstart_k_ptr usage
in flash-attention & xformers

* Remove out-of-date comments

* Remove no-longer used method

* Fix wrong # page-block calculation

* Fix wrong comment

---------

Co-authored-by: Qianfeng <qianfeng.zhang@amd.com>
2024-11-26 20:37:54 +08:00
Adam Osewski
b6bcd76d88 CK-Tile first draft of universal block gemm with interwave & intrawave scheduler (#1676)
* Block universal gemm.

* Universal block gemm with interwave scheduler - draft.

* Refactoring

* Move a/b_warp_tiles into BlockGemmImpl
* set BlockGemmImpl as a class member

* Change tile size for more suitable to memory bound cases.

* Introduce kKPerThread to WarpGemm

* Add documentation comment.

* Fix Interwave scheduler block gemm.

* Add compute/memory friendly tile configuration.

* Clean

* New tile configurations in gemm mem example.

* Add more static checks and fix loop order in block gemm.

* Add more static checks and use warp gemm mfma dispatcher.

* Add default scheduler block gemm.

* Remove logging in example.
2024-11-26 08:45:14 +01:00
carlushuang
440e28b08f [CK_TILE] fused-moe first version (#1634)
* moe pipeline

* update code

* compile OK

* update

* update cpu reference

* update pipeline_gemm0

* compiler ok

* update pipeline

* rename to ex pipeline

* block-asm

* update

* update

* update first gemm ok

* compute correct

* update file structure

* update README

* update

* update

* update code

* update API

* return unsupport case

* add comment

* update readme

* update

* uncomment

* update

* fix build err

---------

Co-authored-by: valarLip <340077269@qq.com>
2024-11-26 11:14:56 +08:00
Po Yen Chen
645fe812f6 [CK_TILE] Fix fMHA fwd MakeKargs() compilation errors (#1689)
* Fix mis-matched tuple<> elem types

* Rename MakeKargs() as MakeKargsImpl()

---------

Co-authored-by: Qianfeng <qianfeng.zhang@amd.com>
2024-11-25 15:30:35 +08:00
carlushuang
36c7ce4e0e [CK_TILE]Moe update index (#1672)
* update MOCK_ID for moe-sorting

* add moe-smoothquant

* update a comment

* fix format

* hot fix

* update topk in overflow case

* update comments

* update bf16 cvt

---------

Co-authored-by: valarLip <340077269@qq.com>
2024-11-25 13:12:35 +08:00
Qianfeng
ce2bdf42a9 Change in fwd-splitkv kernel to support num_splits=1 case (#1690)
* Change in fwd-splitkv kernel to support num_splits=1 case

* Update in codegen fwd-splitkv to make num_splits > 1 cases pass

* Specify instance traits in dispatch

* Fix link error for fp8 kernels

---------

Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
2024-11-25 12:31:38 +08:00
schung-amd
ff92222f93 [CK_TILE] MakeKargs overloads for backward compatibility (#1681)
* Add overloads for MakeKargs

Overload MakeKargs to accept std::tuple<uint64_t, uint64_t> and std::tuple<void*, void*> to preserve functionality of code currently passing in list initializers or tuples.

* Add overloads for MakeKargs

Overload MakeKargs to accept std::tuple<uint64_t, uint64_t> and std::tuple<void*, void*> to preserve functionality of code currently passing in list initializers or tuples.

* Re-format files using ck_tile remod.py

---------

Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
2024-11-23 06:51:35 +08:00
Harisankar Sadasivan
d6d4c2788b universal streamk fp8 changes (#1665)
* universal streamk fp8 changes & ckprofiler instances

* revert strides to -1 and verification options

* fp8 exclusion on pre-gfx94 for universal_streamk

* PR review based revisions: permissions reverted,  removed hip err checks


---------

Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
2024-11-21 08:21:37 -08:00
Po Yen Chen
fb1ccfa9df [CK_TILE] Add paged-kvcache support in group mode fmha fwd splitkv kernels (#1678)
* Generate group mode paged-attn kernel

* Enable paged-kvcache + group mode support

* Add missing header: fused_moe.hpp

* Add comment to explain kernel arg usage

* Make error message more clear

* Add comment for confusing data member names

* Add more comment for confusing variable names

* Fix typo in option description
2024-11-21 14:53:10 +08:00
Illia Silin
8aba2724cc Add bf16 and int8 wmma gemms for Navi3x and Navi4x. (#1671)
* add bf16 gemms for gfx11/gfx12

* reduce the input values in test_gemm

* add int8 wmma gemm instances for gfx11/gfx12

* add example gemm_wmma_int8

* fix bug in gemm_wmma_int8 test

* increase bf16 gemm test tolerance

* update the dates and clean-up commented-out instances
2024-11-18 14:07:04 -08:00
Bartłomiej Kocot
754adc70e3 Batched GEMM Multiple D based on Universal GEMM (#1655)
* Batched GEMM Multiple D based on Universal GEMM

Co-authored-by: Jing Zhang <jizhan@fb.com>

* CI fixes

Co-authored-by: Jing Zhang <jizhan@fb.com>

---------

Co-authored-by: Jing Zhang <jizhan@fb.com>
2024-11-18 14:03:45 +01:00
feli
c1f8d53ce8 [Ck_tile] hot fix, fix rpcf param setting err (#1657)
Co-authored-by: dummycoderfe <noplydummmycoder@163.com>
2024-11-14 14:06:36 +08:00
Illia Silin
efd9261545 fix clang format (#1662) 2024-11-13 09:20:18 -08:00
Taylor Ding
73f02a1083 Move checks for compatibility from Argument() to IsSupportedArgument() (#1653) 2024-11-13 11:20:38 -05:00
Bartłomiej Kocot
d20735691c [CK TILE] Update gemm universal pipeline (#1644)
* [CK TILE] Update gemm universal pipeline

* Fixes

* fix

* Rebase
2024-11-13 11:46:18 +01:00
Thomas Ning
2b6458ddf2 [CK Tile] Improve the Layout, Padding, and Alignment features of CK Tile GEMM (#1651)
* Finished the feature

* Modified the test file

* Test case update

* addresss comment

* Addressed the review comment

* Fixed the CI error
2024-11-11 18:08:25 -08:00
valarLip
8ef8a994e7 [CK_TILE] add more stride for layernorm to support un-continuous Tensor (#1650)
* [CK_TILE] add more stride for layernorm to support un-continuous Tensor

* align CK coding style

* extend strides to layernrom expample

* clang-format...
2024-11-11 16:02:28 +08:00
Po Yen Chen
13332998a4 Return nullptr when block index is invalid (#1649) 2024-11-11 09:28:32 +08:00
dummycoderfe
bec6fbc65f Ck tile/moe sorting (#1624)
* add moe_sorting & check ok

* fix comments & typo

* Run remod.py under include/ck_tile & example/ck_tile directories

* format codes

* fix output ci check bug

* fix moe sorting readme and error commit file

* use magiv div to accelerate compute

* add an loop unroll for moe lds ops

* add extblocksnel to set zeros for moebufs

* [Ck_tile] moe set zero run ok, add size check and fix ref check

* [Ck_tile]fix moe_sorting fuse set_zero remod

* [Ck_tile] change name style, fix zero buffer size err, change folder

* [Ck_tile] moe_sorting: fix name style

* [Ck_tile] moe_sorting, remove useless params in traits

* [Ck_tile] change outputtile cnt * unit_size; change output buf alloc

---------

Co-authored-by: dummycoderfe <noplydummmycoder@163.com>
Co-authored-by: Po Yen, Chen <PoYen.Chen@amd.com>
Co-authored-by: carlushuang <carlus.huang@amd.com>
2024-11-09 17:57:27 +08:00
dummycoderfe
686a58a912 [Ck tile] layernorm2d fwd optimize (#1637)
* optimze small N case using vec io and using rcp div

* [Ck_tile] layernorm, add param to control fastdiv; change generate codes and test pass

* [Ck_tile] fix blockSize compute in Generic2dBlockShape

* [Ck_tile]fix kfastfdiv template style

* [Ck_tile] layernorm, fix stype in review

---------

Co-authored-by: dummycoderfe <noplydummmycoder@163.com>
2024-11-08 12:28:23 +08:00
Illia Silin
75c5bfa364 enable compilation for generic navi targets (#1645) 2024-11-07 14:14:42 -08:00
darren-amd
d0e3a70a2e Statically Cast Pointer Offset (#1631)
* explicit cast ptr offset

* formating change
2024-11-05 09:59:08 -08:00
carlushuang
cb6c5d39dc [CK_TILE] layernorm have more accurate residual (#1623)
* more accurate residual

* modify comment

* Fix literal case in README.md

---------

Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
2024-11-02 13:30:16 +08:00
rocking
fbd654545a [Ck_tile] smoothquant (#1617)
* fix compile error

* fix typo of padding

* Add smoothquant op

* Add smoothquant instance library

* refine type

* add test script

* Re-generate smoothquant.hpp

* Always use 'current year' in copyright

* use Generic2dBlockShape instead

* Add vector = 8 instance back

* Find exe path automatically

* Simplify the api condition

* Remove debugging code

* update year

* Add blank line between function declaration

* explicitly cast return value to dim3

* refine return value

* Fix default warmup and repeat value

* Add comment

* refactor sommthquant cmake

* Add README

* Fix typo

---------

Co-authored-by: Po Yen, Chen <PoYen.Chen@amd.com>
2024-11-01 13:51:56 +08:00
carlushuang
550248deec [layernorm] hot fix (#1620)
* hot fix ln

* some rename
2024-11-01 11:52:50 +08:00
carlushuang
c3a4800c5f [CK_TILE] layernorm support fused-quant/fused-add (#1604)
* add prenorm/postnorm support, refactor using generate.py

* update README

* update README

* fix format

* update some description and fix format

* update format

* format

* use non-raw for loading

* format and update n4096

* dynamic-quant ready

* update readme

* support fused dynamic-quant

* update fused-quant, with smooth

* update README

* update args

* update some based on comment
2024-10-31 14:54:53 +08:00
Bartłomiej Kocot
9a8a52130d Remove virtual destructors from unary ops (#1610)
* Remove virtual destructors from unary ops

* Fixes

* Fixes

* clang format fixes
2024-10-30 17:42:50 +01:00
rocking
7d9111545f clang-format (#1612) 2024-10-30 08:13:30 -07:00