Commit Graph

1631 Commits

Author SHA1 Message Date
Max Podkorytov
ad697c78ac mark unused args 2025-01-08 10:09:54 -08:00
Max Podkorytov
a2e6ad62e2 run clang-format -style=file 2025-01-08 10:09:54 -08:00
Max Podkorytov
aa59ecaa22 run clang-format==12 2025-01-08 10:09:54 -08:00
Max Podkorytov
82fb3f84fb update comment in the policy 2025-01-08 10:09:54 -08:00
Max Podkorytov
4daa82b451 update qsksvs comment 2025-01-08 10:09:54 -08:00
Max Podkorytov
66c5b715c9 remove dead code 2025-01-08 10:09:54 -08:00
Max Podkorytov
edb78a4729 clang-format and remove dead code 2025-01-08 10:09:54 -08:00
Max Podkorytov
60113859fa roll back splitkv 2025-01-08 10:09:54 -08:00
Max Podkorytov
bfc997a7e6 update qsksvs pipeline 2025-01-08 10:09:54 -08:00
Max Podkorytov
f7942b993c qsksvs pipeline changes to mirror qrksvs 2025-01-08 10:09:54 -08:00
AMD-dteng
d5c8a334ca enable bias feature that add bias before adding residual (for rtpllm project) (#1741)
* 1. enable bias feature that add bias before adding residual; 2. change block size from 128->64 when m<64 in fp16

* delete comment

* 1.remove fmha change 2.change buffer name from bias to xbias

* Now bias can be used independently from fadd

* change kbias to kxbias

---------

Co-authored-by: feli <felix.li@amd.com>
2025-01-08 17:51:06 +08:00
spolifroni-amd
a6b761c39a Update LICENSE to 2025 (#1797) 2025-01-07 08:29:40 -08:00
dependabot[bot]
9f6bf9ab26 Bump rocm-docs-core from 1.12.1 to 1.13.0 in /docs/sphinx (#1798)
Bumps [rocm-docs-core](https://github.com/ROCm/rocm-docs-core) from 1.12.1 to 1.13.0.
- [Release notes](https://github.com/ROCm/rocm-docs-core/releases)
- [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.12.1...v1.13.0)

---
updated-dependencies:
- dependency-name: rocm-docs-core
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-01-07 08:26:31 -08:00
Po Yen Chen
24b12d04af [CK_TILE] fmha fwd splitkv optimization for decode (seqlen_q=1) (#1789)
* Update license year

* Add initial code to override decode problem

* Fix splitkv traits/args overriding error

* Reshape and transpose lse for decode

* Remove debug code

* Prettify example code

* Use better function name

* Add kMergeNumHeadGroupsSeqLenQ flag

Kernel user can use this switch to turn on/off optimization for
some problem sizes

* Add missing flag declarations

* Default turn off kMergeNumHeadGroupsSeqLenQ in codegen

* Group similar statements together

* Remove assumption of seqlen_q=1

* Remove kMergeNumHeadGroupsSeqLenQ from splitkv combine kernel

* Support kMergeNumHeadGroupsSeqLenQ=true in fmha splitkv kernel

* Run kMergeNumHeadGroupsSeqLenQ=true kernels when need

* Fix group mode block skip logics

* Undo changes of normal fwd kernel

* Update in GridSize() and using GridSize() for splitkv kernel (#1799)

---------

Co-authored-by: Qianfeng <qianfeng.zhang@amd.com>
2025-01-07 18:49:24 +08:00
Bartłomiej Kocot
888317e698 Fix universal gemm profiler for pk_i4_t (#1790)
* Fix universal gemm profiler for pk_i4_t

* fix
2025-01-04 14:01:33 +01:00
dependabot[bot]
37b3514648 Bump rocm-docs-core from 1.12.0 to 1.12.1 in /docs/sphinx (#1788)
Bumps [rocm-docs-core](https://github.com/ROCm/rocm-docs-core) from 1.12.0 to 1.12.1.
- [Release notes](https://github.com/ROCm/rocm-docs-core/releases)
- [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.12.0...v1.12.1)

---
updated-dependencies:
- dependency-name: rocm-docs-core
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2025-01-03 17:47:48 -08:00
Illia Silin
8ea375bb58 terminology clean-up (#1792) 2025-01-03 16:38:22 -08:00
carlushuang
6df5fe2ad8 [CK_TILE]naive attn support FP8 KVCache quant (#1747)
* quant

* fix bug

* simple smoothquant after softmax

* update kv-quant

* update stride

* fix fp8-pertoken-kvcache

* update int8/fp8 quant support

---------

Co-authored-by: so <a.com>
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
2025-01-03 18:43:07 +08:00
Mingtao Gu
4f62f6e9b7 Implement the fp16xint4 scale weight only kernel for Ali (#1786)
* enable int4 scale (weight only) kernel

* format some files

* Add unit test for int4 weight only

* fixed and formatted code

* fixed

* formated

* formated

* fixed

* fixed a bug in the ckProfiler, and formatted the code

---------

Co-authored-by: mtgu0705 <mtgu@amd.com>
2025-01-03 18:35:21 +08:00
feli
4bc610416a Ck tile/layernorm: implement naive reduce, opt performance (#1784)
* add no welford

* enable output raw

* raw of int8

* fix build

* fix smoke test err

* [ck_tile]layernorm: fix welford ok, set int8 and bf16 small N as default and others open by generate

* [cktile]layernorm, fix err commit files and remove uselss

* fix quant 8192 err & change norm_reduce class and file name

---------

Co-authored-by: coderfeli <coderfeli@163.com>
Co-authored-by: carlushuang <carlus.huang@amd.com>
2025-01-03 14:28:59 +08:00
John Afaganis
17e8efb573 Add afagaj to CODEOWNERS (#1787) 2025-01-02 20:50:07 -06:00
Muhammed Emin Ozturk
9e95d54cd2 BF16 GEMM Stream-K (#1541)
* initial

* Cmake file

* successfull compilation but validation failed

* Cmake

* update

* gpu validation

* gemm universal

* gemm universal sk update

* sk bf16 universal instance

* gemm_universal_streamk.hpp

* only build for gfx94

* Cmakelist

* profiler update, bf16 sk only works at gfx42

* clang

* clang

* clang all

* no need flags

* cmake script

* delete comment

* gemm universal sk fix

* clang

* profiler fix

* clang

* update

* update

* delete comment

* code formatting

* cmake

* fix instance

* clang

* argument supported

* argument supported and clang

* update

* fix

* removing unnecessary comments

* clang formatting

* Update library/src/tensor_operation_instance/gpu/CMakeLists.txt

Co-authored-by: afagaj <john.afaganis@gmail.com>

* CopyRight Comment 2025

* clang reformatting

* copy right 2025

---------

Co-authored-by: Emin Ozturk <ozturk.27@osu.edu>
Co-authored-by: root <root@ctr-ubbsmc16.amd.com>
Co-authored-by: Muhammed Emin Ozturk <meozturk@t004-008.hpcfund>
Co-authored-by: root <root@splinter-126-wr-d3.amd.com>
Co-authored-by: Muhammed Emin Ozturk <meozturk@t006-001.hpcfund>
Co-authored-by: Muhammed Emin Ozturk <meozturk@login1.hpcfund>
Co-authored-by: Muhammed Emin Ozturk <meozturk@t004-004.hpcfund>
Co-authored-by: Emin Ozturk <emin.ozturk@utah.edu>
Co-authored-by: Muhammed Emin Ozturk <meozturk@t008-001.hpcfund>
Co-authored-by: afagaj <john.afaganis@gmail.com>
2025-01-02 10:30:04 -08:00
Adam Osewski
1d8e4ec2ce Jing's contribution: prototype of mixed precision gemm FP16/BF16xint4 GEMM (#1762)
* add a prototype of int4

* clean

* debug

* clean

* clean

* move packed into dynamic_buffer

* fixed coord reset

* add fast pki4 to half conversion

* fix

* fixed reference and host_tensor

* fixed tensor init

* format

* debug i4_to_f16_convert

* format

* fixed splitk

* weight permute

* add b tile permute

* clean

* weight permute with splitki

* format

* improve weight layout

* add and_or_b32

* fixed splitk crush

* add permute switch as a template

* recover v3r1

* clean

* failure with intrawave v2

* fixed

* fixed

* add ckProfiler

* add bfp16 support

* add bf16 example

* fixed int4 to bhalf_t conversion

* format

* fixed int4 to bf16 conversion

* clean

* add instances for mem

* clean

* fixed host tensor size

* fixed

* debug

* fixed

* add pk_i4_t as a struct

* fix

* Update example/01_gemm/gemm_xdl_bf16_pk_i4_v3.cpp

Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>

* Update example/01_gemm/gemm_xdl_bf16_pk_i4_v3.cpp

Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>

* Update example/01_gemm/gemm_xdl_bf16_pk_i4_v3.cpp

Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>

* revert

* Update example/01_gemm/gemm_xdl_bf16_pk_i4_v3.cpp

Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>

* Update example/01_gemm/gemm_xdl_fp16_pk_i4_v3.cpp

Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>

* Update example/01_gemm/gemm_xdl_fp16_pk_i4_v3.cpp

Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>

* Update example/01_gemm/gemm_xdl_fp16_pk_i4_v3.cpp

Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>

* Update example/01_gemm/gemm_xdl_fp16_pk_i4_v3.cpp

Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>

* fixed comments

* revert

* clean

* revert

* revert

* fixed

* Update CMakeLists.txt

* Update script/cmake-ck-dev.sh

Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>

* Update include/ck/tensor_operation/gpu/element/unary_element_wise_operation.hpp

Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>

* Update CMakeLists.txt

Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>

* fixed

* fixed

* fixed

* revert

* revert

* add comments

* format

* fixed assert

* fixed

* Fix I4 define in ckProfiler

* Fixed example_gemm_xdl_bf16_pk_i4_v3 test failed issue

---------

Co-authored-by: Jing Zhang <jizhan@fb.com>
Co-authored-by: zjing14 <zhangjing14@gmail.com>
Co-authored-by: mtgu0705 <mtgu@amd.com>
2025-01-02 11:48:06 +08:00
Bartłomiej Kocot
159fa31946 Add NGCHW bf16 grouped conv fwd instances (#1783)
* Add NGCHW bf16 grouped conv fwd instances

* add missed cmake
2025-01-01 18:00:06 +01:00
Qianfeng
4e076909b6 Remove using partitioner for all fmha kernels (#1778)
* Remove using tile partitioner for fmha_fwd_kernel

* Remove using tile partitioner for fmha_fwd_splitkv and splitkv-combine kernels

* Remove using tile partitioner for fmha_fwd_appendkv kernel

* Unify the format of GetTileIndex
2024-12-29 14:29:56 +08:00
Bartłomiej Kocot
af66494880 [CK TILE] GEMM and Batched GEMM SplitK support (#1724)
* [CK TILE] Add split K support in GEMM

* Updates

* Fixes

* rebase

* fix

* Fix

* fixes

* support for batched gemm
2024-12-28 14:40:17 +01:00
Po Yen Chen
4c2eff023a Correct the dtype checking logics (#1775) 2024-12-25 23:57:28 +08:00
carlushuang
3d15f364b3 [CK_TILE] optimize moe-sorting kernel (#1771)
* opt moe sorting

* remove commented code
2024-12-23 10:59:02 +08:00
Illia Silin
07339c7383 fix typo for CK_USE_OCP_FP8 (#1769) 2024-12-20 07:52:24 -08:00
carlushuang
1c45ca35dd hot-fix (#1768) 2024-12-20 16:40:45 +08:00
Po Yen Chen
37cdbf4f0e [CK_TILE] Add fmha fwd N-Warp S-Shuffle pipeline (fmha fwd splitkv pipeline variant) (#1705)
* Add check for zero values

* Add static assertions

* Remove invalid option '-e' in smoke_test.sh

* Use correct path of smoke_test.sh

* Avoid zero-sized shared memory array

* Add warning comment

* Replace expr by integer_divide_ceil() call

* Use more readable constant names

* Write down assumption as static assertion

* Add more diagnostic error messages

* Fix wrong BlockWarps when using default pipeline policy

* Add more static assertions for A LDS desc

* Allow using vector size < 8 for data type fp16/bf16

* Align vector size between DRAM dist & LDS desc

* Remove no-longer used func decl

* Fix wrong displayed piepline name

* Undo policy template changes for tile_example_gemm_basic

* Add missing space and make error message stands out

* Unify print precision

* Add missing include directive <iomanip>

* Replace constant 64 by get_warp_size() call

* Replace constant 128 by named variable: BankLength

* Add kAMBlock/kBNBlock attributes

* Allow usig different A/B warp dist for multiple blocks

* Add helper function to get warp dist encodings

* Add 4x64x4 fp16 warp gemm attribute impl

* Complete the A/B warp dist encoding logic

* Fix wrong thread mapping for C matrix

* Use smaller vector size for small tile

* Add static assert to block unsupported warp gemm impl

* Extract common code out as helper method

* Add 4x64x16 fp16 warp gemm type alias

* Add comment to warning developers

* Undo WarpGemmAtrributeMfma<> changes

* Use more clear static assertion error message

* Add trivial wrapper to get warp dstr encodings

* Only transpose warp gemm result if it's square

* Fix compilation error

* Support multi-block warp gemm (on N direction)

* Remove duplicated code

* Fix output encoding of warp gemm

* Fix wrong shape of WarpGemmAtrributeMfmaIterateK<>

* Remove unused code

* Fix wrong shape of WarpGemmAttributeMfmaImplF16F16F32M4N64K4

* Add type config for bf16_t

* Add 4x64x16 bf16 warp gemm

* Update WarpGemmAtrributeMfmaIterateKAndTransposedCDistribution

* Add 64x4x4 fp16/bf16 warp gemm impl

* Add 64x4x16 fp16/bf16 warp gemm

* Add static assertion for better error diagnostic

* Get Q dram dstr directly form block gemm

* Add missing header: fused_moe.hpp

* Allow specifying different warp-gemm for gemm0 & gemm1

* Store P matrix into LDS before gemm1

* Fix inconsistant kernel name

* Remove constraint on gemm0 & gemm1 block warps

* Remove unsupported vector size from checking list

* Allow using 4x64x16 warp gemm for gemm0

* Finish policy customization

* Finish pipeline modification
F#

* Use block warps in codegen

* Fix wrong rank of m_lds_window origin

* Use better distributed tensor

* Make P-store earlier

* Remove duplicated experssions

* Remove unnecessary tile window

* Create new files for new splitkv pipeline

* Separate old/new pipeline codegen logic

* Sync changes form develop

* Undo gemm kernel/pipeline changes

* Undo gemm example changes

* Remove blank lines

* Fix typo

* Use new warp gemm interface

* Fix link error

* Fix wrong pipeline tag

* Fix more link error

* Avoid unnecessary padding

* Always use vector load for K

* Padding on fastest dimension when necessary

* Force padding Q on hdim_q

* Set high dimension padding flag to false

* Re-format headers

* Use warps=<1, 4, 1> for both gemm0 & gemm1

* Fix complilation errors

* Remove m/l shuffle logics

* Ignore duplicate data when write lse_acc

* Use gemm0 block warps as lds tile width

* Remove hard-coded numbers

* Fix wrong distribution width

* Remove unnecessary code

* Add s_barrier before writing to LDS

* Store Q into LDS before gemm0

* Fix wrong Q tile size

* Use simple Q lds descriptor for debuging

* Use more realistic Q lds descriptor

* Add comment & use better variable name

* Make Q lds space not overlapped with others

* Remove unnecessary block_tile_reduce_sync() call

* Move Q load statements

* Move block_sync_lds() right before use

* Re-order instructions

* Remove necessary lambda expression

* Use 8 threads on kMaxSplits direction while doing reduction

* Tiny correction for using 8 threads on kMaxSplits direction for combine kernel

* Padding num_split direction of o_acc tile window to 4x

* Update splitkv combine pipeline design

* Add kN1 back to splitkv combine pipeline problem

* Fix compilation errors

* Add missing template parameter

* Fix wrong splitkv combine kernel name

* Fix wrong origin

* Fix wrong LDS descriptor shape

* Fix sync & reduction logics

* Remove unnecessary static assertions

* Extract tile size computation logics

* Make sure we can reuse padding flags in combine kernels

* Rename variables

* Use OaccDataType in BlockFmhaSplitKVCombinePipelineTileSizes<>

* Remove unnecessary static assertion

* Fix function name typo

* Add constraint on kN1 template parameter

* Hide K tile loading latency in earlier iteration

* Fix wrong splitkv kernel name

* Use s_shuffling to replace p_shuffling which removes the needs of cross-warp reduction

* Rename pipeline

* Fix wrong pipeline name attribute

* Add GetAlignmentQ() for NWarpSShuffle pipeline

* Separate Q tile into dram tile & register tile concepts

* Remove non-squre warp gemm transpose c type alias

* Fallback tile size changes for fmha fwd splitkv

* Remove redundant change

* Refine naming for the S tile

* Use better naming of the S tile dstr (read from lds)

* Share Q lds with K lds

* Tiny change

* Fix with using static_for for passing CI checking

---------

Co-authored-by: Qianfeng Zhang <Qianfeng.Zhang@amd.com>
2024-12-20 14:41:01 +08:00
Illia Silin
2944c50894 fix profiler_grouped_gemm (#1766) 2024-12-19 17:24:05 -08:00
Mateusz Ozga
e758d006a5 Apply Ck-tile argument parser for vectors [I/O] (#1758)
* Parser for a vector was added. Additionaly we valid correctnes of numbers

* Remove unnecessary comments

* Review part 1

* Review part 2

* Add const to variadic lambda

* Rename C->K
2024-12-19 17:55:35 +01:00
aledudek
453ca37347 [CK TILE] Refactor GemmKernel to be reused by other GEMM related operators (#1730)
* Gemm Kernel Refactor part1

* Gemm Kernel Refactor common gemm pipeline part2

* [CK TILE] Refactor batched gemm to reuse GemmKernel

* [CK TILE] Refactor GemmKernel - review changes part1

* [CK TILE] Refactor GemmKernel - references fix

* [CK TILE] Refactor GemmKernel - naming changes, add problem

* [CK_TILE] Refactor GemmKernel - update tests

* [CK_TILE] Refactor GemmKernel - review changes

* [CK_TILE] Refactor GemmKernel - update test

* [CK_TILE] Refactor GemmKernel - constness fixes

* [CK_TILE] Refactor GemmKernel - update tests
2024-12-18 17:52:46 +01:00
Xiaodong Wang
1c1b336371 Disambiguate bit_cast (#1749)
Adding namespace to disambiguate with std::bit_cast

Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
2024-12-18 18:32:38 +08:00
aledudek
f6c4d614e3 [CK_TILE] Move hipmalloc/memcpy calls out of gpu reference gemm (#1743)
* [CK_TILE] Move hipmalloc/memcpy calls out of gpu reference gemm

* [CK_TILE] Move hipmalloc/memcpy calls out of gpu reference gemm - review changes

* [CK_TILE] Move hipmalloc/memcpy calls out of gpu reference gemm - review fix
2024-12-18 09:45:58 +01:00
Harisankar Sadasivan
d9e37c6874 updated fp16 instances to be on parity with universal gemm instances (#1754)
* updated fp16 instances to be on parity with universal gemm instances

* corrected instance name to streamk instance
2024-12-17 10:31:21 -08:00
Illia Silin
689a5ae45b Pass build flags to config.h (#1760)
* pass the build flags to config.h

* fix clang format
2024-12-17 10:17:29 -08:00
Max Podkorytov
6ef8d3c295 refactor conditional usage; fix build on rocm6.1 where the reference didn't exist 2024-12-17 08:40:18 -08:00
dependabot[bot]
0e54d7ae5a Bump rocm-docs-core from 1.11.0 to 1.12.0 in /docs/sphinx (#1753)
Bumps [rocm-docs-core](https://github.com/ROCm/rocm-docs-core) from 1.11.0 to 1.12.0.
- [Release notes](https://github.com/ROCm/rocm-docs-core/releases)
- [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.11.0...v1.12.0)

---
updated-dependencies:
- dependency-name: rocm-docs-core
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-12-17 06:57:55 -08:00
jakpiase
627a27bda3 Added unit tests for CK Tile compute bound gemm pipeline (#1728) 2024-12-17 14:25:22 +01:00
Adam Osewski
d46196f291 Enhance printing functionality (#1751)
* Added object print with all template parameters

* fix clang format

---------

Co-authored-by: ravil-mobile <ravil.aviva.com@gmail.com>
Co-authored-by: illsilin <Illia.Silin@amd.com>
2024-12-17 09:19:44 +01:00
Max Podkorytov
0fd6978d2a clarify release notes bullet point 2024-12-16 10:46:19 -08:00
Max Podkorytov
1b75c77da4 add contributing placeholder 2024-12-16 10:46:19 -08:00
Max Podkorytov
30a37cac0e add pull request template placeholder 2024-12-16 10:46:19 -08:00
Max Podkorytov
a8ad7fcce9 add template placeholders 2024-12-16 10:46:19 -08:00
Illia Silin
fdfe210230 upgrade sqlalchemy version (#1748)
* upgrade sqlalchemy version

* replace the connection with engine in to_sql call

* change the hipTes=nsor ctest syntax
2024-12-15 16:25:21 -08:00
Xu, Shengnan
f57d720c67 added moe interleaving pipeline (#1712)
* added moe interleaving pipeline

* remove redundant code

* formater

---------

Co-authored-by: root <root@hjbog-srdc-14.amd.com>
2024-12-15 20:13:10 +08:00
Illia Silin
d68974a5c6 upgrade pandas package (#1746) 2024-12-13 16:30:39 -08:00
Illia Silin
41ebf117a5 Add zstd lib for building hipTensor. (#1745)
* add zstd library to CI docker

* fix the libzstd name
2024-12-13 16:30:22 -08:00