Commit Graph

1617 Commits

Author SHA1 Message Date
coderfeli
db84352941 fix warnings and revert cmake and fix clang format 2024-12-30 08:24:50 +00:00
coderfeli
5765ba51ce auto calculate hard code params 2024-12-30 07:59:47 +00:00
coderfeli
3f9dbcac63 use new pipeline for b preshuffle, run ok; revert olds to fix ckprofiler 2024-12-30 06:52:10 +00:00
coderfeli
54f44e6232 fix brepeat, kloop and lds two buffer; works ok now 2024-12-30 00:25:46 +00:00
coderfeli
2c056624af fix tail 2024-12-27 08:30:03 +00:00
coderfeli
174b46b04a add cpu shuffle 2024-12-27 07:31:14 +00:00
coderfeli
e6f5a78b14 add double buffer scratch 2024-12-26 15:02:04 +00:00
coderfeli
3784329b68 can run 2024-12-26 13:01:07 +00:00
coderfeli
4a1ec81595 add bypass logic and build 2024-12-26 10:05:25 +00:00
coderfeli
19b7c1312c remove all non gemm in cmake 2024-12-24 07:44:12 +00:00
coderfeli
5e5e1a50f9 add instances 2024-12-24 07:43:09 +00:00
coderfeli
9ba219c875 rm debug used files 2024-12-24 03:59:12 +00:00
coderfeli
3f50b99e7b port tiles from a8w8 2024-12-23 14:13:16 +00:00
carlushuang
3d15f364b3 [CK_TILE] optimize moe-sorting kernel (#1771)
* opt moe sorting

* remove commented code
2024-12-23 10:59:02 +08:00
Illia Silin
07339c7383 fix typo for CK_USE_OCP_FP8 (#1769) 2024-12-20 07:52:24 -08:00
carlushuang
1c45ca35dd hot-fix (#1768) 2024-12-20 16:40:45 +08:00
Po Yen Chen
37cdbf4f0e [CK_TILE] Add fmha fwd N-Warp S-Shuffle pipeline (fmha fwd splitkv pipeline variant) (#1705)
* Add check for zero values

* Add static assertions

* Remove invalid option '-e' in smoke_test.sh

* Use correct path of smoke_test.sh

* Avoid zero-sized shared memory array

* Add warning comment

* Replace expr by integer_divide_ceil() call

* Use more readable constant names

* Write down assumption as static assertion

* Add more diagnostic error messages

* Fix wrong BlockWarps when using default pipeline policy

* Add more static assertions for A LDS desc

* Allow using vector size < 8 for data type fp16/bf16

* Align vector size between DRAM dist & LDS desc

* Remove no-longer used func decl

* Fix wrong displayed piepline name

* Undo policy template changes for tile_example_gemm_basic

* Add missing space and make error message stands out

* Unify print precision

* Add missing include directive <iomanip>

* Replace constant 64 by get_warp_size() call

* Replace constant 128 by named variable: BankLength

* Add kAMBlock/kBNBlock attributes

* Allow usig different A/B warp dist for multiple blocks

* Add helper function to get warp dist encodings

* Add 4x64x4 fp16 warp gemm attribute impl

* Complete the A/B warp dist encoding logic

* Fix wrong thread mapping for C matrix

* Use smaller vector size for small tile

* Add static assert to block unsupported warp gemm impl

* Extract common code out as helper method

* Add 4x64x16 fp16 warp gemm type alias

* Add comment to warning developers

* Undo WarpGemmAtrributeMfma<> changes

* Use more clear static assertion error message

* Add trivial wrapper to get warp dstr encodings

* Only transpose warp gemm result if it's square

* Fix compilation error

* Support multi-block warp gemm (on N direction)

* Remove duplicated code

* Fix output encoding of warp gemm

* Fix wrong shape of WarpGemmAtrributeMfmaIterateK<>

* Remove unused code

* Fix wrong shape of WarpGemmAttributeMfmaImplF16F16F32M4N64K4

* Add type config for bf16_t

* Add 4x64x16 bf16 warp gemm

* Update WarpGemmAtrributeMfmaIterateKAndTransposedCDistribution

* Add 64x4x4 fp16/bf16 warp gemm impl

* Add 64x4x16 fp16/bf16 warp gemm

* Add static assertion for better error diagnostic

* Get Q dram dstr directly form block gemm

* Add missing header: fused_moe.hpp

* Allow specifying different warp-gemm for gemm0 & gemm1

* Store P matrix into LDS before gemm1

* Fix inconsistant kernel name

* Remove constraint on gemm0 & gemm1 block warps

* Remove unsupported vector size from checking list

* Allow using 4x64x16 warp gemm for gemm0

* Finish policy customization

* Finish pipeline modification
F#

* Use block warps in codegen

* Fix wrong rank of m_lds_window origin

* Use better distributed tensor

* Make P-store earlier

* Remove duplicated experssions

* Remove unnecessary tile window

* Create new files for new splitkv pipeline

* Separate old/new pipeline codegen logic

* Sync changes form develop

* Undo gemm kernel/pipeline changes

* Undo gemm example changes

* Remove blank lines

* Fix typo

* Use new warp gemm interface

* Fix link error

* Fix wrong pipeline tag

* Fix more link error

* Avoid unnecessary padding

* Always use vector load for K

* Padding on fastest dimension when necessary

* Force padding Q on hdim_q

* Set high dimension padding flag to false

* Re-format headers

* Use warps=<1, 4, 1> for both gemm0 & gemm1

* Fix complilation errors

* Remove m/l shuffle logics

* Ignore duplicate data when write lse_acc

* Use gemm0 block warps as lds tile width

* Remove hard-coded numbers

* Fix wrong distribution width

* Remove unnecessary code

* Add s_barrier before writing to LDS

* Store Q into LDS before gemm0

* Fix wrong Q tile size

* Use simple Q lds descriptor for debuging

* Use more realistic Q lds descriptor

* Add comment & use better variable name

* Make Q lds space not overlapped with others

* Remove unnecessary block_tile_reduce_sync() call

* Move Q load statements

* Move block_sync_lds() right before use

* Re-order instructions

* Remove necessary lambda expression

* Use 8 threads on kMaxSplits direction while doing reduction

* Tiny correction for using 8 threads on kMaxSplits direction for combine kernel

* Padding num_split direction of o_acc tile window to 4x

* Update splitkv combine pipeline design

* Add kN1 back to splitkv combine pipeline problem

* Fix compilation errors

* Add missing template parameter

* Fix wrong splitkv combine kernel name

* Fix wrong origin

* Fix wrong LDS descriptor shape

* Fix sync & reduction logics

* Remove unnecessary static assertions

* Extract tile size computation logics

* Make sure we can reuse padding flags in combine kernels

* Rename variables

* Use OaccDataType in BlockFmhaSplitKVCombinePipelineTileSizes<>

* Remove unnecessary static assertion

* Fix function name typo

* Add constraint on kN1 template parameter

* Hide K tile loading latency in earlier iteration

* Fix wrong splitkv kernel name

* Use s_shuffling to replace p_shuffling which removes the needs of cross-warp reduction

* Rename pipeline

* Fix wrong pipeline name attribute

* Add GetAlignmentQ() for NWarpSShuffle pipeline

* Separate Q tile into dram tile & register tile concepts

* Remove non-squre warp gemm transpose c type alias

* Fallback tile size changes for fmha fwd splitkv

* Remove redundant change

* Refine naming for the S tile

* Use better naming of the S tile dstr (read from lds)

* Share Q lds with K lds

* Tiny change

* Fix with using static_for for passing CI checking

---------

Co-authored-by: Qianfeng Zhang <Qianfeng.Zhang@amd.com>
2024-12-20 14:41:01 +08:00
Illia Silin
2944c50894 fix profiler_grouped_gemm (#1766) 2024-12-19 17:24:05 -08:00
Mateusz Ozga
e758d006a5 Apply Ck-tile argument parser for vectors [I/O] (#1758)
* Parser for a vector was added. Additionaly we valid correctnes of numbers

* Remove unnecessary comments

* Review part 1

* Review part 2

* Add const to variadic lambda

* Rename C->K
2024-12-19 17:55:35 +01:00
aledudek
453ca37347 [CK TILE] Refactor GemmKernel to be reused by other GEMM related operators (#1730)
* Gemm Kernel Refactor part1

* Gemm Kernel Refactor common gemm pipeline part2

* [CK TILE] Refactor batched gemm to reuse GemmKernel

* [CK TILE] Refactor GemmKernel - review changes part1

* [CK TILE] Refactor GemmKernel - references fix

* [CK TILE] Refactor GemmKernel - naming changes, add problem

* [CK_TILE] Refactor GemmKernel - update tests

* [CK_TILE] Refactor GemmKernel - review changes

* [CK_TILE] Refactor GemmKernel - update test

* [CK_TILE] Refactor GemmKernel - constness fixes

* [CK_TILE] Refactor GemmKernel - update tests
2024-12-18 17:52:46 +01:00
Xiaodong Wang
1c1b336371 Disambiguate bit_cast (#1749)
Adding namespace to disambiguate with std::bit_cast

Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>
2024-12-18 18:32:38 +08:00
aledudek
f6c4d614e3 [CK_TILE] Move hipmalloc/memcpy calls out of gpu reference gemm (#1743)
* [CK_TILE] Move hipmalloc/memcpy calls out of gpu reference gemm

* [CK_TILE] Move hipmalloc/memcpy calls out of gpu reference gemm - review changes

* [CK_TILE] Move hipmalloc/memcpy calls out of gpu reference gemm - review fix
2024-12-18 09:45:58 +01:00
Harisankar Sadasivan
d9e37c6874 updated fp16 instances to be on parity with universal gemm instances (#1754)
* updated fp16 instances to be on parity with universal gemm instances

* corrected instance name to streamk instance
2024-12-17 10:31:21 -08:00
Illia Silin
689a5ae45b Pass build flags to config.h (#1760)
* pass the build flags to config.h

* fix clang format
2024-12-17 10:17:29 -08:00
Max Podkorytov
6ef8d3c295 refactor conditional usage; fix build on rocm6.1 where the reference didn't exist 2024-12-17 08:40:18 -08:00
dependabot[bot]
0e54d7ae5a Bump rocm-docs-core from 1.11.0 to 1.12.0 in /docs/sphinx (#1753)
Bumps [rocm-docs-core](https://github.com/ROCm/rocm-docs-core) from 1.11.0 to 1.12.0.
- [Release notes](https://github.com/ROCm/rocm-docs-core/releases)
- [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.11.0...v1.12.0)

---
updated-dependencies:
- dependency-name: rocm-docs-core
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-12-17 06:57:55 -08:00
jakpiase
627a27bda3 Added unit tests for CK Tile compute bound gemm pipeline (#1728) 2024-12-17 14:25:22 +01:00
Adam Osewski
d46196f291 Enhance printing functionality (#1751)
* Added object print with all template parameters

* fix clang format

---------

Co-authored-by: ravil-mobile <ravil.aviva.com@gmail.com>
Co-authored-by: illsilin <Illia.Silin@amd.com>
2024-12-17 09:19:44 +01:00
Max Podkorytov
0fd6978d2a clarify release notes bullet point 2024-12-16 10:46:19 -08:00
Max Podkorytov
1b75c77da4 add contributing placeholder 2024-12-16 10:46:19 -08:00
Max Podkorytov
30a37cac0e add pull request template placeholder 2024-12-16 10:46:19 -08:00
Max Podkorytov
a8ad7fcce9 add template placeholders 2024-12-16 10:46:19 -08:00
Illia Silin
fdfe210230 upgrade sqlalchemy version (#1748)
* upgrade sqlalchemy version

* replace the connection with engine in to_sql call

* change the hipTes=nsor ctest syntax
2024-12-15 16:25:21 -08:00
Xu, Shengnan
f57d720c67 added moe interleaving pipeline (#1712)
* added moe interleaving pipeline

* remove redundant code

* formater

---------

Co-authored-by: root <root@hjbog-srdc-14.amd.com>
2024-12-15 20:13:10 +08:00
Illia Silin
d68974a5c6 upgrade pandas package (#1746) 2024-12-13 16:30:39 -08:00
Illia Silin
41ebf117a5 Add zstd lib for building hipTensor. (#1745)
* add zstd library to CI docker

* fix the libzstd name
2024-12-13 16:30:22 -08:00
Bartłomiej Kocot
4d8fce33dd Add SplitK support into Batched GEMM V3 (#1729)
* add bmm api

* add bf16 multi_d

* add ckProfiler for bf16

* add ckProfiler files

* add more instance; fixed 64bit index issue

* fixed naming

* enabled batched Ds

* use long_index for ds offsets

* clean

* add bmm fp8 ckProfiler

* Update example/24_batched_gemm/batched_gemm_xdl_bf16_v3.cpp

Co-authored-by: Bartłomiej Kocot <bartlomiejkocot98@gmail.com>

* Update example/24_batched_gemm/batched_gemm_xdl_fp8_rowwise_v3.cpp

Co-authored-by: Bartłomiej Kocot <bartlomiejkocot98@gmail.com>

* Update example/24_batched_gemm/run_batched_gemm_example_rowwise.inc

Co-authored-by: Bartłomiej Kocot <bartlomiejkocot98@gmail.com>

* Update library/src/tensor_operation_instance/gpu/gemm_universal_batched/device_batched_gemm_xdl_universal_bf16_bf16_bf16/device_batched_gemm_xdl_universal_bf16_bf16_bf16_mk_nk_mn.hpp

Co-authored-by: Bartłomiej Kocot <bartlomiejkocot98@gmail.com>

* Update library/src/tensor_operation_instance/gpu/gemm_universal_batched/device_batched_gemm_xdl_universal_bf16_bf16_bf16/device_batched_gemm_xdl_universal_bf16_bf16_bf16_mk_nk_mn_mem_v1_default_instance.cpp

Co-authored-by: Bartłomiej Kocot <bartlomiejkocot98@gmail.com>

* Update library/src/tensor_operation_instance/gpu/gemm_universal_batched/device_batched_gemm_xdl_universal_bf16_bf16_bf16/device_batched_gemm_xdl_universal_bf16_bf16_bf16_mk_nk_mn_mem_v2_default_instance.cpp

Co-authored-by: Bartłomiej Kocot <bartlomiejkocot98@gmail.com>

* Update profiler/src/profile_gemm_universal_batched.cpp

Co-authored-by: Bartłomiej Kocot <bartlomiejkocot98@gmail.com>

* Update profiler/include/profiler/profile_gemm_universal_batched_impl.hpp

Co-authored-by: Bartłomiej Kocot <bartlomiejkocot98@gmail.com>

* clean

* Update include/ck/tensor_operation/gpu/device/impl/device_batched_gemm_multiple_d_xdl_cshuffle_v3.hpp

* Update include/ck/tensor_operation/gpu/device/impl/device_batched_gemm_multiple_d_xdl_cshuffle_v3.hpp

* Update library/src/tensor_operation_instance/gpu/gemm_universal_batched/device_batched_gemm_xdl_universal_bf16_bf16_bf16/device_batched_gemm_xdl_universal_bf16_bf16_bf16_mk_nk_mn_comp_default_instance.cpp

* Update include/ck/tensor_operation/gpu/device/impl/device_batched_gemm_multiple_d_xdl_cshuffle_v3.hpp

* Update include/ck/tensor_operation/gpu/device/impl/device_batched_gemm_multiple_d_xdl_cshuffle_v3.hpp

* Update include/ck/tensor_operation/gpu/device/impl/device_batched_gemm_multiple_d_xdl_cshuffle_v3.hpp

* refactor batch offset func

* add splitk suppport into bmm_v3

* clean

* clean

* format

* fixed

* fix

---------

Co-authored-by: Jing Zhang <jizhan@fb.com>
Co-authored-by: zjing14 <zhangjing14@gmail.com>
2024-12-13 21:08:35 +01:00
chenjun
4e73177684 Ck tile/smoothquant out stride (#1742)
* add ck_tile/smoothquant out stride parameter

* Remove the default stride value

---------

Co-authored-by: so <a.com>
2024-12-13 11:53:52 +08:00
carlushuang
77a38e0211 [CK_TILE] naive attn (#1708)
* add reference attention fwd

* refactor addresser

* update

* paged, and i8 reflect-quant

* lets call it forward-quant

* fix error in decode variation

* update naive-attn

* fix page table

* fix build err
2024-12-12 11:54:03 +08:00
Illia Silin
357a0b1c57 add missing stdexcept header (#1740) 2024-12-10 15:16:03 -08:00
Illia Silin
90d8410d56 Upgrade to Ubuntu22.04 as default OS. (#1738)
* upgrade to ubuntu 22.04

* try adding -u roof docker options for ubuntu 22
2024-12-10 08:48:51 -08:00
Jatin Chaudhary
67497a044d Make sure we call __hneg with half to remove ambigios error (#1736) 2024-12-10 08:47:36 -08:00
rocking
94ae7113bd [CK TILE] Use config name instead of data type in FmhaFwdTypeConfig<config> (#1731)
* Add data type config, Prepare to add mix precision in the future

* Fix compile error
2024-12-10 11:36:18 +08:00
Illia Silin
23cf2026b4 build CI for gfx12 by default (#1734) 2024-12-09 14:11:20 -08:00
Illia Silin
2f088b8707 update CI timeout limits (#1733) 2024-12-09 09:32:14 -08:00
Illia Silin
c773cc25a2 remove unnecessary file (#1732) 2024-12-09 08:50:36 -08:00
Illia Silin
355893cdd8 Refactor CI performance tests. (#1726)
* merge the build and performance tests CI stages together

* add gemm performance test on gfx11/gfx12

* add suffices to distinguish gemm performance logs from different archs

* use smaller gemm set in CI for gfx10/gfx11/gfx12

* disable performance tests on gfx1030

* fix the shashing logic

* fix finding python3 for mha instances
2024-12-06 13:04:25 -08:00
Rostyslav Geyyer
5e6bd75a72 Add copy assignment op test (#1718)
* Add copy assignment op test

* Add a deep copy testing
2024-12-06 09:56:27 -06:00
Bartłomiej Kocot
261f1759de Support large batch tensors in grouped conv bwd data (#1711)
* Support large batch tensors in grouped conv bwd data

* Fix multiD

* fixes

* fixes

* fixes
2024-12-06 10:55:23 +01:00
Po Yen Chen
58e7f37fc8 Undo padding-flag changes in fmha_fwd_kernel.hpp (#1725) 2024-12-06 12:59:58 +08:00