Commit Graph

1608 Commits

Author SHA1 Message Date
Bartłomiej Kocot
a064792e96 Fix universal gemm profiler for pk_i4_t (#1790)
* Fix universal gemm profiler for pk_i4_t

* fix

[ROCm/composable_kernel commit: 888317e698]
2025-01-04 14:01:33 +01:00
dependabot[bot]
19603c0e45 Bump rocm-docs-core from 1.12.0 to 1.12.1 in /docs/sphinx (#1788)
Bumps [rocm-docs-core](https://github.com/ROCm/rocm-docs-core) from 1.12.0 to 1.12.1.
- [Release notes](https://github.com/ROCm/rocm-docs-core/releases)
- [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.12.0...v1.12.1)

---
updated-dependencies:
- dependency-name: rocm-docs-core
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

[ROCm/composable_kernel commit: 37b3514648]
2025-01-03 17:47:48 -08:00
Illia Silin
68c7f53cb1 terminology clean-up (#1792)
[ROCm/composable_kernel commit: 8ea375bb58]
2025-01-03 16:38:22 -08:00
carlushuang
60e814a3ba [CK_TILE]naive attn support FP8 KVCache quant (#1747)
* quant

* fix bug

* simple smoothquant after softmax

* update kv-quant

* update stride

* fix fp8-pertoken-kvcache

* update int8/fp8 quant support

---------

Co-authored-by: so <a.com>
Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>

[ROCm/composable_kernel commit: 6df5fe2ad8]
2025-01-03 18:43:07 +08:00
Mingtao Gu
d4a8c6c2ed Implement the fp16xint4 scale weight only kernel for Ali (#1786)
* enable int4 scale (weight only) kernel

* format some files

* Add unit test for int4 weight only

* fixed and formatted code

* fixed

* formated

* formated

* fixed

* fixed a bug in the ckProfiler, and formatted the code

---------

Co-authored-by: mtgu0705 <mtgu@amd.com>

[ROCm/composable_kernel commit: 4f62f6e9b7]
2025-01-03 18:35:21 +08:00
feli
5ce28a1d13 Ck tile/layernorm: implement naive reduce, opt performance (#1784)
* add no welford

* enable output raw

* raw of int8

* fix build

* fix smoke test err

* [ck_tile]layernorm: fix welford ok, set int8 and bf16 small N as default and others open by generate

* [cktile]layernorm, fix err commit files and remove uselss

* fix quant 8192 err & change norm_reduce class and file name

---------

Co-authored-by: coderfeli <coderfeli@163.com>
Co-authored-by: carlushuang <carlus.huang@amd.com>

[ROCm/composable_kernel commit: 4bc610416a]
2025-01-03 14:28:59 +08:00
John Afaganis
de674980aa Add afagaj to CODEOWNERS (#1787)
[ROCm/composable_kernel commit: 17e8efb573]
2025-01-02 20:50:07 -06:00
Muhammed Emin Ozturk
222b1d6b48 BF16 GEMM Stream-K (#1541)
* initial

* Cmake file

* successfull compilation but validation failed

* Cmake

* update

* gpu validation

* gemm universal

* gemm universal sk update

* sk bf16 universal instance

* gemm_universal_streamk.hpp

* only build for gfx94

* Cmakelist

* profiler update, bf16 sk only works at gfx42

* clang

* clang

* clang all

* no need flags

* cmake script

* delete comment

* gemm universal sk fix

* clang

* profiler fix

* clang

* update

* update

* delete comment

* code formatting

* cmake

* fix instance

* clang

* argument supported

* argument supported and clang

* update

* fix

* removing unnecessary comments

* clang formatting

* Update library/src/tensor_operation_instance/gpu/CMakeLists.txt

Co-authored-by: afagaj <john.afaganis@gmail.com>

* CopyRight Comment 2025

* clang reformatting

* copy right 2025

---------

Co-authored-by: Emin Ozturk <ozturk.27@osu.edu>
Co-authored-by: root <root@ctr-ubbsmc16.amd.com>
Co-authored-by: Muhammed Emin Ozturk <meozturk@t004-008.hpcfund>
Co-authored-by: root <root@splinter-126-wr-d3.amd.com>
Co-authored-by: Muhammed Emin Ozturk <meozturk@t006-001.hpcfund>
Co-authored-by: Muhammed Emin Ozturk <meozturk@login1.hpcfund>
Co-authored-by: Muhammed Emin Ozturk <meozturk@t004-004.hpcfund>
Co-authored-by: Emin Ozturk <emin.ozturk@utah.edu>
Co-authored-by: Muhammed Emin Ozturk <meozturk@t008-001.hpcfund>
Co-authored-by: afagaj <john.afaganis@gmail.com>

[ROCm/composable_kernel commit: 9e95d54cd2]
2025-01-02 10:30:04 -08:00
Adam Osewski
ac74520ff6 Jing's contribution: prototype of mixed precision gemm FP16/BF16xint4 GEMM (#1762)
* add a prototype of int4

* clean

* debug

* clean

* clean

* move packed into dynamic_buffer

* fixed coord reset

* add fast pki4 to half conversion

* fix

* fixed reference and host_tensor

* fixed tensor init

* format

* debug i4_to_f16_convert

* format

* fixed splitk

* weight permute

* add b tile permute

* clean

* weight permute with splitki

* format

* improve weight layout

* add and_or_b32

* fixed splitk crush

* add permute switch as a template

* recover v3r1

* clean

* failure with intrawave v2

* fixed

* fixed

* add ckProfiler

* add bfp16 support

* add bf16 example

* fixed int4 to bhalf_t conversion

* format

* fixed int4 to bf16 conversion

* clean

* add instances for mem

* clean

* fixed host tensor size

* fixed

* debug

* fixed

* add pk_i4_t as a struct

* fix

* Update example/01_gemm/gemm_xdl_bf16_pk_i4_v3.cpp

Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>

* Update example/01_gemm/gemm_xdl_bf16_pk_i4_v3.cpp

Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>

* Update example/01_gemm/gemm_xdl_bf16_pk_i4_v3.cpp

Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>

* revert

* Update example/01_gemm/gemm_xdl_bf16_pk_i4_v3.cpp

Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>

* Update example/01_gemm/gemm_xdl_fp16_pk_i4_v3.cpp

Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>

* Update example/01_gemm/gemm_xdl_fp16_pk_i4_v3.cpp

Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>

* Update example/01_gemm/gemm_xdl_fp16_pk_i4_v3.cpp

Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>

* Update example/01_gemm/gemm_xdl_fp16_pk_i4_v3.cpp

Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>

* fixed comments

* revert

* clean

* revert

* revert

* fixed

* Update CMakeLists.txt

* Update script/cmake-ck-dev.sh

Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>

* Update include/ck/tensor_operation/gpu/element/unary_element_wise_operation.hpp

Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>

* Update CMakeLists.txt

Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>

* fixed

* fixed

* fixed

* revert

* revert

* add comments

* format

* fixed assert

* fixed

* Fix I4 define in ckProfiler

* Fixed example_gemm_xdl_bf16_pk_i4_v3 test failed issue

---------

Co-authored-by: Jing Zhang <jizhan@fb.com>
Co-authored-by: zjing14 <zhangjing14@gmail.com>
Co-authored-by: mtgu0705 <mtgu@amd.com>

[ROCm/composable_kernel commit: 1d8e4ec2ce]
2025-01-02 11:48:06 +08:00
Bartłomiej Kocot
a860c20099 Add NGCHW bf16 grouped conv fwd instances (#1783)
* Add NGCHW bf16 grouped conv fwd instances

* add missed cmake

[ROCm/composable_kernel commit: 159fa31946]
2025-01-01 18:00:06 +01:00
Qianfeng
8c1883a424 Remove using partitioner for all fmha kernels (#1778)
* Remove using tile partitioner for fmha_fwd_kernel

* Remove using tile partitioner for fmha_fwd_splitkv and splitkv-combine kernels

* Remove using tile partitioner for fmha_fwd_appendkv kernel

* Unify the format of GetTileIndex

[ROCm/composable_kernel commit: 4e076909b6]
2024-12-29 14:29:56 +08:00
Bartłomiej Kocot
7fbc8a9ac1 [CK TILE] GEMM and Batched GEMM SplitK support (#1724)
* [CK TILE] Add split K support in GEMM

* Updates

* Fixes

* rebase

* fix

* Fix

* fixes

* support for batched gemm

[ROCm/composable_kernel commit: af66494880]
2024-12-28 14:40:17 +01:00
Po Yen Chen
1e65b3ab35 Correct the dtype checking logics (#1775)
[ROCm/composable_kernel commit: 4c2eff023a]
2024-12-25 23:57:28 +08:00
carlushuang
4c4be7b14f [CK_TILE] optimize moe-sorting kernel (#1771)
* opt moe sorting

* remove commented code

[ROCm/composable_kernel commit: 3d15f364b3]
2024-12-23 10:59:02 +08:00
Illia Silin
c369965615 fix typo for CK_USE_OCP_FP8 (#1769)
[ROCm/composable_kernel commit: 07339c7383]
2024-12-20 07:52:24 -08:00
carlushuang
0d16f9b5c7 hot-fix (#1768)
[ROCm/composable_kernel commit: 1c45ca35dd]
2024-12-20 16:40:45 +08:00
Po Yen Chen
f5c4569acd [CK_TILE] Add fmha fwd N-Warp S-Shuffle pipeline (fmha fwd splitkv pipeline variant) (#1705)
* Add check for zero values

* Add static assertions

* Remove invalid option '-e' in smoke_test.sh

* Use correct path of smoke_test.sh

* Avoid zero-sized shared memory array

* Add warning comment

* Replace expr by integer_divide_ceil() call

* Use more readable constant names

* Write down assumption as static assertion

* Add more diagnostic error messages

* Fix wrong BlockWarps when using default pipeline policy

* Add more static assertions for A LDS desc

* Allow using vector size < 8 for data type fp16/bf16

* Align vector size between DRAM dist & LDS desc

* Remove no-longer used func decl

* Fix wrong displayed piepline name

* Undo policy template changes for tile_example_gemm_basic

* Add missing space and make error message stands out

* Unify print precision

* Add missing include directive <iomanip>

* Replace constant 64 by get_warp_size() call

* Replace constant 128 by named variable: BankLength

* Add kAMBlock/kBNBlock attributes

* Allow usig different A/B warp dist for multiple blocks

* Add helper function to get warp dist encodings

* Add 4x64x4 fp16 warp gemm attribute impl

* Complete the A/B warp dist encoding logic

* Fix wrong thread mapping for C matrix

* Use smaller vector size for small tile

* Add static assert to block unsupported warp gemm impl

* Extract common code out as helper method

* Add 4x64x16 fp16 warp gemm type alias

* Add comment to warning developers

* Undo WarpGemmAtrributeMfma<> changes

* Use more clear static assertion error message

* Add trivial wrapper to get warp dstr encodings

* Only transpose warp gemm result if it's square

* Fix compilation error

* Support multi-block warp gemm (on N direction)

* Remove duplicated code

* Fix output encoding of warp gemm

* Fix wrong shape of WarpGemmAtrributeMfmaIterateK<>

* Remove unused code

* Fix wrong shape of WarpGemmAttributeMfmaImplF16F16F32M4N64K4

* Add type config for bf16_t

* Add 4x64x16 bf16 warp gemm

* Update WarpGemmAtrributeMfmaIterateKAndTransposedCDistribution

* Add 64x4x4 fp16/bf16 warp gemm impl

* Add 64x4x16 fp16/bf16 warp gemm

* Add static assertion for better error diagnostic

* Get Q dram dstr directly form block gemm

* Add missing header: fused_moe.hpp

* Allow specifying different warp-gemm for gemm0 & gemm1

* Store P matrix into LDS before gemm1

* Fix inconsistant kernel name

* Remove constraint on gemm0 & gemm1 block warps

* Remove unsupported vector size from checking list

* Allow using 4x64x16 warp gemm for gemm0

* Finish policy customization

* Finish pipeline modification
F#

* Use block warps in codegen

* Fix wrong rank of m_lds_window origin

* Use better distributed tensor

* Make P-store earlier

* Remove duplicated experssions

* Remove unnecessary tile window

* Create new files for new splitkv pipeline

* Separate old/new pipeline codegen logic

* Sync changes form develop

* Undo gemm kernel/pipeline changes

* Undo gemm example changes

* Remove blank lines

* Fix typo

* Use new warp gemm interface

* Fix link error

* Fix wrong pipeline tag

* Fix more link error

* Avoid unnecessary padding

* Always use vector load for K

* Padding on fastest dimension when necessary

* Force padding Q on hdim_q

* Set high dimension padding flag to false

* Re-format headers

* Use warps=<1, 4, 1> for both gemm0 & gemm1

* Fix complilation errors

* Remove m/l shuffle logics

* Ignore duplicate data when write lse_acc

* Use gemm0 block warps as lds tile width

* Remove hard-coded numbers

* Fix wrong distribution width

* Remove unnecessary code

* Add s_barrier before writing to LDS

* Store Q into LDS before gemm0

* Fix wrong Q tile size

* Use simple Q lds descriptor for debuging

* Use more realistic Q lds descriptor

* Add comment & use better variable name

* Make Q lds space not overlapped with others

* Remove unnecessary block_tile_reduce_sync() call

* Move Q load statements

* Move block_sync_lds() right before use

* Re-order instructions

* Remove necessary lambda expression

* Use 8 threads on kMaxSplits direction while doing reduction

* Tiny correction for using 8 threads on kMaxSplits direction for combine kernel

* Padding num_split direction of o_acc tile window to 4x

* Update splitkv combine pipeline design

* Add kN1 back to splitkv combine pipeline problem

* Fix compilation errors

* Add missing template parameter

* Fix wrong splitkv combine kernel name

* Fix wrong origin

* Fix wrong LDS descriptor shape

* Fix sync & reduction logics

* Remove unnecessary static assertions

* Extract tile size computation logics

* Make sure we can reuse padding flags in combine kernels

* Rename variables

* Use OaccDataType in BlockFmhaSplitKVCombinePipelineTileSizes<>

* Remove unnecessary static assertion

* Fix function name typo

* Add constraint on kN1 template parameter

* Hide K tile loading latency in earlier iteration

* Fix wrong splitkv kernel name

* Use s_shuffling to replace p_shuffling which removes the needs of cross-warp reduction

* Rename pipeline

* Fix wrong pipeline name attribute

* Add GetAlignmentQ() for NWarpSShuffle pipeline

* Separate Q tile into dram tile & register tile concepts

* Remove non-squre warp gemm transpose c type alias

* Fallback tile size changes for fmha fwd splitkv

* Remove redundant change

* Refine naming for the S tile

* Use better naming of the S tile dstr (read from lds)

* Share Q lds with K lds

* Tiny change

* Fix with using static_for for passing CI checking

---------

Co-authored-by: Qianfeng Zhang <Qianfeng.Zhang@amd.com>

[ROCm/composable_kernel commit: 37cdbf4f0e]
2024-12-20 14:41:01 +08:00
Illia Silin
17e73266d5 fix profiler_grouped_gemm (#1766)
[ROCm/composable_kernel commit: 2944c50894]
2024-12-19 17:24:05 -08:00
Mateusz Ozga
e08b9f7cce Apply Ck-tile argument parser for vectors [I/O] (#1758)
* Parser for a vector was added. Additionaly we valid correctnes of numbers

* Remove unnecessary comments

* Review part 1

* Review part 2

* Add const to variadic lambda

* Rename C->K

[ROCm/composable_kernel commit: e758d006a5]
2024-12-19 17:55:35 +01:00
aledudek
d9025d054d [CK TILE] Refactor GemmKernel to be reused by other GEMM related operators (#1730)
* Gemm Kernel Refactor part1

* Gemm Kernel Refactor common gemm pipeline part2

* [CK TILE] Refactor batched gemm to reuse GemmKernel

* [CK TILE] Refactor GemmKernel - review changes part1

* [CK TILE] Refactor GemmKernel - references fix

* [CK TILE] Refactor GemmKernel - naming changes, add problem

* [CK_TILE] Refactor GemmKernel - update tests

* [CK_TILE] Refactor GemmKernel - review changes

* [CK_TILE] Refactor GemmKernel - update test

* [CK_TILE] Refactor GemmKernel - constness fixes

* [CK_TILE] Refactor GemmKernel - update tests

[ROCm/composable_kernel commit: 453ca37347]
2024-12-18 17:52:46 +01:00
Xiaodong Wang
eb09d3a572 Disambiguate bit_cast (#1749)
Adding namespace to disambiguate with std::bit_cast

Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>

[ROCm/composable_kernel commit: 1c1b336371]
2024-12-18 18:32:38 +08:00
aledudek
ce345cc50e [CK_TILE] Move hipmalloc/memcpy calls out of gpu reference gemm (#1743)
* [CK_TILE] Move hipmalloc/memcpy calls out of gpu reference gemm

* [CK_TILE] Move hipmalloc/memcpy calls out of gpu reference gemm - review changes

* [CK_TILE] Move hipmalloc/memcpy calls out of gpu reference gemm - review fix

[ROCm/composable_kernel commit: f6c4d614e3]
2024-12-18 09:45:58 +01:00
Harisankar Sadasivan
4f35cc87fe updated fp16 instances to be on parity with universal gemm instances (#1754)
* updated fp16 instances to be on parity with universal gemm instances

* corrected instance name to streamk instance

[ROCm/composable_kernel commit: d9e37c6874]
2024-12-17 10:31:21 -08:00
Illia Silin
57d3525983 Pass build flags to config.h (#1760)
* pass the build flags to config.h

* fix clang format

[ROCm/composable_kernel commit: 689a5ae45b]
2024-12-17 10:17:29 -08:00
Max Podkorytov
477e028e58 refactor conditional usage; fix build on rocm6.1 where the reference didn't exist
[ROCm/composable_kernel commit: 6ef8d3c295]
2024-12-17 08:40:18 -08:00
dependabot[bot]
b0cd5fea51 Bump rocm-docs-core from 1.11.0 to 1.12.0 in /docs/sphinx (#1753)
Bumps [rocm-docs-core](https://github.com/ROCm/rocm-docs-core) from 1.11.0 to 1.12.0.
- [Release notes](https://github.com/ROCm/rocm-docs-core/releases)
- [Changelog](https://github.com/ROCm/rocm-docs-core/blob/develop/CHANGELOG.md)
- [Commits](https://github.com/ROCm/rocm-docs-core/compare/v1.11.0...v1.12.0)

---
updated-dependencies:
- dependency-name: rocm-docs-core
  dependency-type: direct:production
  update-type: version-update:semver-minor
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>

[ROCm/composable_kernel commit: 0e54d7ae5a]
2024-12-17 06:57:55 -08:00
jakpiase
430eafe07d Added unit tests for CK Tile compute bound gemm pipeline (#1728)
[ROCm/composable_kernel commit: 627a27bda3]
2024-12-17 14:25:22 +01:00
Adam Osewski
481555748d Enhance printing functionality (#1751)
* Added object print with all template parameters

* fix clang format

---------

Co-authored-by: ravil-mobile <ravil.aviva.com@gmail.com>
Co-authored-by: illsilin <Illia.Silin@amd.com>

[ROCm/composable_kernel commit: d46196f291]
2024-12-17 09:19:44 +01:00
Max Podkorytov
4fb6f1c199 clarify release notes bullet point
[ROCm/composable_kernel commit: 0fd6978d2a]
2024-12-16 10:46:19 -08:00
Max Podkorytov
084e264f0a add contributing placeholder
[ROCm/composable_kernel commit: 1b75c77da4]
2024-12-16 10:46:19 -08:00
Max Podkorytov
6107f3ab89 add pull request template placeholder
[ROCm/composable_kernel commit: 30a37cac0e]
2024-12-16 10:46:19 -08:00
Max Podkorytov
16166b49a8 add template placeholders
[ROCm/composable_kernel commit: a8ad7fcce9]
2024-12-16 10:46:19 -08:00
Illia Silin
b75aad4943 upgrade sqlalchemy version (#1748)
* upgrade sqlalchemy version

* replace the connection with engine in to_sql call

* change the hipTes=nsor ctest syntax

[ROCm/composable_kernel commit: fdfe210230]
2024-12-15 16:25:21 -08:00
Xu, Shengnan
e63c346bf2 added moe interleaving pipeline (#1712)
* added moe interleaving pipeline

* remove redundant code

* formater

---------

Co-authored-by: root <root@hjbog-srdc-14.amd.com>

[ROCm/composable_kernel commit: f57d720c67]
2024-12-15 20:13:10 +08:00
Illia Silin
6f4b8ba3fd upgrade pandas package (#1746)
[ROCm/composable_kernel commit: d68974a5c6]
2024-12-13 16:30:39 -08:00
Illia Silin
61dab707af Add zstd lib for building hipTensor. (#1745)
* add zstd library to CI docker

* fix the libzstd name

[ROCm/composable_kernel commit: 41ebf117a5]
2024-12-13 16:30:22 -08:00
Bartłomiej Kocot
4111d2fbfd Add SplitK support into Batched GEMM V3 (#1729)
* add bmm api

* add bf16 multi_d

* add ckProfiler for bf16

* add ckProfiler files

* add more instance; fixed 64bit index issue

* fixed naming

* enabled batched Ds

* use long_index for ds offsets

* clean

* add bmm fp8 ckProfiler

* Update example/24_batched_gemm/batched_gemm_xdl_bf16_v3.cpp

Co-authored-by: Bartłomiej Kocot <bartlomiejkocot98@gmail.com>

* Update example/24_batched_gemm/batched_gemm_xdl_fp8_rowwise_v3.cpp

Co-authored-by: Bartłomiej Kocot <bartlomiejkocot98@gmail.com>

* Update example/24_batched_gemm/run_batched_gemm_example_rowwise.inc

Co-authored-by: Bartłomiej Kocot <bartlomiejkocot98@gmail.com>

* Update library/src/tensor_operation_instance/gpu/gemm_universal_batched/device_batched_gemm_xdl_universal_bf16_bf16_bf16/device_batched_gemm_xdl_universal_bf16_bf16_bf16_mk_nk_mn.hpp

Co-authored-by: Bartłomiej Kocot <bartlomiejkocot98@gmail.com>

* Update library/src/tensor_operation_instance/gpu/gemm_universal_batched/device_batched_gemm_xdl_universal_bf16_bf16_bf16/device_batched_gemm_xdl_universal_bf16_bf16_bf16_mk_nk_mn_mem_v1_default_instance.cpp

Co-authored-by: Bartłomiej Kocot <bartlomiejkocot98@gmail.com>

* Update library/src/tensor_operation_instance/gpu/gemm_universal_batched/device_batched_gemm_xdl_universal_bf16_bf16_bf16/device_batched_gemm_xdl_universal_bf16_bf16_bf16_mk_nk_mn_mem_v2_default_instance.cpp

Co-authored-by: Bartłomiej Kocot <bartlomiejkocot98@gmail.com>

* Update profiler/src/profile_gemm_universal_batched.cpp

Co-authored-by: Bartłomiej Kocot <bartlomiejkocot98@gmail.com>

* Update profiler/include/profiler/profile_gemm_universal_batched_impl.hpp

Co-authored-by: Bartłomiej Kocot <bartlomiejkocot98@gmail.com>

* clean

* Update include/ck/tensor_operation/gpu/device/impl/device_batched_gemm_multiple_d_xdl_cshuffle_v3.hpp

* Update include/ck/tensor_operation/gpu/device/impl/device_batched_gemm_multiple_d_xdl_cshuffle_v3.hpp

* Update library/src/tensor_operation_instance/gpu/gemm_universal_batched/device_batched_gemm_xdl_universal_bf16_bf16_bf16/device_batched_gemm_xdl_universal_bf16_bf16_bf16_mk_nk_mn_comp_default_instance.cpp

* Update include/ck/tensor_operation/gpu/device/impl/device_batched_gemm_multiple_d_xdl_cshuffle_v3.hpp

* Update include/ck/tensor_operation/gpu/device/impl/device_batched_gemm_multiple_d_xdl_cshuffle_v3.hpp

* Update include/ck/tensor_operation/gpu/device/impl/device_batched_gemm_multiple_d_xdl_cshuffle_v3.hpp

* refactor batch offset func

* add splitk suppport into bmm_v3

* clean

* clean

* format

* fixed

* fix

---------

Co-authored-by: Jing Zhang <jizhan@fb.com>
Co-authored-by: zjing14 <zhangjing14@gmail.com>

[ROCm/composable_kernel commit: 4d8fce33dd]
2024-12-13 21:08:35 +01:00
chenjun
26839ac17b Ck tile/smoothquant out stride (#1742)
* add ck_tile/smoothquant out stride parameter

* Remove the default stride value

---------

Co-authored-by: so <a.com>

[ROCm/composable_kernel commit: 4e73177684]
2024-12-13 11:53:52 +08:00
carlushuang
b675a4d4ba [CK_TILE] naive attn (#1708)
* add reference attention fwd

* refactor addresser

* update

* paged, and i8 reflect-quant

* lets call it forward-quant

* fix error in decode variation

* update naive-attn

* fix page table

* fix build err

[ROCm/composable_kernel commit: 77a38e0211]
2024-12-12 11:54:03 +08:00
Illia Silin
b0cb070311 add missing stdexcept header (#1740)
[ROCm/composable_kernel commit: 357a0b1c57]
2024-12-10 15:16:03 -08:00
Illia Silin
21ae5f75c3 Upgrade to Ubuntu22.04 as default OS. (#1738)
* upgrade to ubuntu 22.04

* try adding -u roof docker options for ubuntu 22

[ROCm/composable_kernel commit: 90d8410d56]
2024-12-10 08:48:51 -08:00
Jatin Chaudhary
09598a1337 Make sure we call __hneg with half to remove ambigios error (#1736)
[ROCm/composable_kernel commit: 67497a044d]
2024-12-10 08:47:36 -08:00
rocking
6e778cc529 [CK TILE] Use config name instead of data type in FmhaFwdTypeConfig<config> (#1731)
* Add data type config, Prepare to add mix precision in the future

* Fix compile error

[ROCm/composable_kernel commit: 94ae7113bd]
2024-12-10 11:36:18 +08:00
Illia Silin
d65f9559ff build CI for gfx12 by default (#1734)
[ROCm/composable_kernel commit: 23cf2026b4]
2024-12-09 14:11:20 -08:00
Illia Silin
f9d1334a29 update CI timeout limits (#1733)
[ROCm/composable_kernel commit: 2f088b8707]
2024-12-09 09:32:14 -08:00
Illia Silin
d34b5e54a9 remove unnecessary file (#1732)
[ROCm/composable_kernel commit: c773cc25a2]
2024-12-09 08:50:36 -08:00
Illia Silin
f76071125c Refactor CI performance tests. (#1726)
* merge the build and performance tests CI stages together

* add gemm performance test on gfx11/gfx12

* add suffices to distinguish gemm performance logs from different archs

* use smaller gemm set in CI for gfx10/gfx11/gfx12

* disable performance tests on gfx1030

* fix the shashing logic

* fix finding python3 for mha instances

[ROCm/composable_kernel commit: 355893cdd8]
2024-12-06 13:04:25 -08:00
Rostyslav Geyyer
99040fae57 Add copy assignment op test (#1718)
* Add copy assignment op test

* Add a deep copy testing

[ROCm/composable_kernel commit: 5e6bd75a72]
2024-12-06 09:56:27 -06:00
Bartłomiej Kocot
8df7da97a2 Support large batch tensors in grouped conv bwd data (#1711)
* Support large batch tensors in grouped conv bwd data

* Fix multiD

* fixes

* fixes

* fixes

[ROCm/composable_kernel commit: 261f1759de]
2024-12-06 10:55:23 +01:00
Po Yen Chen
44b0100283 Undo padding-flag changes in fmha_fwd_kernel.hpp (#1725)
[ROCm/composable_kernel commit: 58e7f37fc8]
2024-12-06 12:59:58 +08:00