Commit Graph

509 Commits

Author SHA1 Message Date
danyao12
70514fd899 bwd rtn 2024-08-14 12:32:12 +00:00
danyao12
17c97f5814 solve lds read&write conflicts 2024-08-12 13:13:55 +00:00
danyao12
4cc514f803 fix unpadded lse issue in fwd splitkv 2024-08-07 11:00:33 +00:00
rocking
e148767a9b Support unpad lse layout for splitkv 2024-08-06 14:57:58 +00:00
rocking
e6c489df49 Support unpad layout for group lse 2024-08-06 14:25:22 +00:00
danyao12
3efdb593b8 unpadded lse&d for group mode 2024-08-06 14:47:55 +00:00
danyao12
3d5b0755ef non-iglp pipeline for headdim padding cases 2024-08-02 10:59:52 +00:00
danyao12
5d2a5a1131 more strides for fa integration 2024-07-30 10:57:22 +00:00
danyao12
ad3e94bbaa fwd dropout revert 2024-07-28 17:51:48 +08:00
danyao12
7e9d2390cc dq_acc stride stuff 2024-07-27 16:23:16 +08:00
danyao12
99ed2c1ae3 code sync up 2024-07-26 18:41:18 +08:00
danyao12
3552041a70 Merge branch 'develop' into ck_tile/fa_bwd_opt 2024-07-26 18:05:24 +08:00
Dan Yao
e892711040 Merge pull request #1419 from ROCm/ck_tile/fa_bwd_opt_clean
Remove duplicated codes for creating WarpGemm
2024-07-26 12:20:48 +08:00
danyao12
ed8ef7e58f dropout patch for mrepeat 16*16 2024-07-26 12:10:43 +08:00
Qianfeng Zhang
5a561b5e1d Remove duplicated WarpGemm definitions in the policy file 2024-07-25 16:36:37 +00:00
zjing14
105bd708c7 Add rotating buff for gemm_multi_d (#1411)
* add rotating_buff for gemm_multi_d

* format

* Update flush_cache.hpp

* Update gtest.cmake

---------

Co-authored-by: Jing Zhang <jizhan@fb.com>
Co-authored-by: Haocong WANG <haocwang@amd.com>
2024-07-25 23:21:21 +08:00
danyao12
94c957b3db revert 2024-07-25 20:51:02 +08:00
danyao12
dcc3593fe4 fix hd32 error and boost performance 2024-07-25 16:16:30 +08:00
Andriy Roshchenko
4a8a1befd5 Adding more instances of grouped convolution 3d forward for FP8 with ConvScale+Bias element-wise operation. (#1412)
* Add CMakePresets configurations.

* Add binary elementwise ConvScaleAdd and an example.

* Numerical verification of results.

Observed significant irregularities in F8 to F32 type conversions:
```log
ConvScaleAdd: float=145.000000   f8_t=160.000000    e=144.000000
ConvScaleAdd: float=97.000000   f8_t=96.000000    e=104.000000
ConvScaleAdd: float=65.000000   f8_t=64.000000    e=72.000000
```

* Implemented ConvScaleAdd + Example.

* Add ConvScale+Bias Instances

* Add Client Example for ConvScale+Bias

* Fix number of bytes in an example..

* Cleanup.
2024-07-24 15:49:55 -05:00
Bartłomiej Kocot
ffabd70a15 Add support for half_t and bfloat to reduction operations (#1395)
* Add support for half_t and bfloat to reduction operations

* Fix bhalf convert

* Next fix bf16
2024-07-24 12:12:37 -05:00
danyao12
b2510c0541 fix dq alignment 2024-07-24 12:28:41 +08:00
danyao12
da2dce1866 group convert_dq opt 2024-07-23 14:16:21 +08:00
danyao12
b69499b933 fix fwd dropout 2024-07-23 10:20:12 +08:00
Bartłomiej Kocot
5d8c3d8190 Revert Support access per groups and filter2x3 in grouped conv fwd (#1382) (#1406) 2024-07-22 14:21:24 +02:00
danyao12
260ace4b78 code cleanup 2024-07-22 11:35:34 +08:00
danyao12
9b4b4622c1 code cleanup 2024-07-22 11:06:04 +08:00
danyao12
06f575a3b7 refactor dropout 2024-07-21 17:34:40 +08:00
danyao12
99436cd4c6 save clear_tile 2024-07-20 17:21:12 +08:00
danyao12
b3100b6f43 remove FmhaBwdTilePartitioner 2024-07-20 16:09:14 +08:00
danyao12
9d78a6c506 comments 2024-07-20 15:00:02 +08:00
danyao12
42a7240a19 do{}while() -> while(){} 2024-07-20 14:52:35 +08:00
Haocong WANG
8c90f25be3 [GEMM] F8 GEMM, performance optimized. (#1384)
* add ab_scale init support

* enabled interwave

* add scale type; update isSupport

* adjust example

* clean

* enable f8 pure gemm rcr ckprofiler

* Add gemm_multiply_multiply instances

* clang format

* Optimize for ScaleBlockMNK=128

* enable abscale f8 gemm ck profiler

* Add pure f8 gemm test suite

* Reverting to the state of project at f60fd77

* update copyright

* clang format

* update copyright

---------

Co-authored-by: root <jizhan@amd.com>
2024-07-19 22:06:52 +08:00
ltqin
c544eb4da0 Universal gemm splitk using reduce (with multi-d) (#1341)
* init for reduce_threadwise multi_d

* add reduce_threadwise_multi_d

* add reduce_multi_d

* clean

* start add an other splitk device op

* add reduce template parameter to SplitKBatchOffset

* add reduce c matrix

* clean up code

* change example data type to bf16

* add bf16Ai8B example

* remove reduce template parameter

* add splitk atomic status to v4

* example add multi d parameters

* device op add multi-d parameters

* add multi-d to reduce

* fix kbach=1 bug

* change B layout to col in  bf16Ai8B example

* remove float adding struct

* change  multi-d interface

* change file and class name

* remove multi-d of bf16Ai8B example

* change IsReduce function to IsReduceAdd

* change example layout to RRR from RCR

* according layout to set ds stride

* reset parameter layout

* add gemm universal reduce instance

* add reduce factory

* add profile_gemm_universal_reduce

* add reduce to profiler

* fix reduce instance

* fix profiler reduce compiling bug

* format

* format library instance code

* add mem instance for reduce library

* fix call instance names

* add workspace for reduce in ckProfiler

* format

* add mnpading to reduce library instance

* add fp16 instance to reduce of profiler

* change copyright time

* restore profiler cmake file

* add reduce text to instances

* add DsLayout and DsDataType to instances template parameter

* fixed gemm_reduce_multi_d

* add an example without multi_d

* Update common.hpp

* Update gtest.cmake

* Update gemm_xdl_splitk_reduce_bf16.cpp

* clean

* Update gtest.cmake

* format

* fixe api

* format

* default parameter change to RRR

* add vector_len for multi_d

* format

* Update gtest.cmake

* fix bf16A iBB elementwiseop

* add ReduceDataType

* move ReduceDataType to end position

* format

* remove googletest git method  address

* fix copyright time

* update init data

---------

Co-authored-by: root <jizhan@amd.com>
Co-authored-by: letaoqin <letaoqin@amd.com>
Co-authored-by: Jing Zhang <jizhan@meta.com>
Co-authored-by: zjing14 <zhangjing14@gmail.com>
2024-07-19 22:01:22 +08:00
Bartłomiej Kocot
70a814f163 Refactor transform conv to gemm fwd (#1391)
* Refactor transform conv to gemm fwd

* fixes codegen

* wmma fixes

* fix wmma

* Fix copyright
2024-07-19 09:29:25 +02:00
danyao12
1d7099b6c4 fix hd256 dropout scratch 2024-07-19 15:08:45 +08:00
danyao12
a67bdd6349 simplify convert dq 2024-07-18 09:55:52 +08:00
Qianfeng
ee768148f0 Replace the using of __expf by __ocml_exp_f32 to work-around the test_softmax_rank4 failure (#1394) 2024-07-17 09:15:05 -07:00
Andriy Roshchenko
802a8a1df1 Adding more instances of grouped convolution 3d forward for FP8 with ConvScale element-wise operation and ReLU activation. (#1386)
* Add CMakePresets configurations.

* Add ConvScale+ReLU Functor and an Example

* Account for ReLU FLOPs.

* Add instances of 3D convolutions with ConvscaleRelu operation.

* Implement Client Example

* Cleanup
2024-07-16 08:51:49 -07:00
danyao12
fb26ec5d01 hd256 bias support 2024-07-16 13:48:56 +08:00
danyao12
237c93c85b bias support 2024-07-15 12:23:27 +08:00
Bartłomiej Kocot
82e8a78a3f Support access per groups and filter3x3 in grouped conv fwd (#1382)
* Support access per groups and filter3x3 in grouped conv fwd

* Fixes for large cases

* Fixes for large tensors
2024-07-12 11:08:42 -07:00
danyao12
ca4a9f008b reorder files 2024-07-12 18:10:20 +08:00
danyao12
39fc3d4b2e fix group deterministic bugs 2024-07-11 17:34:59 +08:00
danyao12
8c967d76d1 fix batch deterministic bugs 2024-07-11 16:33:34 +08:00
danyao12
74f1516c5b tmp save 2024-07-10 18:05:30 +08:00
carlushuang
8182976c37 [CK_TILE] wa prec, remove sgpr offset for inline asm (#1356)
* wa prec, remove sgpr offset for inline asm

* macro for set tile

* ignore unused param if no kernel instances in host API

* fix more prec issue

* cache buffer resource

* fix

* support pre-nop

* clear tile by vector type members

* add workaround to reduce scratch memory

* conditionally enable workaround code

* enable workaround start from certain build version

* fallback set_tile() implementation from certain build version

* undo template argument changes

* put dummy asm in load_raw()

* fix comments, refactor s_nop inside buffer_load

---------

Co-authored-by: PoYen, Chen <PoYen.Chen@amd.com>
2024-07-08 11:09:55 -07:00
Harisankar Sadasivan
75e622f02f Universal streamk with atomics (#1360)
* universal streamk with atomics with ckprofiler support. grid_size and streamk strategy are tunable. grid_size of -1 leads to #WGs = maximum occupancy X num_CUs. implementation supports many different streamk policies: 1-tile, 2-tile, 3-tile and 4-tile. streamk strategy of -1 leads to default streamk policy (4-tile). 

* Update README.md

* fixing clang-format issues

* removed conflicts in struct members between streamk and universal streamk

* corrected arg parsing for streamk and universal streamk

* added stream-k policies for 3 tile and 4 tile

* fixed argument type issue with parsing cmd args

* changes suggested in PR review are made- removing comments and correcting copyright

* file permissions updated

* added default value support for grid_size and streamk-policy selection set to -1

* print messages for arguments

* print messages for arguments

* print messages for arguments1
2024-07-05 21:40:30 -07:00
jakpiase
eaa870a1ab Add structural sparsity xdlops (#1363)
* Implemented smfmac xdlops

* add reviewer comments
2024-07-04 12:00:14 +02:00
Jun Liu
959073842c Fix issue with multiple targets and remove smfmac tests from unsupported test targets (#1372) 2024-07-03 23:34:38 -07:00
jakpiase
ed21948bcd Add structural sparsity gemm instruction tests (#1309)
* first version of smfmac test

* add reviewer comments

* add reviewer suggestions
2024-06-27 11:30:32 +02:00