composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-05-11 08:50:17 +00:00

Author	SHA1	Message	Date
Shaojie WANG	561ec12f4a	example for convnd bwd weight bf16 splitk (#265 ) * add GetWorkSpaceSize to base arg and make an example on convnd_bwd_weight * add bwd weight for bf16: init * remove redundant compute * use datatype and split k to check whether a workspace is used * remove unused computation for work space size * add some code for bfp16 * add device/grid unary op * add unary type convert to bwd-weight example * support bf16 splitk kernel for convnd bwd weight * 1. remove comments. 2. add checkvalidity. 3. add gridsize computation * add workspace size check * fix format * change function name	2022-06-16 14:16:01 -05:00
Illia Silin	fb9b6b1e33	Use new github credentials (#278 ) * use pre-built docker instead of building a new one * try docker.image.pull * change syntax in docker.image() * add 30 min timeout * increase timeout to 3 hours * move performance tests to first stage for testing * set image variable to the new container name * update image name * check available images * check available images in both places * try different image name * use image ID to refer to image * run performance on gfx90a * fix the gpu_arch labeling, add parameter * move env vars out of stages * add stand-alone performance script, MI200 tests, CU numbers * dos2unix for run_perf_tests.sh * try the new git credentials * use env var for git credentials	2022-06-15 21:26:48 -05:00
Illia Silin	1ced00a577	Add performance tests on MI200 in CI, reporting number of CUs, add stand-alone perf test. (#277 ) * use pre-built docker instead of building a new one * try docker.image.pull * change syntax in docker.image() * add 30 min timeout * increase timeout to 3 hours * move performance tests to first stage for testing * set image variable to the new container name * update image name * check available images * check available images in both places * try different image name * use image ID to refer to image * run performance on gfx90a * fix the gpu_arch labeling, add parameter * move env vars out of stages * add stand-alone performance script, MI200 tests, CU numbers	2022-06-10 14:43:43 -05:00
Illia Silin	1677cf705e	Adding Resnet50 test to Performance tests (#268 ) * add resnet50 test to performance tests * add blanks before gpu_arch in log files * add resnet50 test with N=4 and process its results * add ROCM and HIP versions to test tables * uncomment the sql queries * fix script syntax in jenkinsfile	2022-06-02 18:16:59 -05:00
Shaojie WANG	1c5d06f270	use old ctile to avoid conv2d fwd bias relu add compute error (#271 )	2022-06-02 14:06:42 -05:00
Qianfeng	86185bd7ce	Unify the naming of the math functions used by the host and kernel (#262 ) * Use the unified naming for math functions on host and HIP kernel * Corresponding change/simplification in reduction host/profiler/examples due to unified math functions renaming * Renaming GetReductionZeroVal() to GetIdentityValue() * Tiny renaming in profile_reduce_impl.hpp * More renaming in profile_reduce_impl.hpp * Replace zeroVal by identiyVal * Remove ck_ prefix in the naming of ck::math provided functions	2022-06-01 21:49:53 -05:00
zjing14	b6eaf3eb7e	Pass gemm_descs for grouped gemm via __constant__ buff (#232 ) * moved gemm_descs_args into const buff * use CK_CONSTANT_ADDRESS_SPACE instead of global constant * clean * moved hipMemAlloc outside of deviceOp * add SetWorkSpacePointer * fix ignore	2022-05-31 17:00:43 -05:00
myamlak	7b1e2c379e	Multi-kernel CGEMM (#230 ) * Reference CGEMM + test stub * Format. * Incomplete simple implementation * Library instances * Sketch of tests * Test fixes. * Example added * Cosmetics * Add elementwise operation kernel and example * Add comment * Add template argument of dim . Prepare to support multiple dimension * Rename example * Support 1 dimension * Add static assert * Add comment * Second auxiliary buffer added * Extract pad * Remove redundant argument * Support any dimension for elementwise operation * Remove line * Let it be the multiple number of CU * Move thread per block to the parameter of constructor * Consuming binary ops to do A+B / A-B * Fix + cosmetics + bf16 test commented out temporarily * Format * Enabling bf16 test * Revert "Enabling bf16 test" This reverts commit `f497e2ba44`. * Fix + test reenabled * fix build * Revert "fix build" This reverts commit `d73102384b`. * post PR #235 merge fix * amend * Single workspace for cgemm + helper * Perf calc fix * Review remarks: static_cast * Review remarks: binary ops templated * Cleaning * Removal of instances and their tests * Review remarks from aosew addressed * Review remark: unnecessary attribute * Post-merge fixes * Restrict 4gemm to PassThrough + bug fix * Review remarks * update licence * change cgemm example to fp16 Co-authored-by: rocking <chunylai@amd.com> Co-authored-by: Chao Liu <chao.liu2@amd.com> Co-authored-by: Anthony Chang <ac.chang@outlook.com>	2022-05-31 10:20:55 -05:00
Chao Liu	85fc91c321	Minor fix for recent PR (#260 ) * fix example * update IsSupportedArgument * fix * disable fp64 conv example as test	2022-05-30 19:57:49 -05:00
rocking5566	d32a67a9b6	gemm + layernorm (#261 ) * Implement reduction meand and reduction square mean * Refine file name * Add reduce mean and square mean * Fix parameter name * Add normalize device op (not implement invoker::run()) * Remove epislon * Refine deviceop * Add 5ary elementwise for normalization * Add layernorm example * layerNorm verication * Fix compiler error due to merge from develop * Fix typo * Fix compile error * Refine naming * [What] Suport non pointer for invoker and argument [Why] Snyc coding style with gemm * Refine folder name * Refine class name * Evaluate perf of the kernel * Fix compile error * [What] Refine perf evaluation in example of gemm + reduction [Why] evaluation of gemm + reduction may cause verification fail. Because evaluation will not initial global memory * clang-format	2022-05-30 16:36:55 -05:00
Chao Liu	91d8b7d67a	Fixing conv bug (#258 ) * debugging conv * fix oversight where ctile map is constructed before initializing c desc * example program should returns error code * clean up * changed Block2CTileMap in conv2d and convnd * clean up * clean up * cleanup Co-authored-by: Anthony Chang <ac.chang@outlook.com>	2022-05-27 09:29:37 -05:00
ltqin	3e6c2610ae	Add FP64 XDL GEMM built-in function (#199 ) * add intrin_mfma_f64_16x16x4f64 * add example * gemm reference add double data type * chang init data * fix M N PerXdlops * fix ifdef * add comparsion config * add conv fwd example * format log out * change rc matrix egister layout * reorganize example * reorganize example 2 * format,because merge develop * fix call impl adding acc data type * lost ; * add compiler warning * change example tunning parameters * add test for fp64 * add instance * add test/gemm/gemm_fp64.cpp * fix get name issue * remove some tunning parameter * fix conflict * format * use integer value for GEMM test * add acc data type * remove typeid because fp16 * fix streamconfig etc bug from merging develop * format * remove test_gemm_xdl_fp64 * add AccDataType * AccDataType problem Co-authored-by: qinletao <letaoqin@amd.com> Co-authored-by: Chao Liu <chao.liu2@amd.com>	2022-05-26 14:48:57 -05:00
Qianfeng	97c4d486f4	Add pooling example (#257 ) * Add example for computing LayerNorm mean and meansquare * Refactor the pool2d_fwd example and add example for float type testing * Revert "Add example for computing LayerNorm mean and meansquare" This reverts commit `df52e6f9d8`. * Tiny fix in pool2d_fwd_common.hpp	2022-05-26 10:01:12 -05:00
rocking5566	82d7d9938f	Hotfix binary elementwise (for broadcast on fastest axis) (#254 ) * Support different length of ScalarPerVector * Add example of broadcast on fastest axis * Typo * Refine fastest example * Add dimension check * Modify fastest broadcast example to 3d * Enforce users give scalarPerVector explicitely * 1. Add CscalarPerVedctor 2. Not only broadcast on fastest need to set scalarPerVector to 1 * Rename var * Move IsScalarPerVectorValid() inside IsSupportedArgument() * Separate GridDesc_M0 into A, B and C * rename var * Rename var of length Co-authored-by: rocking <chunylai@amd.com>	2022-05-25 11:17:27 -05:00
Anthony Chang	e579c9e5c6	Tensile-style block to C tile map (#239 ) * fix build * Revert "fix build" This reverts commit `d73102384b`. * post PR #235 merge fix * amend * adds tensile-stype c-tile map * make it dynamic version * add k-split flavor tile map * apply tensile-style tile map to all xdl gridwise gemms * remove dead code Co-authored-by: Chao Liu <chao.liu2@amd.com>	2022-05-24 21:55:22 -05:00
Chao Liu	61851ae2b9	minor fix for recent PR (#255 ) * minor fix * clean	2022-05-24 21:51:34 -05:00
Jianfeng Yan	40b59a63cc	Navi21 gemm (#197 ) * start adding navi21 GEMM * navi_gemm_km_kn_mn_fp32 compiles and passes one test. * rename variables and functions in gridwise_gemm_dlops_v1r3 * add other 3 layouts; format instance * adding more tuning parameters add tuning parameters for other 3 layouts * add gemm_dlops_f16 * tmp * add dependence of DeviceGemm::IsSupportedArg() on arch * minor changes * minor changes * minor changes * minor changes * minor changes * minor changes * minor changes * push gemm_dlops into profiler * minor changes * if using xdl or dlops is moved into profiler_gemm_impl * minor changes * minor changes * remove is_xdl from profile_gemm_impl * make IsSupportedArg dependent on arch for other device_gemm * minor changes * minor changes * fix a bug in f_generate_tensor_value * add 64x64x64 for gemm_dlops_int8 * add 64x64x64 for gemm_dlops_int8 * comment out 3 layouts in gemm_dlops_int8; add 32x32x32 for gemm_dlops_int8; init A values to 1 * fix * start fixing tuning parameters * monir * minor changes * minor changes * minor changes * fixing * adding example * adding example * adding example * add gemm fp32 example * clean up * use 128x128x16 as MNK tile in navi21 gemm example * bug fix * fix test * use new block c tile * clean * fix build Co-authored-by: Chao Liu <chao.liu2@amd.com> Co-authored-by: shaojiewang <wsjmessi@163.com>	2022-05-24 12:19:27 -05:00
Qianfeng	63eee2d999	Overhaul to Reducton and its dependants (#237 ) * Tiny fix in dynamic_buffer.hpp to support vectorized AtomicAdd for double type * Update to host layer and host reduction * Merge and remove reduction kernels * Merge and remove reduction device interfaces and update pooling device interface * Merge and remove useless reduction device instances * Update to reduction profiler and reduction ctests * Update to reduction and pooling examples and add one reduction example * Change to reduction examples to let them testable by ctest * Add explicit pass checking for reduction and pooling examples * Explicit assignment of tensor shapes in example reduce_blockwise_two_call * Use atomic_add to repace atomicAdd and add atomic_add for double type * Add reduce ctest support for double data type * Replace to_int_vector() by using c++ std::vector::assign() * Keep DeviceReduceThreadWise separated from DeviceReduceBlockWise * Merge DeviceReduceBlockWise and DeviceReduceMultiBlockAtomicAdd into DeviceReduceMultiBlock * Add GetAtomicOperationZeroValue() support for AtomicMax * Tiny change to reduce example README.md * Fix some tiny issues due to branch merging * Revoke previous change in dynamic_buffer.hpp and add atomic_add for double2_t * Add reduce multiblock_atomic_add instances for fp64 to verify vectorized atomic_add on fp64 * Renaming * Clean the header includings in device_reduce instances header files	2022-05-24 12:19:12 -05:00
Illia Silin	1085794df3	Add performance tests as a stage of CI. (#247 ) * modify ckProfiler_gemm output * fix syntax * change ckProfiler output and return 0 * fix syntax * output datatype * fix syntax * output datatype in another way * fix syntax * fix syntax * test return values of ckProfiler * add layout info and tests, make sure ckprofiler returns 0 * fix syntax * change layout output * fix syntax * fix syntax again * update script to process perf results * rearrange jenkins stages * fix typo * add python packages to Docker file * adding setuptools-rust package * modify parsing for new test parameters * test db credentials on jenkins * fix syntax * update python script to handle incomplete lines * ungrade python to 3.8 and write the gemm_params table * add sqlalchemy package to docker * move perf data processing to master node * move the master node inside a steps region * add new stage for result processing * move results processing to separate stage * reduce number of tests to speedup debugging * pass config to processPerfResults stage * run script on master in a docker container * replace show_node_info * try loading docker on master node again * use ansible node instead of master * get rid of pymysql package * try ssh connection using paramiko * put back pymysql * put the perf data processing back on the gpu node * put back artifact definition * archive the perf_log before parsing * clean up jenkinsfile, fix parsing * fix typo * enable all perf tests * put all stages in original order, finalize script * fix gpu_arch version * update parsing script * remove obsolete file causing merge conflict	2022-05-24 11:14:50 -05:00
Shaojie WANG	0d08cf1893	add GetWorkSpaceSize to base arg (#253 ) * add GetWorkSpaceSize to base arg and make an example on convnd_bwd_weight * remove redundant compute * use datatype and split k to check whether a workspace is used * remove unused computation for work space size	2022-05-24 11:13:00 -05:00
Chao Liu	ba58a93f60	fix build (#246 ) * fix build * Revert "fix build" This reverts commit `d73102384b`. * post PR #235 merge fix * amend Co-authored-by: Anthony Chang <ac.chang@outlook.com>	2022-05-23 12:10:22 -05:00
Shaojie WANG	ac543313bf	example of conv bwd weight 1d/2d/3d fp32/fp16/bf16 xdl (#244 ) * enable example of conv 1d/3d for bwd weight * make bf16 kernel do not use atomic add * using new gridwise gemm for bwd weight on convnd bwd weight Co-authored-by: Chao Liu <chao.liu2@amd.com>	2022-05-20 17:20:10 -05:00
Chao Liu	44943e0e21	remove options.hpp.in (#240 )	2022-05-20 14:40:12 -05:00
Anthony Chang	a054f7d604	Refactor block to C tile map (#235 ) * refactor block-to-ctile-map * gridwise gemm block2ctile generic validity check * format * amend split-k gemm block2ctile map refactor * add test * format * amend * revert to calculating batch index in kernel instead of passing as block_id_z * move file * add valid ctile index check to gridwise v2r4	2022-05-20 12:40:51 -05:00
Shaojie WANG	070619fbf1	[conv bwd-weight]Binding gemm k1 to conv n (#202 ) * add some instance to develop * avoid bank conflicts for wrw for all instance * add small K1 test * delete some unused instance * binding gemm k1 to conv n * try using half_4 to do ds_read * reset buffer load oob and ds memcpy to default option * remove useless instances * remove redandunt space * remove printf code * clang-format-10 change * use fastest config * fix clang format for the other files * remove gemmk0 pad for output * add gemmk padding macro * add bank length computation * add template to distinguish the instance that need lds padding for wrw * use rocm5.1 as docker * use integer value for GEMM test * add Right padding macro * add 2 test asm code * using 256x256x32 tile size * 1. move dedicated transform into gridwisegemm's head file. 2. make lds tensor params a struct templete. 3. remove useless code * using small vec * 256128 kernel size for example remove asm files * use a new gridwise gemm header for bwd-weight * revert gridwise gemm v2r4r2 * change foramt * reset gridwise gemm v2r4r2 * remove unused code * revert instance file * revert example instance * format file * remove macros * resolve compile error * rename wrw kernel invoker * use gridwisegemm pipeline struct instead of implement run fucntion in the same header Co-authored-by: Chao Liu <chao.liu2@amd.com>	2022-05-20 12:36:25 -05:00
Shaojie WANG	b31b588dd2	remove unused conv bwd data profiler header and cpp (#245 )	2022-05-20 12:34:23 -05:00
Shaojie WANG	b9b9c3b814	[Perf][Bwd-weights]Lds re-layout to avoid ds read/write bank conflict and balance ds ops with address calculations (#190 ) * add some instance to develop * avoid bank conflicts for wrw for all instance * add small K1 test * delete some unused instance * reset buffer load oob and ds memcpy to default option * remove useless instances * remove redandunt space * remove printf code * clang-format-10 change * fix clang format for the other files * add bank length computation * add template to distinguish the instance that need lds padding for wrw * use rocm5.1 as docker * use integer value for GEMM test * 1. move dedicated transform into gridwisegemm's head file. 2. make lds tensor params a struct templete. 3. remove useless code * use a new gridwise gemm header for bwd-weight * revert gridwise gemm v2r4r2 * change foramt * rename kernel invoker Co-authored-by: Chao Liu <chao.liu2@amd.com>	2022-05-20 00:43:10 -05:00
rocking5566	bb4b82a95a	Hotfix eltiwseop (#242 ) * Use vector constructor instead * Fix typo * Move blockSize to the MakeArgumentPointer * Fix naming * Fix clang format * remove blockSize from DeviceBinaryElementwise::Argument() Co-authored-by: rocking <chunylai@amd.com> Co-authored-by: Chao Liu <chao.liu2@amd.com>	2022-05-19 22:02:06 -05:00
rocking5566	0ffe956ab1	Gemm reduce max (#209 ) * [What] Rename the example [Why] Prepare to add unary reduction * Add global oparation to the parameter * Add atomicmax * Fix compile error * Support atomicMax (hip library) * Rename the reduction example * Fix target name * use p_d1_grid as the indicator directly * Prevent performance issue. Let passthrough handle it. * Implement the function template the specialize the float2 * No need to separate into two lines * Remove empty line * add comment * Fix compile error due to merge from develop * make the implementation of atomic_max / atomic_add explicit for each datatype * Refine typo * For future CI test * Fix compiler error in ckProfiler * Merge commit 'de2769e3a6695b38a20529261273ddc5cdaab2fe' * simply use remove_pointer * Rename type and var * Refine example * Modify reducemax example * Fix bug in reduction * Change initialize range * Implement F64 version of atomicMax * Move reduction code together * Add buffer atomic_max * Fix coding style by clang-format * Integrate new api of DeviceGemmReduce_Xdl_CShuffle * Integrate Batch gemm reduction * Fix example * fix example * clean up * Fix batch gemm tensor operation * Fix coding style * Fix template augument * Fix clang format * Keep flexible of different stride for each D tensor * Fix compile error for ckProfiler * Fix typo * [What] Fix naming [Why] Prepare to add out elementop * Add DoutElementOp Co-authored-by: Chao Liu <chao.liu2@amd.com> Co-authored-by: rocking <chunylai@amd.com>	2022-05-19 21:56:56 -05:00
rocking5566	aafc3ac27a	elementwise op (#238 ) * Add elementwise operation kernel and example * Add comment * Add template argument of dim . Prepare to support multiple dimension * Rename example * Support 1 dimension * Add static assert * Add comment * Extract pad * Remove redundant argument * Support any dimension for elementwise operation * Remove line * Let it be the multiple number of CU * Move thread per block to the parameter of constructor * rename threadPerBlock with blockSize * Support double * rename kernel function name * remove redundant include header * Refine type * Need to the final dimension * Refine variable name * Refine type * Use index_t instead of int in API Co-authored-by: rocking <chunylai@amd.com>	2022-05-18 23:34:35 -05:00
Anthony Chang	9f71ff48e2	Validate examples in CI (#233 ) * validate examples in ctest runs * format * fix usage of check_err * amend * add example codes to custom target 'check' Co-authored-by: Chao Liu <chao.liu2@amd.com>	2022-05-13 16:54:44 -05:00
JD	cec69bc3bc	Add host API (#220 ) * Add host API * manually rebase on develop * clean * manually rebase on develop * exclude tests from all target * address review comments * update client app name * fix missing lib name * clang-format update * refactor * refactor * refactor * refactor * refactor * fix test issue * refactor * refactor * refactor * upate cmake and readme Co-authored-by: Chao Liu <chao.liu2@amd.com>	2022-05-12 09:21:01 -05:00
ltqin	0f912e205e	enable convnd bwd data test (#234 )	2022-05-12 09:18:59 -05:00
Anthony Chang	76764d8c92	Manual control of MAC cluster for improved interwave performance (#184 ) * manual control of MAC cluster for improved 2-wave performance ensure setprio's order; ensure inner loop size >= local read size synchronize when single mac cluster * format * use value field from ck::integral_constant * roll out inter-wave loop scheduler to c-shuffle gemm variants will gradually roll out to other applicable device ops when occasional reg spill is resolved * additional comments * format * fix mismatch between inter-wave pipeline and interwave blockwise gemm * address review feedback * amend	2022-05-10 19:19:22 -05:00
Adam Osewski	712e464c4e	Post PR183 review fixes. (#224 ) * Suppress additional warnings for googltest. * Rename file conv_fwd_util to conv_util. * Update includes and ConvParams member access. * Formatting. * Change conv_fwd_util target to conv_util * Fix compiler errors. * Fix leftovers. Co-authored-by: Adam Osewski <aosewski@amd.com> Co-authored-by: Chao Liu <chao.liu2@amd.com>	2022-05-10 15:41:29 -05:00
myamlak	f03a1738d9	Resolution of issue #153 : Add compiler warning on comparing int and size_t (#212 ) * Turning compare warnings on * Cleaning part I * Cleaning part II * Explicit static_cast to ck::type_convert * Resolving large tensor size issue. * format * revert change to tensor descriptor; promote lementSpaceSize to 64bit * use integer value for GEMM test * Review remarks * Review remarks + issues with (un)signed arithmetic * Format fix * Format * Clang-format. * fix 2gb limit issue Co-authored-by: Chao Liu <chao.liu2@amd.com> Co-authored-by: Adam Osewski <aosewski@amd.com>	2022-05-09 15:06:49 -05:00
Wen-Heng (Jack) Chung	968bd93285	Update README.md (#228 )	2022-05-09 15:00:04 -05:00
Chao Liu	ec7c2e912e	Code refactor (#175 ) * format * improving pipeline * fix typo * format * adding thread group * adding thread group * adding thread group * adding gemm pipeline * tweak * refactor * refactor * add missing type convert * refactor * refactor * refactor * clean * fix build * refactor * format * clean up * use remove_cvref_t * clean * clean up * clean up * clean up	2022-05-09 14:57:59 -05:00
Illia Silin	a3c910ac6c	Add Benchmark test into CI (#226 ) * add performance test to jenkins pipeline * fix typo * fix the syntax in conv_fwd_util.cpp * fix the error message syntax spacing * fix the error message syntax spacing again * run profile_gemm and archive results * fix typo * try to figure out the paths * try to figure out the paths one more time * skip the copying step * build ckProfiler release only once * change directory using dir * fix dir syntax * change the gemm parameters * do not pipe script output to file * try running ckProfiler directly * fix typo * use set +e * run profile_gemm.sh \|\| true * run multiple gemms and parse results * fix typo in jenkinsfile * fix syntax * add new gemm sizes, update scripts * put all jenkins steps in original order Co-authored-by: Chao Liu <chao.liu2@amd.com> Co-authored-by: Chao Liu <lc.roy86@gmail.com>	2022-05-08 02:44:18 -05:00
Adam Osewski	8eca05a633	Introduce GoogleTest framework. (#204 ) * Use googletest for tests. Add conv2d_fwd UT. * Add conv1D/3D to gtest UT. * Fix: not duplicate test with CTest. * Convert more tests to googltests. * Fix: GIT_SHALLOW is not allowed for git commit hash. * Clang-format * use integer value for GEMM test Co-authored-by: Adam Osewski <aosewski@amd.com> Co-authored-by: Chao Liu <chao.liu2@amd.com> Co-authored-by: Chao Liu <lc.roy86@gmail.com>	2022-04-30 08:50:16 -05:00
Chao Liu	8a2c69eeee	use integer value for GEMM test (#219 )	2022-04-30 08:44:20 -05:00
Qianfeng	c77ae65d40	Update to gemm_reduce and batched_gemm_reduce (#213 ) * [Experimental] Change to gemm+reduce and batched-gemm+reduce * Use threadwise-reduce function to improve the gridwise_gemm_reduce_xdl_cshuffle kernel * Tiny fix in device_batched_gemm_xdl.hpp * clang-format library/src/utility/conv_fwd_util.cpp	2022-04-29 11:35:25 -05:00
JD	97d8c5045e	Add gfx90a CI stage for tests (#208 ) * Add gfx90a CI stage * upgrade to ROCm 5.1 and fix formatting	2022-04-29 10:36:19 -05:00
Anthony Chang	95e93430de	Hotfix for gemm test (#214 ) * pass by ref to avoid throwing away initialization results * EOL CRLF -> LF	2022-04-29 19:03:34 +08:00
Jianfeng Yan	3956085d8e	add comments to batched_gemm (#186 ) * add comments to batched_gemm * formatting * fix a typo in batched_gemm_documentation * fix naming	2022-04-25 14:32:59 -05:00
Anthony Chang	7c0b149811	profiler: fix fp32 c-shuffle gemm tuning parameter (#194 )	2022-04-22 15:48:51 -05:00
Adam Osewski	31d869adc6	Clang-format only modified files. (#181 )	2022-04-22 15:48:08 -05:00
Anthony Chang	08a979f188	use inline asm for 4x4 int8 transposition (#187 )	2022-04-22 15:47:31 -05:00
Adam Osewski	1a0cd5d160	Convolution FWD profiler refactor. (#183 ) * Convolution ND * Code unification across dimensions for generating tensor descriptors. * Example * Instances * Move convnd f32 instance file to comply with repo structure. * Conv 1D tensor layouts. * Formatting and use ReferenceConv * Reference ConvFwd supporting 1D and 2D convolution. * Debug printing TensorLayout name. * Conv fwd 1D instance f32 * Refactor conv ND example. Needed to support various conv dimensio. Needed to support various conv dimensions * Rename conv nd example director to prevent conflicts. * Refactor some common utility to single file. Plus some tests. * Refactor GetHostTensorDescriptor + UT. * Add 1D test case. * Test reference convolution 1d/2d * Remove some leftovers. * Fix convolution example error for 1D * Refactor test check errors utility function. * Test Conv2D Fwd XDL * More UT for 1D case. * Parameterize input & weight initializers. * Rename example to prevent conflicts. * Split convnd instance into separate files for 1d/2d * Address review comments. * Fix data type for flops/gbytes calculations. * Assign example number 11. * 3D cases for convolution utility functions. * 3D reference convolution. * Add support for 3D convolution. * Check for inputs bigger than 2GB. * Formatting * Support for bf16/f16/f32/i8 - conv instances + UT. * Use check_err from test_util.hpp. * Split convnd test into separate files for each dim. * Fix data generation and use proper instances. * Formatting * Skip tensor initialization if not necessary. * Fix CMakefiles. * Remove redundant conv2d_fwd test. * Lower problem size for conv3D UT. * 3D case for convnd example. * Remove leftovers after merge. * Add Conv Specialization string to GetTypeString * Skip instance causing numerical errors. * Small fixes. * Remove redundant includes. * Fix namespace name error. * Script for automatic testing and logging convolution fwd UTs * Comment out numactl cmd. * Refine weights initalization and relax rtol for fp16 * Move test_util.hpp to check_err.hpp * Refine weights initalization and relax rtol for fp16 * Refactor common part of test conv utils. * Move utility function to single common place. * Add additional common functions to utility. * Refactor convnd_fwd_xdl examples. * Remove redundant files. * Unify structure. * Add constructor to ConvParams. * And add input parameters validation. * Modify conv examples to use single utility file. * Remove check_error from host_tensor.hpp * Get rid of check_indices function. * Remove bf16_to_f32 function overload for scalars. * Fix namespace. * Add half_float::half for check_err. * Fix conv params size in UT. * Fix weights initialization for int8. * Fix weights initialization for int8. * Add type_convert when store output in ref conv 1D. * Get back old conv2d_fwd_xdl operation. * Silence conv debug print. * format * clean * clean * Fix merge. * Fix namespace for check_err * Formatting. * Fix merge artifacts. * Remove deleted header. * Fix some includes and use ck::utils::check_err. * Remove unused check_indices restored by previous merge. * Fix namespaces after merge. * Fix compilation error. * Small fixes. * Use common functions. * Fix filename * Fix namespaces. * Fix merge artifact - retrieve removed by accident fun. * Fix ConvForwardSpecialization. * Working example of OpInstanceRunEngine for conv2dfwd UT. * Adhere to coding style rules. * Formatting and adhere to coding style rules. * Fix merge artifacts. * Utility for collecting conv fwd instances. + Plus commmon part for parsing cmdline params. * Refactor FillUniform because of segfault for int8_t. * Naming convention. * Elegant version of device mem allocation. * Use OpInstanceRunEngine in conv fwd nd tests. * Multiple refinements. * conditional init * don't run reference op if not provided. * Use OpInstanceRunEngine for ckProfiler conv_fwd * Refactor common tensor fill function to separate file. * Clean up unused functions. * Support different init methods. * Create CMake target for conv_fwd_util. * Add header for profile_convnd_fwd.cpp * Fix CMakefiles to link with conv_fwd_util where needed. * Fix some clutter. Co-authored-by: Adam Osewski <aosewski@amd.com> Co-authored-by: Chao Liu <chao.liu2@amd.com>	2022-04-21 17:39:39 -05:00
JD	7353ec0c25	Fix `clang-format` (#189 ) * Fix clang-format filepath * update docker and fix format	2022-04-21 17:02:15 -05:00

1 2 3 4 5 ...

605 Commits