composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-05-03 05:01:25 +00:00

Author	SHA1	Message	Date
arai713	a2969aa8b6	Disable transpose device op for MI300 (#1050 ) * added working example for 5D input using 1D kernel * example with 5D input tensor and 2d kernel - not working: issues with arguments * added updated version of 3d device op - changed descriptors/dims * added example file to check kernel * fixed descriptor and isSupportedArgument stride problem * added and modified kernel for 3d - updated tids/loop * adding some more 5d example files * fixed some issues * changes made for testing * working version: fixed error in stride for A, still a bit inefficient * cleaned up formatting/comments * updating formatting * more formatting fixes * fixing cmake, adding back gpu targets in cmake script * adding client example * added instances for client example * fixed errors in client example * implemented client ex with device_elementwise.hpp and device_elementwise_3d_impl.hpp * removed extra files * minor formatting and naming fixes * adding test files and profiler * fixing minor error * minor fix * removed unneccesary comments, renamed files * updated instance list for client example, added different layout example * removing instances * fixed error in instance generation * remove comments * update profiler and client example tensor layouts * fixed errors in test/profiler * updated vector dim access to enable vector load * updated test/profiler files * updated example with 1d kernel * updating profiler * renamed files * disabled device op for MI300 * skip elementwise_permute_2d on gfx94x * Update CMakeLists.txt * fixing CMake - disabling some GPU targets --------- Co-authored-by: Jing Zhang <jizha@amd.com> Co-authored-by: Jing Zhang <jizhan@amd.com> Co-authored-by: zjing14 <zhangjing14@gmail.com>	2023-11-29 11:36:40 -06:00
Illia Silin	7965d66a81	Split the static library into several files. (#1044 ) * spolit the static library into several * update lib paths and fix client example * do not use device_mha_operarions for client examples * use appropriate libs to link to client examples * remove the gpu/transpose path from the list * try fixing clinet examples 3,4,9 * add necessary libs for client examples * fix the layernorm client example * fix the client examples 23 and 24 * fix typo * add interface library and refresh clang format	2023-11-28 11:17:37 -08:00
Bartlomiej Wroblewski	bfecc19352	Fix cluster length arrange order in fp16 GEMM example (#1055 )	2023-11-27 11:31:14 +01:00
Bartlomiej Wroblewski	627054b941	Add basic support for direct loads from global to LDS (#999 ) * Add basic support for direct loads from global to LDS * Clean the code and comments * Add support for fp16 * Add comments * Add check for thread cluster lengths * Align non-direct-load fp16 example * Small fixes * Extend IsSupported to check for supported GPU gens * Build examples only on the supported HW * Do not throw when instance not supported in 04 example * Review: Apply review suggestions * Review: small fix * Review: small fix	2023-11-25 13:35:22 +01:00
zjing14	e8cddfdc3b	Improve 4k gemm perf (#1047 ) * improve 4k gemm perf * add f8 instances * format --------- Co-authored-by: Jing Zhang <jizha@amd.com>	2023-11-17 07:06:24 -06:00
Bartłomiej Kocot	f2398f612d	Introduce multiABD api and deprecate multiD (#1035 ) * Introduce multiABD api and deprecate multiD * Replace multiD with multiABD * Mark structures as deprecated * Change doxygen deprecated to note to avoid warnings	2023-11-14 17:00:40 +01:00
arai713	454cf7bd1f	Hip tensor permute (#1002 ) * adding files for F32 example * adding functioning implementation with scalar multiplication and unary operator support * added fp 16 type check in unary square * updating scalar multiplication as an operator * functioning version with scalar operator * changing strides for col major * updated column major implementation * working column major implementation * cleaned up comments, rearranged/renamed files	2023-11-13 11:15:48 -06:00
Bartłomiej Kocot	49e52bb357	Support multi AB for grouped conv fwd xdl (#1027 ) * Support multi AB for grouped conv fwd xdl * Add instances * Add client example * Add example * Add interface test * Minor fixes Minor fixes Minor fixes * Comment fixes * Fixes * Reference fix * Test xdl fixes * Improve multi_ab interface test	2023-11-10 15:54:44 +01:00
rocking	1db7560365	Backward of gamma and beta for layernorm and groupnorm (#1013 ) * Add layernorm backward reference code * Add groupnorm backward reference code * Add example * clang format * Fixc bug of reference layernorm and groupnorm * Fix naming * Refine naming * Add device op for normalization bwd gamma and beta * Refine template parameter * Add bwd gamma & beta of kernel * 1. Add groupnorm example 2. Refine layernorm naming * Narrow down the static check for performance * Refine variable name	2023-11-10 18:02:03 +08:00
arai713	3af8c81a72	Transpose 3d (#984 ) * added working example for 5D input using 1D kernel * example with 5D input tensor and 2d kernel - not working: issues with arguments * added updated version of 3d device op - changed descriptors/dims * added example file to check kernel * fixed descriptor and isSupportedArgument stride problem * added and modified kernel for 3d - updated tids/loop * adding some more 5d example files * fixed some issues * changes made for testing * working version: fixed error in stride for A, still a bit inefficient * cleaned up formatting/comments * updating formatting * more formatting fixes * fixing cmake, adding back gpu targets in cmake script * adding client example * added instances for client example * fixed errors in client example * implemented client ex with device_elementwise.hpp and device_elementwise_3d_impl.hpp * removed extra files * minor formatting and naming fixes * adding test files and profiler * fixing minor error * minor fix * removed unneccesary comments, renamed files * updated instance list for client example, added different layout example * removing instances * fixed error in instance generation * remove comments * update profiler and client example tensor layouts * fixed errors in test/profiler * updated vector dim access to enable vector load * updated test/profiler files * updated example with 1d kernel * updating profiler * renamed files --------- Co-authored-by: Jing Zhang <jizha@amd.com>	2023-11-08 19:45:07 -06:00
rocking	a3d9a2cd42	Layernorm4d (#1022 ) * Rename folder * Add layernorm 4d fwd example * Rename original layernorm example * Add layernorm 4d f16 test * Add layernorm4d_fwd client example * Support layernorm4D in ckProfiler * Rename groupnorm to groupnorm fwd in example * Rename layernorm and group fwd in test * Rename normalization to normalization_fwd (instances) * Add fwd to DeviceNormalization * Rename external api header * Rename folder, because we can also add bwd in this folder * Add fwd in layernorm and groupnorm (profiler * Fix compile error --------- Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>	2023-11-09 08:34:51 +08:00
Bartlomiej Wroblewski	16eb824c90	Add missing ComputeDatatype in contraction_multi_ABD_xdl_fp16 (#1024 )	2023-11-03 08:22:11 -07:00
Bartlomiej Wroblewski	4ef704d8a6	Add support for mixed precision in contraction scale and bilinear (#973 ) * Add support for mixed precision in contraction scale and bilinear (#936) * Extract common functionality to separate files * Reference contraction: Remove incorrect consts from type_converts * Reference contraction: Add missing type_convert for dst value * Reference contraction: Fix incorrect order of B matrix dimensions * Add support for mixed precision in contraction scale and bilinear * Move using statements from instances to a common file * Move using statements from examples to a common file * Fix the order of B matrix dimensions across examples and profiler * Fix the computation of error threshold * Make ComputeDataType an optional argument * Include possible DataType -> ComputeDataType casting error in the threshold * Remove commented code * Make the ComputeDataType an optional argument in instance --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>	2023-11-02 14:26:33 -07:00
Bartłomiej Kocot	f27ea94ecb	Add ScaleAddScaleAddRelu post op for conv fwd (#1006 ) * Add ScaleAddScaleAddRelu post op for conv fwd * Fixes * Fix instance file name * Minor fix	2023-11-01 18:31:30 -05:00
Bartłomiej Kocot	2e824c6d46	Add support for groups in Img2Col/Col2Img (#1007 ) * Add support for groups in Img2Col/Col2Img * Fix interface test * Fix interface test G to N * Improve performance * Change gemm layout to 3d * Fixes	2023-10-31 10:46:32 +01:00
Illia Silin	f46a6ffad8	Fix the fp8 gemm for large tensors on MI300. (#1011 ) * Fix the fp8 conversion * Try clipping value before conversion * Fix return * Simplify with a const * reduce the gemm input tensor values to reduce round-off error * replace if-else with lambda * fix syntax --------- Co-authored-by: Rostyslav Geyyer <rosty.geyyer@amd.com>	2023-10-27 21:10:47 -07:00
Bartłomiej Kocot	ac0e006766	Fix cmake dtype check (#989 ) * Fix instances dtype check * Fix source dtypes seletor for examples and tests * Sync with new cmakefile changes * Remove not needed ifdefs * Remove not needed ifdefs	2023-10-21 22:19:43 +02:00
Rostyslav Geyyer	1fd27d520f	Fix bf8 conversion issues (#1003 ) * Fix the conversion * Add bf8 functionality * Enable example on MI200 as well	2023-10-20 08:00:45 -05:00
Bartłomiej Kocot	82f3a835d5	Extend available elementwise operations with conv examples (#995 ) * Extend available elementwise operations with conv examples * Fixes * Remove not needed convert * Update CMakeFile and dir name	2023-10-19 17:23:19 +02:00
Bartlomiej Wroblewski	0abc0f87db	Change 1d,2d,... to 1D,2D,... (#997 )	2023-10-19 16:53:18 +02:00
rocking	3696fe1c76	Layernorm and groupnorm support to save mean and inverse std in forward (#929 ) * save mean and inverse std in normalization * Save mean and inverse std in splitK * Vector save mean and inv std * Modify instance for save mean and std * simplify the layernorm example * Save mean and std in groupnorm example * Save mean and inv std in ckProfiler and test * Remove compute data type from base class * Save mean and inv std in client example * Add changelog * clang format * Fix compile error * Refine naming * Avoid error in bf16 * revert changelog	2023-10-19 07:36:29 +08:00
zjing14	bf435140dc	Clean DTYPES conditions in CMake (#974 ) * Add a condition to build fp8 instances * simplified buffer_load/store * add bfp8/fp8 * fixed * remove all f8/bf8 condition include folder * fixed cmake conditions * fixed DTYPES=fp16/bfp16 * fix * fixed buffer_load * fixed buffer_store * fix * clean example cmake files * fixed ci * fixed cit --------- Co-authored-by: Rostyslav Geyyer <rosty.geyyer@amd.com> Co-authored-by: Jing Zhang <jizha@amd.com>	2023-10-18 11:14:14 -05:00
zjing14	1cc36ba5fb	Add contraction_multi_abd (#972 ) * add gridwise_multi_abd * move element_op into RunRead * merge element_wise op with data read * add multiABD example * allow packed elementwise_op * changed example * clean * clean * add is_detected * fix * minor fix * add scaleAdd_vec4 example * init commit for contraction_multi_ABD * add examples * add examples of multiA and broadcast * update example * fixed comments * Update cmake-ck-dev.sh * Update cmake-ck-dev.sh * Add comments into the example * Update CMakeLists.txt --------- Co-authored-by: Jing Zhang <jizha@amd.com>	2023-10-17 20:17:58 -05:00
Bartłomiej Kocot	16d7c4d2f7	Add grouped conv bwd weight wmma (#985 ) * Add grouped conv bwd weight wmma * Update README, changelog, profiler * Minor fixes * Fix grouped conv bwd wei dl kernel * Minor fixes * Minor stylistic fixes	2023-10-17 10:32:26 +02:00
zjing14	2ce9b56c64	add vector_type support into thread_copy_v3r1 (#969 ) * add vector_type support into thread_copy_v3r1 * remove unncessary type_convert * fixed datatype * fixed dataType * changed API with is_packx_invocable * changed example * add missing cmake file * fixed ci * fixed cmake --------- Co-authored-by: Jing Zhang <jizha@amd.com>	2023-10-13 15:11:43 -05:00
zjing14	ac9595a9f1	Fixed f8_gemm NaN (#975 ) * workaround nan problem by changing output to fp16 * enable f8/bf8 gemm tests on MI200 * workaround f16 to f8 conversion --------- Co-authored-by: Jing Zhang <jizha@amd.com>	2023-10-10 10:30:26 -05:00
Lauren Wrubleski	5913609168	Replace CMake `return` from later CMake (#970 )	2023-10-05 14:58:58 -07:00
Illia Silin	4daedf8ca5	Revert "Add support for mixed precision in contraction scale and bilinear" (#967 ) * Revert "Add support for mixed precision in contraction scale and bilinear (#936)" This reverts commit `f07485060e`. * revert commits #957 and #960	2023-10-05 14:58:23 -07:00
zjing14	570ff3ddbe	remove example 60 (#963 ) Co-authored-by: Jing Zhang <jizha@amd.com>	2023-10-05 09:41:01 -07:00
Rostyslav Geyyer	42facfc6b7	Add conv bwd weight fp16 comp bf8 fp8 op, instances and example (#945 ) * Add f8 bf8 gemm example * Add element-wise ops * Add intrinsics * Update reference calculation * Add an additional type option for xdlops gemm * Fix build process * Add bf8 to buffer addressing * Update blockwise op, split typeA and typeB * Update for compatibility * Uppdate naming to f8->fp8 * Update naming * Format * Update naming (#937) * Add a client example * Add computetypes to device and gridwise ops * Add instances, update instance factory * Format * Fix a flag * Add ckProfiler mode * Fix typos * Add an example * Add bf8 generator * add bf8 mfma; fixed type_convert for bf8 * move verfication ahead of timing * Update reference calculation * Fix reference * Narrow down float init range * Fix bf8 bf8 mfma * Add bf8 @ fp8 mfma * Update example * Update instances * Update profiler api * Update for compatibility * Format * Remove extra example * Clean up * workaround convert --------- Co-authored-by: Jing Zhang <jizha@amd.com>	2023-10-04 08:19:08 -05:00
zjing14	5311d1b325	changed test for grouped_gemm to be random (#959 ) Co-authored-by: Jing Zhang <jizha@amd.com>	2023-10-03 09:32:58 -05:00
zjing14	aa46039f2d	Fixed contraction issues (#960 ) * add missing ComputeType * fixed * Update cmake-ck-dev.sh --------- Co-authored-by: Jing Zhang <jizha@amd.com>	2023-10-03 09:32:44 -05:00
Rostyslav Geyyer	bd09b5c538	Add fp8 @ bf8 gemm support and example (#933 ) * Add f8 bf8 gemm example * Add element-wise ops * Add intrinsics * Update reference calculation * Add an additional type option for xdlops gemm * Fix build process * Add bf8 to buffer addressing * Update blockwise op, split typeA and typeB * Update for compatibility * Uppdate naming to f8->fp8 * Update naming * Format	2023-10-02 16:39:03 -05:00
Illia Silin	59dbb01fd1	get rid of gfx900/906, set rocm5.7 as default (#958 )	2023-10-02 12:01:11 -07:00
zjing14	9d58c42103	Contraction multi abd (#957 ) * add gridwise_multi_abd * move element_op into RunRead * merge element_wise op with data read * add multiABD example * allow packed elementwise_op * changed example * clean * clean * add is_detected * fix * minor fix * add scaleAdd_vec4 example * init commit for contraction_multi_ABD * add examples * add examples of multiA and broadcast * update example * fixed comments * Update cmake-ck-dev.sh * Update cmake-ck-dev.sh * Add comments into the example --------- Co-authored-by: Jing Zhang <jizha@amd.com>	2023-10-02 09:18:36 -05:00
Bartlomiej Wroblewski	f07485060e	Add support for mixed precision in contraction scale and bilinear (#936 ) * Extract common functionality to separate files * Reference contraction: Remove incorrect consts from type_converts * Reference contraction: Add missing type_convert for dst value * Reference contraction: Fix incorrect order of B matrix dimensions * Add support for mixed precision in contraction scale and bilinear * Move using statements from instances to a common file * Move using statements from examples to a common file * Fix the order of B matrix dimensions across examples and profiler * Fix the computation of error threshold * Make ComputeDataType an optional argument * Include possible DataType -> ComputeDataType casting error in the threshold * Remove commented code	2023-09-29 10:54:31 -05:00
Bartłomiej Kocot	cb53874002	Add grouped conv bwd data wmma (#950 ) * Add grouped conv bwd data wmma * Fix copyrights * Add instances with smaller NPerBlock * Update interface test * Minor stylistic fixes * Minor stylistic fixes	2023-09-28 23:10:18 +02:00
Bartłomiej Kocot	e2243a4d1e	Add column to image kernel (#930 ) * Add column to image kernel * Minor fixes for dtypes and client examples * Disable tests for disabled dtypes * Disable add instances functions for disabled data types * Minor stylistic fixes * Revert "Disable add instances functions for disabled data types" This reverts commit `728b869563`. * Instances reduction * Add comments in device_column_to_image_impl * Update changelog and Copyrights * Improve changelog	2023-09-27 17:19:06 +02:00
zjing14	11676c7e49	Add multiple A/B support (#906 ) * add gridwise_multi_abd * move element_op into RunRead * merge element_wise op with data read * add multiABD example * allow packed elementwise_op * changed example * clean * clean * add is_detected * fix * minor fix * add scaleAdd_vec4 example --------- Co-authored-by: Jing Zhang <jizha@amd.com>	2023-09-26 21:16:23 -05:00
Illia Silin	bba085d2b5	Refactoring cmake files to build data types separately. (#932 ) * refactor cmake files for the tests * refactor cmake files for examples * fix cmake for gemm example * fix the cmake file for all examples * add splitting by data types in gemm_splitk instance header * rename test to reflect only dl instances are used * clean up CI workspace, update cmake for instances * change the jenkinsfile syntax * build all instances except DL on gfx11 * move workspace cleanup after stages * clean up workspace after every stage * isolate data types in grouped_conv_fwd header * isolate dl instances for grouped_conv2d_fwd * fix syntax * fix cmake and batchnorm instances * fix typo * fix reduction instances * fix grouped_conv headers * fix syntax * replace parsing logic for instances, replace bfp16 with bf16 * fix the client examples build * clean up DTYPES from instances cmake files * update the parsing logic in cmake files * make an exception for reduction kernels * update few remaining cmake files to handle DTYPES * fix syntax * fix cmake conflicts * replace f8 with fp8 test name * resolve conflicts for dpp instances	2023-09-20 22:15:56 -07:00
zjing14	f9d0eddb90	Add fp16/fp8 support into Grouped gemm FixedNK (#874 ) * move all arguments into device * add b2c_tile_map * add examples * add SetDeviceKernelArgs * dedicated fixed_nk solution * init client api * add grouped_gemm_bias example * add a instance * add instances * formatting * fixed cmake * Update EnableCompilerWarnings.cmake * Update cmake-ck-dev.sh * clean; fixed comments * fixed comment * add instances for fp32 output * add instances for fp32 output * add fp32 out client example * fixed CI * init commit for kbatch * add splitk gridwise * format * fixed * clean deviceop * clean code * finish splitk * fixed instances * change m_loops to tile_loops * add setkbatch * clean code * add splitK+bias * add instances * opt mk_nk instances * clean examples * fixed CI * remove zero * finished non-zero * clean * clean code * optimized global_barrier * fixed ci * fixed CI * instance and client * removed AddBias * format * fixed CI * fixed CI * move 20_grouped_gemm to 21_grouped_gemm * clean * formatting * clean * clean * fixed computeType --------- Co-authored-by: Jing Zhang <jizha@amd.com>	2023-09-14 21:04:10 -05:00
Bartłomiej Kocot	475188ca2e	Add grouped conv bwd weight dl instances and new layout (#897 ) * Add grouped conv bwd weight dl instances and new layout * Add M and N padding * Remove todo comment * Enable grouped conv fwd dl k,c=1 generic instance * Comment fixes	2023-09-13 10:14:31 -05:00
zjing14	a66d14edf2	fixed fp8 issues (#894 ) * fixed fp8 init; and reference gemm * Update host_tensor_generator.hpp * fixed convert * fixed reference gemm * fixed comments * fixed comments * fixed ci * fixed computeType --------- Co-authored-by: Jing Zhang <jizha@amd.com>	2023-09-12 22:17:56 -05:00
Rostyslav Geyyer	62d4af7449	Refactor f8_t, add bf8_t (#792 ) * Refactor f8_t to add bf8_t * Add check_err impl for f8_t * Update fp8 test * Format * Revert the fix * Update vector_type implementation * Add bf8 test * Add bf8, use BitInt types * Add bf8 conversion methods * Update type_convert for fp8/bf8 * Add check_err fp8/bf8 support * Add subnorm fp8 tests * Add subnorm bf8 tests * Fix conversion * Add bf8 cmake bindings * Add macros to enable build with disabled fp8/bf8 * Remove is_native method * Update flag combination for mixed precision instances * Add more flag checks * Add another flag to a client example * Add type traits, decouple f8/bf8 casting * Clean up * Decouple fp8 and bf8 flags * Remove more redundant flags * Remove leftover comments	2023-09-12 17:04:27 -05:00
Haocong WANG	562b4cec48	[Navi3x] Add fp16/int8 wmma conv forward instances (#746 ) * fix wmma gemm int8; add grouped conv int8 example * Add int8 gemm-bilinear instances * compile sanity check unknown * Sanity pass + clang-format * add int8 conv profiler instances * solve merge conflict --------- Co-authored-by: zjing14 <zhangjing14@gmail.com> Co-authored-by: Chao Liu <chao.liu2@amd.com>	2023-09-07 21:59:26 -05:00
Bartlomiej Wroblewski	37a8c1f756	Redesign the DPP8 GEMM kernel to use warp-wise component (#863 ) * Redesign the DPP8 GEMM kernel to use warp-wise component * Review: Improve error messages * Review: Remove unnecessary empty lines * Review: Fix M, N per thread names * Review: Rename mfma_input_type to dpp_input_type * Review: Fix tensor adaptor; remove unnecessary element * Review: Remove calls to dpp_gemm's MakeCDescriptor * Review: Add blockwise doc, change function names to include dimension names * Review: Remove duplicated code; Move Block2CtileMap alias to the top of the file * Review: Add __restrict__ keywords * Review: Use MatrixPadder for padding A, B, C matrices * Review: Remove hardcoded datatypes * Review: Change names from FloatX to XDataType * Review: Introduce AK0 and BK0 instead of a single K0 * Review: Remove construction of dpp_datatypes object * Review: Rename DppInstrRunner to DppLanegroupGemm	2023-09-06 11:44:09 -05:00
Bartłomiej Kocot	0077eeb3be	Add image to column kernel (#867 ) * Add image to column kernel * Add instances, tests, profiler, example * Add client example * Several fixes of image to column * Fix variable name in device_image_to_column_impl * Several fixes of image to column profiler * Fix num_btype calculation * Make new mesaurements for correct bytes calculation	2023-09-05 10:11:40 -05:00
zjing14	f5ec04f091	Grouped Gemm with Fixed K and N with SplitK (#818 ) * move all arguments into device * add b2c_tile_map * add examples * add SetDeviceKernelArgs * dedicated fixed_nk solution * init client api * add grouped_gemm_bias example * add a instance * add instances * formatting * fixed cmake * Update EnableCompilerWarnings.cmake * Update cmake-ck-dev.sh * clean; fixed comments * fixed comment * add instances for fp32 output * add instances for fp32 output * add fp32 out client example * fixed CI * init commit for kbatch * add splitk gridwise * format * fixed * clean deviceop * clean code * finish splitk * fixed instances * change m_loops to tile_loops * add setkbatch * clean code * add splitK+bias * add instances * opt mk_nk instances * clean examples * fixed CI * remove zero * finished non-zero * clean * clean code * optimized global_barrier * fixed ci * fixed CI * removed AddBias * format * fixed CI * fixed CI * move 20_grouped_gemm to 21_grouped_gemm --------- Co-authored-by: Jing Zhang <jizha@amd.com>	2023-08-31 09:22:12 -05:00
rocking	866377de18	MaxPool & AvgPool bwd instances, test, ckProfiler, client example (#861 ) * Add maxpool instances * Rename index pool to max pool. * Add maxpool bwd bf16 instances * Add avg pool bwd instances * Rename avgpool and maxpool to avg_pool3d and max_pool * Add bf16 pool fwd instances * Add max pool bwd to ckProfiler * Add avg pool3d bwd to ckProfiler * Add avg pool bwd test * Fix bug of reference pool fwd (dilation) * Fix bug of max pool bwd (dilation and initZero) * Support bf16 compute data type * Force compute type be f32. Because atomicAdd only support f32 * Add max pool bwd test * Rename folder * Rename pool * Add max pool bwd client example * Add avg pool bwd client example * Add missing workspace * clang format * Rename macro * remove useless header * remove useless layout	2023-08-31 21:01:50 +08:00
zjing14	38ada109ea	add an example of customized type convert - bfp16_rtn (#869 ) * add an example of customized bfp16_rtn * fixed threadwise_copy --------- Co-authored-by: Jing Zhang <jizha@amd.com>	2023-08-29 12:31:24 -05:00

... 13 14 15 16 17 ...

966 Commits