mtgu0705
45c6c584a0
Modify the kernel to 128x128x64, and use mfma_32x32x4
...
Add int4+scale based on Zhang, Jing pk_i4. Compile pass, function pass.
2024-11-04 15:01:00 +08:00
Jing Zhang
f03dda4826
add ckProfiler
2024-10-27 07:27:23 -07:00
Jing Zhang
e463256fc7
fixed
2024-10-24 08:10:16 -07:00
Jing Zhang
9e15aa34a8
fixed
2024-10-24 08:09:43 -07:00
Jing Zhang
e9b7f26799
clean
2024-10-23 14:00:36 -07:00
Jing Zhang
7cb3d6fd58
recover v3r1
2024-10-23 13:57:52 -07:00
Jing Zhang
786a0faaac
add permute switch as a template
2024-10-23 13:42:00 -07:00
Jing Zhang
6a2521ea5d
fixed splitk crush
2024-10-23 10:23:14 -07:00
Jing Zhang
af2c016631
add and_or_b32
2024-10-22 18:30:35 -07:00
Jing Zhang
6d0e78bdee
improve weight layout
2024-10-22 18:09:30 -07:00
Jing Zhang
9fed0adea8
weight permute with splitki
2024-10-22 14:29:01 -07:00
Jing Zhang
be98313d80
add b tile permute
2024-10-22 10:25:53 -07:00
Jing Zhang
e053e94764
weight permute
2024-10-21 21:18:07 -07:00
Jing Zhang
82bb8dde6e
fixed splitk
2024-10-21 12:42:15 -07:00
Jing Zhang
65cfb2a15c
format
2024-10-21 12:26:13 -07:00
Jing Zhang
398f8851c5
debug i4_to_f16_convert
2024-10-21 12:25:39 -07:00
Jing Zhang
222e968893
format
2024-10-20 09:59:32 -07:00
Jing Zhang
05ab9105f5
fixed reference and host_tensor
2024-10-19 19:53:17 -07:00
Jing Zhang
205e0365e3
fix
2024-10-18 10:12:37 -07:00
Jing Zhang
c13366af6d
add fast pki4 to half conversion
2024-10-18 10:10:40 -07:00
Jing Zhang
24e18ae830
fixed coord reset
2024-10-15 19:48:46 -07:00
Jing Zhang
c3a4652a68
move packed into dynamic_buffer
2024-10-15 11:30:09 -07:00
Jing Zhang
77ad000e8a
clean
2024-10-15 10:08:44 -07:00
Jing Zhang
40d038e90d
clean
2024-10-14 22:10:44 -07:00
Jing Zhang
c3d05c0cf2
debug
2024-10-13 22:17:30 -07:00
Jing Zhang
3ef4d2c2c9
clean
2024-10-13 15:36:43 -07:00
Jing Zhang
0f3b88bf57
add a prototype of int4
2024-10-11 15:07:47 -07:00
Christopher Millette
ceaed8e097
Fixes small memory leak from missing hipEventDestroy ( #1554 )
2024-10-09 09:41:35 +02:00
Illia Silin
7d8ea5f08b
Fix build logic using GRU_ARCHS. ( #1536 )
...
* update build logic with GPU_ARCHS
* fix the GPU_ARCHS build for codegen
* unset GPU_TARGETS when GPU_ARCHS are set
2024-10-07 08:18:23 -07:00
Bartłomiej Kocot
6b54d2faf8
Fix grouped gemm check to avoid overflow ( #1545 )
2024-10-04 17:32:43 +02:00
macurtis-amd
aeb7c91f48
Fix compilation errors generated by forthcoming Clang changes ( #1544 )
...
Without this change, the following diagnostic is generated:
a template argument list is expected after a name prefixed by the template
keyword [-Wmissing-template-arg-list-after-template-kw]
See C++17 spec [temp.names] p5.
2024-10-02 13:56:22 -07:00
Illia Silin
42e6dceacc
Fix compilation errors with Clang20.0. ( #1533 )
...
* fix clang20 compilation errors for gfx90a
* fix clang20 compilation errors for gfx11 targets
2024-09-25 13:45:38 -07:00
Bartłomiej Kocot
4ba52b35dc
Add support for NGCHW in grouped conv fwd ( #1499 )
...
* Support NGCHW in grouped conv fwd
* Remove not needed variable
* Fixes
2024-09-20 10:45:46 +02:00
Adam Osewski
0c39954da9
Remove unsupported (fp8) type from Add memory operation. ( #1521 )
...
The dynamic buffer doesn't have support for fp8 in `Update` operation thus fp8 is not supporting `InMemoryDataOperation::Add`
2024-09-20 09:40:45 +02:00
Jun Liu
81bc1496b2
Customize filesystem in CK for legacy systems ( #1509 )
...
* Legacy support: customized filesystem
* Update cmakefile for python alternative path
* fix build issues
* CK has no boost dependency
* More fixes to issues found on legay systems
* fix clang format issue
* Check if blob is correctly generated in cmake
* fix the python issues
* add a compiler flag for codegen when using alternative python
* use target_link_options instead of target_compile_options
---------
Co-authored-by: illsilin <Illia.Silin@amd.com >
2024-09-13 07:51:07 -07:00
Mateusz Ozga
448c0f56d8
Pool2d max/avg kernel in the BWD version ( #1494 )
...
* Add pool2d instance BWD AVG
* Add pool2d instance BWD MAX
* Fix: avg review
* Fix review: part2
* Fix - enable test when type is compiled
* Fix review part3
2024-09-12 11:47:52 +02:00
jakpiase
e8d2887cb2
Rewrite pool2d fwd ( #1462 )
...
* added pool2d fwd
* add tests
* add reviewers changes
* Revert "Merge remote-tracking branch 'origin/develop' into jakpiase/pool2d_fwd_new"
This reverts commit 6b2ba7ff89 , reversing
changes made to 22c82bea0c .
* Revert "add reviewers changes"
This reverts commit 22c82bea0c .
* added reviewers comments
* revert some old files
* add reviewers requests
---------
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com >
2024-09-11 15:21:00 +02:00
jakpiase
2a261afcdf
Added structural sparsity blockwise gemm ( #1435 )
...
* Implemented smfmac xdlops
* Added smfmac blockwise xdlops
* fixes
* add reviewers suggestions
---------
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com >
2024-09-11 15:19:42 +02:00
M.Emin Ozturk
8378855361
Moficiation to fix this issue "threadwise_tensor_slice_transfer_v5r1 issue #1279 " ( #1492 )
...
* issue fix, one line changed for tmp
* clang
---------
Co-authored-by: Emin Ozturk <emin.ozturk@utah.edu >
Co-authored-by: Harisankar Sadasivan <135730918+hsadasiv@users.noreply.github.com >
2024-09-04 21:52:55 -07:00
Haocong WANG
5b10dae6a4
Add gemm universal bf16 instances ( #1484 )
...
* revert ckprofiler change
* temp save
* Add test and test pass
* test pass
* Fix bug inside rotating buffer when tensor is not packed
* bug fix
* clang format
---------
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com >
2024-09-04 20:58:54 -07:00
Bartłomiej Kocot
73b67f290f
Add support for NGCHW in grouped conv bwd wei ( #1491 )
...
* Add support for NGCHW in grouped conv bwd wei
* Comments fixes
* navi fixes
* Update function names
2024-09-03 10:52:03 +02:00
Bartłomiej Kocot
a9b170b541
Revert "Revert "Revert Revert Support access per groups and filter2x3 in grouped conv fwd ( #1382 ) ( #1406 ) ( #1415 )" ( #1455 )" ( #1490 )
...
This reverts commit 5ff8eeebf9 .
2024-09-02 10:39:49 +02:00
Andriy Roshchenko
c3515f277c
Adding Instances and Examples for FP8-based Scaled Convolution and AMAX Reduction. ( #1473 )
...
* Enable CMakePresets build
* Verify Convolution, Scaling and ReLU algorithms.
* Add tensor element-wise scale and type cast operation.
* Reduction implemented but does not work.
* Exploration of Reduction functionality.
* Completed example for Convolution scaled with ReLu activation and AMAX reduction.
* WIP: Add required instances for convolution.
* WIP: Create client example. Implement convolution stage.
* Add elementwise instances.
* Add elementwise scale + convert example.
* Add reduction instances.
* WIP: Client example for AMAX reduction.
* WIP: Add instances for multistage reduction.
* WIP: Implementation of multistage reduction.
* Refactoring.
* Clean up.
* Add CMakePresets.json
* Guard off FP8 instances when the data type is not available.
* Add example for Scaled FP8 Convolution with AMAX reduction.
* Refactor CombConvScaleRelu instances.
* Add CombConvScale instances.
* Add client example for Scaled FP8 Convolution with AMAX reduction.
* Cleanup.
2024-08-21 15:22:41 -07:00
Rostyslav Geyyer
e20f20efbf
Set RNE fp8 conversion as a default ( #1458 )
...
* Set RNE fp8 conversion as a default
* Update f8 tests
* Disable failing test on gfx11
* Update bf8 tests
* Add a flag
* Fix the flag
* Raise flag for gfx10 as well
* Temp commit for tolerance testing
* Update tolerances
2024-08-21 09:09:48 -07:00
Haocong WANG
3049b5467c
[GEMM] gemm_universal related optimization ( #1453 )
...
* replace buffer_atomic with global_atomic
* fixed global_atomic_add
* added bf16 atomic_add
* format
* clang-format-12
* clean
* clean
* add guards
* Update gtest.cmake
* enabled splitk_gemm_multi_d
* format
* add ckProfiler
* format
* fixed naming
* format
* clean
* clean
* add guards
* fix clang format
* format
* add kbatch printout
* clean
* Add rocm6.2 related gemm optimization
* Limit bf16 atomic usage
* remove redundant RCR gemm_universal instance
* Add RRR fp8 gemm universal instance
* Bug fix
* Add GPU_TARGET guard to FP8/BF8 target
* bug fix
* update cmake
* remove all fp8/bf8 example if arch not support
* Enable fp8 RRR support in ckProfiler
* limit greedy-reverse flag to gemm_universal in ckProfiler
---------
Co-authored-by: Jing Zhang <jizhan@fb.com >
Co-authored-by: Jing Zhang <jizhan@meta.com >
Co-authored-by: zjing14 <zhangjing14@gmail.com >
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com >
Co-authored-by: illsilin <Illia.Silin@amd.com >
2024-08-14 10:42:30 +08:00
Mateusz Ozga
0606e5498e
Support large: 12d tensor size for reduction kenrel ( #1465 )
2024-08-13 16:15:47 +02:00
Bartłomiej Kocot
4a870942e6
Fix bug with n block id calculation in DeviceGroupedConvXdlCShuffle ( #1457 )
...
* Fix typo in TransformConvFwdToGemm
* Fix bug in n offset calculation
2024-08-10 13:12:05 +02:00
Jun Liu
5ff8eeebf9
Revert "Revert Revert Support access per groups and filter2x3 in grouped conv fwd ( #1382 ) ( #1406 ) ( #1415 )" ( #1455 )
...
This reverts commit 33b399cc15 .
2024-08-08 19:09:33 -07:00
Juan Manuel Martinez Caamaño
901e5f1540
Remove reinterpret_cast uses that result in undefined behaviour. ( #1445 )
...
* Remove reinterpret_cast uses that result in undefined behaviour. Use a bitcast instead.
See https://en.cppreference.com/w/cpp/language/reinterpret_cast#Type_accessibility
Closes #1439
* fix clang format
---------
Co-authored-by: illsilin <Illia.Silin@amd.com >
2024-08-07 11:49:02 -07:00
Juan Manuel Martinez Caamaño
fd9ef4e678
Add missing constexpr to if conditions ( #1444 )
2024-08-06 11:40:34 -07:00