* add failing tests
* swap out and reference
* add constraint assert to transpose input distribution
* test both pipelines with rectangular block tile
* print mismatched indices
* add a smaller failing test for old pipeline
* print grid and block
* fill output before operating on it
* swap m/n tile sizes and make one test pass
* add device syncs
* add one more flipped test case
* flip block tile at host arg init
* fix tiles for lds pipeline
* clang-format
* rename tests
* roll back error check
* remove device syncs
* reduce large test case's size
* add a dummy test file
* add kernel launch logic to the test
* transfer all test cases into gtest params
* factor kernel out into test config
* add load transpose pipeline tests
* add padded tests and skip invalid kernels at runtime
* enum class for pipeline type
* add multiwarp test cases
* fix type
* try to solve the problem
---------
Co-authored-by: ThomasNing <thomas.ning@amd.com>
* unify pipeline signature with existing example
* iwyu
* move stuff around in load-tile-transpose
* cleanups in batched transpose pipeline
* comments
* use same inputs size
* cleaner printf
* print host args
* use 64 block sides in the 37_transpose example
* roll back grid dimension size adjustment for 37_transpose example
* transpose grid for 37_transpose to unify with 35_batched_transpose
* unify grid computation logic
* make policy methods device only (since they are used only on device from the pipeline)
* more host/device attribute cleanups
* copy over problem
* move over pipeline and policy
* add switch to batched transpose api
* make the lds problem more similar to original problem
* factor out logic into traits
* factor out conditional compilation into trait parameter
* propagate pipeline to args
* unhardcode pipeline dispatch parameter
* refactor vector size
* put warp tile out of dispatch
* rename template parameter for trait
* rewrite vector size in terms of problem
* mark policy-internal struct variable as device
* factor out input distribution and thread access pattern from policies
* reword vector size
* use datatype across batched transpose pipelines, problems and kernel
* remove transpose traits from lds pipeline
* add padding to the lds pipeline *interface*
* add comment
* remove ck_tile example #37
* update cmakelists
* add test for new pipeline
* update batched transpose test
* roll back load_tile_transpose changes
* remove comments
* pack dispatch parameters into a config
* padM can be enabled
* adjust lds vector size to enable padding along N
* update test
* clean up logic
* swap m/n input vector size
* adjust perf test script
* sweep over C/W in perf test
* count both read and written bytes into bandwidth (x2 the number)
* clang-format
* widen size range for perf test
* remove 64k x 64k case; it's too large for index
* remove thread tile from dispatch
* Solve merge conflict
* fix compile
* modify the transpose
* solve the test error and clang format
* Add v3 support for Groupd fwd conv+bias+clamp & ckProfiler (#2463)
* Add logging to IsSupported.
* Less casting in AddClamp
* Conv+bias+clamp instances & profiler BF16
* Fix 3D instances & run just 1x for verification.
* :Run just once for verification conv fwd.
* ckProfiler conv fwd clampwq
* Remove exec bit & formatting
* Add support for MultiD for grouped conv fwd v3.
* Enable 2Lds.
* clean
* align instances
* align instances
* profiler fixes
* Fixes
* fix
* fix
---------
Co-authored-by: Adam Osewski <root@quanta-ccs-aus-f01-19.cs-aus.dcgpu>
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>
* Fixing 0ms and inf GB/s issue in img2col (#2565)
issue :
====
``` sh
$ bin/tile_example_img2col
Perf: 0 ms, inf GB/s
```
solution :
======
Problem occured because config.time_kernel is false by default.
if false, then no need to calculate perf, just print proper message
`image_to_coloumn: pass, No Perf generated due to config.time_kernel=0`
* merge with develop
* solve clang format
---------
Co-authored-by: ThomasNing <thomas.ning@amd.com>
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
Co-authored-by: Adam Osewski <root@quanta-ccs-aus-f01-19.cs-aus.dcgpu>
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>
Co-authored-by: rahjain-amd <Rahul.Jain@amd.com>
* Create tests for ck tile batched transpose using example
* Create ck tile tests for smoothquant using examples
* fix precision input strings and convert batched transpose to regression tests
* Code cleanup and fix asserts
* add missing licenses
* update copyright and licensing in files
* Update smoothquant tests to use example's smoothquant.cpp
* Add custom target for batched transpose tests
* Add missing new lines at end of files for CMakelists
* fix typo in batched transpose CMakeList target_compile_options
---------
Co-authored-by: root <root@ctr-ubbsmc16.amd.com>