* uncomment all the headdim, use optdim to control
* change default back to -1
* uncomment splitkv instance
* Fix typo in receipt 4 for appendkv
* support optdim for bwd, splitkv and appendkv
* Fix 192 key error
---------
Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>
Co-authored-by: Andy Lugo <Andy.LugoReyes@amd.com>
* unify pipeline signature with existing example
* iwyu
* move stuff around in load-tile-transpose
* cleanups in batched transpose pipeline
* comments
* use same inputs size
* cleaner printf
* print host args
* use 64 block sides in the 37_transpose example
* roll back grid dimension size adjustment for 37_transpose example
* transpose grid for 37_transpose to unify with 35_batched_transpose
* unify grid computation logic
* make policy methods device only (since they are used only on device from the pipeline)
* more host/device attribute cleanups
* copy over problem
* move over pipeline and policy
* add switch to batched transpose api
* make the lds problem more similar to original problem
* factor out logic into traits
* factor out conditional compilation into trait parameter
* propagate pipeline to args
* unhardcode pipeline dispatch parameter
* refactor vector size
* put warp tile out of dispatch
* rename template parameter for trait
* rewrite vector size in terms of problem
* mark policy-internal struct variable as device
* factor out input distribution and thread access pattern from policies
* reword vector size
* use datatype across batched transpose pipelines, problems and kernel
* remove transpose traits from lds pipeline
* add padding to the lds pipeline *interface*
* add comment
* remove ck_tile example #37
* update cmakelists
* add test for new pipeline
* update batched transpose test
* roll back load_tile_transpose changes
* remove comments
* pack dispatch parameters into a config
* padM can be enabled
* adjust lds vector size to enable padding along N
* update test
* clean up logic
* swap m/n input vector size
* adjust perf test script
* sweep over C/W in perf test
* count both read and written bytes into bandwidth (x2 the number)
* clang-format
* widen size range for perf test
* remove 64k x 64k case; it's too large for index
* remove thread tile from dispatch
* Solve merge conflict
* fix compile
* modify the transpose
* solve the test error and clang format
* Add v3 support for Groupd fwd conv+bias+clamp & ckProfiler (#2463)
* Add logging to IsSupported.
* Less casting in AddClamp
* Conv+bias+clamp instances & profiler BF16
* Fix 3D instances & run just 1x for verification.
* :Run just once for verification conv fwd.
* ckProfiler conv fwd clampwq
* Remove exec bit & formatting
* Add support for MultiD for grouped conv fwd v3.
* Enable 2Lds.
* clean
* align instances
* align instances
* profiler fixes
* Fixes
* fix
* fix
---------
Co-authored-by: Adam Osewski <root@quanta-ccs-aus-f01-19.cs-aus.dcgpu>
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>
* Fixing 0ms and inf GB/s issue in img2col (#2565)
issue :
====
``` sh
$ bin/tile_example_img2col
Perf: 0 ms, inf GB/s
```
solution :
======
Problem occured because config.time_kernel is false by default.
if false, then no need to calculate perf, just print proper message
`image_to_coloumn: pass, No Perf generated due to config.time_kernel=0`
* merge with develop
* solve clang format
---------
Co-authored-by: ThomasNing <thomas.ning@amd.com>
Co-authored-by: Adam Osewski <19374865+aosewski@users.noreply.github.com>
Co-authored-by: Adam Osewski <root@quanta-ccs-aus-f01-19.cs-aus.dcgpu>
Co-authored-by: Bartłomiej Kocot <barkocot@amd.com>
Co-authored-by: rahjain-amd <Rahul.Jain@amd.com>
issue :
====
``` sh
$ bin/tile_example_img2col
Perf: 0 ms, inf GB/s
```
solution :
======
Problem occured because config.time_kernel is false by default.
if false, then no need to calculate perf, just print proper message
`image_to_coloumn: pass, No Perf generated due to config.time_kernel=0`
* Elementwise kernel implementation
Co-authored-by: Sami Aario <samaario@amd.com>
Co-authored-by: Mohsen Saffari <mohsen.saffari@amd.com>
Co-authored-by: yashagar <yashagar@amd.com>
* Elementwise with generalized nDims
* Adding the n-ary input tensor feature
* Generalize dimensions on top of inputs
* Add TFLOPS + remove std usage for tuples
* 1D basecase optimization
* Cleanup code + refactoring to a common interface
* Generalize to unary and add an example
* Cleanup, refactoring and commenting
* Suggestions for LWPCK-3170: elementwise kernel improvements
* Clang-format: remod.py
* Replace InputTensorType with XDataType as the type of input_tensors
* Add Tuple::apply and use it in ElementWiseKernel::operator to call operation with the exact number of arguments in xs
* Move examples to folder 19_elementwise
* Add missing copyright headers and fix some existing ones
* Replace an assert with throw std::runtime_error in elementwise example
* Avoid reading the output by using make_static_distributed_tensor for y_tile
* Removed two unused includes
* No need to move windows to the next block when each workgroup processes a single tile
* Only copy input tensors to the device
* Use get_warp_size to obtain warp size, and use ceiling division for grid size also for the unary example
* Adding output strides to the kernel, transposition example and update the other examples
* Changes made by remod.py
* Use default template parameter values for memory operation and coherence in a call to make_naive_tensor_view
* Move binary operations to include/ck_tile/ops/elementwise/binary_elementwise_operation.hpp
* Reuse generic reference binary/unary operation in examples + refactoring the transpose reference
* Fix comments in elementwise_example.cpp
- Refer to AMD terminology except when suggesting NVIDIA alternatives in parentheses
- ElementWiseTraits was renamed to ElementWiseShape
- Adopt suggestions made by Copilot when prompted to check for factual or typographical errors
* Simplify CMakeLists.txt and remove the unused variables this uncovers
* Rename a file and fix some copyright statements
* Changes made by script/clang-format-overwrite.sh
* Add basic unit test for ElementWiseKernel
* Remove left-over uninformative comment in apply unit test
* Changes made by clang-format-overwrite.sh
* fixup! Use default template parameter values for memory operation and coherence in a call to make_naive_tensor_view
* Clean up test_tuple_apply.cpp and test_elementwise_1d.cpp
* Use make_uniform_array_with_factory to define h_xs and d_xs_mems_owner as type std::array
* Use a DeviceMem constructor that calls get_element_space_size_in_bytes internally
* Move examples to folder 20_elementwise
* Reduced register pressure on the CK tile elementwise kernel + add 4d input example to be able benchmark against old CK
* Fix CLang formating
* Bump up the elementwise example folder number
* Elementwise: add padding + minor cleanup
* Add Vector Size inference + fix issue with wrong vectorization due to missing GuaranteedLastDimensionVectorStride setting in make_naive_tensor_view
* Add isSupportedArg to Elementwise kernel + addapt example and unit tests
* Fix clang-format on the unit test file
---------
Co-authored-by: Damien Lejeune <damien.lejeune@amd.com>
Co-authored-by: Sami Aario <samaario@amd.com>
Co-authored-by: Mohsen Saffari <mohsen.saffari@amd.com>
Co-authored-by: Aviral Goel <aviral.goel@amd.com>