* Support 64 bit indexing
* Add new grouped conv fwd kernel for large tensors
* Add instances large tensor
* Fixes for transform conv to gemm
* Fixes
* fixes
* Remove not needed instances
* examples fixes
* Remove not need ds arrays
* Fix tests
* Add 2GB check in gridwise dl
* Fixes
* add ab_scale init support
* enabled interwave
* add scale type; update isSupport
* adjust example
* clean
* enable f8 pure gemm rcr ckprofiler
* Add gemm_multiply_multiply instances
* clang format
* Optimize for ScaleBlockMNK=128
* enable abscale f8 gemm ck profiler
* Add pure f8 gemm test suite
* Reverting to the state of project at f60fd77
* update copyright
* clang format
* update copyright
---------
Co-authored-by: root <jizhan@amd.com>
* init for reduce_threadwise multi_d
* add reduce_threadwise_multi_d
* add reduce_multi_d
* clean
* start add an other splitk device op
* add reduce template parameter to SplitKBatchOffset
* add reduce c matrix
* clean up code
* change example data type to bf16
* add bf16Ai8B example
* remove reduce template parameter
* add splitk atomic status to v4
* example add multi d parameters
* device op add multi-d parameters
* add multi-d to reduce
* fix kbach=1 bug
* change B layout to col in bf16Ai8B example
* remove float adding struct
* change multi-d interface
* change file and class name
* remove multi-d of bf16Ai8B example
* change IsReduce function to IsReduceAdd
* change example layout to RRR from RCR
* according layout to set ds stride
* reset parameter layout
* add gemm universal reduce instance
* add reduce factory
* add profile_gemm_universal_reduce
* add reduce to profiler
* fix reduce instance
* fix profiler reduce compiling bug
* format
* format library instance code
* add mem instance for reduce library
* fix call instance names
* add workspace for reduce in ckProfiler
* format
* add mnpading to reduce library instance
* add fp16 instance to reduce of profiler
* change copyright time
* restore profiler cmake file
* add reduce text to instances
* add DsLayout and DsDataType to instances template parameter
* fixed gemm_reduce_multi_d
* add an example without multi_d
* Update common.hpp
* Update gtest.cmake
* Update gemm_xdl_splitk_reduce_bf16.cpp
* clean
* Update gtest.cmake
* format
* fixe api
* format
* default parameter change to RRR
* add vector_len for multi_d
* format
* Update gtest.cmake
* fix bf16A iBB elementwiseop
* add ReduceDataType
* move ReduceDataType to end position
* format
* remove googletest git method address
* fix copyright time
* update init data
---------
Co-authored-by: root <jizhan@amd.com>
Co-authored-by: letaoqin <letaoqin@amd.com>
Co-authored-by: Jing Zhang <jizhan@meta.com>
Co-authored-by: zjing14 <zhangjing14@gmail.com>
* universal streamk with atomics with ckprofiler support. grid_size and streamk strategy are tunable. grid_size of -1 leads to #WGs = maximum occupancy X num_CUs. implementation supports many different streamk policies: 1-tile, 2-tile, 3-tile and 4-tile. streamk strategy of -1 leads to default streamk policy (4-tile).
* Update README.md
* fixing clang-format issues
* removed conflicts in struct members between streamk and universal streamk
* corrected arg parsing for streamk and universal streamk
* added stream-k policies for 3 tile and 4 tile
* fixed argument type issue with parsing cmd args
* changes suggested in PR review are made- removing comments and correcting copyright
* file permissions updated
* added default value support for grid_size and streamk-policy selection set to -1
* print messages for arguments
* print messages for arguments
* print messages for arguments1
* Format
* Format
* Format
* Remove const
* Use the right template
* Format
* Format
* add row/col instances
* Add missing file
* fixed
* fixing block to etile error
* Format
* Updates
* Format
* fixed rrr layout
* generating a sample JSON file: currently contains includes, prologue/epilogue and instances
* version where the json is passed into the instances to generate a key
* updated run function to just launch kernel
* updated run function: only contains kernel object, json file is updated but still needs to be cleaned up, added front-end API to parse JSON into character buffer
* adding in testing files
* cleaned up comments, still need to work on including header files
* removed unneeded files
* removed/commented out JSON implementation
* added fusion(prologue/epilogue) into instance generation
* working on instance selection
* added instance selection, need to fix instance validation
* removed block2etile map validity check for testing purposes
* test running: failing due to incorrect files/input
* all grid descs/ptrs completed, but device file not found
* Update test and embed modules
* Restore older version
* added convolution operation, written test, debugging generated code for compilation
* attempting to include CK in host directory: _Float16 error
* CK header file issues
* slight fix
* don't crash when hip can't report total memory
* dump generated code to a file
* changing sizes
* creating tensor descriptors using CK methods: set up grid desc manually, also trying to set up an argument pointer - this needs to be fixed
* some fixes to call the device code
* separating test files for conv and gemm
* completed arg ptr, now have linking errors
* clang format fix
* resolved linker issues in conv test
* remove dependency on libutility from ck
* resolved num dim error
* properly passing arg ptr, errors with passing typenames: redefinition/redeclaration
* undo the commenting of device function
* hand created kernel code to find rtc issues
* dump the full src to file
* resolved redeclaration errors, cleaned up errors for Amber's kernel code
* debugging purposes: redeclaration error
* config files
* resolved errors for NumTensor and redeclaration, formatted version.h
* resolved most errors in manually added kernel and my own. error with calling kernel object: overloaded function type
* WIP: close to getting kernel compiled
* WIP: fixing rtc errors
* fixed sequence errors, formatting, still one error with run fcn
* yay: kernel compiles and runs
* updated templated/generated version to run and compile
* minor fixes
* working generated example, resolved memory access error due to padding
* adding in reference kernel, validation failing against reference
* debugging: printing kernel argsz
* reduced error in results
* debugged reference kernel and output errors, added to generated version, currently debugging prologue function issues
* working validation (using reference convolution) with prologue function for both hard-coded and generated version
* WIP: create an alt version that creates Argument on the device
* wip: added new duplicate files, fixed fusion templating errors from working example, setting up kernel arguments
* wip: making necessary methods device code
* added grid descs, working on grid pointers, errors with stl numerics
* wip: updating kernel args - issue, replacing some std functions
* replaced std::accumulate call with temp hardcoded version
* wip: args causing memory issue
* Construct Argument object inside the kernel and use it to call convolution device function. Code runs and verification passes
* adding object file dump
* temporary hardcoding of grid size, can remove device op inst + arg ptr
* minor fix for grid size
* added modified example where arg ptr is created on the device for generated version as well
* removed device op instance and arg ptr from modified examples
* moving device op file for testing purposes and to properly build CK
* commenting out print-outs
* adjust compiler args to produce a valid ELF file
* temporary removal of validation
* reverting compiler args back for working example
* retrieve necessary arguments from generated template parameters in correct format
* calculating grid size on host-side, still need to clean up process, pass parameters to host functions properly
* scaled up factory functions/wrapper structs to implement host-side launch parameter calculations using CK host side functions - in hard-coded example
* temporary change to generate ELF format binary object file
* removed unecessary code, added comments
* formatting fix
* cleaned up code, added new tests, restructured library: move helper into CK
* refactored launch parameter calculation to be more concise
* renamed files and variables for more clarity/uniformity
* more code cleaning, removed debug statements
* moved majority of my files into codegen directory, running properly
* updated Embed.cmake(string_view) in codegen directory
* updated host directory to match Embed.cmake as well
* added old tests in
* updated instance generation methods to be more concise
* removed layout from launch parameter calculation
* working test
* fixed issue with verification, all instances working
* updated verification in other tests
* removed duplicate matrix padder file, removed code dumps
* removed old hard-coded tests
* removed old host directory, all files in codegen directory now
* fixed copyright in files
* commenting out validation
* renamed files
* made changes for review: fixed copyright, renamed files for clarity, removed comments, refactored code
* updated headers
* removing duplicate file for fwd conv to gemm, merging with original file
* fix building codegen with clang++ directly
* resolving build error from conv_fwd_to_gemm
* fix for previous error
* renaming tests
* created common test file
* cleaned up code, added comments
* renamed device op
* fixed typos in comments
* removed extra space
* code cleanup: resolving Amber's comments
* removed wrapper struct for matrix padder, fixed template
* cleaned up if statements for better readability
---------
Co-authored-by: Paul <pfultz2@yahoo.com>
Co-authored-by: Jing Zhang <jizha@amd.com>
Co-authored-by: M. Amber Hassaan <amber_474@yahoo.com>
Co-authored-by: illsilin <Illia.Silin@amd.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
* add f8 gemm with multiD for both row/col wise
* change compute_type to fp8
* changed tuning parameters in the example
* add rcr example
* post-merge fix
* fix
* reduce init range
* Add bf16 instances
* Add bf16 gemm universal example
* tempsave
* Add guard to navi compilation
* workground on a specific mixed gemm instance ( bring back it when compiler fix upload)
* fix formatting condition statement issue
* solve conflict
---------
Co-authored-by: Jun Liu <Liu.Jun@amd.com>
* Overload output stream operator for LoopScheduler and PiplineVersion
* Add Run overload accepting grid descriptors MK.
* Add __device__ keyword for CalculateGridSize
* Create device op GroupedGemmMultipleD
* Add GroupedGemm MultipleD Tile Loop implementation.
* Add an example for GroupedGemm MultipleD tile loop.
* Device Op GroupedGEMMTileLoop.
* Bunch of small changes in exmaple.
* CkProfiler
* Remove unused tparam.
* Fix include statement.
* Fix output stream overloads.
* Do not make descriptors and check validity untill we find group.
* Fix gemm desc initialization.
* Revert device op
* Fix compilation for DTYPES=FP16
* Validate tensor transfers paramters.
* Validate on host only NK dims if M is not known.
* Fix bug.
* A convenient debug func for selecting threads.
* Fix has main k block loop bug.
* Make sure that b2c has up to date tile offset.
* Output stream operator for Sequence type.
* Cmake file formatting.
* add flush cache to device op
* add flush cache parameter to ckProfiler
* change calculate size a and b method
* chang evaluation time method foro AVERAGE to MEDIAN
* format code
* adjust some code
* fix core dumped
* remove loop call flush icache in kernel
* remove loop(outer) call flush icache
---------
Co-authored-by: letaoqin <letaoqin@amd.com>
* Refactor elementwise kernels
* Instances fixes
* Fix cmake
* Fix max pool bwd test
* Update two stage gemm split k
* Restore elementwise scale for hiptensor backward compatiblity
* Fix Acc data type check in conv fwd multiple abd
* Disable conv fp64 fwd example
* Update grouped conv weight multi d
* Support A/B/C elementwise ops.
* First part of GGEMM multiD splitk two stage.
* WIP - changes for debuggin.
* tmp save
* working version
* added bf16@int8 version
* fixes
* add reviewers sugestions
* pre-commited missing files
* switched to ifs from elseifs
---------
Co-authored-by: Adam Osewski <Adam.Osewski@amd.com>
* Update device op api to support BComputeType
* Add example
* Add instances
* Add profiler mode
* Add client example
* Update copyright year
* Add BComputeType check
* Fix compute types
* Format
* Format
* Format
* Remove const
* Use the right template
* Format
* Format
* add row/col instances
* Add missing file
* fixed
* Format
* Updates
* Format
* fixed rrr layout
* Format
* Update test and embed modules
* Restore older version
* Update year
* Set -fPIC
* Format
* Use double for isnan
* rename host folder to codegen + minor fix
* add codegen CI test
* add option to build components without building CK
* fix the groovy syntax
* fix typo
* use the correct function for the codegen stage
---------
Co-authored-by: Jing Zhang <jizha@amd.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
Co-authored-by: illsilin <Illia.Silin@amd.com>