* Docs: Add precision support reference page
* edit of the precision type content
* added more description on scalars
---------
Co-authored-by: spolifroni-amd <sandra.polifroni@amd.com>
Co-authored-by: Aviral Goel <aviral.goel@amd.com>
* creation of install doc and refactor of doc in general
* updates based on review comments
* updated based on review comments
* updated readme and contributors markdown
* added extra note to not use -j on its own
* added note about smoke tests and regression tests
* made changes as per Illia's feedback
---------
Co-authored-by: Aviral Goel <aviral.goel@amd.com>
* Add conversion tests
* Fix ctor
* Fix nan logic
* Fix conversion logic
* Permute packed f4_t values
* Fix conversion to float, repack vector elements
* Fix device tests
* Permute elements in a vector
* Add a repro test
* Add a conversion for a repro test
* Update test vectors
* Update conversion
* Fix the test
* Update test vector generator
* Fix vector sr conversion
* Permute conversion args
* Update conversion
* Test
* Fix packing
* Simplify conversion function
* Pack conversion in a loop
* Pack conversion in a loop
* Pack another conversion in a loop
* Pack one more conversion in a loop
* Pack the last conversion in a loop
* Clean up
* Add printf to fix intrinsic
* Add a sw-based workaround
* only build gemm_fp8_pk_i4 examples for gfx942/950
* fix cmake logic
* moved the architecture check to IsSupported function
* Revert "moved the architecture check to IsSupported function"
This reverts commit 056d2a08b3.
* disable all pk_i4 tests for targets other than gfx942/950
* fix cmake logic
* only build gemm_fp8_pk_i4 examples for gfx942/950
* fix cmake logic
* moved the architecture check to IsSupported function
* Revert "moved the architecture check to IsSupported function"
This reverts commit 056d2a08b3.
* enable ClangBuildAnalizer when doing ninja traces
* add branch and date to clang build log name
* fix jenkins syntax
* fix jenkins syntax once more
* fix jenkins syntax once more
* simplify the clang_build log name
* simplify the clang_build log name further
* 50ms -> 28ms
* Fix bug in non fuse_add_store cases
* Fine tuned setting for 2 pass pipeline
* adjust workload
* remove unnecessary change
* add layernorm
* Adding output quant and unquant results at the same time.
* fix test
* fix format
* tune for cases 128x640 and 128x1024
* bug ifx
* Fix compile error on Windows (call to 'amd_wave_read_first_lane' is ambiguous)
* Fix compile error (no matching function for call to 'cast_to_f32_from_f8')
* return value with macro and revert the return value
* [CK-TILE] no-macro launch api solution (#1992)
* no-macro solution
* address -Wcomma
---------
Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>
* add one daily ci build on gfx908
* add redis invocation tag for gfx908
* make ci build for gfx908 conditional
* fix groovy logic
* add option to run perf tests for gfx908
* disable a few tests on mi100
* Added two kernel for M=32 problem
* Comment the first one
* Enable multiply_multiply for Scale_Block_M = 1 for deepseek
* Modify the a_thread offset since the A data load is different from B.
* edit fp8 ab scale for Scale_Block_M=1
* edit GemmSpec to MNKPadding
* enable blockwise pipelie v1 and v2. v1 is work for small K.
* add instance for gemm_ab_scale
* fix cmakelist of ckProfiler
* optimize blockscale gemm. todo: reduce vgpr usage
* fix a correctness bug
* sanity checked
* revert ckprofiler cmake changes
* clang format
* revert unnecessary changes.
* remove commented codes.
* split weight preshuffle library targets
* bring back enable-post-misched=0
* fix build issues for gemm_multiply_multiply_fp8 instances
* fix clang format
* add verbose build flag when building for all targets
* reduce path names for new instances
* fix paths in cmake
* refactor gemm_multiply_multiply library target
* fix a bug in example
* fix example 65 cmake
* reduce the number of threads when building libs for all targets to 50
* use ninja to build for all targets
* reduce teh number of threads when building for all targets
* reduce the number of threads to 32 when building libs for all targets to 50
---------
Co-authored-by: mtgu0705 <mtgu@amd.com>
Co-authored-by: chenjun <junchen2@amd.com>
Co-authored-by: illsilin <Illia.Silin@amd.com>
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
* Add Gemm fp8xint4 example and kernel, function pass.
* Init Gemm_fp8xint4 Bpreshuffle
* Added gemm_fp8xint4_Bpreshuffle files, function not checked yet
* General fix.
* fp8xint4 bpreshuffle function pass
* fix.
* init b preshuffle dequant in VGPR.
* fix bug, function pass.
* move b thread dequant copy to blockwise.
* fix bug, function now passes.
* modified the tile size to 256, 128x128x128.
* fixed a bug.
* Initial int4 moe, compile pass, function not check.
* fix bug in moe_gemm1.cpp, now function pass.
* test expert = 8 and function pass.
* Added moe_pk_i4_gemm2, function pass.
* Added b preshuffle pipeline v3 support.
* fixed merge issue. fp8xint4 and fp8xint4_bpreshuffle function pass.
* Split the blockwise pipeline for fp8xint4.
* commit missing files
* opt gemm2 to 2x2 wave
* fix swizzle = false
* update int4 moe with latest input changes.
* update tile size.
* enable pipeline v3.
* fix nswizzle = true
* commit a version for compiler debug.
* Updated transfer_v3r1_gather to support pk_i4_t type.
* for int4 moe2 for type_convert support.
* remove some values between mfma instructions.
* fix int4 moe
* Updated transfer_v3r1_gather to support pk_i4_t type.
* i4 support lds multiple shuffle
* fixed int4 moe tflops calculation.
* Modified CshuffleCShuffleMXdlPerWavePerShuffle to 1 to suit C multiple shuffle
* updated gemm2.
* change int4 moe example names
* fix and format code.
* format.
* format codes.
* update fp8xint4 example tile size.
* add <unordered_map> header
* fixed.
* format.
* Added conditional compilation for int4 -> fp8 conversion kernels
---------
Co-authored-by: mtgu0705 <mtgu@amd.com>
Co-authored-by: coderfeli <coderfeli@163.com>