composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-06-11 00:39:02 +00:00

Author	SHA1	Message	Date
Illia Silin	c24e528481	[rocm-libraries] ROCm/rocm-libraries#7760 (commit a61bc76) [CK] suppress compiler warnings while building pytorch. (#7760) ## Motivation Recently added compiler flags that are required to suppress false warnings by latest staging compiler are not recognized by older compiler versions and are triggering an avalanche of warnings. Previous attempt to suppress them by using -Wno-unknown-warning-option flag didn't help, because that flag wasn't recognized either and just added more warnings. I've verified that current approach by checking the clang version actually works as intended and makes the warnings go away. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-05-27 06:56:58 -07:00
Illia Silin	ac18460782	[rocm-libraries] ROCm/rocm-libraries#7384 (commit 10e9d70) [CK] Suppress new staging compiler errors (#7384) ## Motivation This should make new builds with staging compiler pass. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-05-14 12:51:08 -07:00
Illia Silin	22b9feb40f	[rocm-libraries] ROCm/rocm-libraries#7111 (commit 651947f) [CK] Fix latest batch of staging compiler warnings (#7111) ## Motivation Suppress the new batch of clang lifetimebound and invalidation warnings with the latest staging compiler. ## Technical Details <!-- Explain the changes along with any relevant GitHub links. --> ## Test Plan <!-- Explain any relevant testing done to verify this PR. --> ## Test Result <!-- Briefly summarize test outcomes. --> ## Submission Checklist - [ ] Look over the contributing guidelines at https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.	2026-05-08 07:14:14 -07:00
Aviral Goel	4dcc3e59c1	chore: update copyright header for misc files (#3402 ) * chore: update copyright header for misc files * fix: typo in kernel resulting in ci failure	2025-12-11 08:25:29 -08:00
JH-Leon-KIM-AMD	4baa4c9fae	[CK, CK_TILE] Add GPU Reference Implementations for Grouped Convolution (#3216 ) * LWPCK-4043: Add GPU reference implementations for CK Tile convolution This commit implements GPU-based reference kernels for CK Tile convolution operations to enable faster verification of optimized kernels, especially for large tensors (>2GB). Changes: - Add naive_grouped_conv_fwd.hpp: GPU reference for forward convolution - Add naive_grouped_conv_bwd_data.hpp: GPU reference for backward data - Add naive_grouped_conv_bwd_weight.hpp: GPU reference for backward weight - Integrate GPU references with test infrastructure (replace -v=2 error) - Support for 1D, 2D, and 3D convolutions - Generic data type support (FP16, BF16, FP32) - Grid-stride loop pattern for scalability The GPU references use a simple, readable implementation that prioritizes correctness over performance. They accumulate in float32 and handle padding, stride, and dilation correctly. * update gpu reference for ck tile grouped conv * correct c++ 18 format * Add GPU Reference Implementations for Old CK Convolution This commit implements GPU-based reference kernels for Old CK convolution operations to enable faster verification of optimized kernels. Changes: - Fixed old CK forward GPU reference (naive_conv_fwd.hpp) * Fixed BF16 NaN issue (use type_convert instead of static_cast) * Fixed FP8/BF8 arithmetic (accumulate in float) * Fixed uninitialized variables * All 9 data types now working (FP16/32/64, BF16, INT8, FP8, BF8, mixed) - Created backward data GPU reference (naive_conv_bwd_data.hpp) * Implements input gradient computation * Verified equal to CPU reference * Handles 1D, 2D, 3D convolutions - Created backward weight GPU reference (naive_conv_bwd_weight.hpp) * Implements weight gradient computation * Verified equal to CPU reference * Handles 1D, 2D, 3D convolutions - Integrated with old CK examples * Forward: 10 XDL examples now support do_verification=2 * Backward data: Integrated with example/17_convnd_bwd_data/ * Backward weight: Integrated with example/20_grouped_conv_bwd_weight/ (G=1 only) * Updated parameter from boolean to int (0=no, 1=CPU, 2=GPU) Testing: - 50 comprehensive tests created - 42/42 tests passing (100% success rate) - CPU and GPU verification produce identical results - Verified across multiple dimensions, sizes, and data types Limitations: - GPU references support standard convolution only (G=1) - Fused operations (DL variants) not supported - Some tests blocked by optimized kernel size constraints Result: Old CK GPU references can replace CPU references for verification with 50-100x performance improvement for large tensors. * Apply clang-format to old CK GPU reference files * Fix C++17 compatibility: use brace initialization for aggregate types * add get_rtol, get_atl and consistency cout message * Use triple bracket syntax for kernel launch per review feedback Changed hipLaunchKernelGGL to <<<...>>> syntax as suggested by @aosewski. This is more idiomatic HIP/CUDA style and equally correct. All tests still passing after this change. * Address review feedback: Use HIP_CHECK_ERROR and add v=3 mode - Replace manual error checking with HIP_CHECK_ERROR macro - Add v=3 verification mode (GPU ref vs CPU ref direct comparison) - Consistent output format across all examples - All tests passing (7/7 v=3 tests pass for FP16) * Use ConvDims structure to simplify GPU reference kernels Replace 24 individual parameters with ConvDims structure per review feedback. - Add conv_common.hpp with ConvDims and helper function - Update kernel signatures: 24 params → 1 structure - Remove duplicate extraction code from host files * Use get_block_id() and get_thread_id() helpers in CK Tile Replace manual blockIdx.x/threadIdx.x arithmetic with helper functions. Updated 3 CK Tile GPU reference kernels per review feedback. * Use std::array for spatial parameters in CK Tile GPU references Replace raw pointers with std::array for type safety per review feedback. - Add conv_common.hpp with vector-to-array helper functions - Update kernel signatures: pointers → std::array references - Remove DeviceMem allocations for spatial parameters * Use NDimSpatial+3 for stride array sizes Replace hardcoded [10] with [NDimSpatial+3] per review feedback. Array sizes now correctly reflect actual dimensions needed. * Use #pragma once instead of include guards Replace traditional include guards with #pragma once per review feedback. Updated 3 Old CK GPU reference headers. * Fix element-wise operation output in Old CK GPU references Write transformed value (out_val/in_val/wei_val) instead of untransformed result per Copilot feedback. This ensures element-wise operations are correctly applied to output. * Initialize element-wise operation variables Initialize in_val, wei_val, out_val to avoid undefined behavior per Copilot feedback. Updated backward data and backward weight kernels. * Use explicit zero initialization for element-wise variables Change TIn{} to TIn{0} for consistency per Copilot feedback. All 3 kernels now use consistent zero initialization. * Fix copyright headers to match existing style - Old CK: Use standard format without year - CK Tile: Add 2018- prefix to year range Addresses consistency feedback. * Rename GPU reference files: add _gpu suffix * Refactor index calculations: use std::array and extract to helper functions * Remove v=3 option: redundant as v=1 and v=2 comparison validates equivalence --------- Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>	2025-12-03 21:14:21 +02:00
Aviral Goel	de6466481f	chore(copyright): update copyright header for include directory (#3293 )	2025-11-26 11:00:05 -07:00
linqunAMD	9fcc1ee9fd	Support Wave32 in CK_TILE - Part 1 (#2594 ) * Support wave32/wave64 in CK_TILE - Part 1 * remove blocksize in kernel launch * fix build error * fix clang format * fix clang format 2 * fix clang format 3 * fix fmha build error * fix fmha build 2 * fix fmha build 3 * fix build error 4 * address review comment * update change log * replace KernelBlockSize with kBlockSize * fix CI fail * fix clang format * address review comment and rebase code. * fix universal test fail --------- Co-authored-by: Lin, Qun <Quentin.Lin+amdeng@amd.com> Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>	2025-08-18 10:08:31 -07:00
Illia Silin	504b101da3	upgrade from clang-format-12 to clang-format-18 (#2568 ) * upgrade to clang-format-18 * update to clang-format-18 in pre-commit-config	2025-07-28 11:34:07 -07:00
carlushuang	6df5fe2ad8	[CK_TILE]naive attn support FP8 KVCache quant (#1747 ) * quant * fix bug * simple smoothquant after softmax * update kv-quant * update stride * fix fp8-pertoken-kvcache * update int8/fp8 quant support --------- Co-authored-by: so <a.com> Co-authored-by: Po Yen Chen <PoYen.Chen@amd.com>	2025-01-03 18:43:07 +08:00
carlushuang	77a38e0211	[CK_TILE] naive attn (#1708 ) * add reference attention fwd * refactor addresser * update * paged, and i8 reflect-quant * lets call it forward-quant * fix error in decode variation * update naive-attn * fix page table * fix build err	2024-12-12 11:54:03 +08:00

10 Commits