## Motivation
Fixing some of the CK CI issues
## Technical Details
fixing paths to dockerfiles and scripts;
moving codegen tests to separate stage (collides with main build since
you must call cmake from same folder but different options);
fixing a couple of clang compilation issues with staging compiler;
## Motivation
- git diff needs access to reference repo
## Technical Details
- mount reference repo path into docker for selective_test_filter.py to
access
## Test Plan
- tested in MICI
## Test Result
- launch_tests.sh ran successfully
## Motivation
Pipelines were failing on Math CI status check.
## Technical Details
For the success case, we just changed the config in Jenkins to use a
proper app token and no code changes were required. However, the failure
case would not have worked as coded, so we needed to move that outside
of the `rocmnode()` block.
## Test Plan
I removed all of the CI in one of the commits to quickly test, and then
added it back. Got a successful "success" message and "failure" message
produced
## Motivation
- Corrects path to script due to superrepo migration
- Forces all tests to run by default
## Technical Details
- now in /projects/composablekernel
---------
Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>
## Motivation
- Maintain a reference repo on slave nodes that speeds up any
clone/checkout operations
## Technical Details
- clone a ref repo if it does not exist
- update ref repo if it does exist
- checkout after ref repo is updated
- eliminates double clone
## Test Result
- Initial checkouts succeeded
Implement per-page K/V quantization for paged attention:
- Add KV_BLOCKSCALE enum to BlockAttentionQuantScaleEnum
- Use exp2 shift trick to eliminate explicit P scaling overhead
- Prefetch physical pages offset for KV cache, overlaps with
computations
## Proposed changes
Please describe the motivation behind the pull request, whether it
enables a new feature or fixes a bug. If there are associated pull
requests or issues, please link them to the pull request.
## Checklist
Please put an `x` into the boxes that apply. You can also fill these out
after creating the PR. If you're not sure, please don't hesitate to ask.
- [ ] I have added tests relevant to the introduced functionality, and
the unit tests are passing locally
- [ ] I have added the test to REGRESSION_TESTS list defined at the top
of CMakeLists.txt in tests/CMakeLists.txt, **IF** the test takes more
than 30 seconds to run.
- [ ] I have added inline documentation which enables the maintainers
with understanding the motivation
- [ ] I have removed the stale documentation which is no longer relevant
after this pull request
- [ ] (If this change is user-facing) I have added release notes which
provide the end users with a brief summary of the improvement from this
pull request
- [ ] I have run `clang-format` on all changed files
- [ ] Any dependent changes have been merged
## Discussion
If this is a relatively large or complex change, feel free to start a
discussion by explaining why you chose the solution you did and what
alternatives you considered
---
🔁 Imported from
[ROCm/composable_kernel#3696](https://github.com/ROCm/composable_kernel/pull/3696)
🧑💻 Originally authored by @Jeff-Huang
---------
Co-authored-by: Jeff Huang <chiachi.huang@amd.com>
Co-authored-by: Illia Silin <Illia.Silin@amd.com>
## Motivation
Enable the CK CI after migration from standalone repo.
## Technical Details
Modify the jenkinsfile in projects/composablekernel to update the CI
workflow.
## Test Plan
This is for CK internal testing only.
## Test Result
Set up new CK CI pipeline/dashboard.
## Submission Checklist
- [x] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.
---------
Co-authored-by: Andrew Clark <andrew.clark@amd.com>
* Added two new failure patterns to detect. Including test function to verify if the patterns are detected
* Modifying pattern match to detect docker login failure. Removed passing tests.
* Removing passing tests. Modifying docker pattern to detect failure
* Removed passing tests
* Removing test logging function
[ROCm/composable_kernel commit: 421b714f13]
* Stream-K smoke test config file generation
This change converts the stream-k smoke tests to use tile engine. Since
the m, n, and k values dependent on the CU count of a device, the
configs are generated during the Configuration Phase.
* Compute GEMM reference on GPU
* Remove redundant Stream-K tests
Removing redundant tests that are now run via tile engine.
* Fix relative and absolute tolerance calculation
This change updates the Stream-K tile engine interface to ensure that
num_wgs_per_tile is propaged and passed into the compare_results
function to calculate the rel and abs tolerance. Before, split-k was
used, which is incorrect for Stream-K since the split-k value is
always 1.
* Cleanup imports, types, and other misc items
This commit makes the following changes:
- Uses Typing module for nested type hints
- Uses quotes around cu_count_arg argument in generate_configs.cmake in
if statements
- Adds explicit include for tuple in test_gemm_streamk_simple.cpp
- Adds a type for the tiles argument in argparser to check argument
validity
* Use CU count as return value for better parsing
* Add reduction tests for bf16, fp8, and bf8
[ROCm/composable_kernel commit: 8cbd09c84a]
Replace specific benchmark numbers with qualitative descriptions since
measurements vary across environments and may become outdated.
Co-authored-by: Claude <noreply@anthropic.com>
[ROCm/composable_kernel commit: 3f04d27b68]
* [Compiler] Addressing new compiler warnings
Clang enables new lifetime warnings in production and we see build
errors due to this with the staging compiler.
The attributes added in this PR are suggested by the compiler. However,
I'm not very familiar with the code base, so the changes may be
incorrect.
* Update some more instances
* Adds file-level ignores via clang diagnostic pragma
The number of instances was large, so I decided to use file-level scope
to disable the warning via pragma clang diagnostic ignored.
It also showed this warning coming from the gtest dependency. For that,
I did add the respective command line flag to the CMake variables. I
don't know if this is acceptable or not.
* This adds the remaining instances
For a build on gfx90a.
* fix clang format
* Adding couple more instances from gfx1200 build
* Fixed another few instances
---------
Co-authored-by: Illia Silin <98187287+illsilin@users.noreply.github.com>
Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>
[ROCm/composable_kernel commit: 069500464d]
* Change call to the intrinsics
* fix clang format
* Undo changes under include/ck/utility
* Use named variable as vector size
---------
Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>
[ROCm/composable_kernel commit: 8c1788757a]
* Adding remaining flavors for grouped conv fwd
As titled. Following variants are added:
- grouped_conv2d_fwd_dynamic_op
- grouped_conv3d_fwd_dynamic_op
- grouped_conv3d_fwd_bilinear
- grouped_conv3d_fwd_convscale
- grouped_conv3d_fwd_convinvscale
- grouped_conv3d_fwd_convscale_add
- grouped_conv3d_fwd_convscale_relu
- grouped_conv3d_fwd_scale
- grouped_conv3d_fwd_combconvscale
- grouped_conv3d_fwd_scaleadd_scaleadd_relu
* Fix incomplete parsing of types from source names in add_instance_library() cmakelists function so we don't build f8 on RDNA3.
* Do not build f8 / bf8 only flavor tests on RDNA3
* Make sure we have proper generic instances for all instance lists related to the post-ces extra flavors, with scalarPerVector = 1. Then disable all but one generic instance per instance list to reduce compile time.
* Post rebase fix: Template parameters for Grouped Conv Fwd Device Impl got tweaked upstream.
* adding int8 and fp16 overloads to the elementwise operations
* fixed copilot nits
* Addressing review comments:
- removed unnecessary examples for dynamic op
- removed unnecessary conv specalizations for all the flavors
- removed spurious bilinear and scale source files
* clang-format
* reduced no of tests
---------
Co-authored-by: Wojciech Laskowski <wojciech.laskowski@streamhpc.com>
[ROCm/composable_kernel commit: 2377a62837]
* chore: split block scale example instances in more separate files to speed up compile times
* wip: fp4 scaffolding for abquant
* feat: add fp4 decoding-while-loading to abquant pipeline
* feat: add support for fp4 CPU verification in abquant
* chore: add time tracking to reference calculation
* feat: add a4w4 test for blockscale gemm
* feat: optimize reference calculation by preconverting values to AccType
* feat: add fp4 to fp8 look-up table
* fix: reference to wrong ComputeDataType field in QuantProblem
* feat: type utilities for determining MFMA compute types
* feat: packed fp4 for abquant weight preshuffle
* feat: add separate tests for a4w4 base case, padding and preshuffleB
* fix: fp4 conversion on gfx950 attempting to use non-supported method
* fix: test case was using quant group sizes which don't work on gfx950 due to larger mfma tile size
* chore: add fp4 preshuffleb mode to block scale example
* chore: sanity check for packed types being 1 byte
* chore: clarify tensor dimension indices with constants
* chore: replace traits check with specialized check for packed types
* style: some minor refactoring and cleanup
* fix: correct conversion table for FNUZ fp8
* chore: add fp4 instances to main abquant instances again
* chore: use same initialization branch for int4 and fp4
* chore: add missing initialization for fp4 in block scale gemm example
---------
Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
[ROCm/composable_kernel commit: 6a6177a246]
* Decouple configure/build/test tools from Docker
Create a two-layer tool architecture:
- Core tools (ck-configure, ck-build, ck-test): Environment-agnostic,
work on any system with ROCm - no Docker dependency
- Container tools (ck-docker): Manage Docker containers and delegate
to core tools via docker exec
Changes:
- Add ck-configure: New CMake configuration tool with preset support,
native GPU detection, and flexible options
- Refactor ck-build: Remove Docker dependency, add --configure and
--list options, call ninja directly
- Refactor ck-test: Remove Docker dependency, add CTest integration
with --smoke/--regression/--all options
- Enhance common.sh: Add native GPU detection, build directory utils,
and output helpers
- Update ck-docker: Add configure/build/test/exec commands that
delegate to core tools inside container
This enables:
- Native development on ROCm hosts without Docker
- Simpler CI/CD integration
- Consistent behavior inside and outside containers
Co-Authored-By: Claude <noreply@anthropic.com>
* Add ck-rocprof: GPU profiling tool for rocprof-compute
Adds a command-line profiling tool to simplify GPU performance
analysis workflow using AMD rocprof-compute.
Features:
- Easy setup with automatic Python venv configuration
- Simple CLI: setup, run, analyze, compare, list
- Automatic GPU architecture detection
- Focus on LDS metrics (Block 12) for bank conflict analysis
- Comprehensive documentation with examples and troubleshooting
Usage:
ck-rocprof setup # One-time environment setup
ck-rocprof run <name> <executable> # Profile executable
ck-rocprof analyze <name> [block] # Analyze metrics
ck-rocprof compare <name1> <name2> # Compare two runs
ck-rocprof list # List available runs
* Make ck-rocprof documentation concise and improve Docker integration
- Streamlined documentation from 416 to 157 lines (62% reduction)
- Focused on essential commands, metrics, and workflows
- Enhanced script to run all operations inside Docker containers
- Fixed workload directory path and improved container management
- Added automatic rocprofiler-compute installation and dependency handling
* Add --no-roof flag to ck-rocprof profile command
Skip roofline analysis by default to speed up profiling. Roofline
analysis can add significant time to profiling runs but is not
needed for most LDS bank conflict analysis workflows.
* Make ck-rocprof work independently of Docker
Add native execution mode that runs rocprof-compute directly on the host
system when available, falling back to Docker mode when not.
Key changes:
- Auto-detect native mode when rocprof-compute is in PATH or common locations
- Add execution mode wrappers (exec_cmd, file_exists, dir_exists, etc.)
- Native mode stores venv at .ck-rocprof-venv in project root
- Native mode stores workloads at build/workloads/
- Support user-installed rocprofiler-compute (e.g., ~/.local/rocprofiler-compute)
- Add CK_FORCE_DOCKER env var to force Docker mode
- Update help message to show current execution mode
- Maintain full backward compatibility with existing Docker workflow
Tested successfully with rocprofiler-compute 3.4.0 installed from source
on MI300X GPU in native mode.
Co-Authored-By: Claude <noreply@anthropic.com>
* Add clean/status commands and improve ck-rocprof robustness
- Add 'clean' command to remove profiling runs (supports --all)
- Add 'status' command to show configuration and environment info
- Add workload name validation to prevent path traversal attacks
- Fix uv installation to use pip instead of curl for reliability
- Add cross-platform stat support for macOS compatibility
- Consolidate ROCPROF_CANDIDATES to avoid code duplication
- Expand help documentation with all profiling block descriptions
- Fix Docker wrapper script escaping issues
Co-Authored-By: Claude <noreply@anthropic.com>
* Fix analyze command to use correct workload path
rocprof-compute stores results directly in the workload directory
(pmc_perf.csv) rather than in a GPU architecture subdirectory.
Updated find_workload_path to detect this correctly.
Co-Authored-By: Claude <noreply@anthropic.com>
* Address PR review security and robustness issues
Security fixes:
- Escape executable path in cmd_run to prevent shell injection
- Add workload name validation to cmd_analyze and cmd_compare
Robustness improvements:
- Add error checking for uv package manager installation
- Use consistent project root detection (find_project_root || get_project_root)
- Use /opt/rocm instead of hardcoded /opt/rocm-7.0.1 in Docker mode
- Derive ROCM_REQUIREMENTS path from ROCPROF_BIN for flexibility
- Use gfx950 as fallback GPU consistent with common.sh
Documentation updates:
- Fix env var name GPU_TARGET -> CK_GPU_TARGET
- Update storage layout to reflect current structure (workloads/<name>/)
- Document clean and status commands
- Clarify native vs Docker default paths
Co-Authored-By: Claude <noreply@anthropic.com>
* Simplify ck-rocprof to native-only mode
Remove Docker mode from ck-rocprof. Docker users should run the tool
via `ck-docker exec ck-rocprof ...` instead.
This simplification:
- Removes ~210 lines of Docker-specific code
- Eliminates mode detection complexity
- Makes the script easier to maintain
- Provides clearer error messages when rocprof-compute is not found
The setup command now lists all searched locations when rocprof-compute
is not found, helping users understand how to install it.
Co-Authored-By: Claude <noreply@anthropic.com>
* Add rocprofiler-compute source installation fallback
When rocprof-compute is not found in system locations, automatically
install rocprofiler-compute 3.4.0 from source as a fallback. This
eliminates the hard dependency on system ROCm packages.
Implementation details:
- Clone rocprofiler-compute from GitHub to ~/.local/
- Install dependencies via requirements.txt (not editable install)
- Create wrapper that sets PYTHONPATH to source directory
- Execute source script directly rather than importing as module
This approach matches the project's development workflow and works
around the incomplete pyproject.toml that prevents editable installs.
Co-Authored-By: Claude <noreply@anthropic.com>
---------
Co-authored-by: Claude <noreply@anthropic.com>
[ROCm/composable_kernel commit: 83b6155354]
* create a filter to build only libs required by hiptensor
* allow building libs for miopen and hiptensor at the same time
* tweak the lib filtering logic one more time
[ROCm/composable_kernel commit: 05ef93a69d]