* Decouple configure/build/test tools from Docker
Create a two-layer tool architecture:
- Core tools (ck-configure, ck-build, ck-test): Environment-agnostic,
work on any system with ROCm - no Docker dependency
- Container tools (ck-docker): Manage Docker containers and delegate
to core tools via docker exec
Changes:
- Add ck-configure: New CMake configuration tool with preset support,
native GPU detection, and flexible options
- Refactor ck-build: Remove Docker dependency, add --configure and
--list options, call ninja directly
- Refactor ck-test: Remove Docker dependency, add CTest integration
with --smoke/--regression/--all options
- Enhance common.sh: Add native GPU detection, build directory utils,
and output helpers
- Update ck-docker: Add configure/build/test/exec commands that
delegate to core tools inside container
This enables:
- Native development on ROCm hosts without Docker
- Simpler CI/CD integration
- Consistent behavior inside and outside containers
Co-Authored-By: Claude <noreply@anthropic.com>
* Add ck-rocprof: GPU profiling tool for rocprof-compute
Adds a command-line profiling tool to simplify GPU performance
analysis workflow using AMD rocprof-compute.
Features:
- Easy setup with automatic Python venv configuration
- Simple CLI: setup, run, analyze, compare, list
- Automatic GPU architecture detection
- Focus on LDS metrics (Block 12) for bank conflict analysis
- Comprehensive documentation with examples and troubleshooting
Usage:
ck-rocprof setup # One-time environment setup
ck-rocprof run <name> <executable> # Profile executable
ck-rocprof analyze <name> [block] # Analyze metrics
ck-rocprof compare <name1> <name2> # Compare two runs
ck-rocprof list # List available runs
* Make ck-rocprof documentation concise and improve Docker integration
- Streamlined documentation from 416 to 157 lines (62% reduction)
- Focused on essential commands, metrics, and workflows
- Enhanced script to run all operations inside Docker containers
- Fixed workload directory path and improved container management
- Added automatic rocprofiler-compute installation and dependency handling
* Add --no-roof flag to ck-rocprof profile command
Skip roofline analysis by default to speed up profiling. Roofline
analysis can add significant time to profiling runs but is not
needed for most LDS bank conflict analysis workflows.
* Make ck-rocprof work independently of Docker
Add native execution mode that runs rocprof-compute directly on the host
system when available, falling back to Docker mode when not.
Key changes:
- Auto-detect native mode when rocprof-compute is in PATH or common locations
- Add execution mode wrappers (exec_cmd, file_exists, dir_exists, etc.)
- Native mode stores venv at .ck-rocprof-venv in project root
- Native mode stores workloads at build/workloads/
- Support user-installed rocprofiler-compute (e.g., ~/.local/rocprofiler-compute)
- Add CK_FORCE_DOCKER env var to force Docker mode
- Update help message to show current execution mode
- Maintain full backward compatibility with existing Docker workflow
Tested successfully with rocprofiler-compute 3.4.0 installed from source
on MI300X GPU in native mode.
Co-Authored-By: Claude <noreply@anthropic.com>
* Add clean/status commands and improve ck-rocprof robustness
- Add 'clean' command to remove profiling runs (supports --all)
- Add 'status' command to show configuration and environment info
- Add workload name validation to prevent path traversal attacks
- Fix uv installation to use pip instead of curl for reliability
- Add cross-platform stat support for macOS compatibility
- Consolidate ROCPROF_CANDIDATES to avoid code duplication
- Expand help documentation with all profiling block descriptions
- Fix Docker wrapper script escaping issues
Co-Authored-By: Claude <noreply@anthropic.com>
* Fix analyze command to use correct workload path
rocprof-compute stores results directly in the workload directory
(pmc_perf.csv) rather than in a GPU architecture subdirectory.
Updated find_workload_path to detect this correctly.
Co-Authored-By: Claude <noreply@anthropic.com>
* Address PR review security and robustness issues
Security fixes:
- Escape executable path in cmd_run to prevent shell injection
- Add workload name validation to cmd_analyze and cmd_compare
Robustness improvements:
- Add error checking for uv package manager installation
- Use consistent project root detection (find_project_root || get_project_root)
- Use /opt/rocm instead of hardcoded /opt/rocm-7.0.1 in Docker mode
- Derive ROCM_REQUIREMENTS path from ROCPROF_BIN for flexibility
- Use gfx950 as fallback GPU consistent with common.sh
Documentation updates:
- Fix env var name GPU_TARGET -> CK_GPU_TARGET
- Update storage layout to reflect current structure (workloads/<name>/)
- Document clean and status commands
- Clarify native vs Docker default paths
Co-Authored-By: Claude <noreply@anthropic.com>
* Simplify ck-rocprof to native-only mode
Remove Docker mode from ck-rocprof. Docker users should run the tool
via `ck-docker exec ck-rocprof ...` instead.
This simplification:
- Removes ~210 lines of Docker-specific code
- Eliminates mode detection complexity
- Makes the script easier to maintain
- Provides clearer error messages when rocprof-compute is not found
The setup command now lists all searched locations when rocprof-compute
is not found, helping users understand how to install it.
Co-Authored-By: Claude <noreply@anthropic.com>
* Add rocprofiler-compute source installation fallback
When rocprof-compute is not found in system locations, automatically
install rocprofiler-compute 3.4.0 from source as a fallback. This
eliminates the hard dependency on system ROCm packages.
Implementation details:
- Clone rocprofiler-compute from GitHub to ~/.local/
- Install dependencies via requirements.txt (not editable install)
- Create wrapper that sets PYTHONPATH to source directory
- Execute source script directly rather than importing as module
This approach matches the project's development workflow and works
around the incomplete pyproject.toml that prevents editable installs.
Co-Authored-By: Claude <noreply@anthropic.com>
---------
Co-authored-by: Claude <noreply@anthropic.com>
This PR introduces a Python toolkit for analyzing Clang's `-ftime-trace` build performance data. This is the foundation for our systematic effort to reduce CK and CK-Tile build times (#3575).
The toolkit provides fast parsing of trace JSON files into pandas DataFrames using orjson, with specialized functions for analyzing template instantiation costs and compilation phase breakdowns. It includes a core library (`trace_analysis/`), example scripts for quick analysis, a comprehensive README with usage documentation, and an interactive Jupyter notebook demonstration.
Key features include memory-efficient DataFrame schemas with optimized dtypes, recursive hierarchical phase analysis, automatic metadata extraction (source file, compilation timing), and template instantiation filtering. The design supports both standalone scripts and interactive Jupyter notebook workflows.
This single-file analysis capability lays the groundwork for future multi-file analysis across thousands of compilation units, enabling data-driven optimization and build time regression detection.
* CMakeLists.txt hack for Windows.
* Add Windows build instructions.
* Fix type issue with variadic min function.
* Use std::common_type to fix the variadic min/max functions.
* Enable CPU guard compilation on Windows.
* Suppress warnings related to std::getenv on Windows platform.
* Git ignore the output directory on Windows platform.
* Powershell script for running tests and generating reports.
* Improve test logging.
* Disable non-conv tests.
* Fix Debug build on Windows.
* More debug build changes.
* Update Windows build instructions.
* Enable all tests.
* Test fixes.
* Suppress not found linker options warning.
* Update unsigned long literals and format specifiers to work correctly in Windows
* Fix conv 3D bwd weight bilinear tests on Windows.
* Revert changes on .gitignore.
* Clean-up CMake project file for Windows builds.
* clang-format
* Fix definition of CMAKE_PREFIX_PATH on both Linux and Windows platforms.
* Fix building examples on Windows.
* Update Readme.
* Remove the suppression of the deprecated warnings.
* Remove Windows specific min/max implementations from CK Tile math core.
* Remove unnecessary no-op on Windows.
---------
Co-authored-by: User <user@example.com>
Co-authored-by: Ville Pietilä <none>
Co-authored-by: John Afaganis <john.afaganis@amd.com>
Co-authored-by: Ville Pietilä <>
* Removing hard-coded trace filename
* Including stage name in notification
* Simplifying capture setup and tagging file names with arch
* Removed test property from notification message
* Fixing regex to get arch name
* Fixing error in notification and modified regex
* generate and visualize build traces for all archs
* generate build traces in all cases
* fix jenkins logic
* fix typo
* use more threads for parsing dependency map
* add script to parse ninja traces and issue warnings
* fix python script syntax and header
* fix python syntax one more time
* fix python syntax
* Add README.md for testing
* Add tensor_memory_manager.
* ck-builder: tensor memory manager rebase fixes
This fixes some issues caused by the API being changed recently.
Also, this streamlines the ckt namespace to always be ck_tile::builder::test,
as this is already being used by other tests
Really, this commit should be squashed into the previous,
but I'm keeping it separate for brevity.
* ck-builder: test arguments initial prototype
* ck-builder: test system initial prototype
* ck-builder: fix non-standardized copyright comments
* ck-builder: new prototype
* ck-builder: group testing inputs/outputs into a separate structure
This is basically the return of the tensor memory manager after all,
except that the design is more closely tied to the actual operation.
Using a struct allows us to add additional input/output tensors
without breaking code (by defaulting those new parameters). Note
that the tensors are split into a separate inputs/outputs because we
usually want to allocate the output _twice_: once for the real
computation and once for the reference computation.
* ck-builder: simplify prototype naming; start docs
* ck-builder: update testing readme
* ck-builder: testing documentation
* ck-builder: HipStatusMatcher
This matcher can be used to check HIP status codes and provide
nice and readable error messages.
* ck-builder: tensor_buffer.hpp tests
* ck-builder: conv_fwd.hpp tests
* ck-builder: add example end-to-end test in conv fwd 2d fp16
* ck-builder: simplify extent usage
* ck-builder: update testing doc
* ck-builder: skip end to end test on non-gfx9
* fix check_copyright_year interpreter
/bin/bash is not guaranteed to exist on Linux. Signed,
a NixOS user
* ck-builder: fix copyrights
* ck-builder: reduce conv fwd testing size
This test allocated 24GB of memory, too much for 16GB cards.
---------
Co-authored-by: John Shumway <jshumway@amd.com>
This pull requests adds some initial "factory tests" - these check that the instances which are used in MIOpen are actually present in CK. The main reason for this is documentation and sanity checking. Its likely that these tests get outdated fast, so we'll have to maintain them, but fortunately this is quite straight forward and shouldn't take a lot of time once they are in place.
* Run ctest with --output-on-failure
* Fix synchronization issues in bwd pipelines
The bwd kernel reuses the same area of LDS for ds (SGrad), bias and
dbias (BiasGrad). This means that there must be block_sync_lds between
loading one tensor and storing another to the same area.
Heavy instructions like MFMA/WMMA and global loads are executed between
reuses of the same memory so in MOST cases loading is finished by all
warps before storing is started. However, sometimes warps progress at
different speeds.
Running the tests multiple times and, preferably, with multiple
processes on the same GPU helps to trigger this issue:
bin/test_ck_tile_fmha_bwd_bf16 --gtest_repeat=-1 --gtest_shuffle --gtest_throw_on_failure
* initial work on adding support of gfx12 in tile_engine for GEMM benchmarking
* add stage("Run TILE_ENGINE_GEMM Tests on gfx1201") to Jenkins config
* make tile_[m/n/k] validation arch dependent
* docs(README): update readme with new build instructions
* feat(grouped_gemm): add support back for non persistent kernel
* refactor(grouped_gemm): simplify tensor creation
* refactor(grouped_gemm): Persistance is now GemmConfig value for easier management
* chore(grouped_gemm): add print statements to ease debugging
* WIP(grouped_gemm): add grouped_gemm_preshuffle example and update CMake configuration
* fix(tile_gemm_traits): change default value of Preshuffle_ from 0 to false for clarity
* WIP(grouped_gemm): add dummy variables to compile the preshuffle pipelines
* chore(grouped_gemm): add print statements and variables to debug numerical error with preshuffle
* style: clang format work so far
* BUG!(grouped_gemm_kernel.hpp): figured out a potential bug in for numerical errors in preshuffle pipeline
* fix(grouped_gemm_kernel): add function in the kernel code to dynamically calculate tail_number resolving numerical errors
* refactor(gemm_presuffle): make preshuffle pipeline v2 compatible with operator () calls from grouped gemm
* chore(grouped_gemm): add/remove debug comments and debug print statements
* feat(grouped_gemm): integrate preshuffle pipeline v2 into grouped gemm for all supported shapes
* chore(gemm_profile): add new argument combinations
* fix: branch cleanup, formatting, refactoring
* fix: branch cleanup, formatting, refactoring
* chore(changelog): update changelog to reflect new featuer
* address review comments & nit
* Adding RapidJson Library
* Adding Json Dumps in all CK_Tile Examples
Not verified yet
* Adding json to cktile Batched Transpose
* adding json dumps to layernorm2d_fwd
* Adding json dump to flatmm_basic
* Adding RapidJson Library
* Adding Json Dumps in all CK_Tile Examples
Not verified yet
* Adding json to cktile Batched Transpose
* adding json dumps to layernorm2d_fwd
* Adding json dump to flatmm_basic
* Adding json in 03_gemm
* Add json dump to 16_batched_gemm
* Add json dump to gemm_multi_d_fp16
* Add json dump to grouped_gemm
* fix fmha_bwd/fwd
* Fix clang-format errors
exclude include/rapidjson in jenkins as its a third-party library
* Saparating function and defination.
* Update Documentation of 03_gemm
* Refactoring as per code review
* Disable fp8 instances on unsupported targets (#2592)
* Restrict building of gemm_universal_preshuffle_f8 instances to specific targets in CMakeLists.txt
* Add condition to skip gemm_xdl_universal_preshuffle_f8 instances for unsupported targets in CMakeLists.txt
* Add conditions to skip unsupported targets for gemm_universal_preshuffle_f8 and gemm_xdl_universal_preshuffle_f8 instances in CMakeLists.txt
* Refine conditions to exclude gemm_universal_preshuffle_f8 instances for unsupported targets in CMakeLists.txt
---------
Co-authored-by: AviralGoelAMD <aviralgoel@amd.com>
* fix clang format
* remove duplicate lines of code from library/src/tensor_operation_instance/gpu/CMakeLists.txt
* Fixing Readme and unifying jsondumps
* adding moe_smoothquant
* adding fused_moe
* Fixing Readme for batched_gemm
* Fixing Readme for grouped_gemm
* adding flatmm
* adding gemm_multi_d_fp16
* adding elementwise
* adding File name when json is dumped
* Fixing Reduce after merge
* adding batched_transpose
* Adding Warptile in Gemm
* Fixing Clang Format
---------
Co-authored-by: Aviral Goel <aviral.goel@amd.com>
Co-authored-by: AviralGoelAMD <aviralgoel@amd.com>
Co-authored-by: illsilin_amdeng <Illia.Silin@amd.com>
* Making edits to identify individual compilation issues.
* Minor fix for blob txt files not being created.
* Fixing compilation issues.
* Fixing ordering bug.
* Adding python profiling functionality.
* Setting individual build as default.
* Setting gpu target filtering for tile engine to gfx90a, gfx942 and gfx950.
* update the default running parameters and settings
* Fixing bug with benchmarking, shifting file generation to build instead of config.
* Updating fixes.
* Fixing json output and parsing.
* Disable ccache for tile engine gemm ops because we dont need it.
* Removing duplicate type definition.
* Improving json printing.
* Add the flexibility of different layout and more warp tile support
* Fix extra flag in name of individual kernels.
* Fixing bug with booleans.
* Solve the first patch of the post merge conflict
* Compilation fixes, and cosmetic improvements.
* Yet again compilation fixes after latest changes from develop.
* Fixing python benchmarking script.
---------
Co-authored-by: Vidyasagar Ananthan <vidyasagar.ananthan@amd.com>
Co-authored-by: Vidyasagar Ananthan <vanantha@amd.com>
* update the performance test logic
* fix unstash perf logs logic
* untangle unstashing fmha logs for different archs
* run process stage after running fmha tests
* fix the processing of perf logs
* fix arguments for run_performance scripts
* script for generating list of not referenced files in tests, list is in json format
* script comment added
* added empty line at the end of the script
* format changes
* something khushbu can help with
* v1 v2 works with flatmm develop
* v0 v1 v2 numerical error gone
* Fixing numerical error, and interchange preshuffle configs to match with flatmm
* Refactor GEMM pipeline configurations and integrate preshuffle support
- Updated preshuffle pipeline definitions to include multiple versions (V1, V2, V3).
- Changed the pipeline constant from CK_TILE_PIPELINE_PRESHUFFLE to CK_TILE_PIPELINE_PRESHUFFLE_V3 in relevant configurations.
- Removed obsolete code and comments
* clang format
* fix vectorloadsize bug
* add the Preshuffle3
* update kwarp calculation in gemm utils
* update vector size A and B correctly in V2 pipeline; Added few more changes to align with dteng's branch
* fix: add CK_GFX950_SUPPORT macro for gfx950 detection
* default disable rotating buffer
* docs(CHANGELOG): update changelog for rocm 7.0
* Revert "docs(CHANGELOG): update changelog for rocm 7.0"
This reverts commit 2bc16fff84.
* Remove unused Preshuffle V3 pipeline and related code; update gemm function to use Preshuffle V2; clean up comments and formatting in various files.
* revert example/ck_tile/flatmm to its original state
* remove comment added by second author
* switch to xor ALDSDescriptor
* modify the MakeALdsDescriptor()
* temporary profiling script
* getting rid of line marker compiler error
* UniversalWeightPreshufflePipelineAgBgCrPolicy now derives from UniversalGemmBasePolicy
* add a minor fix for the config
* typo fix
* Fix formatting in lambda function for WeightPreshufflePipelineAGmemBGmemCRegV2
* revert change in include/ck_tile/ops/flatmm/pipeline/flatmm_pipeline_agmem_bgmem_creg_v1.hpp
* revert change in include/ck_tile/core/arch/amd_buffer_addressing.hpp
* reenable the GemmSpatiallyLocalTilePartitioner
* make GemmConfigPreshuffle_1 for v1 pipeline, GemmConfigPreshuffle_2 for v2 pipeline
* remove hardcoded true for preshuffle bool template argument
* rename script
* remove gemm_profilie.sh script
* merge conflict resolve
* clang formatted
* typo fix
* Remove duplicate include of block_gemm_areg_bsmem_creg_v2r1.hpp in gemm.hpp
* Remove commented-out code in UniversalWeightPreshufflePipelineAgBgCrPolicy
* Fix missing newline at end of file in run_gemm_example.inc
* Remove unused barrier call in BlockWeightPreshuffleASmemBSmemCRegV1
* addressing review comments
* removing debug code
* addressing review comments
* Revert "addressing review comments"
This reverts commit 29c45192ba.
* updating tile_engine code
* addressing review comments
---------
Co-authored-by: amd-khushbu <khuagarw@amd.com>
Co-authored-by: ThomasNing <thomas.ning@amd.com>
All our platforms support C++20 now, so update to C++20 standard
for language features such as concepts, designated initializers,
range-based for initializers, and consteval. This PR only switches
the compiler flags to C++20, no other changes.
* Updating runtime log message for CK TILE ENGINE
* CKTile layout from config
* CKTile custom config for CI
* Documentation for Layout Changes
* CKTile Layout changes to Jenkins
* Fixing Clang Format
* Changes to Jenkins file to fix error
* fix(cmake-ck-dev): no longer sets invalid values as gpu arch
* style(py files): ruff formatting
* fix(cmake-ck-release): no longer sets invalid values as gpu arch
* chore(cmake-tile_engine): add reminder to uncomment user config json
* Changes to jenkin file to address more cases
* Changes to Jenkins to fix Error
* Changes to Jenkins file for fixing an error
* Update Jenkinsfile (#2517)
* Update Jenkinsfile
---------
Co-authored-by: ThruptiRajLakshmanaGowda <tlakshma@amd.com>
Co-authored-by: AviralGoelAMD <aviral.goel@amd.com>
Co-authored-by: Thomas Ning <Thomas.Ning@amd.com>
* Adding ninja log json convertion utility
* Updating to match old ninjatracing
* Updating Jenkins to use new ninjatracing
* Ensuring v7 works
* Removing old ninjatracing from dockerfile
* fix(precommit_install): script now installs packages in virtual env
* fix(precommit_install): installs packages in virtual env
* feat(precommit): added ruff for python linting and formatting
* feat(precommit): added ruff for python linting and formatting
* feat(precommit): run ruff when py files are commited
* feat(precommit): remod.py is run when ck_tile modified
* add empty line at the end
* style(precommit.yaml): remove empty line
---------
Co-authored-by: Max Podkorytov <4273004+tenpercent@users.noreply.github.com>
* Selective test filter initial commit.
* Expanded folder paths for parsing ninja dependencies.
* Fixing default branch name in the test evaluation script.
* Fixing paths for robustness and adding ctest command to the launch script.
* change jenkins file and few tests to upgrade CI
* Setting ninja build path.
* Fixing typo in Jenkinsfile, and wrong paths.
* Fixing typo in launch script.
* add few more tests to check CI logic
* Fixing header for shell script.
* turn off performance test by default, add option to run all unit tests
* revert dummy changes in source code to trigger tests
* make sure develop branch runs all unit tests
---------
Co-authored-by: Vidyasagar Ananthan <vidyasagar.ananthan@amd.com>