* Decouple configure/build/test tools from Docker Create a two-layer tool architecture: - Core tools (ck-configure, ck-build, ck-test): Environment-agnostic, work on any system with ROCm - no Docker dependency - Container tools (ck-docker): Manage Docker containers and delegate to core tools via docker exec Changes: - Add ck-configure: New CMake configuration tool with preset support, native GPU detection, and flexible options - Refactor ck-build: Remove Docker dependency, add --configure and --list options, call ninja directly - Refactor ck-test: Remove Docker dependency, add CTest integration with --smoke/--regression/--all options - Enhance common.sh: Add native GPU detection, build directory utils, and output helpers - Update ck-docker: Add configure/build/test/exec commands that delegate to core tools inside container This enables: - Native development on ROCm hosts without Docker - Simpler CI/CD integration - Consistent behavior inside and outside containers Co-Authored-By: Claude <noreply@anthropic.com> * Add ck-rocprof: GPU profiling tool for rocprof-compute Adds a command-line profiling tool to simplify GPU performance analysis workflow using AMD rocprof-compute. Features: - Easy setup with automatic Python venv configuration - Simple CLI: setup, run, analyze, compare, list - Automatic GPU architecture detection - Focus on LDS metrics (Block 12) for bank conflict analysis - Comprehensive documentation with examples and troubleshooting Usage: ck-rocprof setup # One-time environment setup ck-rocprof run <name> <executable> # Profile executable ck-rocprof analyze <name> [block] # Analyze metrics ck-rocprof compare <name1> <name2> # Compare two runs ck-rocprof list # List available runs * Make ck-rocprof documentation concise and improve Docker integration - Streamlined documentation from 416 to 157 lines (62% reduction) - Focused on essential commands, metrics, and workflows - Enhanced script to run all operations inside Docker containers - Fixed workload directory path and improved container management - Added automatic rocprofiler-compute installation and dependency handling * Add --no-roof flag to ck-rocprof profile command Skip roofline analysis by default to speed up profiling. Roofline analysis can add significant time to profiling runs but is not needed for most LDS bank conflict analysis workflows. * Make ck-rocprof work independently of Docker Add native execution mode that runs rocprof-compute directly on the host system when available, falling back to Docker mode when not. Key changes: - Auto-detect native mode when rocprof-compute is in PATH or common locations - Add execution mode wrappers (exec_cmd, file_exists, dir_exists, etc.) - Native mode stores venv at .ck-rocprof-venv in project root - Native mode stores workloads at build/workloads/ - Support user-installed rocprofiler-compute (e.g., ~/.local/rocprofiler-compute) - Add CK_FORCE_DOCKER env var to force Docker mode - Update help message to show current execution mode - Maintain full backward compatibility with existing Docker workflow Tested successfully with rocprofiler-compute 3.4.0 installed from source on MI300X GPU in native mode. Co-Authored-By: Claude <noreply@anthropic.com> * Add clean/status commands and improve ck-rocprof robustness - Add 'clean' command to remove profiling runs (supports --all) - Add 'status' command to show configuration and environment info - Add workload name validation to prevent path traversal attacks - Fix uv installation to use pip instead of curl for reliability - Add cross-platform stat support for macOS compatibility - Consolidate ROCPROF_CANDIDATES to avoid code duplication - Expand help documentation with all profiling block descriptions - Fix Docker wrapper script escaping issues Co-Authored-By: Claude <noreply@anthropic.com> * Fix analyze command to use correct workload path rocprof-compute stores results directly in the workload directory (pmc_perf.csv) rather than in a GPU architecture subdirectory. Updated find_workload_path to detect this correctly. Co-Authored-By: Claude <noreply@anthropic.com> * Address PR review security and robustness issues Security fixes: - Escape executable path in cmd_run to prevent shell injection - Add workload name validation to cmd_analyze and cmd_compare Robustness improvements: - Add error checking for uv package manager installation - Use consistent project root detection (find_project_root || get_project_root) - Use /opt/rocm instead of hardcoded /opt/rocm-7.0.1 in Docker mode - Derive ROCM_REQUIREMENTS path from ROCPROF_BIN for flexibility - Use gfx950 as fallback GPU consistent with common.sh Documentation updates: - Fix env var name GPU_TARGET -> CK_GPU_TARGET - Update storage layout to reflect current structure (workloads/<name>/) - Document clean and status commands - Clarify native vs Docker default paths Co-Authored-By: Claude <noreply@anthropic.com> * Simplify ck-rocprof to native-only mode Remove Docker mode from ck-rocprof. Docker users should run the tool via `ck-docker exec ck-rocprof ...` instead. This simplification: - Removes ~210 lines of Docker-specific code - Eliminates mode detection complexity - Makes the script easier to maintain - Provides clearer error messages when rocprof-compute is not found The setup command now lists all searched locations when rocprof-compute is not found, helping users understand how to install it. Co-Authored-By: Claude <noreply@anthropic.com> * Add rocprofiler-compute source installation fallback When rocprof-compute is not found in system locations, automatically install rocprofiler-compute 3.4.0 from source as a fallback. This eliminates the hard dependency on system ROCm packages. Implementation details: - Clone rocprofiler-compute from GitHub to ~/.local/ - Install dependencies via requirements.txt (not editable install) - Create wrapper that sets PYTHONPATH to source directory - Execute source script directly rather than importing as module This approach matches the project's development workflow and works around the incomplete pyproject.toml that prevents editable installs. Co-Authored-By: Claude <noreply@anthropic.com> --------- Co-authored-by: Claude <noreply@anthropic.com>
4.3 KiB
CK ROCProf Tool
GPU performance profiling for Composable Kernel applications using AMD rocprof-compute.
Note: This is a native-only tool. For Docker usage, run via ck-docker exec ck-rocprof ...
Quick Start
# One-time setup (requires rocprofiler-compute installed)
./script/tools/ck-rocprof setup
# Profile executable
cd build
../script/tools/ck-rocprof run baseline ./bin/tile_example_gemm_universal
# Analyze LDS metrics
../script/tools/ck-rocprof analyze baseline
# Compare optimizations
../script/tools/ck-rocprof run optimized ./bin/tile_example_gemm_universal
../script/tools/ck-rocprof compare baseline optimized
Commands
setup
One-time setup: creates Python venv, installs dependencies, configures rocprof-compute.
run <name> <executable> [args]
Profile executable and save results.
# Basic profiling
ck-rocprof run baseline ./bin/gemm_example
# With arguments
ck-rocprof run large_matrix ./bin/gemm_example -m 8192 -n 8192 -k 4096
# Test filtering
ck-rocprof run unit_test ./bin/test_gemm --gtest_filter="*Fp16*"
analyze <name> [block]
Display profiling metrics (default: Block 12 - LDS).
ck-rocprof analyze baseline # LDS metrics
ck-rocprof analyze baseline 2 # L2 Cache
ck-rocprof analyze baseline 7 # Instruction Mix
compare <name1> <name2>
Side-by-side comparison of two runs.
list
List all profiling runs with size and date.
clean <name> / clean --all
Remove profiling runs. Use --all to remove all runs.
status
Show current configuration: mode (native/Docker), paths, setup status.
Key LDS Metrics (Block 12)
Target Values:
- Bank Conflicts/Access: <0.01 (1% conflict rate)
- Bank Conflict Rate: >90% of peak bandwidth
Critical Metrics:
- 12.2.9 Bank Conflicts/Access: Direct conflict measure
- Baseline (naive): ~0.04 (4% conflicts)
- Optimized: <0.005 (<0.5% conflicts)
- 12.2.12 Bank Conflict Cycles: Wasted cycles per kernel
- 12.2.17 LDS Data FIFO Full: Memory system pressure
Optimization Workflow
# 1. Baseline
ck-rocprof run baseline ./bin/my_kernel
# 2. Check conflicts
ck-rocprof analyze baseline
# Look for Bank Conflicts/Access > 0.02
# 3. Optimize code (XOR transforms, padding, etc.)
# ... edit source ...
# 4. Test optimization
ninja my_kernel
ck-rocprof run optimized ./bin/my_kernel
# 5. Verify improvement
ck-rocprof compare baseline optimized
# Target: 8-10x reduction in conflicts
Environment Variables
CK_PROFILE_VENV: Python venv path (default:$PROJECT/.ck-rocprof-venv)CK_ROCPROF_BIN: rocprof-compute binary path (auto-detected from PATH or /opt/rocm)CK_ROCM_REQUIREMENTS: Path to rocprofiler-compute requirements.txt (auto-detected)CK_WORKLOAD_DIR: Results directory (default:$PROJECT/build/workloads)CK_GPU_TARGET: Override GPU detection (e.g.,gfx950,MI300X)
Interpreting Results
Good Performance:
Bank Conflicts/Access: <0.01
Bank Conflict Rate: >90% of peak
LDS Data FIFO Full: Minimal cycles
Needs Optimization:
Bank Conflicts/Access: >0.02
Bank Conflict Cycles: High MAX values
LDS Data FIFO Full: High memory pressure
Troubleshooting
"Profiling environment not set up"
ck-rocprof setup
"rocprof-compute not found"
export CK_ROCPROF_BIN=/custom/path/rocprof-compute
ck-rocprof setup
"Profiling results not found"
ck-rocprof list # Check available runs
rocminfo | grep gfx # Verify GPU arch
export CK_GPU_TARGET=gfx950 # Override if needed
Storage Layout
Results stored in workloads/<name>/:
pmc_perf.csv: Performance counters (primary data file)perfmon/: Input metric filesout/: Raw output data from profiler runslog.txt: Profiling log
Technical Details
- Setup: Creates isolated Python venv, installs dependencies
- Profiling: Runs
rocprof-compute profile --name <name> -- <executable> - Analysis: Runs
rocprof-compute analyze --path <path> --block <block> - GPU Support: MI300/MI350 series, auto-detects architecture
Related Tools
ck-docker: Container managementrocprof-compute: AMD GPU profiler v2rocm-smi: System monitoring
License
Copyright (c) Advanced Micro Devices, Inc. SPDX-License-Identifier: MIT