mirror of
https://github.com/ROCm/composable_kernel.git
synced 2026-05-18 20:09:25 +00:00
168 lines
4.3 KiB
Markdown
168 lines
4.3 KiB
Markdown
# CK ROCProf Tool
|
|
|
|
GPU performance profiling for Composable Kernel applications using AMD rocprof-compute.
|
|
|
|
**Note:** This is a native-only tool. For Docker usage, run via `ck-docker exec ck-rocprof ...`
|
|
|
|
## Quick Start
|
|
|
|
```bash
|
|
# One-time setup (requires rocprofiler-compute installed)
|
|
./script/tools/ck-rocprof setup
|
|
|
|
# Profile executable
|
|
cd build
|
|
../script/tools/ck-rocprof run baseline ./bin/tile_example_gemm_universal
|
|
|
|
# Analyze LDS metrics
|
|
../script/tools/ck-rocprof analyze baseline
|
|
|
|
# Compare optimizations
|
|
../script/tools/ck-rocprof run optimized ./bin/tile_example_gemm_universal
|
|
../script/tools/ck-rocprof compare baseline optimized
|
|
```
|
|
|
|
## Commands
|
|
|
|
### `setup`
|
|
One-time setup: creates Python venv, installs dependencies, configures rocprof-compute.
|
|
|
|
### `run <name> <executable> [args]`
|
|
Profile executable and save results.
|
|
|
|
```bash
|
|
# Basic profiling
|
|
ck-rocprof run baseline ./bin/gemm_example
|
|
|
|
# With arguments
|
|
ck-rocprof run large_matrix ./bin/gemm_example -m 8192 -n 8192 -k 4096
|
|
|
|
# Test filtering
|
|
ck-rocprof run unit_test ./bin/test_gemm --gtest_filter="*Fp16*"
|
|
```
|
|
|
|
### `analyze <name> [block]`
|
|
Display profiling metrics (default: Block 12 - LDS).
|
|
|
|
```bash
|
|
ck-rocprof analyze baseline # LDS metrics
|
|
ck-rocprof analyze baseline 2 # L2 Cache
|
|
ck-rocprof analyze baseline 7 # Instruction Mix
|
|
```
|
|
|
|
### `compare <name1> <name2>`
|
|
Side-by-side comparison of two runs.
|
|
|
|
### `list`
|
|
List all profiling runs with size and date.
|
|
|
|
### `clean <name>` / `clean --all`
|
|
Remove profiling runs. Use `--all` to remove all runs.
|
|
|
|
### `status`
|
|
Show current configuration: mode (native/Docker), paths, setup status.
|
|
|
|
## Key LDS Metrics (Block 12)
|
|
|
|
**Target Values:**
|
|
- Bank Conflicts/Access: <0.01 (1% conflict rate)
|
|
- Bank Conflict Rate: >90% of peak bandwidth
|
|
|
|
**Critical Metrics:**
|
|
- **12.2.9 Bank Conflicts/Access**: Direct conflict measure
|
|
- Baseline (naive): ~0.04 (4% conflicts)
|
|
- Optimized: <0.005 (<0.5% conflicts)
|
|
- **12.2.12 Bank Conflict Cycles**: Wasted cycles per kernel
|
|
- **12.2.17 LDS Data FIFO Full**: Memory system pressure
|
|
|
|
## Optimization Workflow
|
|
|
|
```bash
|
|
# 1. Baseline
|
|
ck-rocprof run baseline ./bin/my_kernel
|
|
|
|
# 2. Check conflicts
|
|
ck-rocprof analyze baseline
|
|
# Look for Bank Conflicts/Access > 0.02
|
|
|
|
# 3. Optimize code (XOR transforms, padding, etc.)
|
|
# ... edit source ...
|
|
|
|
# 4. Test optimization
|
|
ninja my_kernel
|
|
ck-rocprof run optimized ./bin/my_kernel
|
|
|
|
# 5. Verify improvement
|
|
ck-rocprof compare baseline optimized
|
|
# Target: 8-10x reduction in conflicts
|
|
```
|
|
|
|
## Environment Variables
|
|
|
|
- `CK_PROFILE_VENV`: Python venv path (default: `$PROJECT/.ck-rocprof-venv`)
|
|
- `CK_ROCPROF_BIN`: rocprof-compute binary path (auto-detected from PATH or /opt/rocm)
|
|
- `CK_ROCM_REQUIREMENTS`: Path to rocprofiler-compute requirements.txt (auto-detected)
|
|
- `CK_WORKLOAD_DIR`: Results directory (default: `$PROJECT/build/workloads`)
|
|
- `CK_GPU_TARGET`: Override GPU detection (e.g., `gfx950`, `MI300X`)
|
|
|
|
## Interpreting Results
|
|
|
|
**Good Performance:**
|
|
```
|
|
Bank Conflicts/Access: <0.01
|
|
Bank Conflict Rate: >90% of peak
|
|
LDS Data FIFO Full: Minimal cycles
|
|
```
|
|
|
|
**Needs Optimization:**
|
|
```
|
|
Bank Conflicts/Access: >0.02
|
|
Bank Conflict Cycles: High MAX values
|
|
LDS Data FIFO Full: High memory pressure
|
|
```
|
|
|
|
## Troubleshooting
|
|
|
|
**"Profiling environment not set up"**
|
|
```bash
|
|
ck-rocprof setup
|
|
```
|
|
|
|
**"rocprof-compute not found"**
|
|
```bash
|
|
export CK_ROCPROF_BIN=/custom/path/rocprof-compute
|
|
ck-rocprof setup
|
|
```
|
|
|
|
**"Profiling results not found"**
|
|
```bash
|
|
ck-rocprof list # Check available runs
|
|
rocminfo | grep gfx # Verify GPU arch
|
|
export CK_GPU_TARGET=gfx950 # Override if needed
|
|
```
|
|
|
|
## Storage Layout
|
|
|
|
Results stored in `workloads/<name>/`:
|
|
- `pmc_perf.csv`: Performance counters (primary data file)
|
|
- `perfmon/`: Input metric files
|
|
- `out/`: Raw output data from profiler runs
|
|
- `log.txt`: Profiling log
|
|
|
|
## Technical Details
|
|
|
|
- **Setup**: Creates isolated Python venv, installs dependencies
|
|
- **Profiling**: Runs `rocprof-compute profile --name <name> -- <executable>`
|
|
- **Analysis**: Runs `rocprof-compute analyze --path <path> --block <block>`
|
|
- **GPU Support**: MI300/MI350 series, auto-detects architecture
|
|
|
|
## Related Tools
|
|
|
|
- `ck-docker`: Container management
|
|
- `rocprof-compute`: AMD GPU profiler v2
|
|
- `rocm-smi`: System monitoring
|
|
|
|
## License
|
|
|
|
Copyright (c) Advanced Micro Devices, Inc. SPDX-License-Identifier: MIT
|