diff --git a/docs/profiling/images/roofline_8_group_merged.jpg b/docs/profiling/images/roofline_8_group_merged.jpg new file mode 100644 index 0000000000..e71e554d27 Binary files /dev/null and b/docs/profiling/images/roofline_8_group_merged.jpg differ diff --git a/docs/profiling/images/roofline_no_group_merging.jpg b/docs/profiling/images/roofline_no_group_merging.jpg new file mode 100644 index 0000000000..19eda68889 Binary files /dev/null and b/docs/profiling/images/roofline_no_group_merging.jpg differ diff --git a/docs/profiling/rocprof-compute.md b/docs/profiling/rocprof-compute.md new file mode 100644 index 0000000000..9cdf0f519b --- /dev/null +++ b/docs/profiling/rocprof-compute.md @@ -0,0 +1,139 @@ +```` + _ __ ___ ___ _ __ _ __ ___ / _| ___ ___ _ __ ___ _ __ _ _| |_ ___ +| '__/ _ \ / __| '_ \| '__/ _ \| |_ _____ / __/ _ \| '_ ` _ \| '_ \| | | | __/ _ \ +| | | (_) | (__| |_) | | | (_) | _|_____| (_| (_) | | | | | | |_) | |_| | || __/ +|_| \___/ \___| .__/|_| \___/|_| \___\___/|_| |_| |_| .__/ \__,_|\__\___| + |_| |_| +```` + +# Installation + +Either install using Ubuntu package manager as described in https://rocm.docs.amd.com/projects/rocprofiler-sdk/en/latest/install/installation.html or +build from source following the instructions from https://github.com/ROCm/rocm-systems/tree/develop/projects/rocprofiler-compute + +Building the `rocprof-compute` locally might be preferable to avoid version conflicts between local ROCm stack and +what `rocprof-compute` expects. + +## Local build + +If `rocprof-compute` is built locally following the Readme instructions, the binary executable (`rocprof-compute.bin`) is located +under `/rocm-systems/projects/rocprofiler-compute/build`. One can make alias to local `rocprof-compute` by running + +````bash +echo "alias rocprof-compute='~/git/rocm-systems/projects/rocprofiler-compute/build/rocprof-compute.bin'" >> ~/.bashrc +source ~/.bashrc +```` + +You can verify that `rocprof-compute` points to your local build by running + +````bash +rocprof-compute --version +```` + +# Running the profler step + +Run the profiler: + +````bash +rocprof-compute profile -n -- +```` + +For grouped convs, I usually use the CK examples to create an entry point to a single kernel. This corresponds to profiling with command + +````bash +rocprof-compute profile -n grouped_conv_fwd --roofline-data-type FP16 -- ./bin/example_grouped_conv_fwd_xdl_fp16 0 1 0 2 32 32 4 4 3 3 200 200 1 1 1 1 1 1 1 1 +```` + +If you are runnign e.g. multiple version of the same kernel, make sure that the run name (specified by `-n`) +is unique. This allows comparison of two runs later in the analysis stage. + +It it important set the first and third args to zero. These disable result verification and kernel timing. +If timig is enabled, the kernel will be run multiple times during a single example run. +The `rockprof-compute` will internally take care of running the application multiple times to reliable statistics. + +The profiling runs are saved by default in directory where the profiler was executed. The parent directory is +`workloads` which contains all profiler runs in subdirectories corresponding to the run names, e.g., +the example run is loacted at ``workloads\grouped_conv_fwd` + +## Roofline plots + +For the roofline plots, we can specify the data type with flag `--roofline-data-type`. +This flag takes values FP4, FP6, FP8, FP16, BF16, FP32, FP64, I8, I32, I64. +If the roofline data type is notspecified, it defaults to FP32. + +The profiling step can produce a roofline chart to get an idea what is the bottleneck for given kernel and given convolution shape. +The roofline analysis takes into account the various cache level (L1, L2) and the global memory bandwidth. +The roofline plots alone can tell something about the relative performance of kernels. + +**Use case**: Consider a grouped fwd conv problem with small number of channels per group and large image. +This is typically a memory bound problem as we can see from this roofline chart + +![Roofline Chart](./images/roofline_no_group_merging.jpg) + +When we combine multiple conv groups into a single GEMM batch, we can do more meaninful compute per tile, +which moves the problem more into the compute bound regime and improves performance. We can see this directly from the +profiler roofline charts. + +![Roofline Chart](./images/roofline_8_group_merged.jpg) + +# Running the analysis step + +Running the analysis step requires that the data has been collected with the profier step. +To run full analysis, use + +````bash +rocprof-compute analyze -p --output-format txt --output-name +```` + +For example, + +````bash +rocprof-compute analyze -p workloads/grouped_conv_fwd/MI350/ --output-format txt --output-name full-analysis +```` + +This will create a fairly large file that contains all metrics available in `rockprof-compute`. In practice, we +will want to limit to some specific metrics. + +## Analyzing specific metrics + +To get a list of available metrics for given architecture, use `--list-metrics` flag + +````bash +rocprof-compute --list-metrics gfx950 +```` + +This command list all blocks in the analysis, and one limit the analysis to specific blocks for brevity. +To limit the analysis to L1 cache metrics, we run the analysis with only block 13, which corresponds to +instruction cache + +````bash +rocprof-compute analyze -p workloads/grouped_conv_fwd/MI350/ --output-format txt --output-name L1-cache-analysis -b 13 +```` + +## Analyzing two different implementations + +One of the most useful features of `rockprof-compute` is the ability to analyze two different implementations +side-by-side. Assuming we have two implementation that we have profiling and we want to know how they compare +against each other, we can run the analysis as follows + +````bash +rocprof-compute analyze -p -p +```` + +This allows to compare the baseline implementation agains a new implementation. +Running the full analysis allowos to investigate what metrics are changed, and which direction. +The profiler reports absolute differences agains the baseline (first argument), which can be +used to determine where two implementations differ the most. + + +To compare our two different conv implementations in terms of L1 cache efficiency, we can run + +```bash +rocprof-compute analyze -p workloads/no_group_merge/MI350/ -p workloads/8_group_merged/MI350/ -b 13 --output-format txt --output-name L1-cache-comparison +``` + +For LDS utilization and statistics, we can use + +```bash +rocprof-compute analyze -p workloads/no_group_merge/MI350/ -p workloads/8_group_merged/MI350/ -b 12 --output-format txt --output-name LDS-comparison +``` \ No newline at end of file