cutlass/python/cutlass_api/examples/003_host_latency_best_practices.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "f97e61c9",
   "metadata": {},
   "source": [
    "# Best practices for reducing host-side latency"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a7a9c63c",
   "metadata": {},
   "source": [
    "Overall performance depends on both device performance (i.e., that of the kernel) and host performance (i.e., that of the runtime).\n",
    "This notebook focuses on the latter: techniques to minimize any overheads incurred from the CUTLASS API and underlying\n",
    "DSL runtimes.\n",
    "\n",
    "This notebook does not discuss techniques for improving device-side performance. A future notebook may cover this topic.\n",
    "\n",
    "**Note**: Latency measurements can vary from system to system. You may see different results on your system than shown\n",
    "in the pre-populated fields of this notebook."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e3ca9e40",
   "metadata": {},
   "outputs": [],
   "source": [
    "import time\n",
    "import torch\n",
    "import cutlass_api"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "efaac09c",
   "metadata": {},
   "outputs": [],
   "source": [
    "if not (status := cutlass_api.utils.is_device_cc_supported({80, 89, 90, 100, 103})):\n",
    "    print(\n",
    "        f\"This notebook requires a GPU with compute capability >= 80.\\n{status.error}\"\n",
    "    )\n",
    "    import sys\n",
    "\n",
    "    sys.exit(0)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "40de11ce",
   "metadata": {},
   "source": [
    "We start with boilerplate initial setup to create tensors and pick a kernel.\n",
    "\n",
    "For the purposes of this notebook, we use a very small GEMM size of M=N=K=128\n",
    "and L=1. This small size is chosen to magnify the impact of host latency on\n",
    "end-to-end performance so as to better illustrate the effect of the techniques\n",
    "described below."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b8c44947",
   "metadata": {},
   "outputs": [],
   "source": [
    "warmup_iterations = 10\n",
    "profiling_iterations = 100\n",
    "total_iterations = warmup_iterations + profiling_iterations\n",
    "\n",
    "# Use a small problem size to showcase host overheads\n",
    "L, M, N, K = 1, 128, 128, 128\n",
    "\n",
    "# We use different operands in each iteration. Though not particularly relevant for\n",
    "# host latency, this is a best practice when benchmarking GPU kernels to avoid\n",
    "# unrealistic caching effects.\n",
    "As = [\n",
    "    torch.randint(-1, 2, (M, K), device=\"cuda\", dtype=torch.float16)\n",
    "    for _ in range(total_iterations)\n",
    "]\n",
    "Bs = [\n",
    "    torch.randint(-1, 2, (K, N), device=\"cuda\", dtype=torch.float16)\n",
    "    for _ in range(total_iterations)\n",
    "]\n",
    "outs = [\n",
    "    torch.empty((M, N), device=\"cuda\", dtype=torch.float16)\n",
    "    for _ in range(total_iterations)\n",
    "]\n",
    "\n",
    "# Construct arguments outside of the benchmarking loop. We will later also consider\n",
    "# cases in which they are constructed inside the benchmarking loop.\n",
    "args = [\n",
    "    cutlass_api.arguments.GemmArguments(\n",
    "        A=As[i], B=Bs[i], out=outs[i], accumulator_type=torch.float32\n",
    "    )\n",
    "    for i in range(total_iterations)\n",
    "]\n",
    "\n",
    "references = [(As[i] @ Bs[i]).to(outs[i].dtype) for i in range(total_iterations)]\n",
    "\n",
    "cc = cutlass_api.utils.device_cc()\n",
    "kernels = cutlass_api.get_kernels(args[0], cc=cc)\n",
    "assert len(kernels) > 0\n",
    "\n",
    "kernel = kernels[0]"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f2e7eece",
   "metadata": {},
   "source": [
    "We next set up a basic benchmarking routine."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2472eafa",
   "metadata": {},
   "outputs": [],
   "source": [
    "def benchmark(\n",
    "    label, code, warmup_it=warmup_iterations, profiling_it=profiling_iterations\n",
    "):\n",
    "    total_it = warmup_it + profiling_it\n",
    "    assert total_it <= total_iterations, (\n",
    "        f\"Benchmark-local iteration count must be less than or equal to total iterations: {total_it} > {total_iterations}\"\n",
    "    )\n",
    "    # warmup\n",
    "    rets = [None] * total_it\n",
    "    for i in range(warmup_it):\n",
    "        rets[i] = code(i)\n",
    "    torch.cuda.synchronize()\n",
    "\n",
    "    start = time.time()\n",
    "    for i in range(profiling_it):\n",
    "        idx = warmup_it + i\n",
    "        rets[idx] = code(idx)\n",
    "    torch.cuda.synchronize()\n",
    "    end = time.time()\n",
    "\n",
    "    avg_time = (end - start) / profiling_it\n",
    "    print(f\"[{label:<30}] avg of {profiling_it} iterations: {avg_time:1.3e} seconds\")\n",
    "    return avg_time, rets"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4909a76b",
   "metadata": {},
   "source": [
    "We now describe techniques for reducing host latency:\n",
    "* Compile once, run many times\n",
    "* Bypassing checks for argument-kernel compatibility\n",
    "* Using [CUDA Graphs](https://developer.nvidia.com/blog/cuda-graphs/)\n",
    "* Using [TVM FFI](https://tvm.apache.org/ffi/)\n",
    "\n",
    "These techniques are complementary and should be used together when applicable\n",
    "for an application."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "06495033",
   "metadata": {},
   "source": [
    "### Compile once, run many times\n",
    "The `kernel.run` method takes in an optional `compiled_artifact` argument of type\n",
    "`cutlass_api.artifact.CompiledArtifact`. When this argument is set, the kernel\n",
    "will directly use the precompiled function within `compiled_artifact`. When\n",
    "it is not set, the call to `kernel.run` will JIT compile the kernel on each\n",
    "invocation.\n",
    "\n",
    "Precompiling the kernel is critical to achieving good performance."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6de11f56",
   "metadata": {},
   "outputs": [],
   "source": [
    "stream = torch.cuda.current_stream()\n",
    "\n",
    "\n",
    "def no_compiled_artifact(i: int):\n",
    "    return kernel.run(args[i], stream=stream)\n",
    "\n",
    "\n",
    "# Compile the kernel once, reuse for each iterations\n",
    "compiled_artifact = kernel.compile(args[0])\n",
    "\n",
    "\n",
    "def with_compiled_artifact(i: int):\n",
    "    return kernel.run(args[i], stream=stream, compiled_artifact=compiled_artifact)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "350c9bd6",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[Without compiled artifact     ] avg of 5 iterations: 1.376e+00 seconds\n",
      "[With compiled artifact        ] avg of 5 iterations: 1.016e-05 seconds\n"
     ]
    }
   ],
   "source": [
    "time_no_artifact, _ = benchmark(\n",
    "    f\"Without compiled artifact\", no_compiled_artifact, warmup_it=2, profiling_it=5\n",
    ")\n",
    "time_w_artifact, _ = benchmark(\n",
    "    f\"With compiled artifact\", with_compiled_artifact, warmup_it=2, profiling_it=5\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5cfbc2d2",
   "metadata": {},
   "source": [
    "### Bypassing checks for argument-kernel compatibility\n",
    "By default, the call to `kernel.run` will check if the kernel supports the provided arguments.\n",
    "Under the hood, this invokes `kernel.supports(args)`.\n",
    "\n",
    "While these checks are helpful for catching incompatible arguments, they are performed\n",
    "in Python, and thus can add to host overhead.\n",
    "\n",
    "When confident that arguments will be compatible with a kernel, one should bypass\n",
    "the `supports` check in `kernel.run` by setting the optional `assume_supported_args`\n",
    "argument to `True`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5b93dfae",
   "metadata": {},
   "outputs": [],
   "source": [
    "def with_supports_check(i: int):\n",
    "    return kernel.run(\n",
    "        args[i],\n",
    "        compiled_artifact=compiled_artifact,\n",
    "        stream=stream,\n",
    "        assume_supported_args=False,\n",
    "    )\n",
    "\n",
    "\n",
    "def without_supports_check(i: int):\n",
    "    return kernel.run(\n",
    "        args[i],\n",
    "        compiled_artifact=compiled_artifact,\n",
    "        stream=stream,\n",
    "        assume_supported_args=True,\n",
    "    )"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b282f437",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[With supports check           ] avg of 100 iterations: 1.463e-05 seconds\n",
      "[Bypass supports check         ] avg of 100 iterations: 6.239e-06 seconds\n",
      "Speedup with skip supports: 2.34x\n"
     ]
    }
   ],
   "source": [
    "time_w_supports, _ = benchmark(\"With supports check\", with_supports_check)\n",
    "time_wo_supports, _ = benchmark(\"Bypass supports check\", without_supports_check)\n",
    "print(f\"Speedup with skip supports: {time_w_supports / time_wo_supports:.2f}x\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d74cb3e7",
   "metadata": {},
   "source": [
    "### CUDA Graphs"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "656d5e2c",
   "metadata": {},
   "source": [
    "[CUDA Graphs](https://developer.nvidia.com/blog/cuda-graphs/) allow a sequence of GPU operations to be defined as a dependency graph and then launched as a single unit, significantly reducing CPU launch overhead and enabling whole-graph optimizations.\n",
    "\n",
    "CUTLASS API supports CUDA Graphs usage with PyTorch as usual.\n",
    "\n",
    "The kernel compilation must happen outside the CUDA graph. Then, we create a graph using usual PyTorch idioms to launch a kernel several times on the graph's stream."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e614509f",
   "metadata": {},
   "outputs": [],
   "source": [
    "num_launches = 20\n",
    "\n",
    "# Create a CUDA Graph to run our compiled kernel N times\n",
    "g = torch.cuda.CUDAGraph()\n",
    "with torch.cuda.graph(g):\n",
    "\n",
    "    ### NOTE! Kernel compilation must happen outside the graph\n",
    "    ### kernel.compile(args)\n",
    "\n",
    "    # Run N iterations of our compiled kernel on the current stream\n",
    "    for i in range(num_launches):\n",
    "        kernel.run(\n",
    "            args[i],\n",
    "            compiled_artifact=compiled_artifact,\n",
    "            stream=torch.cuda.current_stream(),\n",
    "            assume_supported_args=True,\n",
    "        )"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8fc69c6e",
   "metadata": {},
   "source": [
    "This records/captures all the kernel launches to the CUDA Stream associated with the graph `g`, without actually launching them.\n",
    "Once captured, we can replay the graph.\n",
    "\n",
    "Note that graph replay will only replay the kernel launches placed on the graph's stream\n",
    "* During graph capture, we must be careful to capture to the correct stream (`torch.cuda.current_stream()` under the graph context)\n",
    "* Any other preparatory work on the host and arguments passed in from Python are cached during the capture. Changing them would require re-capturing the graph"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "d9c5d5c5",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Replay captured graph and check first result\n",
    "g.replay()\n",
    "\n",
    "torch.testing.assert_close(outs[0], references[0])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "388c8e02",
   "metadata": {},
   "source": [
    "Let's compare the timing:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "45d4e739",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[20 launches without CUDA Graph] avg of 1 iterations: 4.699e-04 seconds\n",
      "[20 launches with CUDA Graph   ] avg of 1 iterations: 9.084e-05 seconds\n",
      "Speedup with CUDA Graph: 5.17x\n"
     ]
    }
   ],
   "source": [
    "def without_cuda_graph(x: int):\n",
    "    for i in range(num_launches):\n",
    "        kernel.run(\n",
    "            args[i],\n",
    "            compiled_artifact=compiled_artifact,\n",
    "            stream=torch.cuda.current_stream(),\n",
    "            assume_supported_args=True,\n",
    "        )\n",
    "\n",
    "\n",
    "def with_cuda_graph(x: int):\n",
    "    g.replay()\n",
    "\n",
    "\n",
    "time_wo_cuda_graph, _ = benchmark(\n",
    "    f\"{num_launches} launches without CUDA Graph\",\n",
    "    without_cuda_graph,\n",
    "    warmup_it=0,\n",
    "    profiling_it=1,\n",
    ")\n",
    "time_w_cuda_graph, _ = benchmark(\n",
    "    f\"{num_launches} launches with CUDA Graph\",\n",
    "    with_cuda_graph,\n",
    "    warmup_it=0,\n",
    "    profiling_it=1,\n",
    ")\n",
    "\n",
    "print(f\"Speedup with CUDA Graph: {time_wo_cuda_graph / time_w_cuda_graph:.2f}x\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fe5c3168",
   "metadata": {},
   "source": [
    "### TVM FFI"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ee7f9fd2",
   "metadata": {},
   "source": [
    "[Apache TVM FFI](https://tvm.apache.org/ffi/) is an open ABI and FFI for machine learning systems.\n",
    "When available, CUTLASS API uses Apache TVM-FFI under the hood as its interface for invoking compiled DSL kernels from Python.\n",
    "\n",
    "TVM FFI is enabled by default in CUTLASS API, and is recommended for best performance."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "1690bbed",
   "metadata": {},
   "source": [
    "`cutlass_api.config.GlobalOptions().use_tvm_ffi` controls whether or not TVM-FFI will be used by CUTLASS API."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "993c60ae",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "True\n"
     ]
    }
   ],
   "source": [
    "print(cutlass_api.config.GlobalOptions().use_tvm_ffi)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "00ed9a40",
   "metadata": {},
   "source": [
    "If for some reason you do not wish to use it, this section demonstrates how, you can set this to False. No other change is needed. The below code compares the performance with and without TVM-FFI."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e8f56be3",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[TVM-FFI ON ] Create args     ] avg of 100 iterations: 8.367e-05 seconds\n",
      "[[TVM-FFI ON ] Compile kernel  ] avg of 5 iterations: 1.352e+00 seconds\n",
      "[[TVM-FFI ON ] Run kernel      ] avg of 100 iterations: 6.509e-06 seconds\n"
     ]
    }
   ],
   "source": [
    "original_use_tvm_ffi = cutlass_api.config.GlobalOptions().use_tvm_ffi\n",
    "\n",
    "cutlass_api.config.GlobalOptions().use_tvm_ffi = True\n",
    "\n",
    "\n",
    "def run_iteration(i):\n",
    "    args = cutlass_api.arguments.GemmArguments(\n",
    "        A=As[i], B=Bs[i], out=outs[i], accumulator_type=torch.float16\n",
    "    )\n",
    "    return kernel.run(\n",
    "        args,\n",
    "        compiled_artifact=compiled_artifact,\n",
    "        stream=torch.cuda.current_stream(),\n",
    "        assume_supported_args=True,\n",
    "    )\n",
    "\n",
    "\n",
    "def create_arguments(i: int):\n",
    "    return cutlass_api.arguments.GemmArguments(\n",
    "        A=As[i], B=Bs[i], out=outs[i], accumulator_type=torch.float16\n",
    "    )\n",
    "\n",
    "\n",
    "args_creation_on, args = benchmark(\"[TVM-FFI ON ] Create args\", create_arguments)\n",
    "compilation_on, compiled = benchmark(\n",
    "    \"[TVM-FFI ON ] Compile kernel\",\n",
    "    lambda i: kernel.compile(args[i]),\n",
    "    warmup_it=2,\n",
    "    profiling_it=5,\n",
    ")\n",
    "compiled_artifact = compiled[0]\n",
    "run_on, _ = benchmark(\n",
    "    \"[TVM-FFI ON ] Run kernel\",\n",
    "    lambda i: kernel.run(\n",
    "        args[i],\n",
    "        compiled_artifact=compiled_artifact,\n",
    "        assume_supported_args=True,\n",
    "        stream=stream,\n",
    "    ),\n",
    ")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5a4c2db4",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[[TVM-FFI OFF ] Create args    ] avg of 100 iterations: 1.255e-04 seconds\n",
      "[[TVM-FFI OFF ] Compile kernel ] avg of 5 iterations: 1.278e+00 seconds\n",
      "[[TVM-FFI OFF ] Run kernel     ] avg of 100 iterations: 4.519e-05 seconds\n"
     ]
    }
   ],
   "source": [
    "cutlass_api.config.GlobalOptions().use_tvm_ffi = False\n",
    "args_creation_off, args = benchmark(\"[TVM-FFI OFF ] Create args\", create_arguments)\n",
    "compilation_off, compiled = benchmark(\n",
    "    \"[TVM-FFI OFF ] Compile kernel\",\n",
    "    lambda i: kernel.compile(args[i]),\n",
    "    warmup_it=2,\n",
    "    profiling_it=5,\n",
    ")\n",
    "compiled_artifact = compiled[0]\n",
    "run_off, _ = benchmark(\n",
    "    \"[TVM-FFI OFF ] Run kernel\",\n",
    "    lambda i: kernel.run(\n",
    "        args[i],\n",
    "        compiled_artifact=compiled_artifact,\n",
    "        assume_supported_args=True,\n",
    "        stream=stream,\n",
    "    ),\n",
    ")\n",
    "\n",
    "# Restore original setting\n",
    "cutlass_api.config.GlobalOptions().use_tvm_ffi = original_use_tvm_ffi"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "17b43718",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Speedups with TVM-FFI: \n",
      "Arg creation: 1.50x\n",
      "Compilation: 0.95x\n",
      "Run: 6.94x\n"
     ]
    }
   ],
   "source": [
    "print(\"Speedups with TVM-FFI: \")\n",
    "print(f\"Arg creation: {args_creation_off / args_creation_on:.2f}x\")\n",
    "print(f\"Compilation: {compilation_off / compilation_on:.2f}x\")\n",
    "print(f\"Run: {run_off / run_on:.2f}x\")"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.12.5"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}