[rocm-libraries] ROCm/rocm-libraries#4471 (commit 10fa702)

[CK] Optimize vector type build times **Supercedes https://github.com/ROCm/rocm-libraries/pull/4281 due to CI issues on import** ## Proposed changes Build times can be affected by many different things and is highly attributed to the way we write and use the code. Two critical areas of the builds are **frontend parsing** and **backend codegen and compilation**. ### Frontend Parsing The length of the code, the include header tree and macro expansions all affect the front-end parsing time. This PR seeks to reduce the parsing time of the dtype_vector.hpp vector_type class by reducing redundant code by generalization. * Partial specializations of vector_type for native and non-native datatypes have been generalized to one single class, consolidating all of the data initialization and AsType casting requirements into one place. * The class nnvb_data_t_selector (e.g., Non-native vector base dataT selector) class has been removed and replaced with scalar_type instantiations as they have the same purpose. Scalar type class' purpose is already to map generalized datatypes to native types compatible with ext_vector_t. ### Backend Codegen Template instantiation behavior can also affect build times. Recursive instantiations are very slow versus concrete instantiations. The compiler must make multiple passes to expand template instantiations so we need to be careful about how they are used. * Previous vector_type classes declared a union storage class, which aliases StaticallyIndexedArray<T,N>. ``` template <typename T> struct vector_type<T, 4, typename ck::enable_if_t<is_native_type<T>()>> { using d1_t = T; typedef T d2_t __attribute__((ext_vector_type(2))); typedef T d4_t __attribute__((ext_vector_type(4))); using type = d4_t; union { d4_t d4_; StaticallyIndexedArray<d1_t, 4> d1x4_; StaticallyIndexedArray<d2_t, 2> d2x2_; StaticallyIndexedArray<d4_t, 1> d4x1_; } data_; ... }; ``` * Upon further inspection, StaticallyIndexedArray is built on-top of a recursive Tuple concatenation. ``` template <typename T, index_t N> struct StaticallyIndexedArrayImpl { using type = typename tuple_concat<typename StaticallyIndexedArrayImpl<T, N / 2>::type, typename StaticallyIndexedArrayImpl<T, N - N / 2>::type>::type; }; ``` This union storage has been removed from the vector_type storage class. * Further references to StaticallyIndexedArray have been replaced with StaticallyIndexedArray_v2, which is a concrete implementation using C-style arrays. ``` template <typename T, index_t N> struct StaticallyIndexedArray_v2 { ... T data_[N]; }; ``` ### Fixes * Using bool datatype with vector_type was previously error prone. Bool, as a native datatype would be stored into bool ext_vector_type(N) for storage, which is a packed datatype. Meaning that for example, sizeof(bool ext_vector_type(4)) == 1, which does not equal sizeof(StaticallyIndexedArray<bool ext_vector_type(1), 4> == 4. The union of these datatypes has incorrect data slicing, meaning that the bits location of the packed bool do not match with the StaticallyIndexedArray member. As such, vector_type will use C-Style array storage for bool type instead of ext_vector_type. ``` template <typename T, index_t Rank> using NativeVectorT = T __attribute__((ext_vector_type(Rank))); sizeof(NativeVectorT<bool, 4>) == 1 (1 byte per 4 bool - packed) element0 = bit 0 of byte 0 element1 = bit 1 of byte 0 element2 = bit 2 of byte 0 element3 = bit 3 of byte 0 sizeof(StaticallyIndexedArray[NativeVectorT<bool, 1>, 4] == 4 (1 byte per bool) element0 = bit 0 of byte 0 element1 = bit 0 of byte 1 element1 = bit 0 of byte 2 element1 = bit 0 of byte 3 union{ NativeVectorT<bool, 4> d1_t; ... StaticallyIndexedArray[NativeVectorT<bool,1>, 4] d4x1; }; // union size == 4 which means invalid slicing! ``` * Math utilities such as next_power_of_two addressed for invalid cases of X < 2 * Remove redundant implementation of next_pow2 ### Additions * integer_log2_floor to math.hpp * is_power_of_two_integer to math.hpp ### Build Time Analysis Machine: banff-cyxtera-s78-2 Target: gfx942 | Build Target | Threads | Frontend Parse Time (s) | Backend Codegen Time (s) | TotalTime (s) | commitId | |---------------|---------|-------------------------|--------------------------|---------------|
2026-04-20 06:49:15 +00:00 · 2026-02-11 19:01:05 +00:00
parent 2dd2f114b3
commit 04eddbc5ce
5 changed files with 512 additions and 1966 deletions
--- a/include/ck/utility/data_type.hpp
+++ b/include/ck/utility/data_type.hpp
@@ -34,9 +34,48 @@ using f4_t    = unsigned _BitInt(4);
 using f6_t    = _BitInt(6);          // e2m3 format
 using bf6_t   = unsigned _BitInt(6); // e3m2 format

-// scalar_type
-template <typename TV>
-struct scalar_type;
+// native types: double, float, _Float16, ushort, int32_t, int8_t, uint8_t, f8_fnuz_t, bf8_fnuz_t,
+// native types: bool
+template <typename T>
+inline constexpr bool is_native_type()
+{
+    return is_same_v<T, double> || is_same_v<T, float> || is_same_v<T, half_t> ||
+           is_same_v<T, bhalf_t> || is_same_v<T, int32_t> || is_same_v<T, uint32_t> ||
+           is_same_v<T, int8_t> || is_same_v<T, uint8_t> || is_same_v<T, _BitInt(8)> ||
+           is_same_v<T, unsigned _BitInt(8)> || is_same_v<T, bool>;
+}
+
+/**
+ * @brief Wrapper for native vector type
+ * @tparam T The element type of the vector
+ * @tparam Rank The number of elements in the vector
+ */
+template <typename T, index_t Rank>
+using NativeVectorT = T __attribute__((ext_vector_type(Rank)));
+
+/**
+ * @brief Mapping of incoming type to local native vector storage type and vector size
+ * @tparam T Incoming data type
+ */
+template <typename T>
+struct scalar_type
+{
+    // Basic data type mapping to unsigned _BitInt of appropriate size
+    using type                           = unsigned _BitInt(8 * sizeof(T));
+    static constexpr index_t vector_size = 1;
+};
+
+/**
+ * @brief scalar_type trait override for NativeVectorT
+ * @tparam T The vector type
+ * @tparam Rank The number of elements in the vector
+ */
+template <typename T, index_t Rank>
+struct scalar_type<NativeVectorT<T, Rank>>
+{
+    using type                           = T;
+    static constexpr index_t vector_size = Rank;
+};

 struct f4x2_pk_t
 {
@@ -74,6 +113,39 @@ struct f4x2_pk_t
    }
 };

+// TODO: Unfortunately, we cannot partially specialize scalar_type for vectors written
+// in the following way:
+// template<typename T, index_t Rank>
+// struct scalar_type<T __attribute__((__vector_size__(sizeof(T) * Rank)))>
+// {
+//     using type                           = T;
+//     static constexpr index_t vector_size = Rank;
+// };
+// The compiler errors out with "partial specialization is not allowed for this type",
+// claiming that the Rank is not a deducible parameter. This might be a compiler bug.
+// Note the above type is classified differently from the NativeVectorT<T, Rank> alias,
+// even though they are functionally equivalent and are trivially constructibe from each other.
+// This is unfortunate, but we have to work around it because some LLVM builtins for some
+// operations (e.g., mma) may return the former type.
+// For now we have to explicitly specialize for each vector size we need. These are used
+// in f6_pk_t below.
+
+/// @brief scalar_type trait override for uint32_t vector of size 3
+template <>
+struct scalar_type<uint32_t __attribute__((__vector_size__(sizeof(uint32_t) * 3)))>
+{
+    using type                           = uint32_t;
+    static constexpr index_t vector_size = 3;
+};
+
+/// @brief scalar_type trait override for uint32_t vector of size 6
+template <>
+struct scalar_type<uint32_t __attribute__((__vector_size__(sizeof(uint32_t) * 6)))>
+{
+    using type                           = uint32_t;
+    static constexpr index_t vector_size = 6;
+};
+
 template <typename BitType, index_t pk_size>
 struct f6_pk_t
 {
@@ -89,28 +161,48 @@ struct f6_pk_t
    static constexpr index_t vector_size =
        (packed_size * num_bits_elem) / num_bits_vec_elem; // 3 or 6 element_type units

-    using storage_type = element_type __attribute__((ext_vector_type(vector_size)));
+    using storage_type = NativeVectorT<element_type, vector_size>;
    storage_type data_{storage_type(0)}; // packed data

    using type = f6_pk_t<BitType, packed_size>;

+    /** This class may trivially constructed by the following vector type alias
+     * for example from a result of an mma operation. This is primarily for internal use.
+     * @note f6x16_pk_t and f6x32_pk_t storage types, may be trivially constructed from
+     *       uint32_t vectors of size 3 and 6 respectively for example from mma operation results.
+     *       Unfortunately, unsigned int __attribute__((ext_vector_type(6))) a.k.a
+     * NativeVectorT<uint32_t, 6> is NOT the same as __attribute__((__vector_size__(6 *
+     * sizeof(unsigned int)))) unsigned int which is returned from the mma ops despite being
+     * functionally equivalent. This class may be trivially constructed from both, so we can steer
+     * the templated ctor below to only consider incoming vectors types other than our two storage
+     * types of interest.
+     */
+    using storage_type_alias =
+        element_type __attribute__((__vector_size__(sizeof(element_type) * vector_size)));
+
    __host__ __device__ constexpr f6_pk_t() {}
    __host__ __device__ constexpr f6_pk_t(const storage_type& init) : data_{init}
    {
        // TODO: consider removing initialization similar to vector_type<T, 256>
    }

-    // Initialize from a vector type with the same size as packed_size
-    template <typename T, typename = enable_if_t<scalar_type<T>::vector_size == packed_size>>
+    // Initialize from a vector type with the same size as packed_size.
+    // Exclude storage_type and storage_type_alias because these are trivially constructible.
+    template <
+        typename T,
+        typename = enable_if_t<!is_same_v<T, storage_type> && !is_same_v<T, storage_type_alias> &&
+                               scalar_type<T>::vector_size == packed_size>>
    __host__ __device__ f6_pk_t(const T& v)
    {
+        static_assert(scalar_type<T>::vector_size == packed_size,
+                      "Input vector size must match packed_size.");
        static_for<0, packed_size, 1>{}(
            [&](auto i) { pack(v[static_cast<index_t>(i)], static_cast<index_t>(i)); });
    }

    // Broadcast single initialization value to all packed elements
    __host__ __device__ f6_pk_t(const int8_t v)
-        : f6_pk_t(static_cast<int8_t __attribute__((ext_vector_type(packed_size)))>(v))
+        : f6_pk_t(static_cast<NativeVectorT<int8_t, packed_size>>(v))
    {
        // TODO: consider removing initialization similar to vector_type<T, 256>
    }
@@ -191,27 +283,6 @@ struct pk_i4_t
    __host__ __device__ constexpr pk_i4_t(type init) : data{init} {}
 };

-inline constexpr auto next_pow2(uint32_t x)
-{
-    // Precondition: x > 1.
-    return x > 1u ? (1u << (32u - __builtin_clz(x - 1u))) : x;
-}
-
-// native types: double, float, _Float16, ushort, int32_t, int8_t, uint8_t, f8_fnuz_t, bf8_fnuz_t,
-// native types: bool
-template <typename T>
-inline constexpr bool is_native_type()
-{
-    return is_same<T, double>::value || is_same<T, float>::value || is_same<T, half_t>::value ||
-           is_same<T, bhalf_t>::value || is_same<T, int32_t>::value ||
-           is_same<T, uint32_t>::value || is_same<T, int8_t>::value || is_same<T, uint8_t>::value ||
-           is_same_v<T, _BitInt(8)> || is_same_v<T, unsigned _BitInt(8)> || is_same<T, bool>::value;
-}
-
-// scalar_type
-template <typename TV>
-struct scalar_type;
-
 // is_scalar_type
 template <typename TV>
 struct is_scalar_type
@@ -224,14 +295,13 @@ template <typename X, typename Y>
 using has_same_scalar_type = is_same<typename scalar_type<remove_cvref_t<X>>::type,
                                     typename scalar_type<remove_cvref_t<Y>>::type>;

-template <typename T, index_t N>
-struct scalar_type<T __attribute__((ext_vector_type(N)))>
+template <>
+struct scalar_type<bool>
 {
-    using type                           = T;
-    static constexpr index_t vector_size = N;
+    using type                           = bool;
+    static constexpr index_t vector_size = 1;
 };

-//
 template <>
 struct scalar_type<double>
 {
@@ -293,35 +363,35 @@ struct scalar_type<int4_t>
 template <>
 struct scalar_type<pk_i4_t>
 {
-    using type                           = pk_i4_t;
+    using type                           = typename pk_i4_t::type;
    static constexpr index_t vector_size = 1;
 };

 template <>
 struct scalar_type<f8_fnuz_t>
 {
-    using type                           = f8_fnuz_t::data_type;
+    using type                           = typename f8_fnuz_t::data_type;
    static constexpr index_t vector_size = 1;
 };

 template <>
 struct scalar_type<bf8_fnuz_t>
 {
-    using type                           = bf8_fnuz_t::data_type;
+    using type                           = typename bf8_fnuz_t::data_type;
    static constexpr index_t vector_size = 1;
 };

 template <>
 struct scalar_type<f8_ocp_t>
 {
-    using type                           = f8_ocp_t::data_type;
+    using type                           = typename f8_ocp_t::data_type;
    static constexpr index_t vector_size = 1;
 };

 template <>
 struct scalar_type<bf8_ocp_t>
 {
-    using type                           = bf8_ocp_t::data_type;
+    using type                           = typename bf8_ocp_t::data_type;
    static constexpr index_t vector_size = 1;
 };

@@ -329,7 +399,7 @@ struct scalar_type<bf8_ocp_t>
 template <>
 struct scalar_type<e8m0_bexp_t>
 {
-    using type                           = e8m0_bexp_t::type;
+    using type                           = typename e8m0_bexp_t::type;
    static constexpr index_t vector_size = 1;
 };
 #endif
@@ -337,42 +407,35 @@ struct scalar_type<e8m0_bexp_t>
 template <>
 struct scalar_type<f4x2_pk_t>
 {
-    using type                           = f4x2_pk_t::type;
+    using type                           = typename f4x2_pk_t::type;
    static constexpr index_t vector_size = 1;
 };

 template <>
 struct scalar_type<f6x32_pk_t>
 {
-    using type                           = f6x32_pk_t::storage_type;
+    using type                           = typename f6x32_pk_t::storage_type;
    static constexpr index_t vector_size = 1;
 };

 template <>
 struct scalar_type<bf6x32_pk_t>
 {
-    using type                           = bf6x32_pk_t::storage_type;
+    using type                           = typename bf6x32_pk_t::storage_type;
    static constexpr index_t vector_size = 1;
 };

 template <>
 struct scalar_type<f6x16_pk_t>
 {
-    using type                           = f6x16_pk_t::storage_type;
+    using type                           = typename f6x16_pk_t::storage_type;
    static constexpr index_t vector_size = 1;
 };

 template <>
 struct scalar_type<bf6x16_pk_t>
 {
-    using type                           = bf6x16_pk_t::storage_type;
-    static constexpr index_t vector_size = 1;
-};
-
-template <>
-struct scalar_type<bool>
-{
-    using type                           = bool;
+    using type                           = typename bf6x16_pk_t::storage_type;
    static constexpr index_t vector_size = 1;
 };

--- a/include/ck/utility/dtype_vector.hpp
+++ b/include/ck/utility/dtype_vector.hpp
--- a/include/ck/utility/dynamic_buffer.hpp
+++ b/include/ck/utility/dynamic_buffer.hpp
@@ -260,7 +260,10 @@ struct DynamicBuffer
                x, p_data_, i, is_valid_element, element_space_size_ / PackedSize);
        }
        else if constexpr(GetAddressSpace() == AddressSpaceEnum::Lds &&
-                          is_same<typename scalar_type<remove_cvref_t<T>>::type, int8_t>::value &&
+                          is_same_v<typename scalar_type<remove_cvref_t<T>>::type, int8_t> &&
+                          !is_same_v<remove_cvref_t<T>,
+                                     pk_i4_t> && // TODO: This needs to be fixed for pk_i4_t which
+                                                 // cannot be handled below, but is stored as int8_t
                          workaround_int8_ds_write_issue)
        {
            if(is_valid_element)
--- a/include/ck/utility/math.hpp
+++ b/include/ck/utility/math.hpp
@@ -222,16 +222,28 @@ template <index_t X>
 __host__ __device__ constexpr auto next_power_of_two()
 {
    // TODO: X need to be 2 ~ 0x7fffffff. 0, 1, or larger than 0x7fffffff will compile fail
-    constexpr index_t Y = 1 << (32 - __builtin_clz(X - 1));
+    constexpr index_t Y = X > 1 ? (1 << (32 - __builtin_clz(X - 1))) : X;
    return Y;
 }

 template <index_t X>
-__host__ __device__ constexpr auto next_power_of_two(Number<X> x)
+__host__ __device__ constexpr auto next_power_of_two(Number<X>)
 {
-    // TODO: X need to be 2 ~ 0x7fffffff. 0, 1, or larger than 0x7fffffff will compile fail
-    constexpr index_t Y = 1 << (32 - __builtin_clz(x.value - 1));
-    return Number<Y>{};
+    return Number<next_power_of_two<X>()>{};
+}
+
+__host__ __device__ constexpr int32_t integer_log2_floor(int32_t x)
+{
+    // x valid for 1 ~ 0x7fffffff
+    // __builtin_clz will produce unexpected result if x is 0;
+    return (x > 0) ? (31 - __builtin_clz(x)) : -1;
+}
+
+__host__ __device__ constexpr bool is_power_of_two_integer(int32_t x)
+{
+    // x valid for 1 ~ 0x7fffffff
+    // Powers of 2 always positive
+    return (x > 0) ? !(x & (x - 1)) : false;
 }

 } // namespace math
--- a/include/ck/utility/type_convert.hpp
+++ b/include/ck/utility/type_convert.hpp
@@ -1841,7 +1841,7 @@ inline __host__ __device__ f6x32_t f6_convert_rne(float32_t x, float scale = 1.0
        float float_array[32];
    } in{x};

-    using array_type = uint8_t __attribute__((ext_vector_type(32)));
+    using array_type = NativeVectorT<uint8_t, 32>;
    array_type uint8_array;

    // collect the 6-bit values into an array
@@ -2178,7 +2178,7 @@ inline __host__ __device__ bf6x32_t bf6_convert_rne(float32_t x, float scale = 1
        float float_array[32];
    } in{x};

-    using array_type = uint8_t __attribute__((ext_vector_type(32)));
+    using array_type = NativeVectorT<uint8_t, 32>;
    array_type uint8_array;

    // collect the 6-bit values into an array