composable_kernel

mirror of https://github.com/ROCm/composable_kernel.git synced 2026-07-02 21:27:45 +00:00

Files

Matthias Gehre d43c474532 [CK] AIESW-32282: thread BElementwiseOperation dequant op down to ThreadwiseTensorSliceTransfer_v4 + bf16 truncate variant

Previously the wmma_cshuffle_v3 b_scale device-op's BElementwiseOperation
template parameter was carried as a struct member through the gridwise and
blockwise pipelines, but the per-nibble dequant call site in
ThreadwiseTensorSliceTransfer_v4::Run() hardcoded a local DequantPack8{} /
DequantPack8WithZp{} instance and ignored the template-supplied op.

This commit:

* Adds a new dedicated BDequantOp template parameter (defaulted to void
  for upstream compatibility) to:
    - device/impl/device_gemm_wmma_cshuffle_v3_b_scale.hpp
    - grid/gridwise_gemm_wmma_cshuffle_v3_ab_scale.hpp
    - grid/gridwise_gemm_wmma_cshuffle_v3_common.hpp
    - block/blockwise_gemm_pipeline_wmma_selector.hpp
    - block/blockwise_gemm_pipeline_wmmaops_v1.hpp
  The new slot is separate from BElementwiseOperation because that one is
  also consumed at the B global->LDS copy via
  ThreadwiseTensorSliceTransfer_v3r1, which expects a 2-arg operator()
  while DequantPack8WithZp has 3/4-arg overloads only.

* Adds a DequantPolicyFor<> trait in
  element/unary_element_wise_operation.hpp that maps a "B dequant carrier"
  type to the (sym, asym) pair the v1 Interwave pipeline must compile with.
  Defaults to (void, void) so any non-dequant carrier (PassThrough
  included) lowers to existing behaviour.

* Updates ThreadwiseTensorSliceTransfer_v4::Run() (sym + asym overloads)
  to take a templated BElementOp / BElementOpAsym and instantiate it
  locally. Default arguments preserve the prior DequantPack8{} /
  DequantPack8WithZp{} behaviour bit-identically — existing CK callers
  (no BDequantOp passed) link unchanged.

* Adds bf16 truncate-via-bit-cast variants in
  element/unary_element_wise_operation.hpp:
    - fp32_to_bhalf_truncate(float)
    - i4_to_bhalf4_scale_truncate(int, bhalf2_t)
    - i4_to_bhalf4_zp_scale_truncate(int, bhalf2_t, bhalf2_t)
    - DequantPack8Truncate / DequantPack8WithZpTruncate element-ops
  These skip the IEEE round-to-nearest-even chain that
  type_convert<bhalf_t>(float) lowers to (v_add3_u32 +0x7fff bias +
  v_cmp_o_f32 + v_cndmask_b16 0x7fc0 NaN-quietening) — about ~1150 of
  the 3988 lines in the bf16 asym ISA dump. Worst-case error vs RTE
  is 0.5 ULP of bf16 = ~4e-3 relative, well inside the W4A16 op-test
  TOL_REL=5e-3. fp16 overloads of the truncate variants delegate to the
  non-truncate path (the fp16 bit-trick is already optimal, no rounding
  chain to remove).

  CK analog of the Triton-side optimization in vLLM PR ROCm/vllm#953.

End-to-end measurement on RedHatAI/Qwen3-8B-quantized.w4a16 (bf16, native
dtype, --num-prompts 10 --output-len 1 --input-len 3968):

  Triton baseline:  3030 ms TTFT
  CK with RTE:      3278 ms TTFT (+8.2% LOSS vs Triton)
  CK with TRUNC:    2796 ms TTFT (-7.7% WIN vs Triton)

The truncate variant closes the bf16 gap and overtakes Triton; RTE is a
regression on this hardware (gfx1151 / Strix Halo) because RDNA3 lacks
packed bf16 FMA and the rounding chain dominates the dequant pipeline.

Smoke test: all 16 (sym/asym x fp16/bf16 x G=32/G=128 x RTE/TRUNC)
combinations pass at TOL_REL=5e-3 in op_tests/test_gemm_w4a16.py.

Backward compatibility: with the new template arg left at its void
default, existing CK callers produce bit-identical code to before — the
threadwise transfer's defaulted template arg resolves to DequantPack8 /
DequantPack8WithZp via std::conditional_t.

2026-05-21 11:50:24 +02:00

impl

[CK] AIESW-32282: thread BElementwiseOperation dequant op down to ThreadwiseTensorSliceTransfer_v4 + bf16 truncate variant

2026-05-21 11:50:24 +02:00

conv_tensor_rearrange_op.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

convolution_backward_data_specialization.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

convolution_backward_weight_specialization.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

convolution_forward_specialization.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

device_avgpool_bwd.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

device_base.hpp

Improve XDL to WMMA porting for grouped conv fwd (#3456 )

2025-12-19 15:58:51 -07:00

device_batched_contraction_multiple_d.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

device_batched_gemm_e_permute.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

device_batched_gemm_gemm.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

device_batched_gemm_multi_d.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

device_batched_gemm_multiple_d_gemm_multiple_d.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

device_batched_gemm_softmax_gemm_permute.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

device_batched_gemm_softmax_gemm.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

device_batched_gemm.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

device_batchnorm_backward.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

device_batchnorm_forward.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

device_batchnorm_infer.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

device_cgemm.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

device_contraction_multiple_abd.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

device_contraction_multiple_d.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

device_conv_bwd_data.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

device_conv_fwd_bias_activation_add.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

device_conv_fwd_bias_activation.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

device_conv_fwd.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

device_conv_tensor_rearrange.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

device_elementwise_normalization.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

device_elementwise_scale.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

device_elementwise.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

device_gemm_bias_e_permute.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

device_gemm_dequantB.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

device_gemm_multiple_abd.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

device_gemm_multiple_d_ab_scale.hpp

Wmma support for gemm_ab_scale (#3314 )

2025-12-11 09:06:20 +01:00

device_gemm_multiple_d_layernorm.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

device_gemm_multiple_d_multiple_r.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

device_gemm_multiple_d.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

device_gemm_mx.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

device_gemm_reduce.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

device_gemm_splitk.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

device_gemm_streamk_v2.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

device_gemm_streamk.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

device_gemm_v2.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

device_gemm.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

device_grouped_contraction_multiple_d.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

device_grouped_conv_bwd_data_multiple_d.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

device_grouped_conv_bwd_weight_multiple_d.hpp

[Conv] Enable bwd weight splitk autodeduction with cap (#3656 )

2026-01-29 17:40:28 +00:00

device_grouped_conv_bwd_weight.hpp

[Conv] Enable bwd weight splitk autodeduction with cap (#3656 )

2026-01-29 17:40:28 +00:00

device_grouped_conv_fwd_multiple_abd.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

device_grouped_conv_fwd_multiple_d.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

device_grouped_conv_fwd.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

device_grouped_gemm_fixed_nk.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

device_grouped_gemm_multi_abd_fixed_nk.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

device_grouped_gemm_multi_abd.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

device_grouped_gemm_softmax_gemm_permute.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

device_grouped_gemm_splitk.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

device_grouped_gemm_tile_loop.hpp

Implement grouped gemm tile loop for RDNA4 (#3304 )

2026-01-13 07:14:23 +01:00

device_grouped_gemm.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

device_max_pool_bwd.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

device_multiple_reduce.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

device_normalization_bwd_data.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

device_normalization_bwd_gamma_beta.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

device_normalization_fwd.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

device_permute.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

device_pool_fwd.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

device_put_element.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

device_reduce_multi_d.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

device_reduce.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

device_softmax.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

device_splitk_contraction_multiple_d.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

gemm_specialization.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

helper.hpp

chore(copyright): update copyright header for include directory (#3224 )

2025-11-18 10:17:18 -08:00

masking_specialization.hpp

chore(copyright): update copyright header for include directory (#3293 )

2025-11-26 11:00:05 -07:00

matrix_padder.hpp

[rocm-libraries] ROCm/rocm-libraries#4828 (commit 7de19bb)

2026-02-28 20:11:11 +00:00

reduction_operator_mapping.hpp

chore(copyright): update copyright header for include directory (#3293 )

2025-11-26 11:00:05 -07:00

tensor_layout.hpp

[Compiler] Addressing new compiler warnings (#3640 )

2026-02-02 09:39:48 -08:00

tensor_specialization.hpp

chore(copyright): update copyright header for include directory (#3293 )

2025-11-26 11:00:05 -07:00

welford_helper.hpp

chore(copyright): update copyright header for include directory (#3293 )

2025-11-26 11:00:05 -07:00