vllm.model_executor.layers.quantization.utils.quant_utils ¶
This file is used for /tests and /benchmarks
GroupShape ¶
Bases: _GroupShape
This class describes the quantization group shape. It includes static members for common shapes (per-tensor, per-token).
Source code in vllm/model_executor/layers/quantization/utils/quant_utils.py
QuantKey dataclass ¶
Class for identifying the type of quantization. dtype: quantized data type scale: scale descriptor scale2: second-level scale descriptor symmetric: symmetric if True, asymmetric if False
Source code in vllm/model_executor/layers/quantization/utils/quant_utils.py
ScaleDesc dataclass ¶
Class for describing a single quantization scaling factor. dtype: data type of the scale static: static scale if True, dynamic if False group_shape: group shape of the scale
Source code in vllm/model_executor/layers/quantization/utils/quant_utils.py
convert_bf16_scales_to_fp8 ¶
Convert a BF16 scale tensor into the pair of (fp8_scales, channel_scales) expected by W4A8 GEMM kernels.
Source code in vllm/model_executor/layers/quantization/utils/quant_utils.py
convert_packed_uint4b8_to_signed_int4_inplace ¶
Convert int4b8 (packed to int32) to signed int4
Source code in vllm/model_executor/layers/quantization/utils/quant_utils.py
get_and_maybe_dequant_weights ¶
get_and_maybe_dequant_weights(
layer: LinearBase, out_dtype: dtype = float32
)
Return layer's unquantized weights in [out, in] layout
Source code in vllm/model_executor/layers/quantization/utils/quant_utils.py
get_fp8_min_max ¶
Get the min and max values for FP8 quantization.
Source code in vllm/model_executor/layers/quantization/utils/quant_utils.py
prep_scale_for_group_broadcast ¶
prep_scale_for_group_broadcast(
scale: Tensor, x: Tensor, group_shape: GroupShape | None
) -> Tensor
Prepare the input quantization scale for group broadcasting.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
scale | Tensor | The scale tensor (scalar or 1D). | required |
x | Tensor | Target tensor whose shape determines broadcast dimensions. | required |
group_shape | GroupShape | None | GroupShape to broadcast over. | required |
Returns:
| Type | Description |
|---|---|
Tensor | scale reshaped for correct broadcasting. |
Source code in vllm/model_executor/layers/quantization/utils/quant_utils.py
scaled_quantize ¶
scaled_quantize(
x: Tensor,
group_shape: GroupShape,
quant_dtype: dtype,
compute_dtype: dtype | None = None,
) -> tuple[Tensor, Tensor]
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
x | Tensor | Input tensor to quantize | required |
group_shape | GroupShape | Shape of quantization groups | required |
quant_dtype | dtype | Target quantized dtype (e.g., torch.float8_e4m3fn) | required |
compute_dtype | dtype | None | Optional dtype for intermediate computations. If None, uses input dtype. Use torch.float32 for higher precision. | None |