vllm.model_executor.layers.fused_moe.flashinfer_trtllm_moe ¶
_supports_activation ¶
_supports_parallel_config ¶
_supports_parallel_config(
moe_parallel_config: FusedMoEParallelConfig,
) -> bool
Supports TRTLLM Kernel does not support EPLB.
_supports_quant_scheme ¶
Supports Fp8 per-tensor and Fp8 block.
Source code in vllm/model_executor/layers/fused_moe/flashinfer_trtllm_moe.py
_supports_router_logits_dtype ¶
_supports_router_logits_dtype(
router_logits_dtype: dtype | None,
routing_method: RoutingMethodType,
) -> bool
The FlashInfer TRTLLM FP8 kernel expects bfloat16 router_logits by default. Only DeepSeekV3 routing supports float32 router_logits (which is converted internally in the kernel).
Source code in vllm/model_executor/layers/fused_moe/flashinfer_trtllm_moe.py
_supports_routing_method ¶
_supports_routing_method(
weight_key: QuantKey | None,
activation_key: QuantKey | None,
routing_method: RoutingMethodType,
) -> bool
Monolithic kernels need to express router support.
Source code in vllm/model_executor/layers/fused_moe/flashinfer_trtllm_moe.py
is_supported_config_trtllm_bf16 ¶
is_supported_config_trtllm_bf16(
moe_config: FusedMoEConfig,
activation_format: FusedMoEActivationFormat,
) -> tuple[bool, str | None]
This method mirrors mk.FusedMoEPermuteExpertsUnpermute.is_supported_config for BF16 unquantized kernels.
Source code in vllm/model_executor/layers/fused_moe/flashinfer_trtllm_moe.py
is_supported_config_trtllm_fp8 ¶
is_supported_config_trtllm_fp8(
moe_config: FusedMoEConfig,
weight_key: QuantKey | None,
activation_key: QuantKey | None,
activation_format: FusedMoEActivationFormat,
) -> tuple[bool, str | None]
This method mirrors mk.FusedMoEPermuteExpertsUnpermute.is_supported_config