vllm.model_executor.layers.fused_moe.moe_permute_unpermute ¶
moe_permute ¶
moe_permute(
hidden_states: Tensor,
a1q_scale: Tensor | None,
topk_ids: Tensor,
n_expert: int,
n_local_expert: int = -1,
expert_map: Tensor | None = None,
permuted_hidden_states: Tensor | None = None,
) -> tuple[Tensor, Tensor | None, Tensor, Tensor, Tensor]
This function expands and permutes activation to gather uncontinuous tokens for each expert. Parameters: - hidden_states (torch.Tensor): The input tensor to the MoE layer. - a1q_scale (Optional[torch.Tensor]): quant scale for hidden_states - topk_ids (torch.Tensor): topk expert route id for each token. - n_expert (int): The number of expert. - n_local_expert (int): The number of expert in current EP rank. - expert_map (Optional[torch.Tensor]): A tensor mapping expert indices from the global expert space to the local expert space of the expert parallel shard. - permuted_hidden_states (Optional[torch.Tensor]): Optional output tensor. If None, the output tensor will be created in this function. Returns: - permuted_hidden_states (torch.Tensor): permuted activation. - a1q_scale (Optional[torch.Tensor]): permuted quant scale for hidden_states if original scale not per-tensor scaling - expert_first_token_offset (torch.Tensor): offset of the first token of each expert for standard grouped gemm. - inv_permuted_idx (torch.Tensor): idx map for moe_unpermute. - permuted_idx (torch.Tensor): idx map from hidden to permuted_hidden.
Source code in vllm/model_executor/layers/fused_moe/moe_permute_unpermute.py
moe_unpermute ¶
moe_unpermute(
out: Tensor,
permuted_hidden_states: Tensor,
topk_weights: Tensor,
inv_permuted_idx: Tensor,
expert_first_token_offset: Tensor | None = None,
) -> None
This function expands and permutes activation to gathering uncontinuous tokens for each expert. Parameters: - out (torch.Tensor): output tensor - permuted_hidden_states (torch.Tensor): permuted activation. - topk_weights (torch.Tensor): topk expert route weight for each token. - inv_permuted_idx (torch.Tensor): row idx map for moe_unpermute. - expert_first_token_offset (Optional[torch.Tensor]): offset of the first token of each expert for grouped gemm. Returns: - hidden_states (torch.Tensor): The reduced and unpermuted activation tensor.