vllm.v1.engine ¶

Modules:

Name	Description
`async_llm`
`coordinator`
`core`
`core_client`
`detokenizer`
`exceptions`
`input_processor`
`llm_engine`
`logprobs`
`output_processor`
`parallel_sampling`
`utils`

EngineCoreEvent ¶

Bases: Struct

A timestamped engine core event associated with a request.

The timestamp is a monotonic timestamps and is used for by the engine frontend to calculate intervals between engine core events. These timestamps should not be compared with timestamps from other processes.

Source code in vllm/v1/engine/__init__.py

class EngineCoreEvent(msgspec.Struct):
    """A timestamped engine core event associated with a request.

    The timestamp is a monotonic timestamps and is used for by the engine
    frontend to calculate intervals between engine core events. These
    timestamps should not be compared with timestamps from other processes.
    """

    type: EngineCoreEventType
    timestamp: float

    @classmethod
    def new_event(
        cls, event_type: EngineCoreEventType, timestamp: float | None = None
    ) -> "EngineCoreEvent":
        timestamp = time.monotonic() if timestamp is None else timestamp
        return cls(event_type, timestamp)

EngineCoreEventType ¶

Bases: IntEnum

The type of engine core request event.

Source code in vllm/v1/engine/__init__.py

class EngineCoreEventType(enum.IntEnum):
    """The type of engine core request event."""

    QUEUED = 1
    SCHEDULED = 2
    PREEMPTED = 3

EngineCoreRequest ¶

Bases: Struct

Source code in vllm/v1/engine/__init__.py

class EngineCoreRequest(
    msgspec.Struct,
    array_like=True,  # type: ignore[call-arg]
    omit_defaults=True,  # type: ignore[call-arg]
    gc=False,
):  # type: ignore[call-arg]
    request_id: str
    prompt_token_ids: list[int] | None
    mm_features: list[MultiModalFeatureSpec] | None
    sampling_params: SamplingParams | None
    pooling_params: PoolingParams | None
    eos_token_id: int | None
    arrival_time: float
    lora_request: LoRARequest | None
    cache_salt: str | None
    data_parallel_rank: int | None
    prompt_embeds: torch.Tensor | None = None

    # Index of the client, used to ensure outputs are sent back to the same
    # client for this request when scaling out the front-end.
    client_index: int = 0

    # Used in DP case to indicate which wave of requests this is expected to
    # belong to, to cover a race condition where the request is sent before
    # a wave finished notification is received.
    current_wave: int = 0
    priority: int = 0

    trace_headers: Mapping[str, str] | None = None
    resumable: bool = False

    # The user-provided request ID. This field is set internally,
    # copied from the provided request_id that's originally assigned
    # to the request_id field, see InputProcessor.assign_request_id().
    # Used in outputs and to support abort(req_id, internal=False).
    external_req_id: str | None = None

    reasoning_ended: bool | None = None

    @property
    def params(self) -> SamplingParams | PoolingParams:
        """Return the processed params (sampling or pooling)."""
        if self.sampling_params is not None:
            return self.sampling_params
        assert self.pooling_params is not None
        return self.pooling_params

params `property` ¶

params: SamplingParams | PoolingParams

Return the processed params (sampling or pooling).

EngineCoreRequestType ¶

Bases: Enum

Request types defined as hex byte strings, so it can be sent over sockets without separate encoding step.

Source code in vllm/v1/engine/__init__.py

class EngineCoreRequestType(enum.Enum):
    """
    Request types defined as hex byte strings, so it can be sent over sockets
    without separate encoding step.
    """

    ADD = b"\x00"
    ABORT = b"\x01"
    START_DP_WAVE = b"\x02"
    UTILITY = b"\x03"
    # Sentinel used within EngineCoreProc.
    EXECUTOR_FAILED = b"\x04"

FinishReason ¶

Bases: IntEnum

Reason a request finished - stop, length, abort, or error.

Int rather than Str for more compact serialization.

stop - a stop string was emitted length - max_tokens was consumed, or max_model_len was reached abort - aborted by client error - retryable request-level internal error (e.g., KV load failure). Invariant: always converted to 500 Internal Server Error.

Source code in vllm/v1/engine/__init__.py

class FinishReason(enum.IntEnum):
    """
    Reason a request finished - stop, length, abort, or error.

    Int rather than Str for more compact serialization.

    stop - a stop string was emitted
    length - max_tokens was consumed, or max_model_len was reached
    abort - aborted by client
    error - retryable request-level internal error (e.g., KV load failure).
            Invariant: always converted to 500 Internal Server Error.

    """

    STOP = 0
    LENGTH = 1
    ABORT = 2
    ERROR = 3

    def __str__(self):
        return FINISH_REASON_STRINGS[self.value]

ReconfigureRankType ¶

Bases: IntEnum

Rank type for reconfiguring distributed request.

Source code in vllm/v1/engine/__init__.py

class ReconfigureRankType(enum.IntEnum):
    """
    Rank type for reconfiguring distributed request.
    """

    KEEP_CURRENT_RANK = -1
    SHUTDOWN_CURRENT_RANK = -2

vllm.v1.engine ¶

EngineCoreEvent ¶

EngineCoreEventType ¶

EngineCoreRequest ¶

params property ¶

EngineCoreRequestType ¶

FinishReason ¶

ReconfigureRankType ¶

params `property` ¶