vllm.v1.attention.ops.triton_unified_attention ¶
_get_tile_size ¶
Select tile size with Gemma3-specific optimization.
For Gemma3, use 32 for both prefill and decode to better utilize the larger head dimension (128/256). For other models, use the default vLLM behavior.
Source code in vllm/v1/attention/ops/triton_unified_attention.py
_is_gemma3_attention ¶
Detect Gemma3 models via unique (head_size, sliding_window) signature.
Gemma3 models are the only ones using sliding_window=1024 with head_size 128 (27B) or 256 (1B, 4B, 12B). Other SWA models use different window sizes (Mistral=4096, Phi-3=2047).