vllm.transformers_utils.configs.radio ¶
Radio vision model configuration
RadioConfig ¶
Bases: PretrainedConfig
This is the configuration class to store the configuration of a Radio vision model. It is used to instantiate a Radio model according to the specified arguments, defining the model architecture.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model_name | str | Name of the vision transformer model (e.g., "vit_base_patch16_224"). Used to determine architecture dimensions from | required |
image_size | int | The size (resolution) of each image. | 224 |
patch_size | int | The size (resolution) of each patch. | 16 |
qkv_bias | bool | Whether to add a bias to the queries, keys and values. | True |
qk_normalization | bool | Whether to apply normalization to queries and keys. | False |
norm_type | str | The normalization type to use. | 'layer_norm' |
layer_norm_eps | float | The epsilon used by the layer normalization layers. | 1e-06 |
initializer_factor | float | A factor for initializing all weight matrices. | 1.0 |
hidden_act | str | The non-linear activation function in the encoder. | 'gelu' |
cpe_max_size | int | Maximum image size for position embeddings. | 2048 |
norm_mean | tuple[float, float, float] | list | Mean values for image normalization (RGB channels). Defaults to (0.48145466, 0.4578275, 0.40821073)). | OPENAI_CLIP_MEAN |
norm_std | tuple[float, float, float] | list | Standard deviation values for image normalization (RGB channels). Defaults to (0.26862954, 0.26130258, 0.27577711)). | OPENAI_CLIP_STD |
register_multiple | int | None | Number of register tokens to use. | None |
teachers | list[dict[str, Any]] | None | A list of teacher model configurations. Each teacher configuration is a dict with keys like "name" and some may have "use_summary". | None |
cls_token_per_teacher | bool | Whether to use a separate CLS token for each teacher. | False |