vllm.model_executor.layers.quantization.turboquant ¶
TurboQuant: Near-optimal KV-cache quantization for vLLM.
PolarQuant compression: random rotation + per-coordinate Lloyd-Max scalar quantization for keys, uniform quantization for values.
Reference: "TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate" (ICLR 2026), Zandieh et al.
Modules:
| Name | Description |
|---|---|
centroids | Lloyd-Max optimal scalar quantizer for TurboQuant. |
config | TurboQuant configuration. |
quantizer | TurboQuant quantizer utilities. |
TurboQuantConfig dataclass ¶
Configuration for TurboQuant KV-cache quantization.
Uses PolarQuant (WHT rotation + Lloyd-Max scalar quantization) for keys and uniform quantization for values. QJL is intentionally omitted — community consensus (5+ independent groups) found it hurts attention quality by amplifying variance through softmax.
Named presets (use via --kv-cache-dtype): turboquant_k8v4: FP8 keys + 4-bit values, 2.6x, +1.17% PPL turboquant_4bit_nc: 4-bit MSE keys + 4-bit values + NC, 3.8x, +2.71% turboquant_k3v4_nc: 3-bit MSE keys + 4-bit values + NC, ~3.5x, +10.63% turboquant_3bit_nc: 3-bit MSE keys + 3-bit values + NC, 4.9x, +20.59%
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
head_dim | int | Attention head dimension (e.g. 64, 96, 128). | 128 |
key_quant_bits | int | Bits for key quantization. 8 = FP8 keys (no rotation/MSE). 3-4 = Lloyd-Max MSE quantized keys. | 3 |
value_quant_bits | int | Bits per value dimension for uniform quantization. 3 = 8 levels, 4 = 16 levels (default). | 4 |
seed | int | Base seed for deterministic random matrix generation. Actual seed per layer = seed + layer_idx * 1337. | 42 |
norm_correction | bool | Re-normalize centroid vectors to unit norm before inverse rotation during dequant. Fixes quantization-induced norm distortion, improving PPL by ~0.8% at 4-bit. | False |
Source code in vllm/model_executor/layers/quantization/turboquant/config.py
35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 | |
effective_value_quant_bits property ¶
effective_value_quant_bits: int
Actual bits used for value storage.
key_mse_bits property ¶
key_mse_bits: int
MSE bits actually used for key quantization (0 if FP8 keys).
key_packed_size property ¶
key_packed_size: int
Packed bytes for a single KEY vector.
FP8 mode (key_quant_bits=8): head_dim bytes (1 byte per element, no overhead).
TQ mode
- MSE indices: ceil(head_dim * key_mse_bits / 8) bytes
- vec_norm: 2 bytes (float16)
mse_bits property ¶
mse_bits: int
MSE quantizer bit-width (determines centroid count: 2^mse_bits).
For MSE key modes, equals key_quant_bits. For FP8 key mode, falls back to value_quant_bits (centroids are still needed for continuation-prefill dequant and decode kernel params).
slot_size property ¶
slot_size: int
Total packed bytes per head per position (key + value combined).
Layout: [key_packed | value_packed]
slot_size_aligned property ¶
slot_size_aligned: int
Slot size rounded up to next even number.
Even-number is required so effective_head_size = slot_size_aligned // 2 is integral.
value_packed_size property ¶
value_packed_size: int
Packed bytes for a single VALUE vector.
Uniform quantization: ceil(head_dim * bits / 8) + 4 bytes (scale + zero fp16).
from_cache_dtype staticmethod ¶
from_cache_dtype(
cache_dtype: str, head_dim: int
) -> TurboQuantConfig
Create config from a named preset.
Valid presets: turboquant_k8v4, turboquant_4bit_nc, etc.
Source code in vllm/model_executor/layers/quantization/turboquant/config.py
get_boundary_skip_layers staticmethod ¶
Get layer indices to skip TQ compression (boundary protection).
Returns first N and last N layer indices as strings, suitable for kv_cache_dtype_skip_layers.