vllm.config.kv_transfer ¶
KVTransferConfig ¶
Configuration for distributed KV cache transfer.
Source code in vllm/config/kv_transfer.py
18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 |
|
enable_permute_local_kv class-attribute
instance-attribute
¶
enable_permute_local_kv: bool = False
Experiment feature flag to enable HND to NHD KV Transfer
engine_id class-attribute
instance-attribute
¶
engine_id: str | None = None
The engine id for KV transfers.
kv_buffer_device class-attribute
instance-attribute
¶
kv_buffer_device: str = 'cuda'
The device used by kv connector to buffer the KV cache. Choices are 'cuda' and 'cpu'.
kv_buffer_size class-attribute
instance-attribute
¶
kv_buffer_size: float = 1000000000.0
The buffer size for TorchDistributedConnector. Measured in number of bytes. Recommended value: 1e9 (about 1GB).
kv_connector class-attribute
instance-attribute
¶
kv_connector: str | None = None
The KV connector for vLLM to transmit KV caches between vLLM instances.
kv_connector_extra_config class-attribute
instance-attribute
¶
any extra config that the connector may need.
kv_connector_module_path class-attribute
instance-attribute
¶
kv_connector_module_path: str | None = None
The Python module path to dynamically load the KV connector from. Only supported in V1.
kv_ip class-attribute
instance-attribute
¶
kv_ip: str = '127.0.0.1'
The KV connector ip, used to build distributed connection.
kv_parallel_size class-attribute
instance-attribute
¶
kv_parallel_size: int = 1
The number of parallel instances for KV cache transfer. For P2pNcclConnector, this should be 2.
kv_port class-attribute
instance-attribute
¶
kv_port: int = 14579
The KV connector port, used to build distributed connection.
kv_rank class-attribute
instance-attribute
¶
kv_rank: int | None = None
The rank of this vLLM instance in the KV cache transfer. Typical value: 0 for prefill instance, 1 for decode instance. Currently only 1P1D is supported.
kv_role class-attribute
instance-attribute
¶
kv_role: KVRole | None = None
Whether this vLLM instance produces, consumes KV cache, or both. Choices are 'kv_producer', 'kv_consumer', and 'kv_both'.
__post_init__ ¶
Source code in vllm/config/kv_transfer.py
compute_hash ¶
compute_hash() -> str
WARNING: Whenever a new field is added to this config, ensure that it is included in the factors list if it affects the computation graph.
Provide a hash that uniquely identifies all the configs that affect the structure of the computation graph from input ids/embeddings to the final hidden states, excluding anything before input ids/embeddings and after the final hidden states.