VLLM KV Connector解析

本文最后更新于 2026年4月15日 晚上

初始化

vllmConfig.__post_init__()会初始化KVTransferConfig,然后在scheduler/worker侧根据kv_connector类型实例化对应的connector(KVConnectorFactory.create_connector)

对于offloading connector,首先初始化spec(OffloadingSpecFactory.create_spec),spec会包含gpu_block_size、offloaded_block_size、OfflaodingManager信息,Manager又包含替换策略(lru/arc)等信息,然后用spec去初始化OffloadingConnectorScheduler/OffloadingConnectorWorker

流程

  1. scheduler调度部分,首先调度running队列,然后调度waiting队列。对于num_computed_tokens==0的请求,进行prefix cache hit,如果hit了外部kv cache,则触发异步加载(load_kv_async=True),状态置为WAITING_FOR_REMOTE_KVS,然后调用connector.update_state_after_alloc,记录reqs_to_load。然后生成SchedulerOutput,再传给connector并build_connector_meta,记录reqs_to_load和reqs_to_store

  2. worker部分,先提交上一步延迟提交的store,再提交这一步的load,start load kv,forward结束后再wait for save,把本步的store变成待提交任务,最后汇总

流程图:

sequenceDiagram
    participant U as User Script
    participant L as LLM/LLMEngine
    participant S as Scheduler
    participant C as OffloadingConnector(Scheduler)
    participant W as OffloadingConnector(Worker)
    participant H as OffloadingHandler

    U->>L: generate(prompts)
    L->>S: schedule()
    S->>C: get_num_new_matched_tokens()
    C-->>S: ext_tokens, load_kv_async
    S->>C: update_state_after_alloc()
    S->>C: build_connector_meta()
    S-->>W: kv_connector_metadata

    W->>W: start_load_kv()
    W->>H: transfer_async(CPU->GPU)
    H-->>W: async submitted

    Note over W: model forward / sampling

    W->>W: wait_for_save()/prepare_store_kv()
    W->>H: transfer_async(GPU->CPU) (deferred submit)
    W-->>S: finished_recving / finished_sending

    S->>C: update_connector_output()
    C->>C: complete_load()/complete_store()
    S->>S: update req states, free finished blocks

VLLM KV Connector解析
https://gentlecold.top/20260309/vllm_kv_connector/
作者
GentleCold
发布于
2026年3月9日
许可协议