Skip to content

[RFC]: KVBlocks and Metrics Publishing In Inference Frameworks #16669

@alec-flowers

Description

@alec-flowers

Motivation.

To do effective planning and routing for distributed LLM inference, we need mechanisms to publish information about the state of the inference engine to other pieces of the distributed system.

Examples include:

  • Publishing events when blocks are added and removed from the local prefix tree so that a router can create a global prefix tree with data from all the workers.
  • Publishing real-time metrics so a router can load balance between workers if a worker is starting to queue requests.

Proposed Change.

Metrics

In vLLM there is already a methodology to publish Metrics using classes that inherit StatsLoggerBase. For example PrometheusStatLogger. We would add a method to relevant engine classes which would accept 3rd party StatLoggers that conform to the StatsLoggerBase interface.

In this way an inference framework that wraps vLLM (such as dynamo) has a way to inject custom logic for publishing Metric events.

def add_stats_logger(self, name, logger: StatsLoggerBase):
	self.engine.add_logger(name, logger)

or

class AsyncLLM(EngineClient):

    def __init__(
        self,
        vllm_config: VllmConfig,
        executor_class: type[Executor],
        log_stats: bool,
        usage_context: UsageContext = UsageContext.ENGINE_CONTEXT,
        mm_registry: MultiModalRegistry = MULTIMODAL_REGISTRY,
        use_cached_outputs: bool = False,
        log_requests: bool = True,
        start_engine_loop: bool = True,
        stat_loggers: Optional[list[StatLoggerBase]] = None,
    ) -> None:

KVEvents

To publish KVEvents requires a larger change as we need to introduce new KVEvent classes and add new code to the BlockPool class in order to publish these events. Because we manage the BlockPool in a separate process from the client, we will need to use zmq to send these events back to the client where a 3rd party inference framework can then inject a KVBlockObserver to publish events.

# Classes to add, need to be serializable msgspec

@dataclass
class KVCacheEvent:
    """Base class for all KV cache-related events"""
    pass

@dataclass
class BlockStored(KVCacheEvent):
    block_hashes: List[int]
    parent_block_hash: Optional[int]
    token_ids: List[int]
    num_toks_per_block: List[int]
    lora_id: int

@dataclass
class BlockRemoved(KVCacheEvent):
    block_hashes: List[int]

@dataclass
class AllBlocksCleared(KVCacheEvent):
    """Event signaling that the entire prefix cache has been reset."""
    pass

class KVBlockObserver:
def observe(self, event: KVCacheEvent)
 		pass

In vLLM V1 we would hook into the BlockPool class and this is where we would form events.

def cache_full_blocks # BlockStored
def _maybe_evict_cached_block(self, block: KVCacheBlock) # BlockRemoved 
def reset_prefix_cache(self) # AllBlocksCleared

In the client, we would be able to inject a KVBlockObserver that would be responsible for observing the events from the engine process, and then publishing them.

def add_kv_block_observer(self, name, logger: KVBlockObserver):
	

or

class AsyncLLM(EngineClient):

    def __init__(
        self,
        vllm_config: VllmConfig,
        executor_class: type[Executor],
        log_stats: bool,
        usage_context: UsageContext = UsageContext.ENGINE_CONTEXT,
        mm_registry: MultiModalRegistry = MULTIMODAL_REGISTRY,
        use_cached_outputs: bool = False,
        log_requests: bool = True,
        start_engine_loop: bool = True,
        stat_loggers: Optional[list[StatLoggerBase]] = None,
        kv_block_observer: Optional[list[KVBlockObserver]] = None,
    ) -> None:

Feedback Period.

No response

CC List.

@pavanimajety @robertgshaw2-redhat

Any Other Things.

No response

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and asked the chatbot living at the bottom right corner of the documentation page, which can answer lots of frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    RFCstaleOver 90 days of inactivity

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions