-
-
Notifications
You must be signed in to change notification settings - Fork 11.4k
[Prefill with Prefix Cache] Improve the efficiency of prefilling with prefix cache by allowing a larger batch size #3402
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
rkooo567
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add relevant tests to core/test_scheduler.py and core/test_block_manager.py?
vllm/core/block_manager.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add doctsring to indicate the difference between cached vs computed blocks?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, thank you for your review! Since now it is very late in my area, I plan to add unit tests and more doctstrings tommorow. I have reviewed my logic, and think it shoud be right but kind of tricky. I will add more comments to make it more easy to understand. Thank you!
vllm/core/scheduler.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be max(num_prompt_tokens - num_computed_tokens, 0)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
E.g., block size == 16, 4 blocks are cached (and 60 tokens are prefix), and if the token size is 63, it can have negative value?
vllm/core/block_manager.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q: Does this overestimate the number of computed tokens? (as it is not guaranteed all slots in the block has token allocated)? Can you write a docstring that explains the behavior here? (Either it can be overestimated OR if we have a certain assumption it should be documented)
vllm/core/block_manager.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Comment why it breaks? (I assume it has assumption blocks are computed in order?)
vllm/sequence.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd intentionally avoided the type here so that torch doesn't need to be imported, since this layer deals with higher level sequence abstractions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, I haved removed this modification. I did it before because Pylance in VSCode always remind me this problem, which is kind of annoying. BTW, it' fine according to your explanation, thank you!
|
@MeloYang05 this looks great, I think it would be good to split into two PRs though since the improvements are independent? |
OK thanks, I will split the PR into two independent ones. |
|
@MeloYang05 Thanks for pointing out the incremental detokenization issue. Do you mind if I submit a PR fixing it in a more comprehensive way? You will be added as a co-author. |
OK it's fine, I'm very glad to see it can be fixed in a more comprehensive way, thank you! I will only handle the batch size of prefix prefill in this PR. |
107fd9a to
02a8a2e
Compare
vllm/core/block_manager.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit;
def get_num_computed_tokens(self, seq: Sequence) -> int:
"""Return the number of tokens that are already computed.
NOTE: This excludes tokens from the last blocks.
"""…Allocator Abstraction
|
Hi, I have made some edits to ensure my modifications are compatible with the latest updates regarding the block manager abstraction. |
|
This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you! |
|
This pull request has merge conflicts that must be resolved before it can be |
Hello everyone, I am testing the efficiency of prefilling with cache enabled to explore the potential opportunities to apply the techniques of chunked prefilling or even SplitFuse in vLLM. However, currently the performance of prefilling with prefix cache is not satisfied mainly because of the following two reasons:
scheduler.py, when determining whether aSequenceGroupcan be added to the running queue, thecan_allocate()method ofblock_managerdoesn't aware the cached blocks of theSequenceGroup. For example, assume there's 10 free blocks, and theSequenceGrouprequires 12 blocks, but 8 of the blocks is cached. In this scenario, suchSequenceGroupshould be allowed to be added to the running queue as it only requires 4 extra blocks indeed. However, current implementation doesn't check the cached blocks and so there's no difference between a prefix prefill and a normal prefill. In theory, prefix prefill should allow a much larger batch size than the normal prefill to increase the entire throughput. Moreover, in the later checking ofnum_batched_tokens, current implementation also doesn't move out the computed tokens number, further lead to the small batch size.incremental_detokenize. I also find that even when I fix the batch size to be 1, prefill prefill with 1 extra token's latency(e.g., prompt length 257, and 256 tokens have been cached) is still far behind decode 1 token. I find that the performance gap mainly results from the different execution time ofincremental_detokenizeof decoding and prefix prefilling. As for decoding, since theseq.tokenshave been computed, it only needs to compute the token for the new token id. However, in prefix prefill,seq.tokensisNone, so it has to detokenize all the token ids, which is a huge gap. Additionally, I find thatincremental_detokenizeactually detokenize all the token ids more than once when prefilling. In method_decode_sequenceinllm_engine.py, stepself._decode_logprobs(...)will callincremental_detokenizemany times, and all theseincremental_detokenizewill all detokenize all the token ids ifseq.tokensisNone, which I think is totally unnecessary. Such implementation leads to the performance degradation especially when the prompt length is long.In this PR, I primarily address the problem of enabling a larger batch size for prefix prefilling. The issue of repeated
incremental_detokenizecalls will be resolved in a separate PR by @Yard1, who will approach the fix in a more systematic way.How do I fix the problems
Enable Larger Batch Size for Prefilling with Prefix Cache
To enable a larger batch size, we have to make the following two kinds of modifications:
Aware of the Cached Blocks
Cached blocks are those blocks stored in block manager's
cached_blocksitem, where the hash value of the block is the key.In
block_manager.py, I add a method to get the number of cached blocks:And in method
can_allocate(), I add the following lines to make it exclude the number of cached blocks:Aware of the Computed Blocks
The computed blocks of a sequence should satisfy the following conditions:
cached blocksorevictorblock.computedis truecached_blocksand marked as computed, but 4 is not continuous with (0, 1, 2). Thus, the computed blocks are only (0, 1, 2)Moreover, to align with #3239, the last blocks of a sequence will always not be regarded as the computed block.
Therefore, I write the following method in
block_manager.pyto get the number of computed blocks before it is allocated.I believe the logic of my method aligns with that of the
get_all_computed_blocksmethod, which will be invoked following the allocation of the sequence.Additionally, in the
_scheduler()method inscheduler.py, I have modified it to exclude computed tokens when checking the number of batched tokens.Other Optimizations
Moreover, since now it is more frequently to compute the cache value of logical blocks(
seq.hash_of_block(logical_idx)), to improve the efficiency, I store the hash value of the logical block when it is firstly computed(Notice that only the full logical block can store hash value).Also there's a small modification in method
compute_full_blocks_in_seqinblock_manager.py:In the previous implementation, we had
max_full_block = seq.get_len() // self.block_size - 1. However, I believe there is no need to subtract 1 frommax_full_blockif the last block is full. This is because the last block will be excluded byblock_table[:-1]in the expressionfor b in takewhile(lambda b: b.computed, block_table[:-1])within theget_all_computed_blocksmethod. Given that the last block is full, it can also be marked as computed.Performance Improvement
I write the following scripts file to test the performance gain after my modifications.
I use the
Llama-2-7B-FP16model, and my gpu is A800. Before the modifications, the maximum batch size for prefix prefilling is just 3, and the entire prefix prefilling process takes about 0.93s. After the changes, the batch size reaches 48, and the prefix prefilling time has been shorten to about 0.21s (also with the repeated detokenization simply fixed ), which achieves more than 4x speed up. I hope these changes can help vLLM prepared for techniques which applying prefix prefilling.