[Core] Increase default max_num_batched_tokens
for multimodal models
#8028
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Enabling chunked prefill causes some confusing errors for multimodal models as
max_num_batched_tokens < num_multimodal_tokens
leads to mismatched placeholder count when running the model.This PR partially solves this issue by increasing the default
max_num_batched_tokens
for multimodal models so that it is sufficient for most cases.As indicated by the TODO, it would be more ideal to determine the number of multimodal tokens that are in the prompt and raise an error if we detect that chunked prefill would truncate the multimodal tokens. However, this requires some refactoring for
LLMEngine
to access the multimodal registry used in theModelRunner
, so let's leave that to another PR.As mentioned by @ywang96 , another improvement would be to dynamically set the default
max_num_batched_tokens
, but that also requires access to theModelRunner
as the maximum number of multimodal tokens is only available afterinit_mm_limits_per_prompt
is called.FIX #7996