Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docker/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -375,7 +375,7 @@ RUN --mount=type=bind,from=build,src=/workspace/dist,target=/vllm-workspace/dist
# Install FlashInfer from source
ARG FLASHINFER_GIT_REPO="https://github.com/flashinfer-ai/flashinfer.git"
# Keep this in sync with "flashinfer" extra in setup.py
ARG FLASHINFER_GIT_REF="v0.2.14.post1"
ARG FLASHINFER_GIT_REF="v0.3.0"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

With this upgrade to FlashInfer v0.3.0, it's worth investigating if the workarounds for the 'FlashInfer AOT wheel' issue are still necessary. There are TODOs on lines 18 and 428 in this file related to this. If the issue is resolved in v0.3.0, those sections could be cleaned up as part of this upgrade to simplify the Dockerfile.

# Flag to control whether to compile FlashInfer AOT kernels
# Set to "true" to enable AOT compilation:
# docker build --build-arg FLASHINFER_AOT_COMPILE=true ...
Expand Down
2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -694,7 +694,7 @@ def _read_requirements(filename: str) -> list[str]:
"mistral_common[audio]"], # Required for audio processing
"video": [], # Kept for backwards compatibility
# FlashInfer should be updated together with the Dockerfile
"flashinfer": ["flashinfer-python==0.2.14.post1"],
"flashinfer": ["flashinfer-python==0.3.0"],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This upgrade to flashinfer-python==0.3.0 is great. The release notes for this version mention that dynamic tile size for MoE kernels is now enabled. This likely resolves the TODO on line 28 of vllm/model_executor/layers/quantization/utils/flashinfer_utils.py, which hardcodes tile_tokens_dim = 8 due to issues in a previous FlashInfer version. It would be beneficial to update that logic to take advantage of the new version's capabilities and potentially improve performance.

# Optional deps for AMD FP4 quantization support
"petit-kernel": ["petit-kernel"],
},
Expand Down