-
-
Notifications
You must be signed in to change notification settings - Fork 10.2k
[Core] Upgrade to pytorch 2.2, remove cupy dependency, avoid nccl 2.19 bug #3442
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Core] Upgrade to pytorch 2.2, remove cupy dependency, avoid nccl 2.19 bug #3442
Conversation
- `torch==2.1.1` -> `torch==2.2.0` - `xformers==0.0.23.post1` -> `xformers==0.0.24` - ROCM not updated because no `torch==2.2.0` containers have been published yet
Co-authored-by: Woosuk Kwon <[email protected]>
This reverts commit daca4e1.
Let me test with this modification first, and then test with torch 2.2.0 . In my dev box, this fix can work with torch 2.2.0 . |
I'm trying to bisect which pytorch nightly version starts to fail vllm. The most annoying thing is we have to build xformers from source each time, which takes quite a long time. #clone code
git clone https://github.com/vllm-project/vllm.git
cd vllm
git remote add ykc https://github.com/youkaichao/vllm.git
git fetch ykc fix_parallel_distributed_test
git checkout fix_parallel_distributed_test
# remove pytorch version requirements, as we will install nightly version
# modify requirements-build.txt requirements.txt pyproject.toml
git clone https://github.com/facebookresearch/xformers.git ../
pushd ../xformers
git submodule update --init --recursive
popd
export DATE=20240316
conda create -n vllm-torch-$DATE python=3.9 -y
conda activate vllm-torch-$DATE
# install some basic requirements
pip install "numpy<2"
pip install ninja
# install pre-build pytorch nightly wheel
pip install https://download.pytorch.org/whl/nightly/pytorch_triton-3.0.0%2B989adb9a29-cp39-cp39-linux_x86_64.whl
pip install https://download.pytorch.org/whl/nightly/cu121/torch-2.4.0.dev20240316%2Bcu121-cp39-cp39-linux_x86_64.whl
# verify basic cuda functionality
python -c "import torch; print(torch.randn(5, 5, 5).cuda().sum())"
# verify pt2 stack
python -c "import torch; print(torch.compile(lambda x: x + 1)(torch.randn(5, 5, 5).cuda()).sum())"
# build xformers from source
pushd ../xformers
# only compile what you want
export TORCH_CUDA_ARCH_LIST=8.9
# avoid CPU overload
MAX_JOBS=6 python setup.py install
python setup.py clean
# verify xformers
python -c "from xformers.ops import memory_efficient_attention; import torch; data = torch.rand(8, 128, 32, 128).cuda(); output = memory_efficient_attention(data, data, data); print(output.sum())"
popd
# build vllm
pip install -r requirements-build.txt
pip install -r requirements-dev.txt
python setup.py install
python setup.py clean
# use installed version, not current directory
mv vllm vllm_link
# test vllm
TEST_DIST_MODEL=facebook/opt-125m python -m pytest -s tests/distributed/test_basic_distributed_correctness.py
TEST_DIST_MODEL=meta-llama/Llama-2-7b-hf python -m pytest -s tests/distributed/test_basic_distributed_correctness.py
# move back
mv vllm_link vllm Tested the following pytorch version:
*: starting with torch-2.2.0.dev20231116 , because of pytorch/pytorch#112623 , The most interesting one, is that What's more, from pytorch 2.1.0 to pytorch 2.2.0 , the To avoid confounder from To gather more information, I also modified the following code: Lines 117 to 145 in 120157f
Modified code is: # Profile the memory usage of the model and get the maximum number of
# cache blocks that can be allocated with the remaining free memory.
torch.cuda.empty_cache()
torch.cuda.memory._record_memory_history()
# Execute a forward pass with dummy inputs to profile the memory usage
# of the model.
self.model_runner.profile_run()
# Calculate the number of blocks that can be allocated with the
# profiled peak memory.
torch.cuda.synchronize()
free_gpu_memory, total_gpu_memory = torch.cuda.mem_get_info()
# NOTE(woosuk): Here we assume that the other processes using the same
# GPU did not change their memory usage during the profiling.
peak_memory = self.init_gpu_memory - free_gpu_memory
cache_block_size = self.get_cache_block_size_bytes(
block_size, cache_dtype)
num_gpu_blocks = int(
(total_gpu_memory * gpu_memory_utilization - peak_memory) //
cache_block_size)
num_cpu_blocks = int(cpu_swap_space // cache_block_size)
num_gpu_blocks = max(num_gpu_blocks, 0)
num_cpu_blocks = max(num_cpu_blocks, 0)
if self.model_runner.lora_manager:
self.model_runner.remove_all_loras()
gc.collect()
torch.cuda.empty_cache()
import os
import multiprocessing
current_process = multiprocessing.current_process()
process_name = current_process.name
torch.cuda.memory._dump_snapshot(f"/home/gcpuser/vllm/stats-{process_name}-{os.getpid()}.pickle")
with open(f"/home/gcpuser/vllm/stats-{process_name}-{os.getpid()}.txt", "w") as f:
print(f"{free_gpu_memory=}, {total_gpu_memory=}, {self.init_gpu_memory=}, {peak_memory=}", file=f)
print(f"{block_size=}, {cache_block_size=}, {num_gpu_blocks=}, {num_cpu_blocks=}", file=f)
from pprint import pprint
pprint(torch.cuda.memory_stats(0), stream=f)
pprint(torch.cuda.memory_stats(1), stream=f)
return num_gpu_blocks, num_cpu_blocks |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we can instead just clean up states between tests using
import gc
import torch
gc.collect()
torch.cuda.empty_cache()
?
|
||
# manually load the nccl library | ||
_path = os.path.dirname(os.path.abspath(__file__)) | ||
so_file = glob.glob(f"{_path}/../../lib/nvidia/nccl/lib/libnccl.so.*")[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haven't looked into the whole PR. But I feel having this kind of absolute path for libnccl seems very dangerous. Many cloud providers may ship their own nccl package (e.g., aws for efa support). I'm not sure about what's the effect of this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is our secrect version of nccl
, living inside vllm
, we ship it with our wheel. Please check the modification in setup.py
.
We can remove this after nccl
fixes the bug, or after pytorch downgrades to nccl==2.18
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Expanding the path, you can see it is /path/to/vllm/lib/nvidia/nccl/lib/libnccl.so.2.18.3
. I created a secrect lib
folder under vllm
library, and this .so
file will come with vllm
installation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like torch is also coming with its own version? https://discuss.pytorch.org/t/how-can-i-change-nccl-version-in-pytorch/143942 @zhuohan123 do you know how AWS resolves this?
Imo, this is a great direction. I've seen many successful cases reducing external dependencies like this that are not necessary + I've seen lots of users having to turn off cuda graph because cupy didn't work with their environment (and it will resolve their issues) |
Thanks for your comment! I will completely remove |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you have a benchmark result before/after the PR?
with tempfile.TemporaryDirectory() as temp_dir: | ||
local_zip_path = ( | ||
f"{temp_dir}/" | ||
"nvidia_nccl_cu12-2.18.3-py3-none-manylinux1_x86_64.whl") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does vllm currently support amd arch (the wheel is only for x86)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean arm64
? We don't need to consider amd
here because we are under the condition if _is_cuda():
.
# a secrect place | ||
|
||
# Define the URL of the file and the directory to unzip to | ||
file_url = ('https://files.pythonhosted.org/packages/44/6e/' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
consider using a constant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I wonder if we can support env var so that we can also decide to load arbitrary .so instead of always downloading our package when we build
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
detecting env var
during runtime is good.
w.r.t. downloading our package when we build, we have to do this because nccl brought by torch==2.2.0 does not work.
logger.info(f'Creating target directory {target_dir} ...') | ||
os.makedirs(target_dir) | ||
# Check if the file is already downloaded | ||
if os.path.exists(target_dir + "nvidia"): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit, but
if os.path.exists(target_dir + "nvidia"):
break
# Download the file
logger.info('Downloading file...')
....
....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have no choice here, we cannot break
or return
, because it is not inside any function 👀
# ncclResult_t ncclGetUniqueId(ncclUniqueId* uniqueId); | ||
_c_ncclGetUniqueId = nccl.ncclGetUniqueId | ||
_c_ncclGetUniqueId.restype = ctypes.c_int | ||
_c_ncclGetUniqueId.argtypes = [ctypes.POINTER(NcclUniqueId)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm should we use more standard solution like pybinding here? (how common practice to manually link c to python this way?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did this to explicitly avoid pybind11
or any other solutions requiring additional building/compilation, to reduce binary dependencies. This code will work for any nccl version, as long as they have ncclAllReduce
function symbol.
|
||
# manually load the nccl library | ||
_path = os.path.dirname(os.path.abspath(__file__)) | ||
so_file = glob.glob(f"{_path}/../../lib/nvidia/nccl/lib/libnccl.so.*")[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like torch is also coming with its own version? https://discuss.pytorch.org/t/how-can-i-change-nccl-version-in-pytorch/143942 @zhuohan123 do you know how AWS resolves this?
|
||
# manually load the nccl library | ||
_path = os.path.dirname(os.path.abspath(__file__)) | ||
so_file = glob.glob(f"{_path}/../../lib/nvidia/nccl/lib/libnccl.so.*")[0] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q: is there no conflict with nccl loaded with torch? (if so why)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No conflict. PyTorch will load libnccl.so.2
, while our file is libnccl.so.2.18.3
. That's the value of this PR, to separate nccl versions used by vllm and pytorch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how does this solution scale to other backends, like AMD GPU? Does it require another secret .so?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our CI does not run in amd GPUs. We don't know if amd rccl
has a similar problem. Help is needed!
I don't have a benchmark result yet. I suppose it does not change too much, only might get some performance variation (up or down) because of version change in torch & xformers . |
More support evidence for pinning
More support evidence for removing We get more and more bug reports with cupy’s TCPStore recently, e.g. #3057 and #3617 |
@WoosukKwon mentioned pypi package size limit. If we want to avoid the large whl size, we can have an additional build flag like |
Then, how does it work with the vLLM PyPI package? Also, a dumb question: can we use an older version of NVIDIA NCCL package https://pypi.org/project/nvidia-nccl-cu12/ instead of including it in our package? |
Also, how does this work with AMD GPUs? As you know, AMD GPUs also use ROCm version of CuPy because of some bugs in using PyTorch RCCL with HIP graphs. |
No, it will break pytorch, because pytorch has binary dependency on nccl 2.19.3. |
Then pypi version of vllm will use nccl 2.19.3, which costs more memory and would have lower performance when using cudagraph. |
It seems RCCL has exactly the same API as NCCL. Ideally we should be able to just use |
The benchmark in our CI is really just a placeholder. It doesn't run distributed (tp>1) and also running tiny models. |
PR Checklist (Click to expand. Please read before submitting.)
Thank you for your contribution to vLLM! Before submitting the pull request, please ensure the PR meets the following criteria. This helps vLLM maintain the code quality and improve the efficiency of the review process.
PR Title and Classification
Only specific types of PRs will be reviewed. The PR title is prefixed appropriately to indicate the type of change. Please use one of the following:
[Bugfix]
for bug fixes.[CI/Build]
for build or continuous integration improvements.[Doc]
for documentation fixes and improvements.[Model]
for adding a new model or improving an existing model. Model name should appear in the title.[Frontend]
For changes on the vLLM frontend (e.g., OpenAI API server,LLM
class, etc.)[Kernel]
for changes affecting CUDA kernels or other compute kernels.[Core]
for changes in the core vLLM logic (e.g.,LLMEngine
,AsyncLLMEngine
,Scheduler
, etc.)[Hardware][Vendor]
for hardware-specific changes. Vendor name should appear in the prefix (e.g.,[Hardware][AMD]
).[Misc]
for PRs that do not fit the above categories. Please use this sparingly.Note: If the PR spans more than one category, please include all relevant prefixes.
Code Quality
The PR need to meet the following code quality standards:
format.sh
to format your code.docs/source/
if the PR modifies the user-facing behaviors of vLLM. It helps vLLM user understand and utilize the new features or changes.Notes for Large Changes
Please keep the changes as concise as possible. For major architectural changes (>500 LOC excluding kernel/data/config/test), we would expect a GitHub issue (RFC) discussing the technical design and justification. Otherwise, we will tag it with
rfc-required
and might not go through the PR.What to Expect for the Reviews
The goal of the vLLM team is to be a transparent reviewing machine. We would like to make the review process transparent and efficient and make sure no contributor feel confused or frustrated. However, the vLLM team is small, so we need to prioritize some PRs over others. Here is what you can expect from the review process:
action-required
label on the PR if there are changes required. The contributor should address the comments and ping the reviewer to re-review the PR.Thank You
Finally, thank you for taking the time to read these guidelines and for your interest in contributing to vLLM. Your contributions make vLLM a great tool for everyone!
vLLM will allocate all the available memory, so we need to run the tests one by one. The solution is to pass arguments (model name) by environment variables.