Skip to content

Conversation

youkaichao
Copy link
Member

PR Checklist (Click to expand. Please read before submitting.)

Thank you for your contribution to vLLM! Before submitting the pull request, please ensure the PR meets the following criteria. This helps vLLM maintain the code quality and improve the efficiency of the review process.

PR Title and Classification

Only specific types of PRs will be reviewed. The PR title is prefixed appropriately to indicate the type of change. Please use one of the following:

  • [Bugfix] for bug fixes.
  • [CI/Build] for build or continuous integration improvements.
  • [Doc] for documentation fixes and improvements.
  • [Model] for adding a new model or improving an existing model. Model name should appear in the title.
  • [Frontend] For changes on the vLLM frontend (e.g., OpenAI API server, LLM class, etc.)
  • [Kernel] for changes affecting CUDA kernels or other compute kernels.
  • [Core] for changes in the core vLLM logic (e.g., LLMEngine, AsyncLLMEngine, Scheduler, etc.)
  • [Hardware][Vendor] for hardware-specific changes. Vendor name should appear in the prefix (e.g., [Hardware][AMD]).
  • [Misc] for PRs that do not fit the above categories. Please use this sparingly.

Note: If the PR spans more than one category, please include all relevant prefixes.

Code Quality

The PR need to meet the following code quality standards:

  • We adhere to Google Python style guide and Google C++ style guide.
  • Pass all linter checks. Please use format.sh to format your code.
  • The code need to be well-documented to ensure future contributors can easily understand the code.
  • Include sufficient tests to ensure the project to stay correct and robust. This includes both unit tests and integration tests.
  • Please add documentation to docs/source/ if the PR modifies the user-facing behaviors of vLLM. It helps vLLM user understand and utilize the new features or changes.

Notes for Large Changes

Please keep the changes as concise as possible. For major architectural changes (>500 LOC excluding kernel/data/config/test), we would expect a GitHub issue (RFC) discussing the technical design and justification. Otherwise, we will tag it with rfc-required and might not go through the PR.

What to Expect for the Reviews

The goal of the vLLM team is to be a transparent reviewing machine. We would like to make the review process transparent and efficient and make sure no contributor feel confused or frustrated. However, the vLLM team is small, so we need to prioritize some PRs over others. Here is what you can expect from the review process:

  • After the PR is submitted, the PR will be assigned to a reviewer. Every reviewer will pick up the PRs based on their expertise and availability.
  • After the PR is assigned, the reviewer will provide status update every 2-3 days. If the PR is not reviewed within 7 days, please feel free to ping the reviewer or the vLLM team.
  • After the review, the reviewer will put an action-required label on the PR if there are changes required. The contributor should address the comments and ping the reviewer to re-review the PR.
  • Please respond to all comments within a reasonable time frame. If a comment isn't clear or you disagree with a suggestion, feel free to ask for clarification or discuss the suggestion.

Thank You

Finally, thank you for taking the time to read these guidelines and for your interest in contributing to vLLM. Your contributions make vLLM a great tool for everyone!


vLLM will allocate all the available memory, so we need to run the tests one by one. The solution is to pass arguments (model name) by environment variables.

@youkaichao
Copy link
Member Author

Let me test with this modification first, and then test with torch 2.2.0 . In my dev box, this fix can work with torch 2.2.0 .

@youkaichao
Copy link
Member Author

youkaichao commented Mar 16, 2024

I'm trying to bisect which pytorch nightly version starts to fail vllm. The most annoying thing is we have to build xformers from source each time, which takes quite a long time.

#clone code

git clone https://github.com/vllm-project/vllm.git
cd vllm
git remote add ykc https://github.com/youkaichao/vllm.git
git fetch ykc fix_parallel_distributed_test
git checkout fix_parallel_distributed_test
# remove pytorch version requirements, as we will install nightly version
# modify requirements-build.txt requirements.txt pyproject.toml

git clone https://github.com/facebookresearch/xformers.git ../
pushd ../xformers
git submodule update --init --recursive
popd

export DATE=20240316
conda create -n vllm-torch-$DATE python=3.9 -y
conda activate vllm-torch-$DATE

# install some basic requirements
pip install "numpy<2"
pip install ninja

# install pre-build pytorch nightly wheel
pip install https://download.pytorch.org/whl/nightly/pytorch_triton-3.0.0%2B989adb9a29-cp39-cp39-linux_x86_64.whl
pip install https://download.pytorch.org/whl/nightly/cu121/torch-2.4.0.dev20240316%2Bcu121-cp39-cp39-linux_x86_64.whl

# verify basic cuda functionality
python -c "import torch; print(torch.randn(5, 5, 5).cuda().sum())"
# verify pt2 stack
python -c "import torch; print(torch.compile(lambda x: x + 1)(torch.randn(5, 5, 5).cuda()).sum())"

# build xformers from source
pushd ../xformers
# only compile what you want
export TORCH_CUDA_ARCH_LIST=8.9
# avoid CPU overload
MAX_JOBS=6 python setup.py install
python setup.py clean
# verify xformers
python -c "from xformers.ops import memory_efficient_attention; import torch; data = torch.rand(8, 128, 32, 128).cuda(); output = memory_efficient_attention(data, data, data); print(output.sum())"
popd

# build vllm
pip install -r requirements-build.txt
pip install -r requirements-dev.txt
python setup.py install
python setup.py clean

# use installed version, not current directory
mv vllm vllm_link

# test vllm
TEST_DIST_MODEL=facebook/opt-125m python -m pytest -s tests/distributed/test_basic_distributed_correctness.py
TEST_DIST_MODEL=meta-llama/Llama-2-7b-hf python -m pytest -s tests/distributed/test_basic_distributed_correctness.py

# move back
mv vllm_link vllm

Tested the following pytorch version:

PyTorch status
torch-2.2.0.dev20231115 Good
torch-2.2.0.dev20231115 (build myself from commit cc11c0d11bfdcc1f1804cedfcd634a5e79715638 ) Good
torch-2.2.0.dev20231116 Good*
torch-2.2.0.dev20231117 Good*
torch-2.2.0.dev20231118 Good*
torch-2.2.0.dev20231119 Good*
torch-2.2.0.dev20231120 Good*
torch-2.2.0.dev20231121 Good*
torch-2.2.0.dev20231121 - small wheel OOM
torch-2.2.0.dev20231122 OOM
torch-2.2.0.dev20231122 (build myself from commit e0ec71deab2aedd6d44f4ea3e03b52bdaf5db3da ) Good
torch-2.2.0.dev20231123 OOM
torch-2.2.0.dev20231125 (local build) Good
torch-2.2.0.dev20231210 OOM
torch-2.3.0.dev20231222 OOM
torch-2.3.0.dev20231222 (build myself from commit a357a0f31519f96cff9839c1672a112539ba98ff ) OOM
torch-2.3.0.dev20231224 OOM
torch-2.3.0.dev20240124 OOM
torch-2.3.0.dev20240125 (build myself from commit 3c9076f070fab5b27eae3b7846755c98b7c97a1a ) Good
torch-2.3.0.dev20240224 OOM
torch-2.3.0.dev20240316 OOM

*: starting with torch-2.2.0.dev20231116 , because of pytorch/pytorch#112623 , import torch will cause cuInit call, making torch module unable to be used in fork or pickle environment, which will cause vllm to have deadlock. To test these pytorch versions, we have to defer import torch until worker runs. The deadlock problem is solved in torch-2.2.0.dev20231222, because pytorch/pytorch#116201 disabled pytorch/pytorch#112623 , import torch does not initialize cuda context.

The most interesting one, is that torch-2.2.0.dev20231121 does not work, while its small wheel version works. Their only difference is whether to ship cuda-related library with the wheel.

What's more, from pytorch 2.1.0 to pytorch 2.2.0 , the nccl version bumped from 2.18.3 to 2.19.3. The bug is not fixed in 2.20.3.

To avoid confounder from cupy, I write a pure python wrapper myself to call nccl, and indeed just switching the ncclAllReduce from 2.18.3 to 2.19.3 , I got OOM.

To gather more information, I also modified the following code:

vllm/vllm/worker/worker.py

Lines 117 to 145 in 120157f

# Profile the memory usage of the model and get the maximum number of
# cache blocks that can be allocated with the remaining free memory.
torch.cuda.empty_cache()
# Execute a forward pass with dummy inputs to profile the memory usage
# of the model.
self.model_runner.profile_run()
# Calculate the number of blocks that can be allocated with the
# profiled peak memory.
torch.cuda.synchronize()
free_gpu_memory, total_gpu_memory = torch.cuda.mem_get_info()
# NOTE(woosuk): Here we assume that the other processes using the same
# GPU did not change their memory usage during the profiling.
peak_memory = self.init_gpu_memory - free_gpu_memory
cache_block_size = self.get_cache_block_size_bytes(
block_size, cache_dtype)
num_gpu_blocks = int(
(total_gpu_memory * gpu_memory_utilization - peak_memory) //
cache_block_size)
num_cpu_blocks = int(cpu_swap_space // cache_block_size)
num_gpu_blocks = max(num_gpu_blocks, 0)
num_cpu_blocks = max(num_cpu_blocks, 0)
if self.model_runner.lora_manager:
self.model_runner.remove_all_loras()
gc.collect()
torch.cuda.empty_cache()
return num_gpu_blocks, num_cpu_blocks

Modified code is:

        # Profile the memory usage of the model and get the maximum number of
        # cache blocks that can be allocated with the remaining free memory.
        torch.cuda.empty_cache()
        torch.cuda.memory._record_memory_history()

        # Execute a forward pass with dummy inputs to profile the memory usage
        # of the model.
        self.model_runner.profile_run()

        # Calculate the number of blocks that can be allocated with the
        # profiled peak memory.
        torch.cuda.synchronize()
        free_gpu_memory, total_gpu_memory = torch.cuda.mem_get_info()
        # NOTE(woosuk): Here we assume that the other processes using the same
        # GPU did not change their memory usage during the profiling.
        peak_memory = self.init_gpu_memory - free_gpu_memory

        cache_block_size = self.get_cache_block_size_bytes(
            block_size, cache_dtype)
        num_gpu_blocks = int(
            (total_gpu_memory * gpu_memory_utilization - peak_memory) //
            cache_block_size)
        num_cpu_blocks = int(cpu_swap_space // cache_block_size)
        num_gpu_blocks = max(num_gpu_blocks, 0)
        num_cpu_blocks = max(num_cpu_blocks, 0)
        if self.model_runner.lora_manager:
            self.model_runner.remove_all_loras()
        gc.collect()
        torch.cuda.empty_cache()
        import os
        import multiprocessing
        current_process = multiprocessing.current_process()
        process_name = current_process.name
        torch.cuda.memory._dump_snapshot(f"/home/gcpuser/vllm/stats-{process_name}-{os.getpid()}.pickle")
        with open(f"/home/gcpuser/vllm/stats-{process_name}-{os.getpid()}.txt", "w") as f:
            print(f"{free_gpu_memory=}, {total_gpu_memory=}, {self.init_gpu_memory=}, {peak_memory=}", file=f)
            print(f"{block_size=}, {cache_block_size=}, {num_gpu_blocks=}, {num_cpu_blocks=}", file=f)
            from pprint import pprint
            pprint(torch.cuda.memory_stats(0), stream=f)
            pprint(torch.cuda.memory_stats(1), stream=f)
        return num_gpu_blocks, num_cpu_blocks

Copy link
Collaborator

@rkooo567 rkooo567 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we can instead just clean up states between tests using

    import gc
    import torch
    gc.collect()
    torch.cuda.empty_cache()

?


# manually load the nccl library
_path = os.path.dirname(os.path.abspath(__file__))
so_file = glob.glob(f"{_path}/../../lib/nvidia/nccl/lib/libnccl.so.*")[0]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't looked into the whole PR. But I feel having this kind of absolute path for libnccl seems very dangerous. Many cloud providers may ship their own nccl package (e.g., aws for efa support). I'm not sure about what's the effect of this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is our secrect version of nccl, living inside vllm, we ship it with our wheel. Please check the modification in setup.py.

We can remove this after nccl fixes the bug, or after pytorch downgrades to nccl==2.18.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Expanding the path, you can see it is /path/to/vllm/lib/nvidia/nccl/lib/libnccl.so.2.18.3. I created a secrect lib folder under vllm library, and this .so file will come with vllm installation.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like torch is also coming with its own version? https://discuss.pytorch.org/t/how-can-i-change-nccl-version-in-pytorch/143942 @zhuohan123 do you know how AWS resolves this?

@rkooo567
Copy link
Collaborator

Imo, this is a great direction. I've seen many successful cases reducing external dependencies like this that are not necessary + I've seen lots of users having to turn off cuda graph because cupy didn't work with their environment (and it will resolve their issues)

@youkaichao
Copy link
Member Author

because cupy didn't work with their environment (and it will resolve their issues)

Thanks for your comment! I will completely remove cupy dependency in a followup PR.

Copy link
Collaborator

@rkooo567 rkooo567 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you have a benchmark result before/after the PR?

with tempfile.TemporaryDirectory() as temp_dir:
local_zip_path = (
f"{temp_dir}/"
"nvidia_nccl_cu12-2.18.3-py3-none-manylinux1_x86_64.whl")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does vllm currently support amd arch (the wheel is only for x86)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean arm64? We don't need to consider amd here because we are under the condition if _is_cuda(): .

# a secrect place

# Define the URL of the file and the directory to unzip to
file_url = ('https://files.pythonhosted.org/packages/44/6e/'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider using a constant?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I wonder if we can support env var so that we can also decide to load arbitrary .so instead of always downloading our package when we build

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

detecting env var during runtime is good.

w.r.t. downloading our package when we build, we have to do this because nccl brought by torch==2.2.0 does not work.

logger.info(f'Creating target directory {target_dir} ...')
os.makedirs(target_dir)
# Check if the file is already downloaded
if os.path.exists(target_dir + "nvidia"):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, but

if os.path.exists(target_dir + "nvidia"):
    break

# Download the file
logger.info('Downloading file...')
....
....

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have no choice here, we cannot break or return, because it is not inside any function 👀

# ncclResult_t ncclGetUniqueId(ncclUniqueId* uniqueId);
_c_ncclGetUniqueId = nccl.ncclGetUniqueId
_c_ncclGetUniqueId.restype = ctypes.c_int
_c_ncclGetUniqueId.argtypes = [ctypes.POINTER(NcclUniqueId)]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm should we use more standard solution like pybinding here? (how common practice to manually link c to python this way?)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did this to explicitly avoid pybind11 or any other solutions requiring additional building/compilation, to reduce binary dependencies. This code will work for any nccl version, as long as they have ncclAllReduce function symbol.


# manually load the nccl library
_path = os.path.dirname(os.path.abspath(__file__))
so_file = glob.glob(f"{_path}/../../lib/nvidia/nccl/lib/libnccl.so.*")[0]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like torch is also coming with its own version? https://discuss.pytorch.org/t/how-can-i-change-nccl-version-in-pytorch/143942 @zhuohan123 do you know how AWS resolves this?


# manually load the nccl library
_path = os.path.dirname(os.path.abspath(__file__))
so_file = glob.glob(f"{_path}/../../lib/nvidia/nccl/lib/libnccl.so.*")[0]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: is there no conflict with nccl loaded with torch? (if so why)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No conflict. PyTorch will load libnccl.so.2, while our file is libnccl.so.2.18.3 . That's the value of this PR, to separate nccl versions used by vllm and pytorch.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how does this solution scale to other backends, like AMD GPU? Does it require another secret .so?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our CI does not run in amd GPUs. We don't know if amd rccl has a similar problem. Help is needed!

@youkaichao
Copy link
Member Author

Do you have a benchmark result before/after the PR?

I don't have a benchmark result yet. I suppose it does not change too much, only might get some performance variation (up or down) because of version change in torch & xformers .

@youkaichao
Copy link
Member Author

More support evidence for pinning nccl versions:

  1. nccl team does not consider 30x cudagraphs before, their ideal case is no more than 10 cudagraphs, and we are lucky nccl 2.18.3 works for us: Report of increased memory overhead during cudagraph capture with nccl >= 2.19 NVIDIA/nccl#1234 (comment)
  2. It seems quite common for users to pin nccl versions themselves, e.g. nccl hangs in clusters with mixed cpu vendors (Intel/AMD), due to inconsistent ALGO/PROTO selection in tuning NVIDIA/nccl#1136 pins nccl 2.17.1 , and many other reports in https://discuss.pytorch.org/t/how-can-i-change-nccl-version-in-pytorch/143942/13

More support evidence for removing cupy dependencies:

We get more and more bug reports with cupy’s TCPStore recently, e.g. #3057 and #3617

@youkaichao
Copy link
Member Author

@WoosukKwon mentioned pypi package size limit. If we want to avoid the large whl size, we can have an additional build flag like DOWNLOAD_NCCL , and turn it off by default, but turn it on in our CI & docker image. WDYT?

@WoosukKwon
Copy link
Collaborator

If we want to avoid the large whl size, we can have an additional build flag like DOWNLOAD_NCCL , and turn it off by default, but turn it on in our CI & docker image. WDYT?

Then, how does it work with the vLLM PyPI package?

Also, a dumb question: can we use an older version of NVIDIA NCCL package https://pypi.org/project/nvidia-nccl-cu12/ instead of including it in our package?

@WoosukKwon
Copy link
Collaborator

Also, how does this work with AMD GPUs? As you know, AMD GPUs also use ROCm version of CuPy because of some bugs in using PyTorch RCCL with HIP graphs.

@youkaichao
Copy link
Member Author

can we use an older version of NVIDIA NCCL package pypi.org/project/nvidia-nccl-cu12 instead of including it in our package?

No, it will break pytorch, because pytorch has binary dependency on nccl 2.19.3.

@youkaichao
Copy link
Member Author

how does it work with the vLLM PyPI package if we have an additional build flag like DOWNLOAD_NCCL , and turn it off by default, but turn it on in our CI & docker image?

Then pypi version of vllm will use nccl 2.19.3, which costs more memory and would have lower performance when using cudagraph.

@youkaichao
Copy link
Member Author

Also, how does this work with AMD GPUs? As you know, AMD GPUs also use ROCm version of CuPy because of some bugs in using PyTorch RCCL with HIP graphs.

It seems RCCL has exactly the same API as NCCL. Ideally we should be able to just use VLLM_NCCL_SO_PATH to point to a valid RCCL path, and it should work. However I don't have a working AMD environment, and I cannot test it.

@youkaichao
Copy link
Member Author

Benchmarks:

PT 2.1.2 f408d05 (current main)

image

PT 2.2.0 3d9332a (this branch)

image

PT 2.2.1 76f46f6 (this branch)

image

@simon-mo
Copy link
Collaborator

The benchmark in our CI is really just a placeholder. It doesn't run distributed (tp>1) and also running tiny models.

@youkaichao
Copy link
Member Author

This PR has been separated into #3625 and #3805

@youkaichao youkaichao closed this Apr 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants