Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
64 commits
Select commit Hold shift + click to select a range
1a7d826
Upgrade to `torch==2.2.0`
hmellor Feb 7, 2024
7de363f
Remove `wheel` from `requirements-dev.txt`
hmellor Feb 7, 2024
9bc921d
Revert change to `Dockerfile.rocm`
hmellor Feb 12, 2024
76ab3e7
Kick CI
hmellor Feb 15, 2024
0109fd2
Merge branch 'main' into pytorch-2.2.0-upgrade
hmellor Feb 15, 2024
4c616a8
Merge branch 'main' into pytorch-2.2.0-upgrade
hmellor Feb 21, 2024
922aa0c
Update requirements.txt
hmellor Feb 22, 2024
193d73a
Merge branch 'main' into pytorch-2.2.0-upgrade
hmellor Feb 22, 2024
bfcc926
Merge branch 'main' into pytorch-2.2.0-upgrade
hmellor Feb 22, 2024
584e6ef
Merge branch 'main' into pytorch-2.2.0-upgrade
hmellor Mar 4, 2024
daca4e1
Update to 2.2.1
hmellor Mar 4, 2024
015b7d4
Revert "Update to 2.2.1"
hmellor Mar 4, 2024
fef9e03
Merge branch 'main' into pytorch-2.2.0-upgrade
hmellor Mar 7, 2024
cf400cb
Merge branch 'main' into pytorch-2.2.0-upgrade
hmellor Mar 12, 2024
d77e855
Merge branch 'main' into pytorch-2.2.0-upgrade
hmellor Mar 15, 2024
75f05de
Update requirements.txt
hmellor Mar 15, 2024
e82cf3a
try to test one distributed at a time
youkaichao Mar 16, 2024
6d10bf5
upgrade to pytorch 2.2.0 by merging 'graphcore/pytorch-2.2.0-upgrade'
youkaichao Mar 16, 2024
4accd02
try pytorch 2.2.1
youkaichao Mar 16, 2024
a92346f
try to fix test
youkaichao Mar 21, 2024
e7f215b
use pip install to resolve the problem
youkaichao Mar 21, 2024
f99fe2a
remove nccl version to test
youkaichao Mar 21, 2024
0f3181f
move to Dockerfile
youkaichao Mar 21, 2024
6ef3843
fix version
youkaichao Mar 21, 2024
7db0e1b
use docerfile
youkaichao Mar 21, 2024
62650ae
try 2.2.0 first
youkaichao Mar 21, 2024
4ed16b9
place nccl install after vllm
youkaichao Mar 21, 2024
2d215df
patchelf
youkaichao Mar 21, 2024
0f6f243
update rpath for cupy
youkaichao Mar 21, 2024
da1df5e
try to write a custom pynccl
youkaichao Mar 22, 2024
b4085a1
add wget
youkaichao Mar 22, 2024
f77c9ae
delete logging code
youkaichao Mar 22, 2024
2766418
remove some debugging print
youkaichao Mar 22, 2024
0e18aed
use nccl 2.18.3
youkaichao Mar 23, 2024
7c531b0
add test for pynccl
youkaichao Mar 23, 2024
bbe3622
Merge remote-tracking branch 'origin' into fix_parallel_distributed_test
youkaichao Mar 23, 2024
1abf38e
fix linter
youkaichao Mar 23, 2024
5d661a6
update cupy_utils to pynccl
youkaichao Mar 23, 2024
99f96d7
rename cupy_utils to pynccl_utils
youkaichao Mar 23, 2024
b567f04
update import
youkaichao Mar 23, 2024
74fcf08
update pytorch in cmake
youkaichao Mar 23, 2024
43da101
add test with cudagraph
youkaichao Mar 23, 2024
37e7425
fix test; fix TORCH_CUDA_ARCH_LIST
youkaichao Mar 23, 2024
7e983f5
fix amd tests
youkaichao Mar 23, 2024
e3f8d5f
add pynccl test
youkaichao Mar 23, 2024
4e277ae
pack up libnccl.so
youkaichao Mar 23, 2024
a20d802
add so in setup.py, and use programatical path in pynccl
youkaichao Mar 23, 2024
dfc9d82
rename cupy --> pynccl
youkaichao Mar 23, 2024
8a5a011
rename cupy --> pynccl
youkaichao Mar 23, 2024
a009e31
rename cupy --> pynccl
youkaichao Mar 23, 2024
68e4792
rename cupy --> pynccl
youkaichao Mar 23, 2024
0a6fab1
fix wget install order
youkaichao Mar 23, 2024
a82a976
rename cupy --> pynccl
youkaichao Mar 23, 2024
1c6ec48
fix so filename and search path
youkaichao Mar 23, 2024
47ff82a
fix dockerfile
youkaichao Mar 23, 2024
b0c15c2
fix dockerfile
youkaichao Mar 23, 2024
0b4f7dd
download and use manifest in to force keeping .so file
youkaichao Mar 23, 2024
7942050
download and use manifest in to force keeping .so file
youkaichao Mar 23, 2024
20a3ec4
restore dockerfile
youkaichao Mar 23, 2024
0ca27b7
add lib file to package data
youkaichao Mar 23, 2024
a3c2340
add libnccl.so.2.18.3 via hard-coding
youkaichao Mar 23, 2024
71e2976
enable VLLM_NCCL_SO_PATH at runtime
youkaichao Mar 25, 2024
3d9332a
nit, os.makedirs(target_dir, exist_ok=True)
youkaichao Mar 25, 2024
76f46f6
upgrade to pt 2.2.1
youkaichao Mar 25, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 12 additions & 2 deletions .buildkite/test-pipeline.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -22,8 +22,18 @@ steps:
working_dir: "/vllm-workspace/tests/distributed"
num_gpus: 2 # only support 1 or 2 for now.

- label: Distributed Correctness Test
command: pytest -v -s --forked test_basic_distributed_correctness.py
- label: Distributed pynccl Test
command: pytest -v -s --forked test_pynccl.py
working_dir: "/vllm-workspace/tests/distributed"
num_gpus: 2 # only support 1 or 2 for now.

- label: Distributed Correctness Test-facebook/opt-125m
command: TEST_DIST_MODEL=facebook/opt-125m pytest -v -s --forked test_basic_distributed_correctness.py
working_dir: "/vllm-workspace/tests/distributed"
num_gpus: 2 # only support 1 or 2 for now.

- label: Distributed Correctness Test-meta-llama/Llama-2-7b-hf
command: TEST_DIST_MODEL=meta-llama/Llama-2-7b-hf pytest -v -s --forked test_basic_distributed_correctness.py
working_dir: "/vllm-workspace/tests/distributed"
num_gpus: 2 # only support 1 or 2 for now.

Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/publish.yml
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,7 @@ jobs:
matrix:
os: ['ubuntu-20.04']
python-version: ['3.8', '3.9', '3.10', '3.11']
pytorch-version: ['2.1.2'] # Must be the most recent version that meets requirements.txt.
pytorch-version: ['2.2.1'] # Must be the most recent version that meets requirements.txt.
cuda-version: ['11.8', '12.1']

steps:
Expand Down
5 changes: 4 additions & 1 deletion CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,9 @@ set(PYTHON_SUPPORTED_VERSIONS "3.8" "3.9" "3.10" "3.11")
# Supported NVIDIA architectures.
set(CUDA_SUPPORTED_ARCHS "7.0;7.5;8.0;8.6;8.9;9.0")

# used when building pytorch-related extensions
set(TORCH_CUDA_ARCH_LIST "7.0;7.5;8.0;8.6;8.9;9.0")
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note that pytorch 2.2.0 has 9.0a support by default:

https://github.com/pytorch/pytorch/blob/19d27a13ea052230d9fb565a5b82e683e28d1697/Dockerfile#L60

while our docker image does not support 9.0a

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this make our build system always compile the CUDA kernels for all architectures?

If I remember correctly, we only compiled the kernels for a single architecture by detecting the equipped GPUs on the user machine (I'm not sure this is still true after we changed our build system to CMake though), to reduce the compile time. Exceptionally, we targeted all architectures when building docker images or pypi wheels.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is used in docker image. It seems this CMake inherit the build architecture from pytorch by default, and so I have to change it (to avoid 9.0a architecture that is not supported in docker nvcc).


# Supported AMD GPU architectures.
set(HIP_SUPPORTED_ARCHS "gfx908;gfx90a;gfx942;gfx1100")

Expand All @@ -28,7 +31,7 @@ set(HIP_SUPPORTED_ARCHS "gfx908;gfx90a;gfx942;gfx1100")
# requirements.txt files and should be kept consistent. The ROCm torch
# versions are derived from Dockerfile.rocm
#
set(TORCH_SUPPORTED_VERSION_CUDA "2.1.2")
set(TORCH_SUPPORTED_VERSION_CUDA "2.2.1")
set(TORCH_SUPPORTED_VERSION_ROCM_5X "2.0.1")
set(TORCH_SUPPORTED_VERSION_ROCM_6X "2.1.1")

Expand Down
3 changes: 3 additions & 0 deletions Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,9 @@ RUN ldconfig /usr/local/cuda-12.1/compat/

WORKDIR /workspace

# used for downloading files
RUN apt install -y wget unzip

# install build and runtime dependencies
COPY requirements.txt requirements.txt
RUN --mount=type=cache,target=/root/.cache/pip \
Expand Down
1 change: 1 addition & 0 deletions MANIFEST.in
Original file line number Diff line number Diff line change
Expand Up @@ -4,3 +4,4 @@ include CMakeLists.txt

recursive-include cmake *
recursive-include csrc *
recursive-include vllm/lib *
2 changes: 1 addition & 1 deletion pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ requires = [
"ninja",
"packaging",
"setuptools >= 49.4.0",
"torch == 2.1.2",
"torch == 2.2.1",
"wheel",
]
build-backend = "setuptools.build_meta"
Expand Down
2 changes: 1 addition & 1 deletion requirements-build.txt
Original file line number Diff line number Diff line change
Expand Up @@ -3,5 +3,5 @@ cmake>=3.21
ninja
packaging
setuptools>=49.4.0
torch==2.1.2
torch==2.2.1
wheel
4 changes: 2 additions & 2 deletions requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -4,9 +4,9 @@ psutil
ray >= 2.9
sentencepiece # Required for LLaMA tokenizer.
numpy
torch == 2.1.2
torch == 2.2.1
xformers == 0.0.25 # Requires PyTorch 2.2.1.
transformers >= 4.39.0 # Required for StarCoder2.
xformers == 0.0.23.post1 # Required for CUDA 12.1.
fastapi
uvicorn[standard]
pydantic >= 2.0 # Required for OpenAI server.
Expand Down
57 changes: 55 additions & 2 deletions setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -11,6 +11,12 @@
from shutil import which
import torch
from torch.utils.cpp_extension import CUDA_HOME
import zipfile
import shutil
import logging
import tempfile

logger = logging.getLogger(__name__)

ROOT_DIR = os.path.dirname(__file__)

Expand Down Expand Up @@ -188,6 +194,48 @@ def _install_punica() -> bool:
return bool(int(os.getenv("VLLM_INSTALL_PUNICA_KERNELS", "0")))


if _is_cuda():

# tricky part, nccl 2.19 has a bug that increased memory overhead
# of cudagraph. However, pytorch has binary dependencies on nccl 2.19,
# simply `pip install nvidia-nccl-cu12==2.18.3` will break pytorch,
# so we have to manually download nccl 2.18 and keep the library to
# a secrect place

# Define the URL of the file and the directory to unzip to
file_url = ('https://files.pythonhosted.org/packages/44/6e/'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

consider using a constant?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I wonder if we can support env var so that we can also decide to load arbitrary .so instead of always downloading our package when we build

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

detecting env var during runtime is good.

w.r.t. downloading our package when we build, we have to do this because nccl brought by torch==2.2.0 does not work.

'3c9cd7007072f8a63dae7b5eddd1cc1525fd357377467ce3a4749b02d5ff'
'/nvidia_nccl_cu12-2.18.3-py3-none-manylinux1_x86_64.whl')

logger.info('Installing NVIDIA NCCL library...')

target_dir = os.path.dirname(os.path.abspath(__file__)) + "/vllm/lib/"
with tempfile.TemporaryDirectory() as temp_dir:
local_zip_path = (
f"{temp_dir}/"
"nvidia_nccl_cu12-2.18.3-py3-none-manylinux1_x86_64.whl")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does vllm currently support amd arch (the wheel is only for x86)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean arm64? We don't need to consider amd here because we are under the condition if _is_cuda(): .

# make sure the target directory exists
os.makedirs(target_dir, exist_ok=True)
# Check if the file is already downloaded
if os.path.exists(target_dir + "nvidia"):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit, but

if os.path.exists(target_dir + "nvidia"):
    break

# Download the file
logger.info('Downloading file...')
....
....

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have no choice here, we cannot break or return, because it is not inside any function 👀

logger.info('library already exists.')
else:
# Download the file
logger.info('Downloading file...')
os.system(f"wget {file_url} -q -P {temp_dir}/")
# Unzip the file
logger.info('Unzipping file...')
with zipfile.ZipFile(local_zip_path, 'r') as zip_ref:
zip_ref.extractall(temp_dir)
shutil.rmtree(f"{temp_dir}/nvidia_nccl_cu12-2.18.3.dist-info")
os.remove(local_zip_path)
# Move the unzipped files to the target directory
logger.info('Moving files...')
os.system(f"mv {temp_dir}/nvidia {target_dir}")
so_path = f"{target_dir}/nvidia/nccl/lib/libnccl.so.2"
os.rename(so_path, so_path.replace(".so.2", ".so.2.18.3"))


def get_hipcc_rocm_version():
# Run the hipcc --version command
result = subprocess.run(['hipcc', '--version'],
Expand Down Expand Up @@ -330,7 +378,10 @@ def get_requirements() -> List[str]:
ext_modules.append(CMakeExtension(name="vllm._C"))

package_data = {
"vllm": ["py.typed", "model_executor/layers/fused_moe/configs/*.json"]
"vllm": [
"py.typed", "model_executor/layers/fused_moe/configs/*.json",
"lib/nvidia/nccl/lib/libnccl.so.2.18.3"
]
}
if os.environ.get("VLLM_USE_PRECOMPILED"):
package_data["vllm"].append("*.so")
Expand Down Expand Up @@ -362,6 +413,8 @@ def get_requirements() -> List[str]:
python_requires=">=3.8",
install_requires=get_requirements(),
ext_modules=ext_modules,
cmdclass={"build_ext": cmake_build_ext} if not _is_neuron() else {},
cmdclass={
"build_ext": cmake_build_ext if not _is_neuron() else build_ext,
},
package_data=package_data,
)
16 changes: 13 additions & 3 deletions tests/distributed/test_basic_distributed_correctness.py
Original file line number Diff line number Diff line change
@@ -1,13 +1,23 @@
"""Compare the outputs of HF and distributed vLLM when using greedy sampling.

Run `pytest tests/distributed/test_basic_distributed_correctness.py --forked`.
vLLM will allocate all the available memory, so we need to run the tests one
by one. The solution is to pass arguments (model name) by environment
variables.
Run:

```sh
TEST_DIST_MODEL=facebook/opt-125m pytest \
test_basic_distributed_correctness.py
TEST_DIST_MODEL=meta-llama/Llama-2-7b-hf \
test_basic_distributed_correctness.py
```
"""
import os
import pytest
import torch

MODELS = [
"facebook/opt-125m",
"meta-llama/Llama-2-7b-hf",
os.environ["TEST_DIST_MODEL"],
]


Expand Down
88 changes: 88 additions & 0 deletions tests/distributed/test_pynccl.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
# this script is not run with `pytest`.
# It is run with `torchrun`.
import os
import multiprocessing
import pytest
import torch
from vllm.model_executor.parallel_utils.pynccl import (
NCCLCommunicator,
ncclGetUniqueId,
)


def distributed_run(fn, world_size):
number_of_processes = world_size
processes = []
for i in range(number_of_processes):
env = os.environ.copy()
env['RANK'] = str(i)
env['WORLD_SIZE'] = str(number_of_processes)
env['MASTER_ADDR'] = 'localhost'
env['MASTER_PORT'] = '12345'
p = multiprocessing.Process(target=fn, args=(env, ))
processes.append(p)
p.start()

for p in processes:
p.join()


def update_env(fn):

def wrapper(env):
import os
os.environ.update(env)
fn()

return wrapper


@update_env
def worker_fn():
comm = NCCLCommunicator()
tensor = torch.ones(16, 1024, 1024, dtype=torch.float32).cuda(comm.rank)
comm.all_reduce(tensor)
result = tensor.mean().cpu().item()
assert result == comm.world_size


@pytest.mark.skipif(torch.cuda.device_count() < 2,
reason="Need at least 2 GPUs to run the test.")
def test_pynccl():
distributed_run(worker_fn, 2)


@update_env
def worker_fn_with_cudagraph():
with torch.no_grad():
graph = torch.cuda.CUDAGraph()
comm = NCCLCommunicator()
# run something in the default stream to initialize torch engine
a = torch.ones((4, 4), device=f'cuda:{comm.rank}')
torch.cuda.synchronize()
with torch.cuda.graph(graph, stream=comm.stream):
comm.all_reduce(a)
comm.stream.synchronize()
assert a.mean().cpu().item() == comm.world_size**0
graph.replay()
comm.stream.synchronize()
assert a.mean().cpu().item() == comm.world_size**2


@pytest.mark.skipif(torch.cuda.device_count() < 2,
reason="Need at least 2 GPUs to run the test.")
def test_pynccl_with_cudagraph():
distributed_run(worker_fn_with_cudagraph, 2)


def test_ncclGetUniqueId():
unique_id = ncclGetUniqueId()
# `list(unique_id.internal)` is something like this:
# [34, -16, 23, 83, 109, -19, 59, 95, 2, 0, -86, 55, 10, -128, 0, 29, 0,
# 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
# 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
# 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
# 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
# 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
# as long as the function doesn't raise an exception, we're good
assert unique_id is not None
8 changes: 4 additions & 4 deletions vllm/model_executor/parallel_utils/communication_op.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,12 +4,12 @@
import torch
from torch.distributed import ProcessGroup

from vllm.model_executor.parallel_utils import cupy_utils
from vllm.model_executor.parallel_utils import pynccl_utils
from vllm.model_executor.parallel_utils.parallel_state import (
get_tensor_model_parallel_rank,
get_tensor_model_parallel_world_size,
get_tensor_model_parallel_group,
is_cupy_nccl_enabled_for_all_reduce,
is_pynccl_enabled_for_all_reduce,
)
from vllm.model_executor.parallel_utils.custom_all_reduce import (
custom_all_reduce)
Expand All @@ -33,9 +33,9 @@ def tensor_model_parallel_all_reduce(input_: torch.Tensor) -> torch.Tensor:
out = custom_all_reduce(input_)
if out is not None:
return out
if is_cupy_nccl_enabled_for_all_reduce():
if is_pynccl_enabled_for_all_reduce():
# TODO: support multiple parallel groups.
cupy_utils.all_reduce(input_)
pynccl_utils.all_reduce(input_)
else:
torch.distributed.all_reduce(input_,
group=get_tensor_model_parallel_group())
Expand Down
Loading