Skip to content

Model Loading Stuck (in ray ?) #1846

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
qy1026 opened this issue Nov 30, 2023 · 21 comments
Closed

Model Loading Stuck (in ray ?) #1846

qy1026 opened this issue Nov 30, 2023 · 21 comments

Comments

@qy1026
Copy link

qy1026 commented Nov 30, 2023

python = 3.11.5
torch = 2.1.0 + cu121
vllm = 0.2.2
GPU: L40 * 4

I install vllm by "pip install vllm".

It will STUCK when loading vicuna-7b-v1.5 model using the vllm framework, while the fastchat framework work well.

When I arise a KeyboardInterrupt, it is stuck at

the ./ray/_private/worker.py, line 769, in get_objects
data_metadata_pairs = self.core_worker.get_objects(

File "python/ray/_raylet.pyx" in line 3211, in ray._raylet.CoreWorker.get_objects
File "python/ray/_raylet.pyx" in line 449, in ray._raylet.check_status

KeyboardInterrupt

@simon-mo
Copy link
Collaborator

What's your nvidia driver version and topology? You can get it via nvidia-smi and nvidia-smi topo

@qy1026
Copy link
Author

qy1026 commented Nov 30, 2023

What's your nvidia driver version and topology? You can get it via nvidia-smi and nvidia-smi topo

Driver Version: 535.129.03
CUDA Version: 12.2

@simon-mo
Copy link
Collaborator

Thanks. I was suspecting this is the case of #1801 but doesn't seem like it. Can you paste your full command line argument or script? Additionally, can you run ray stack when it is stuck and paste in the output so we can see what's stuck here? Lastly, a version of nccl would be useful.

python -c "import torch;print(torch.cuda.nccl.version())"

@qy1026
Copy link
Author

qy1026 commented Nov 30, 2023

Thanks. I was suspecting this is the case of #1801 but doesn't seem like it. Can you paste your full command line argument or script? Additionally, can you run ray stack when it is stuck and paste in the output so we can see what's stuck here? Lastly, a version of nccl would be useful.

python -c "import torch;print(torch.cuda.nccl.version())"
  1. This is what the terminal show, when the code is stuck for a long time:

    I use 2 L40 for vicuna-7b-v1.5.

(vv) xxx@vision11:~/llm_projects$ python "/home/xxx/llm_projects/vllm_test.py"
2023-11-30 09:49:16,217 INFO worker.py:1673 -- Started a local Ray instance.
INFO 11-30 09:49:17 llm_engine.py:72] Initializing an LLM engine with config: model='/data2_vision5/xxx/vicuna-7b-v1.5/', tokenizer='/data2_vision5/xxx/vicuna-7b-v1.5/', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=2, quantization=None, seed=0)

(RayWorker pid=1080652) [E ProcessGroupNCCL.cpp:474] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800265 milliseconds before timing out.
(RayWorker pid=1080652) [E ProcessGroupNCCL.cpp:488] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
(RayWorker pid=1080652) [E ProcessGroupNCCL.cpp:494] To avoid data inconsistency, we are taking the entire process down.
(RayWorker pid=1080652) [E ProcessGroupNCCL.cpp:915] [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800265 milliseconds before timing out.
(RayWorker pid=1080652) [2023-11-30 10:19:23,102 E 1080652 1080830] [logging.cc:97](http://logging.cc:97/): Unhandled exception: St13runtime_error. what(): [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800265 milliseconds before timing out.
(RayWorker pid=1080652) [2023-11-30 10:19:23,114 E 1080652 1080830] [logging.cc:104](http://logging.cc:104/): Stack trace:
(RayWorker pid=1080652)  /home/xxx/anaconda3/envs/vv/lib/python3.10/site-packages/ray/_[raylet.so](http://raylet.so/)(+0xf199fa) [0x7fc24121e9fa] ray::operator<<()
(RayWorker pid=1080652) /home/xxx/anaconda3/envs/vv/lib/python3.10/site-packages/ray/_[raylet.so](http://raylet.so/)(+0xf1c1b8) [0x7fc2412211b8] ray::TerminateHandler()
(RayWorker pid=1080652) /home/xxx/anaconda3/envs/vv/bin/../lib/libstdc++.so.6(+0xb135a) [0x7fc24018635a] __cxxabiv1::__terminate()
(RayWorker pid=1080652) /home/xxx/anaconda3/envs/vv/bin/../lib/libstdc++.so.6(+0xb13c5) [0x7fc2401863c5]
(RayWorker pid=1080652) /home/xxx/anaconda3/envs/vv/bin/../lib/libstdc++.so.6(+0xb134f) [0x7fc24018634f]
(RayWorker pid=1080652) /home/xxx/anaconda3/envs/vv/lib/python3.10/site-packages/torch/lib/[libtorch_cuda.so](http://libtorch_cuda.so/)(+0xc86dc5) [0x7f92ad1f8dc5] c10d::ProcessGroupNCCL::ncclCommWatchdog()
(RayWorker pid=1080652) /home/xxx/anaconda3/envs/vv/bin/../lib/libstdc++.so.6(+0xdbbf4) [0x7fc2401b0bf4] execute_native_thread_routine
(RayWorker pid=1080652) /lib/x86_64-linux-gnu/libpthread.so.0(+0x8609) [0x7fc242f67609] start_thread
(RayWorker pid=1080652) /lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7fc242d32133] __clone
(RayWorker pid=1080652)
(RayWorker pid=1080652) *** SIGABRT received at time=1701310763 on cpu 64 ***
(RayWorker pid=1080652) PC: @     0x7fc242c5600b  (unknown)  raise
(RayWorker pid=1080652)     @     0x7fc242f73420       4048  (unknown)
(RayWorker pid=1080652)     @     0x7fc24018635a  (unknown)  __cxxabiv1::__terminate()
(RayWorker pid=1080652)     @     0x7fc240186070  (unknown)  (unknown)
(RayWorker pid=1080652) [2023-11-30 10:19:23,115 E 1080652 1080830] [logging.cc:361](http://logging.cc:361/): *** SIGABRT received at time=1701310763 on cpu 64 ***
(RayWorker pid=1080652) [2023-11-30 10:19:23,115 E 1080652 1080830] [logging.cc:361](http://logging.cc:361/): PC: @     0x7fc242c5600b  (unknown)  raise
(RayWorker pid=1080652) [2023-11-30 10:19:23,115 E 1080652 1080830] [logging.cc:361](http://logging.cc:361/):     @     0x7fc242f73420       4048  (unknown)
(RayWorker pid=1080652) [2023-11-30 10:19:23,115 E 1080652 1080830] [logging.cc:361](http://logging.cc:361/):     @     0x7fc24018635a  (unknown)  __cxxabiv1::__terminate()
(RayWorker pid=1080652) [2023-11-30 10:19:23,116 E 1080652 1080830] [logging.cc:361](http://logging.cc:361/):     @     0x7fc240186070  (unknown)  (unknown)
(RayWorker pid=1080652) Fatal Python error: Aborted
(RayWorker pid=1080652)
(RayWorker pid=1080652)
(RayWorker pid=1080652) Extension modules: msgpack._cmsgpack, google._upb._message, psutil._psutil_linux, psutil._psutil_posix, setproctitle, yaml._yaml, uvloop.loop, ray._raylet, [charset_normalizer.md](http://charset_normalizer.md/), numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, pydantic.typing, pydantic.errors, pydantic.version, pydantic.utils, pydantic.class_validators, pydantic.config, pydantic.color, pydantic.datetime_parse, pydantic.validators, pydantic.networks, pydantic.types, pydantic.json, pydantic.error_wrappers, pydantic.fields, pydantic.parse, pydantic.schema, pydantic.main, pydantic.dataclasses, pydantic.annotated_types, pydantic.decorator, pydantic.env_settings, pydantic.tools, pydantic, sentencepiece._sentencepiece, pyarrow.lib, pyarrow._hdfsio, pyarrow._fs, pyarrow._hdfs, pyarrow._gcsfs, pyarrow._s3fs, pandas._libs.tslibs.np_datetime, pandas._libs.tslibs.dtypes, pandas._libs.tslibs.base, pandas._libs.tslibs.nattype, pandas._libs.tslibs.timezones, pandas._libs.tslibs.ccalendar, pandas._libs.tslibs.fields, pandas._libs.tslibs.timedeltas, pandas._libs.tslibs.tzconversion, pandas._libs.tslibs.timestamps, pandas._libs.properties, pandas._libs.tslibs.offsets, pandas._libs.tslibs.strptime, pandas._libs.tslibs.parsing
(RayWorker pid=1080652) , pandas._libs.tslibs.conversion
(RayWorker pid=1080652) , pandas._libs.tslibs.period, pandas._libs.tslibs.vectorized, pandas._libs.ops_dispatch, pandas._libs.missing, pandas._libs.hashtable, pandas._libs.algos, pandas._libs.interval, pandas._libs.lib, pandas._libs.ops, pyarrow._compute, pandas._libs.arrays, pandas._libs.tslib, pandas._libs.sparse, pandas._libs.indexing, pandas._libs.index, pandas._libs.internals, pandas._libs.join, pandas._libs.writers, pandas._libs.window.aggregations, pandas._libs.window.indexers, pandas._libs.reshape, pandas._libs.groupby, pandas._libs.json, pandas._libs.parsers, pandas._libs.testing, pyarrow._parquet, pyarrow._json (total: 101)
2023-11-30 10:19:23,326 WARNING worker.py:2074 -- A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffffa79f2a2a0976a2c753c732ae01000000 Worker ID: 45477f19145fb3424019689328b818fbf84c547b27e6eb4ffa8fc060 Node ID: 9c6f827858d621904b3d1cfb06057c57480da46c3edab524030f1e58 Worker IP address: [10.9.70.181](http://10.9.70.181/) Worker port: 45575 Worker PID: 1080652 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
Traceback (most recent call last):
  File "/home/xxx/llm_projects/vllm_test.py", line 35, in <module>
    llm = LLM(
  File "/home/xxx/anaconda3/envs/vv/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 93, in __init__
    self.llm_engine = LLMEngine.from_engine_args(engine_args)
  File "/home/xxx/anaconda3/envs/vv/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 231, in from_engine_args
    engine = cls(*engine_configs,
  File "/home/xxx/anaconda3/envs/vv/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 108, in __init__
    self._init_workers_ray(placement_group)
  File "/home/xxx/anaconda3/envs/vv/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 181, in _init_workers_ray
    self._run_workers(
  File "/home/xxx/anaconda3/envs/vv/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 704, in _run_workers
    all_outputs = ray.get(all_outputs)
  File "/home/xxx/anaconda3/envs/vv/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 24, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/home/xxx/anaconda3/envs/vv/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "/home/xxx/anaconda3/envs/vv/lib/python3.10/site-packages/ray/_private/worker.py", line 2565, in get
    raise value
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
        class_name: RayWorker
        actor_id: a79f2a2a0976a2c753c732ae01000000
        pid: 1080652
        namespace: 7cfbf24e-bae4-49d1-9ac2-d4db1cb83a89
        ip: [10.9.70.181](http://10.9.70.181/)
The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
(RayWorker pid=1080651) Extension modules: msgpack._cmsgpack, google._upb._message, psutil._psutil_linux, psutil._psutil_posix, setproctitle, yaml._yaml, uvloop.loop, ray._raylet, [charset_normalizer.md](http://charset_normalizer.md/), numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, pydantic.typing, pydantic.errors, pydantic.version, pydantic.utils, pydantic.class_validators, pydantic.config, pydantic.color, pydantic.datetime_parse, pydantic.validators, pydantic.networks, pydantic.types, pydantic.json, pydantic.error_wrappers, pydantic.fields, pydantic.parse, pydantic.schema, pydantic.main, pydantic.dataclasses, pydantic.annotated_types, pydantic.decorator, pydantic.env_settings, pydantic.tools, pydantic, sentencepiece._sentencepiece, pyarrow.lib, pyarrow._hdfsio, pyarrow._fs, pyarrow._hdfs, pyarrow._gcsfs, pyarrow._s3fs, pandas._libs.tslibs.np_datetime, pandas._libs.tslibs.dtypes, pandas._libs.tslibs.base, pandas._libs.tslibs.nattype, pandas._libs.tslibs.timezones, pandas._libs.tslibs.ccalendar, pandas._libs.tslibs.fields, pandas._libs.tslibs.timedeltas, pandas._libs.tslibs.tzconversion, pandas._libs.tslibs.timestamps, pandas._libs.properties, pandas._libs.tslibs.offsets, pandas._libs.tslibs.strptime, pandas._libs.tslibs.parsing, pandas._libs.tslibs.conversion, pandas._libs.tslibs.period, pandas._libs.tslibs.vectorized, pandas._libs.ops_dispatch, pandas._libs.missing, pandas._libs.hashtable, pandas._libs.algos, pandas._libs.interval, pandas._libs.lib, pandas._libs.ops, pyarrow._compute, pandas._libs.arrays, pandas._libs.tslib, pandas._libs.sparse, pandas._libs.indexing, pandas._libs.index, pandas._libs.internals, pandas._libs.join, pandas._libs.writers, pandas._libs.window.aggregations, pandas._libs.window.indexers, pandas._libs.reshape, pandas._libs.groupby, pandas._libs.json, pandas._libs.parsers, pandas._libs.testing, pyarrow._parquet, pyarrow._json (total: 101)
2023-11-30 10:19:23,769 WARNING worker.py:2074 -- A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffffba61971047f999e0f9ceaab101000000 Worker ID: 0e4b1eeade4874f7080db5807059e066cc128f27de4d1404a6057750 Node ID: 9c6f827858d621904b3d1cfb06057c57480da46c3edab524030f1e58 Worker IP address: [10.9.70.181](http://10.9.70.181/) Worker port: 34011 Worker PID: 1080651 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
(RayWorker pid=1080651) [E ProcessGroupNCCL.cpp:474] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800305 milliseconds before timing out.
(RayWorker pid=1080651) [E ProcessGroupNCCL.cpp:488] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
(RayWorker pid=1080651) [E ProcessGroupNCCL.cpp:494] To avoid data inconsistency, we are taking the entire process down.
(RayWorker pid=1080651) [E ProcessGroupNCCL.cpp:915] [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800305 milliseconds before timing out.
(RayWorker pid=1080651) [2023-11-30 10:19:23,561 E 1080651 1080832] [logging.cc:97](http://logging.cc:97/): Unhandled exception: St13runtime_error. what(): [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800305 milliseconds before timing out.
(RayWorker pid=1080651) [2023-11-30 10:19:23,568 E 1080651 1080832] [logging.cc:104](http://logging.cc:104/): Stack trace:
(RayWorker pid=1080651)  /home/xxx/anaconda3/envs/vv/lib/python3.10/site-packages/ray/_[raylet.so](http://raylet.so/)(+0xf199fa) [0x7fecd8c0c9fa] ray::operator<<()
(RayWorker pid=1080651) /home/xxx/anaconda3/envs/vv/lib/python3.10/site-packages/ray/_[raylet.so](http://raylet.so/)(+0xf1c1b8) [0x7fecd8c0f1b8] ray::TerminateHandler()
(RayWorker pid=1080651)  [repeated 7x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/ray-logging.html#log-deduplication for more options.)
(RayWorker pid=1080651) /home/xxx/anaconda3/envs/vv/bin/../lib/libstdc++.so.6(+0xdbbf4) [0x7fecd7b9ebf4] execute_native_thread_routine
(RayWorker pid=1080651) /lib/x86_64-linux-gnu/libpthread.so.0(+0x8609) [0x7fecda955609] start_thread
(RayWorker pid=1080651) /lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7fecda720133] __clone
(RayWorker pid=1080651) *** SIGABRT received at time=1701310763 on cpu 72 ***
(RayWorker pid=1080651) PC: @     0x7fecda64400b  (unknown)  raise
(RayWorker pid=1080651)     @     0x7fecd7b7435a  (unknown)  __cxxabiv1::__terminate() [repeated 2x across cluster]
(RayWorker pid=1080651)     @     0x7fecd7b74070  (unknown)  (unknown)
(RayWorker pid=1080651) [2023-11-30 10:19:23,568 E 1080651 1080832] [logging.cc:361](http://logging.cc:361/): *** SIGABRT received at time=1701310763 on cpu 72 ***
(RayWorker pid=1080651) [2023-11-30 10:19:23,568 E 1080651 1080832] [logging.cc:361](http://logging.cc:361/): PC: @     0x7fecda64400b  (unknown)  raise
(RayWorker pid=1080651) [2023-11-30 10:19:23,568 E 1080651 1080832] [logging.cc:361](http://logging.cc:361/):     @     0x7fecd7b7435a  (unknown)  __cxxabiv1::__terminate() [repeated 2x across cluster]
(RayWorker pid=1080651) [2023-11-30 10:19:23,569 E 1080651 1080832] [logging.cc:361](http://logging.cc:361/):     @     0x7fecd7b74070  (unknown)  (unknown)
(RayWorker pid=1080651) Fatal Python error: Aborted

  1. python -c "import torch;print(torch.cuda.nccl.version())": (2, 18, 1)
  2. I'm not the sudoer. It seems I can't run ray stack.

@qy1026
Copy link
Author

qy1026 commented Nov 30, 2023

There is something wrong with the connection between 2 specific gpus. When I use the other 2 gpus, the code can work well.

@qy1026 qy1026 closed this as completed Nov 30, 2023
@mirkogolze
Copy link

We have the same problem running inference with fastchat vllm-worker.
With a standard fastchat-worker using accelerate of HuggingFace the parallel inference is working.
But with the vllm-worker the problem occurs. accelerate and vllm using the cuda-Lib in different ways, but perhaps vllm has to change anything.

@JenniePing
Copy link

There is something wrong with the connection between 2 specific gpus. When I use the other 2 gpus, the code can work well.

It does work for me too, but is weird🧐 what's wrong with the connection?

@qy1026
Copy link
Author

qy1026 commented Dec 7, 2023

We have the same problem running inference with fastchat vllm-worker. With a standard fastchat-worker using accelerate of HuggingFace the parallel inference is working. But with the vllm-worker the problem occurs. accelerate and vllm using the cuda-Lib in different ways, but perhaps vllm has to change anything.

Do you also use L40 to run vllm ?

@qy1026
Copy link
Author

qy1026 commented Dec 7, 2023

There is something wrong with the connection between 2 specific gpus. When I use the other 2 gpus, the code can work well.

It does work for me too, but is weird🧐 what's wrong with the connection?

Do you also use L40 to run vllm ?

@qy1026
Copy link
Author

qy1026 commented Dec 7, 2023

When I add "export NCCL_P2P_DISABLE = 1" in the ~/.bashrc , the code can also work using the previous 2 GPUs.

@mirkogolze
Copy link

Do you also use L40 to run vllm ?

No, we are using 2xA100 (40GB), no linking. And we tried that out with Tesla T4 on another machine.
image

image

@mirkogolze
Copy link

mirkogolze commented Dec 7, 2023

When I add "export NCCL_P2P_DISABLE = 1" in the ~/.bashrc , the code can also work using the previous 2 GPUs.

great. This is working on both machines. With A100 and Tesla T4.
Does this have any performance impact? Now vllm is doing all controlling work between the GPUs?

@JenniePing
Copy link

There is something wrong with the connection between 2 specific gpus. When I use the other 2 gpus, the code can work well.

It does work for me too, but is weird🧐 what's wrong with the connection?

Do you also use L40 to run vllm ?

No I'm using RTX3090*8, but this works for me, also the "export NCCL_P2P_DISABLE = 1" one works.

@JenniePing
Copy link

When I add "export NCCL_P2P_DISABLE = 1" in the ~/.bashrc , the code can also work using the previous 2 GPUs.

great. This is working on both machines. With A100 and Tesla T4. Does this have any performance impact? Now vllm is doing all controlling work between the GPUs?

I think "export NCCL_P2P_DISABLE = 1" has impact on performance. You can check issue NVIDIA/nccl-tests#117, it is used on 4090.

@cangyi071
Copy link

There is something wrong with the connection between 2 specific gpus. When I use the other 2 gpus, the code can work well.

how to change gpus? For example ,if I have 8 gpus in a machine,how can I specify to use gpu 3 and gpu 4?

@qy1026
Copy link
Author

qy1026 commented Dec 15, 2023

There is something wrong with the connection between 2 specific gpus. When I use the other 2 gpus, the code can work well.

how to change gpus? For example ,if I have 8 gpus in a machine,how can I specify to use gpu 3 and gpu 4?

Maybe CUDA_VISIBLE_DEVICES=3,4 is helpful ?

@mirkogolze
Copy link

I am using docker
docker run --runtime nvidia --gpus '"device=1,2"'

@qy1026
Copy link
Author

qy1026 commented Dec 15, 2023

I am using docker docker run --runtime nvidia --gpus '"device=1,2"'

Well sorry I don't know much about the docker command.

@cangyi071
Copy link

There is something wrong with the connection between 2 specific gpus. When I use the other 2 gpus, the code can work well.

how to change gpus? For example ,if I have 8 gpus in a machine,how can I specify to use gpu 3 and gpu 4?

Maybe CUDA_VISIBLE_DEVICES=3,4 is helpful ?

thank you

@cangyi071
Copy link

I am using docker docker run --runtime nvidia --gpus '"device=1,2"'

Well sorry I don't know much about the docker command.

The command CUDA_VISIBLE_DEVICES=3,4 doesn't seem to be effective. Although I've set it, the script continues to run on devices 1 and 2. I believe specifying the devices directly in the Docker run command would be more useful

@mirkogolze
Copy link

I am not sure if you are using the CUDA_VISIBLE_DEVICES the right way. Have a look at #691 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants