Model Loading Stuck (in ray ?) #1846

qy1026 · 2023-11-30T02:08:20Z

python = 3.11.5
torch = 2.1.0 + cu121
vllm = 0.2.2
GPU: L40 * 4

I install vllm by "pip install vllm".

It will STUCK when loading vicuna-7b-v1.5 model using the vllm framework, while the fastchat framework work well.

When I arise a KeyboardInterrupt, it is stuck at

the ./ray/_private/worker.py, line 769, in get_objects
data_metadata_pairs = self.core_worker.get_objects(

File "python/ray/_raylet.pyx" in line 3211, in ray._raylet.CoreWorker.get_objects
File "python/ray/_raylet.pyx" in line 449, in ray._raylet.check_status

KeyboardInterrupt

simon-mo · 2023-11-30T02:17:37Z

What's your nvidia driver version and topology? You can get it via nvidia-smi and nvidia-smi topo

qy1026 · 2023-11-30T02:25:39Z

What's your nvidia driver version and topology? You can get it via nvidia-smi and nvidia-smi topo

Driver Version: 535.129.03
CUDA Version: 12.2

simon-mo · 2023-11-30T02:36:25Z

Thanks. I was suspecting this is the case of #1801 but doesn't seem like it. Can you paste your full command line argument or script? Additionally, can you run ray stack when it is stuck and paste in the output so we can see what's stuck here? Lastly, a version of nccl would be useful.

python -c "import torch;print(torch.cuda.nccl.version())"

qy1026 · 2023-11-30T03:32:00Z

Thanks. I was suspecting this is the case of #1801 but doesn't seem like it. Can you paste your full command line argument or script? Additionally, can you run ray stack when it is stuck and paste in the output so we can see what's stuck here? Lastly, a version of nccl would be useful.
python -c "import torch;print(torch.cuda.nccl.version())"

This is what the terminal show, when the code is stuck for a long time:

I use 2 L40 for vicuna-7b-v1.5.

(vv) xxx@vision11:~/llm_projects$ python "/home/xxx/llm_projects/vllm_test.py"
2023-11-30 09:49:16,217 INFO worker.py:1673 -- Started a local Ray instance.
INFO 11-30 09:49:17 llm_engine.py:72] Initializing an LLM engine with config: model='/data2_vision5/xxx/vicuna-7b-v1.5/', tokenizer='/data2_vision5/xxx/vicuna-7b-v1.5/', tokenizer_mode=auto, revision=None, tokenizer_revision=None, trust_remote_code=False, dtype=torch.float16, max_seq_len=4096, download_dir=None, load_format=auto, tensor_parallel_size=2, quantization=None, seed=0)

(RayWorker pid=1080652) [E ProcessGroupNCCL.cpp:474] [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800265 milliseconds before timing out.
(RayWorker pid=1080652) [E ProcessGroupNCCL.cpp:488] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
(RayWorker pid=1080652) [E ProcessGroupNCCL.cpp:494] To avoid data inconsistency, we are taking the entire process down.
(RayWorker pid=1080652) [E ProcessGroupNCCL.cpp:915] [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800265 milliseconds before timing out.
(RayWorker pid=1080652) [2023-11-30 10:19:23,102 E 1080652 1080830] [logging.cc:97](http://logging.cc:97/): Unhandled exception: St13runtime_error. what(): [Rank 1] NCCL watchdog thread terminated with exception: [Rank 1] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800265 milliseconds before timing out.
(RayWorker pid=1080652) [2023-11-30 10:19:23,114 E 1080652 1080830] [logging.cc:104](http://logging.cc:104/): Stack trace:
(RayWorker pid=1080652)  /home/xxx/anaconda3/envs/vv/lib/python3.10/site-packages/ray/_[raylet.so](http://raylet.so/)(+0xf199fa) [0x7fc24121e9fa] ray::operator<<()
(RayWorker pid=1080652) /home/xxx/anaconda3/envs/vv/lib/python3.10/site-packages/ray/_[raylet.so](http://raylet.so/)(+0xf1c1b8) [0x7fc2412211b8] ray::TerminateHandler()
(RayWorker pid=1080652) /home/xxx/anaconda3/envs/vv/bin/../lib/libstdc++.so.6(+0xb135a) [0x7fc24018635a] __cxxabiv1::__terminate()
(RayWorker pid=1080652) /home/xxx/anaconda3/envs/vv/bin/../lib/libstdc++.so.6(+0xb13c5) [0x7fc2401863c5]
(RayWorker pid=1080652) /home/xxx/anaconda3/envs/vv/bin/../lib/libstdc++.so.6(+0xb134f) [0x7fc24018634f]
(RayWorker pid=1080652) /home/xxx/anaconda3/envs/vv/lib/python3.10/site-packages/torch/lib/[libtorch_cuda.so](http://libtorch_cuda.so/)(+0xc86dc5) [0x7f92ad1f8dc5] c10d::ProcessGroupNCCL::ncclCommWatchdog()
(RayWorker pid=1080652) /home/xxx/anaconda3/envs/vv/bin/../lib/libstdc++.so.6(+0xdbbf4) [0x7fc2401b0bf4] execute_native_thread_routine
(RayWorker pid=1080652) /lib/x86_64-linux-gnu/libpthread.so.0(+0x8609) [0x7fc242f67609] start_thread
(RayWorker pid=1080652) /lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7fc242d32133] __clone
(RayWorker pid=1080652)
(RayWorker pid=1080652) *** SIGABRT received at time=1701310763 on cpu 64 ***
(RayWorker pid=1080652) PC: @     0x7fc242c5600b  (unknown)  raise
(RayWorker pid=1080652)     @     0x7fc242f73420       4048  (unknown)
(RayWorker pid=1080652)     @     0x7fc24018635a  (unknown)  __cxxabiv1::__terminate()
(RayWorker pid=1080652)     @     0x7fc240186070  (unknown)  (unknown)
(RayWorker pid=1080652) [2023-11-30 10:19:23,115 E 1080652 1080830] [logging.cc:361](http://logging.cc:361/): *** SIGABRT received at time=1701310763 on cpu 64 ***
(RayWorker pid=1080652) [2023-11-30 10:19:23,115 E 1080652 1080830] [logging.cc:361](http://logging.cc:361/): PC: @     0x7fc242c5600b  (unknown)  raise
(RayWorker pid=1080652) [2023-11-30 10:19:23,115 E 1080652 1080830] [logging.cc:361](http://logging.cc:361/):     @     0x7fc242f73420       4048  (unknown)
(RayWorker pid=1080652) [2023-11-30 10:19:23,115 E 1080652 1080830] [logging.cc:361](http://logging.cc:361/):     @     0x7fc24018635a  (unknown)  __cxxabiv1::__terminate()
(RayWorker pid=1080652) [2023-11-30 10:19:23,116 E 1080652 1080830] [logging.cc:361](http://logging.cc:361/):     @     0x7fc240186070  (unknown)  (unknown)
(RayWorker pid=1080652) Fatal Python error: Aborted
(RayWorker pid=1080652)
(RayWorker pid=1080652)
(RayWorker pid=1080652) Extension modules: msgpack._cmsgpack, google._upb._message, psutil._psutil_linux, psutil._psutil_posix, setproctitle, yaml._yaml, uvloop.loop, ray._raylet, [charset_normalizer.md](http://charset_normalizer.md/), numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, pydantic.typing, pydantic.errors, pydantic.version, pydantic.utils, pydantic.class_validators, pydantic.config, pydantic.color, pydantic.datetime_parse, pydantic.validators, pydantic.networks, pydantic.types, pydantic.json, pydantic.error_wrappers, pydantic.fields, pydantic.parse, pydantic.schema, pydantic.main, pydantic.dataclasses, pydantic.annotated_types, pydantic.decorator, pydantic.env_settings, pydantic.tools, pydantic, sentencepiece._sentencepiece, pyarrow.lib, pyarrow._hdfsio, pyarrow._fs, pyarrow._hdfs, pyarrow._gcsfs, pyarrow._s3fs, pandas._libs.tslibs.np_datetime, pandas._libs.tslibs.dtypes, pandas._libs.tslibs.base, pandas._libs.tslibs.nattype, pandas._libs.tslibs.timezones, pandas._libs.tslibs.ccalendar, pandas._libs.tslibs.fields, pandas._libs.tslibs.timedeltas, pandas._libs.tslibs.tzconversion, pandas._libs.tslibs.timestamps, pandas._libs.properties, pandas._libs.tslibs.offsets, pandas._libs.tslibs.strptime, pandas._libs.tslibs.parsing
(RayWorker pid=1080652) , pandas._libs.tslibs.conversion
(RayWorker pid=1080652) , pandas._libs.tslibs.period, pandas._libs.tslibs.vectorized, pandas._libs.ops_dispatch, pandas._libs.missing, pandas._libs.hashtable, pandas._libs.algos, pandas._libs.interval, pandas._libs.lib, pandas._libs.ops, pyarrow._compute, pandas._libs.arrays, pandas._libs.tslib, pandas._libs.sparse, pandas._libs.indexing, pandas._libs.index, pandas._libs.internals, pandas._libs.join, pandas._libs.writers, pandas._libs.window.aggregations, pandas._libs.window.indexers, pandas._libs.reshape, pandas._libs.groupby, pandas._libs.json, pandas._libs.parsers, pandas._libs.testing, pyarrow._parquet, pyarrow._json (total: 101)
2023-11-30 10:19:23,326 WARNING worker.py:2074 -- A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffffa79f2a2a0976a2c753c732ae01000000 Worker ID: 45477f19145fb3424019689328b818fbf84c547b27e6eb4ffa8fc060 Node ID: 9c6f827858d621904b3d1cfb06057c57480da46c3edab524030f1e58 Worker IP address: [10.9.70.181](http://10.9.70.181/) Worker port: 45575 Worker PID: 1080652 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
Traceback (most recent call last):
  File "/home/xxx/llm_projects/vllm_test.py", line 35, in <module>
    llm = LLM(
  File "/home/xxx/anaconda3/envs/vv/lib/python3.10/site-packages/vllm/entrypoints/llm.py", line 93, in __init__
    self.llm_engine = LLMEngine.from_engine_args(engine_args)
  File "/home/xxx/anaconda3/envs/vv/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 231, in from_engine_args
    engine = cls(*engine_configs,
  File "/home/xxx/anaconda3/envs/vv/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 108, in __init__
    self._init_workers_ray(placement_group)
  File "/home/xxx/anaconda3/envs/vv/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 181, in _init_workers_ray
    self._run_workers(
  File "/home/xxx/anaconda3/envs/vv/lib/python3.10/site-packages/vllm/engine/llm_engine.py", line 704, in _run_workers
    all_outputs = ray.get(all_outputs)
  File "/home/xxx/anaconda3/envs/vv/lib/python3.10/site-packages/ray/_private/auto_init_hook.py", line 24, in auto_init_wrapper
    return fn(*args, **kwargs)
  File "/home/xxx/anaconda3/envs/vv/lib/python3.10/site-packages/ray/_private/client_mode_hook.py", line 103, in wrapper
    return func(*args, **kwargs)
  File "/home/xxx/anaconda3/envs/vv/lib/python3.10/site-packages/ray/_private/worker.py", line 2565, in get
    raise value
ray.exceptions.RayActorError: The actor died unexpectedly before finishing this task.
        class_name: RayWorker
        actor_id: a79f2a2a0976a2c753c732ae01000000
        pid: 1080652
        namespace: 7cfbf24e-bae4-49d1-9ac2-d4db1cb83a89
        ip: [10.9.70.181](http://10.9.70.181/)
The actor is dead because its worker process has died. Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
(RayWorker pid=1080651) Extension modules: msgpack._cmsgpack, google._upb._message, psutil._psutil_linux, psutil._psutil_posix, setproctitle, yaml._yaml, uvloop.loop, ray._raylet, [charset_normalizer.md](http://charset_normalizer.md/), numpy.core._multiarray_umath, numpy.core._multiarray_tests, numpy.linalg._umath_linalg, numpy.fft._pocketfft_internal, numpy.random._common, numpy.random.bit_generator, numpy.random._bounded_integers, numpy.random._mt19937, numpy.random.mtrand, numpy.random._philox, numpy.random._pcg64, numpy.random._sfc64, numpy.random._generator, torch._C, torch._C._fft, torch._C._linalg, torch._C._nested, torch._C._nn, torch._C._sparse, torch._C._special, pydantic.typing, pydantic.errors, pydantic.version, pydantic.utils, pydantic.class_validators, pydantic.config, pydantic.color, pydantic.datetime_parse, pydantic.validators, pydantic.networks, pydantic.types, pydantic.json, pydantic.error_wrappers, pydantic.fields, pydantic.parse, pydantic.schema, pydantic.main, pydantic.dataclasses, pydantic.annotated_types, pydantic.decorator, pydantic.env_settings, pydantic.tools, pydantic, sentencepiece._sentencepiece, pyarrow.lib, pyarrow._hdfsio, pyarrow._fs, pyarrow._hdfs, pyarrow._gcsfs, pyarrow._s3fs, pandas._libs.tslibs.np_datetime, pandas._libs.tslibs.dtypes, pandas._libs.tslibs.base, pandas._libs.tslibs.nattype, pandas._libs.tslibs.timezones, pandas._libs.tslibs.ccalendar, pandas._libs.tslibs.fields, pandas._libs.tslibs.timedeltas, pandas._libs.tslibs.tzconversion, pandas._libs.tslibs.timestamps, pandas._libs.properties, pandas._libs.tslibs.offsets, pandas._libs.tslibs.strptime, pandas._libs.tslibs.parsing, pandas._libs.tslibs.conversion, pandas._libs.tslibs.period, pandas._libs.tslibs.vectorized, pandas._libs.ops_dispatch, pandas._libs.missing, pandas._libs.hashtable, pandas._libs.algos, pandas._libs.interval, pandas._libs.lib, pandas._libs.ops, pyarrow._compute, pandas._libs.arrays, pandas._libs.tslib, pandas._libs.sparse, pandas._libs.indexing, pandas._libs.index, pandas._libs.internals, pandas._libs.join, pandas._libs.writers, pandas._libs.window.aggregations, pandas._libs.window.indexers, pandas._libs.reshape, pandas._libs.groupby, pandas._libs.json, pandas._libs.parsers, pandas._libs.testing, pyarrow._parquet, pyarrow._json (total: 101)
2023-11-30 10:19:23,769 WARNING worker.py:2074 -- A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. RayTask ID: ffffffffffffffffba61971047f999e0f9ceaab101000000 Worker ID: 0e4b1eeade4874f7080db5807059e066cc128f27de4d1404a6057750 Node ID: 9c6f827858d621904b3d1cfb06057c57480da46c3edab524030f1e58 Worker IP address: [10.9.70.181](http://10.9.70.181/) Worker port: 34011 Worker PID: 1080651 Worker exit type: SYSTEM_ERROR Worker exit detail: Worker unexpectedly exits with a connection error code 2. End of file. There are some potential root causes. (1) The process is killed by SIGKILL by OOM killer due to high memory usage. (2) ray stop --force is called. (3) The worker is crashed unexpectedly due to SIGSEGV or other unexpected errors.
(RayWorker pid=1080651) [E ProcessGroupNCCL.cpp:474] [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800305 milliseconds before timing out.
(RayWorker pid=1080651) [E ProcessGroupNCCL.cpp:488] Some NCCL operations have failed or timed out. Due to the asynchronous nature of CUDA kernels, subsequent GPU operations might run on corrupted/incomplete data.
(RayWorker pid=1080651) [E ProcessGroupNCCL.cpp:494] To avoid data inconsistency, we are taking the entire process down.
(RayWorker pid=1080651) [E ProcessGroupNCCL.cpp:915] [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800305 milliseconds before timing out.
(RayWorker pid=1080651) [2023-11-30 10:19:23,561 E 1080651 1080832] [logging.cc:97](http://logging.cc:97/): Unhandled exception: St13runtime_error. what(): [Rank 0] NCCL watchdog thread terminated with exception: [Rank 0] Watchdog caught collective operation timeout: WorkNCCL(SeqNum=1, OpType=ALLREDUCE, NumelIn=1, NumelOut=1, Timeout(ms)=1800000) ran for 1800305 milliseconds before timing out.
(RayWorker pid=1080651) [2023-11-30 10:19:23,568 E 1080651 1080832] [logging.cc:104](http://logging.cc:104/): Stack trace:
(RayWorker pid=1080651)  /home/xxx/anaconda3/envs/vv/lib/python3.10/site-packages/ray/_[raylet.so](http://raylet.so/)(+0xf199fa) [0x7fecd8c0c9fa] ray::operator<<()
(RayWorker pid=1080651) /home/xxx/anaconda3/envs/vv/lib/python3.10/site-packages/ray/_[raylet.so](http://raylet.so/)(+0xf1c1b8) [0x7fecd8c0f1b8] ray::TerminateHandler()
(RayWorker pid=1080651)  [repeated 7x across cluster] (Ray deduplicates logs by default. Set RAY_DEDUP_LOGS=0 to disable log deduplication, or see https://docs.ray.io/en/master/ray-observability/ray-logging.html#log-deduplication for more options.)
(RayWorker pid=1080651) /home/xxx/anaconda3/envs/vv/bin/../lib/libstdc++.so.6(+0xdbbf4) [0x7fecd7b9ebf4] execute_native_thread_routine
(RayWorker pid=1080651) /lib/x86_64-linux-gnu/libpthread.so.0(+0x8609) [0x7fecda955609] start_thread
(RayWorker pid=1080651) /lib/x86_64-linux-gnu/libc.so.6(clone+0x43) [0x7fecda720133] __clone
(RayWorker pid=1080651) *** SIGABRT received at time=1701310763 on cpu 72 ***
(RayWorker pid=1080651) PC: @     0x7fecda64400b  (unknown)  raise
(RayWorker pid=1080651)     @     0x7fecd7b7435a  (unknown)  __cxxabiv1::__terminate() [repeated 2x across cluster]
(RayWorker pid=1080651)     @     0x7fecd7b74070  (unknown)  (unknown)
(RayWorker pid=1080651) [2023-11-30 10:19:23,568 E 1080651 1080832] [logging.cc:361](http://logging.cc:361/): *** SIGABRT received at time=1701310763 on cpu 72 ***
(RayWorker pid=1080651) [2023-11-30 10:19:23,568 E 1080651 1080832] [logging.cc:361](http://logging.cc:361/): PC: @     0x7fecda64400b  (unknown)  raise
(RayWorker pid=1080651) [2023-11-30 10:19:23,568 E 1080651 1080832] [logging.cc:361](http://logging.cc:361/):     @     0x7fecd7b7435a  (unknown)  __cxxabiv1::__terminate() [repeated 2x across cluster]
(RayWorker pid=1080651) [2023-11-30 10:19:23,569 E 1080651 1080832] [logging.cc:361](http://logging.cc:361/):     @     0x7fecd7b74070  (unknown)  (unknown)
(RayWorker pid=1080651) Fatal Python error: Aborted

python -c "import torch;print(torch.cuda.nccl.version())": (2, 18, 1)
I'm not the sudoer. It seems I can't run ray stack.

qy1026 · 2023-11-30T08:23:40Z

There is something wrong with the connection between 2 specific gpus. When I use the other 2 gpus, the code can work well.

mirkogolze · 2023-12-07T07:45:10Z

We have the same problem running inference with fastchat vllm-worker.
With a standard fastchat-worker using accelerate of HuggingFace the parallel inference is working.
But with the vllm-worker the problem occurs. accelerate and vllm using the cuda-Lib in different ways, but perhaps vllm has to change anything.

JenniePing · 2023-12-07T09:44:53Z

There is something wrong with the connection between 2 specific gpus. When I use the other 2 gpus, the code can work well.

It does work for me too, but is weird🧐 what's wrong with the connection?

qy1026 · 2023-12-07T10:01:08Z

We have the same problem running inference with fastchat vllm-worker. With a standard fastchat-worker using accelerate of HuggingFace the parallel inference is working. But with the vllm-worker the problem occurs. accelerate and vllm using the cuda-Lib in different ways, but perhaps vllm has to change anything.

Do you also use L40 to run vllm ?

qy1026 · 2023-12-07T10:01:22Z

There is something wrong with the connection between 2 specific gpus. When I use the other 2 gpus, the code can work well.

It does work for me too, but is weird🧐 what's wrong with the connection?

Do you also use L40 to run vllm ?

qy1026 · 2023-12-07T10:05:30Z

When I add "export NCCL_P2P_DISABLE = 1" in the ~/.bashrc , the code can also work using the previous 2 GPUs.

mirkogolze · 2023-12-07T10:56:34Z

Do you also use L40 to run vllm ?

No, we are using 2xA100 (40GB), no linking. And we tried that out with Tesla T4 on another machine.

mirkogolze · 2023-12-07T11:17:26Z

When I add "export NCCL_P2P_DISABLE = 1" in the ~/.bashrc , the code can also work using the previous 2 GPUs.

great. This is working on both machines. With A100 and Tesla T4.
Does this have any performance impact? Now vllm is doing all controlling work between the GPUs?

JenniePing · 2023-12-07T14:34:37Z

There is something wrong with the connection between 2 specific gpus. When I use the other 2 gpus, the code can work well.

It does work for me too, but is weird🧐 what's wrong with the connection?

Do you also use L40 to run vllm ?

No I'm using RTX3090*8, but this works for me, also the "export NCCL_P2P_DISABLE = 1" one works.

JenniePing · 2023-12-07T14:41:34Z

When I add "export NCCL_P2P_DISABLE = 1" in the ~/.bashrc , the code can also work using the previous 2 GPUs.

great. This is working on both machines. With A100 and Tesla T4. Does this have any performance impact? Now vllm is doing all controlling work between the GPUs?

I think "export NCCL_P2P_DISABLE = 1" has impact on performance. You can check issue NVIDIA/nccl-tests#117, it is used on 4090.

cangyi071 · 2023-12-15T06:02:16Z

There is something wrong with the connection between 2 specific gpus. When I use the other 2 gpus, the code can work well.

how to change gpus? For example ,if I have 8 gpus in a machine,how can I specify to use gpu 3 and gpu 4?

qy1026 · 2023-12-15T06:18:02Z

There is something wrong with the connection between 2 specific gpus. When I use the other 2 gpus, the code can work well.

how to change gpus? For example ,if I have 8 gpus in a machine,how can I specify to use gpu 3 and gpu 4?

Maybe CUDA_VISIBLE_DEVICES=3,4 is helpful ?

mirkogolze · 2023-12-15T06:18:22Z

I am using docker
docker run --runtime nvidia --gpus '"device=1,2"'

qy1026 · 2023-12-15T06:21:02Z

I am using docker docker run --runtime nvidia --gpus '"device=1,2"'

Well sorry I don't know much about the docker command.

cangyi071 · 2023-12-15T08:34:30Z

There is something wrong with the connection between 2 specific gpus. When I use the other 2 gpus, the code can work well.

how to change gpus? For example ,if I have 8 gpus in a machine,how can I specify to use gpu 3 and gpu 4?

Maybe CUDA_VISIBLE_DEVICES=3,4 is helpful ?

thank you

cangyi071 · 2023-12-15T10:17:39Z

I am using docker docker run --runtime nvidia --gpus '"device=1,2"'

Well sorry I don't know much about the docker command.

The command CUDA_VISIBLE_DEVICES=3,4 doesn't seem to be effective. Although I've set it, the script continues to run on devices 1 and 2. I believe specifying the devices directly in the Docker run command would be more useful

mirkogolze · 2023-12-15T10:32:39Z

I am not sure if you are using the CUDA_VISIBLE_DEVICES the right way. Have a look at #691 (comment)

qy1026 closed this as completed Nov 30, 2023

Uh oh!

Model Loading Stuck (in ray ?) #1846

Model Loading Stuck (in ray ?) #1846

Comments

qy1026 commented Nov 30, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

simon-mo commented Nov 30, 2023

Uh oh!

qy1026 commented Nov 30, 2023

Uh oh!

simon-mo commented Nov 30, 2023

Uh oh!

qy1026 commented Nov 30, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

qy1026 commented Nov 30, 2023

Uh oh!

mirkogolze commented Dec 7, 2023

Uh oh!

JenniePing commented Dec 7, 2023

Uh oh!

qy1026 commented Dec 7, 2023

Uh oh!

qy1026 commented Dec 7, 2023

Uh oh!

qy1026 commented Dec 7, 2023

Uh oh!

mirkogolze commented Dec 7, 2023

Uh oh!

mirkogolze commented Dec 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JenniePing commented Dec 7, 2023

Uh oh!

JenniePing commented Dec 7, 2023

Uh oh!

cangyi071 commented Dec 15, 2023

Uh oh!

qy1026 commented Dec 15, 2023

Uh oh!

mirkogolze commented Dec 15, 2023

Uh oh!

qy1026 commented Dec 15, 2023

Uh oh!

cangyi071 commented Dec 15, 2023

Uh oh!

cangyi071 commented Dec 15, 2023

Uh oh!

mirkogolze commented Dec 15, 2023

Uh oh!

qy1026 commented Nov 30, 2023 •

edited

Loading

qy1026 commented Nov 30, 2023 •

edited

Loading

mirkogolze commented Dec 7, 2023 •

edited

Loading