-
Notifications
You must be signed in to change notification settings - Fork 37
Description
I'd like to report a memory not-release issue for large BS on FastSeq.
Impact:
I can re-produce it every time. As it does not release memory after crash, I am afraid if releasing this package to users, they may experience the same issue and are not easy to handle it.
How to reproduce:
I tested in gpu0 machine.
Below are the detailed steps of re-producing this issue:
-
Docker run image:
sudo docker run --gpus all --privileged --name fastseq_dev_py3_tiy -it adsbrainwestus2.azurecr.io/fastseq:dev-py3 /bin/bash -
Inside the container:
- Create RSA-key, add it to github account (just make it easy to download code)
- mkdir tiy & cd tiy
- Install the latest fastseq:
git clone [email protected]:microsoft/fastseq.git
cd fastseq
pip install --editable ./ - cd benchmarks
Set LOOP in utils.sh to be 1 - Run nvidia-smi the first time, no memory occupation, which is expected:
-
Run ./benchmark.sh fairseq+fastseq bart.large.cnn cnn_dm/len-1024.bin valid 256
Failed because of Bus error:
Processing Loop=1/1 Util=fairseq_v0.9.0+fastseq_v0.0.3 Model=bart.large.cnn Task=cnn_dm/len-1024.bin Split=valid BS=256
benchmark_seq.sh: line 55: 533 Bus error (core dumped) $util $data_dir --path $model_path --fp16 --task translation --batch-size $bs --gen-subset $split --truncate-source --bpe gpt2 --beam 4 --num-workers 4 --min-len 55 --max-len-b 140 --no-repeat-ngram-size 3 --lenpen 2.0#--print-alignment#--print-step # KeyError: steps--skip-invalid-size-inputs-valid-test $* > $STDOUT_FILE 2> $STDERR_FILE
Failed at benchmark_seq.sh (line 80): $util $data_dir --path $model_path --fp16 --task translation --batch-size $bs --gen-subset $split --truncate-source --bpe gpt2 --beam 4 --num-workers 4 --min-len 55 --max-len-b 140 --no-repeat-ngram-size 3 --lenpen 2.0#--print-alignment#--print-step # KeyError: steps--skip-invalid-size-inputs-valid-test $* > $STDOUT_FILE 2> $STDERR_FILE -
Run nvidia-smi the second time, memory occupation on GPU0:
Other information:
I re-run 5 times to check if there is any information in fastseq.stderr. Most of time, there is no any error msg in fastseq.stderr.
- 4 times, no any error message in fastseq.stderr
root@6e86574394fb:/workspace/tiy/fastseq/benchmarks# cat /tmp/fastseq.stderr
/opt/conda/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py:102: UserWarning: torch.distributed.reduce_op is deprecated, please use torch.distributed.ReduceOp instead
warnings.warn("torch.distributed.reduce_op is deprecated, please use "
/opt/conda/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py:102: UserWarning: torch.distributed.reduce_op is deprecated, please use torch.distributed.ReduceOp instead
warnings.warn("torch.distributed.reduce_op is deprecated, please use "
- 1 time, there was EOFError recorded in fastseq.stderr
_root@6e86574394fb:/workspace/tiy/fastseq/benchmarks# cat /tmp/fastseq.stderr
/opt/conda/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py:102: UserWarning: torch.distributed.reduce_op is deprecated, please use torch.distributed.ReduceOp instead
warnings.warn("torch.distributed.reduce_op is deprecated, please use "
/opt/conda/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py:102: UserWarning: torch.distributed.reduce_op is deprecated, please use torch.distributed.ReduceOp instead
warnings.warn("torch.distributed.reduce_op is deprecated, please use "
Traceback (most recent call last):
File "/opt/conda/lib/python3.6/multiprocessing/resource_sharer.py", line 142, in _serve
with self._listener.accept() as conn:
File "/opt/conda/lib/python3.6/multiprocessing/connection.py", line 456, in accept
answer_challenge(c, self._authkey)
File "/opt/conda/lib/python3.6/multiprocessing/connection.py", line 732, in answer_challenge
message = connection.recv_bytes(256) # reject large message
File "/opt/conda/lib/python3.6/multiprocessing/connection.py", line 216, in recv_bytes
buf = self._recv_bytes(maxlength)
File "/opt/conda/lib/python3.6/multiprocessing/connection.py", line 407, in _recv_bytes
buf = self._recv(4)
File "/opt/conda/lib/python3.6/multiprocessing/connection.py", line 383, in recv
raise EOFError
EOFError

