Skip to content

memory not-release issue for large BS on FastSeq #26

@yetingqiaqia

Description

@yetingqiaqia

I'd like to report a memory not-release issue for large BS on FastSeq.

Impact:
I can re-produce it every time. As it does not release memory after crash, I am afraid if releasing this package to users, they may experience the same issue and are not easy to handle it.

How to reproduce:
I tested in gpu0 machine.
Below are the detailed steps of re-producing this issue:

  • Docker run image:
    sudo docker run --gpus all --privileged --name fastseq_dev_py3_tiy -it adsbrainwestus2.azurecr.io/fastseq:dev-py3 /bin/bash

  • Inside the container:

  1. Create RSA-key, add it to github account (just make it easy to download code)
  2. mkdir tiy & cd tiy
  3. Install the latest fastseq:
    git clone [email protected]:microsoft/fastseq.git
    cd fastseq
    pip install --editable ./
  4. cd benchmarks
    Set LOOP in utils.sh to be 1
  5. Run nvidia-smi the first time, no memory occupation, which is expected:

image

  1. Run ./benchmark.sh fairseq+fastseq bart.large.cnn cnn_dm/len-1024.bin valid 256
    Failed because of Bus error:
    Processing Loop=1/1 Util=fairseq_v0.9.0+fastseq_v0.0.3 Model=bart.large.cnn Task=cnn_dm/len-1024.bin Split=valid BS=256
    benchmark_seq.sh: line 55: 533 Bus error (core dumped) $util $data_dir --path $model_path --fp16 --task translation --batch-size $bs --gen-subset $split --truncate-source --bpe gpt2 --beam 4 --num-workers 4 --min-len 55 --max-len-b 140 --no-repeat-ngram-size 3 --lenpen 2.0 #--print-alignment #--print-step # KeyError: steps --skip-invalid-size-inputs-valid-test $* > $STDOUT_FILE 2> $STDERR_FILE
    Failed at benchmark_seq.sh (line 80): $util $data_dir --path $model_path --fp16 --task translation --batch-size $bs --gen-subset $split --truncate-source --bpe gpt2 --beam 4 --num-workers 4 --min-len 55 --max-len-b 140 --no-repeat-ngram-size 3 --lenpen 2.0 #--print-alignment #--print-step # KeyError: steps --skip-invalid-size-inputs-valid-test $* > $STDOUT_FILE 2> $STDERR_FILE

  2. Run nvidia-smi the second time, memory occupation on GPU0:

image

Other information:
I re-run 5 times to check if there is any information in fastseq.stderr. Most of time, there is no any error msg in fastseq.stderr.

  • 4 times, no any error message in fastseq.stderr

root@6e86574394fb:/workspace/tiy/fastseq/benchmarks# cat /tmp/fastseq.stderr
/opt/conda/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py:102: UserWarning: torch.distributed.reduce_op is deprecated, please use torch.distributed.ReduceOp instead
warnings.warn("torch.distributed.reduce_op is deprecated, please use "
/opt/conda/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py:102: UserWarning: torch.distributed.reduce_op is deprecated, please use torch.distributed.ReduceOp instead
warnings.warn("torch.distributed.reduce_op is deprecated, please use "

  • 1 time, there was EOFError recorded in fastseq.stderr

_root@6e86574394fb:/workspace/tiy/fastseq/benchmarks# cat /tmp/fastseq.stderr
/opt/conda/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py:102: UserWarning: torch.distributed.reduce_op is deprecated, please use torch.distributed.ReduceOp instead
warnings.warn("torch.distributed.reduce_op is deprecated, please use "
/opt/conda/lib/python3.6/site-packages/torch/distributed/distributed_c10d.py:102: UserWarning: torch.distributed.reduce_op is deprecated, please use torch.distributed.ReduceOp instead
warnings.warn("torch.distributed.reduce_op is deprecated, please use "
Traceback (most recent call last):
File "/opt/conda/lib/python3.6/multiprocessing/resource_sharer.py", line 142, in _serve
with self._listener.accept() as conn:
File "/opt/conda/lib/python3.6/multiprocessing/connection.py", line 456, in accept
answer_challenge(c, self._authkey)
File "/opt/conda/lib/python3.6/multiprocessing/connection.py", line 732, in answer_challenge
message = connection.recv_bytes(256) # reject large message
File "/opt/conda/lib/python3.6/multiprocessing/connection.py", line 216, in recv_bytes
buf = self._recv_bytes(maxlength)
File "/opt/conda/lib/python3.6/multiprocessing/connection.py", line 407, in _recv_bytes
buf = self._recv(4)
File "/opt/conda/lib/python3.6/multiprocessing/connection.py", line 383, in recv
raise EOFError
EOFError

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions