Closed
Description
"Stochastic" issue happening with training at some point. Training starts okay for x number of epochs and at some point this often happens with Pytorch Lightning
(quite close still to the Build a segmentation workflow (with PyTorch Lightning)
)
, and is probably propagating from Pytorch code? (e.g. fastai/fastai#23)
reduction.py", line 161, in recvfds
len(ancdata))
RuntimeError: received 0 items of ancdata
TypeError: 'NoneType' object is not iterable
Which I thought was happening first with the CacheDataset
as it was quite RAM-intensive?:
train_ds = CacheDataset(data=datalist_train, transform=train_trans, cache_rate=1, num_workers=4)
val_ds = CacheDataset(data=datalist_val, transform=val_trans, cache_rate=1, num_workers=4)
but the same behavior was happening with the vanilla loader
train_ds = Dataset(data=datalist_train, transform=train_trans)
val_ds = Dataset(data=datalist_val, transform=val_trans)
with the following transformation
I guess this depends on environment in which the code is run, but do you have any ideas how to get rid of this?
Full trace:
MONAI version: 0.2.0
Python version: 3.6.8 |Anaconda, Inc.| (default, Dec 30 2018, 01:22:34) [GCC 7.3.0]
Numpy version: 1.18.5
Pytorch version: 1.5.0
Optional dependencies:
Pytorch Ignite version: 0.3.0
Nibabel version: 3.1.0
scikit-image version: 0.17.2
Pillow version: 7.1.2
Tensorboard version: 2.2.2
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
CUDA_VISIBLE_DEVICES: [0]
| Name | Type | Params
-------------------------------------------
0 | _model | UNet | 4 M
1 | loss_function | DiceLoss | 0
Validation sanity check: 100%|████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:05<00:00, 5.89s/it]
Validation sanity check: 100%|████████████████████████████████████████████████████████████████████████████████████████| 1/1 [00:05<00:00, 5.89s/it]
current epoch: 0 current mean loss: 0.6968 best mean loss: 0.6968 (best dice at that loss 0.0061) at epoch 0
Epoch 1: 100%|███████████████████████████████████████████████████████████████████████████████| 222/222 [08:21<00:00, 2.26s/it, loss=0.604, v_num=0]
...
Epoch 41: 83%|████████████████████████████████████████████████████████████████▋ | 184/222 [06:40<01:22, 2.18s/it, loss=0.281, v_num=0]
Traceback (most recent call last):
trainer.fit(net)
site-packages/pytorch_lightning/trainer/trainer.py", line 918, in fit
self.single_gpu_train(model)
site-packages/pytorch_lightning/trainer/distrib_parts.py", line 176, in single_gpu_train
self.run_pretrain_routine(model)
site-packages/pytorch_lightning/trainer/trainer.py", line 1093, in run_pretrain_routine
self.train()
site-packages/pytorch_lightning/trainer/training_loop.py", line 375, in train
self.run_training_epoch()
site-packages/pytorch_lightning/trainer/training_loop.py", line 445, in run_training_epoch
enumerate(_with_is_last(train_dataloader)), "get_train_batch"
site-packages/pytorch_lightning/profiler/profilers.py", line 64, in profile_iterable
value = next(iterator)
site-packages/pytorch_lightning/trainer/training_loop.py", line 844, in _with_is_last
for val in it:
site-packages/torch/utils/data/dataloader.py", line 345, in __next__
data = self._next_data()
site-packages/torch/utils/data/dataloader.py", line 841, in _next_data
idx, data = self._get_data()
site-packages/torch/utils/data/dataloader.py", line 808, in _get_data
success, data = self._try_get_data()
site-packages/torch/utils/data/dataloader.py", line 761, in _try_get_data
data = self._data_queue.get(timeout=timeout)
File "/home/petteri/anaconda3/lib/python3.6/multiprocessing/queues.py", line 113, in get
return _ForkingPickler.loads(res)
site-packages/torch/multiprocessing/reductions.py", line 294, in rebuild_storage_fd
fd = df.detach()
File "/home/petteri/anaconda3/lib/python3.6/multiprocessing/resource_sharer.py", line 58, in detach
return reduction.recv_handle(conn)
File "/home/petteri/anaconda3/lib/python3.6/multiprocessing/reduction.py", line 182, in recv_handle
return recvfds(s, 1)[0]
File "/home/petteri/anaconda3/lib/python3.6/multiprocessing/reduction.py", line 161, in recvfds
len(ancdata))
RuntimeError: received 0 items of ancdata
Exception ignored in: <object repr() failed>
Traceback (most recent call last):
site-packages/tqdm/std.py", line 1086, in __del__
site-packages/tqdm/std.py", line 1293, in close
site-packages/tqdm/std.py", line 1471, in display
site-packages/tqdm/std.py", line 1089, in __repr__
site-packages/tqdm/std.py", line 1433, in format_dict
TypeError: 'NoneType' object is not iterable