-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Closed
Labels
has workaroundhigh prioritymodule: crashProblem manifests as a hard crash, as opposed to a RuntimeErrorProblem manifests as a hard crash, as opposed to a RuntimeErrormodule: dataloaderRelated to torch.utils.data.DataLoader and SamplerRelated to torch.utils.data.DataLoader and SamplertriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate moduleThis issue has been looked at a team member, and triaged and prioritized into an appropriate module
Description
Editorial note: If you are having this problem, try running torch.multiprocessing.set_sharing_strategy('file_system')
right after your import of torch
I am using a DataLoader
in my code with a custom Dataset
class, and it worked fine during training for several epochs. However, when testing my model, after a bit less than 1k iterations, I'm getting the following error:
RuntimeError Traceback (most recent call last)
/home/jfsantos/src/pytorch_models/test_model.py in <module>()
82
83 print('Generating samples...')
---> 84 for k, batch in tqdm(enumerate(test_loader)):
85 f = G_test.audio_paths[k]
86 spec, phase = spectrogram_from_file(f, window=window, step=step)
/home/jfsantos/anaconda3/envs/pytorch/lib/python3.5/site-packages/tqdm/_tqdm.py in __iter__(self)
831 """, fp_write=getattr(self.fp, 'write', sys.stderr.write))
832
--> 833 for obj in iterable:
834 yield obj
835 # Update and print the progressbar.
/home/jfsantos/anaconda3/envs/pytorch/lib/python3.5/site-packages/torch/utils/data/dataloader.py in __next__(self)
166 while True:
167 assert (not self.shutdown and self.batches_outstanding > 0)
--> 168 idx, batch = self.data_queue.get()
169 self.batches_outstanding -= 1
170 if idx != self.rcvd_idx:
/home/jfsantos/anaconda3/envs/pytorch/lib/python3.5/multiprocessing/queues.py in get(self)
343 res = self._reader.recv_bytes()
344 # unserialize the data after having released the lock
--> 345 return ForkingPickler.loads(res)
346
347 def put(self, obj):
/home/jfsantos/anaconda3/envs/pytorch/lib/python3.5/site-packages/torch/multiprocessing/reductions.py in rebuild_storage_fd(cls, df, size)
68 fd = multiprocessing.reduction.rebuild_handle(df)
69 else:
---> 70 fd = df.detach()
71 try:
72 storage = storage_from_cache(cls, fd_id(fd))
/home/jfsantos/anaconda3/envs/pytorch/lib/python3.5/multiprocessing/resource_sharer.py in detach(self)
56 '''Get the fd. This should only be called once.'''
57 with _resource_sharer.get_connection(self._id) as conn:
---> 58 return reduction.recv_handle(conn)
59
60
/home/jfsantos/anaconda3/envs/pytorch/lib/python3.5/multiprocessing/reduction.py in recv_handle(conn)
179 '''Receive a handle over a local connection.'''
180 with socket.fromfd(conn.fileno(), socket.AF_UNIX, socket.SOCK_STREAM) as s:
--> 181 return recvfds(s, 1)[0]
182
183 def DupFd(fd):
/home/jfsantos/anaconda3/envs/pytorch/lib/python3.5/multiprocessing/reduction.py in recvfds(sock, size)
158 if len(ancdata) != 1:
159 raise RuntimeError('received %d items of ancdata' %
--> 160 len(ancdata))
161 cmsg_level, cmsg_type, cmsg_data = ancdata[0]
162 if (cmsg_level == socket.SOL_SOCKET and
RuntimeError: received 0 items of ancdata
However, if I just do idxs = [k for k, batch in tqdm(enumerate(test_loader))]
I do not have this issue.
I do not have any idea on how to test it as my knowledge of how PyTorch does this is currently very limited, but I could help debug this given some instructions. Does anyone have any idea on where I could start?
xiumingzhang, lopuhin, juanmed, isht7, zardadi and 21 more
Metadata
Metadata
Assignees
Labels
has workaroundhigh prioritymodule: crashProblem manifests as a hard crash, as opposed to a RuntimeErrorProblem manifests as a hard crash, as opposed to a RuntimeErrormodule: dataloaderRelated to torch.utils.data.DataLoader and SamplerRelated to torch.utils.data.DataLoader and SamplertriagedThis issue has been looked at a team member, and triaged and prioritized into an appropriate moduleThis issue has been looked at a team member, and triaged and prioritized into an appropriate module