Skip to content

GCSFileSystem() hangs when called from multiple processes #379

@JackKelly

Description

@JackKelly

What happened:
In the last two versions of gcsfs (versions 2021.04.0 and 0.8.0), calling gcsfs.GCSFileSystem() from multiple processes hangs without any error messages if gcsfs.GCSFileSystem() has been called previously in the same Python interpreter session.

This bug was not present in gcsfs version 0.7.2 (with fsspec 0.8.7). All the code examples below work perfectly with gcsfs version 0.7.2 (with fsspec 0.8.7).

Minimal Complete Verifiable Example:

The examples below assume gcsfs version 2021.04.0 is installed (with fsspec 2021.04.0) or gcsfs version 0.8.0 (with fsspec 0.9.0)

Install a fresh conda environment: conda create --name test_gcsfs python=3.8 gcsfs ipykernel

The last block of this code hangs:

from concurrent.futures import ProcessPoolExecutor
import gcsfs

# This line works fine!  (And it's fine to repeat this line multiple times.)
gcs = gcsfs.GCSFileSystem() 

# This block hangs, with no error messages:
with ProcessPoolExecutor() as executor:
    for i in range(8):
        future = executor.submit(gcsfs.GCSFileSystem)

But, if we don't do gcs = gcsfs.GCSFileSystem(), then the code works fine. The next code example works perfectly, if run in a fresh Python interpreter. The only difference between the next code example and the previous code example is I've removed gcs = gcsfs.GCSFileSystem().

from concurrent.futures import ProcessPoolExecutor
import gcsfs

# This works fine:
with ProcessPoolExecutor() as executor:
    for i in range(8):
        future = executor.submit(gcsfs.GCSFileSystem)

Likewise, calling the ProcessPoolExecutor multiple times works the first time, but hangs on subsequent tries:

from concurrent.futures import ProcessPoolExecutor
import gcsfs

def process_pool():
    with ProcessPoolExecutor(max_workers=1) as executor:
        for i in range(8):
            future = executor.submit(gcsfs.GCSFileSystem)

# The first attempt works fine:
process_pool()

# This second attempt hangs:
process_pool()

Anything else we should know

Thank you so much for all your hard work on gcsfs - it's a hugely useful tool! Sorry to be reporting a bug!

I tested all this code in a Jupyter Lab notebook.

This issue might be related to this Stack Overflow issue: https://stackoverflow.com/questions/66283634/use-gcsfilesystem-with-multiprocessing

Environment:

  • Dask version: Not installed
  • Python version: 3.8
  • Operating System: Ubuntu 20.10
  • Install method: conda, from conda-forge, using a fresh conda environment.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions