Skip to content

FTPFileSystem incompatibility with pandas/pyarrow #705

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
matteosantama opened this issue Jul 14, 2021 · 7 comments
Open

FTPFileSystem incompatibility with pandas/pyarrow #705

matteosantama opened this issue Jul 14, 2021 · 7 comments

Comments

@matteosantama
Copy link

I've run across an issue, and to be honest I'm not quite sure which library it belongs in.

It starts with the line

df.to_parquet(
    path=path,
    partition_cols=partition_cols,
    storage_options=fs.storage_options
)

which takes me through the pandas library until reaching this function in pyarrow

https://github.com/apache/arrow/blob/e2238582e2a2bf20a68a967145fe1a7b2337a997/python/pyarrow/parquet.py#L1891

which internally calls the helper function _mkdir_if_not_exists()

https://github.com/apache/arrow/blob/e2238582e2a2bf20a68a967145fe1a7b2337a997/python/pyarrow/parquet.py#L1883

The problem is that for FTPFileSystem, this helper function does not actually create the directory because of this line

https://github.com/intake/filesystem_spec/blob/85bb2f3fef2aa12f7ec8497ea116c78c644b49ec/fsspec/spec.py#L1219

The simplest solution would be to override this _is_filestore() method in FTPFileSystem to return True.

@matteosantama
Copy link
Author

Upon further investigation, simply creating the directory before trying to store to parquet does not work because of the partition.

The creation of the partition directories also depend on the _mkdir_if_not_exists() method

https://github.com/apache/arrow/blob/e2238582e2a2bf20a68a967145fe1a7b2337a997/python/pyarrow/parquet.py#L2015

@martindurant
Copy link
Member

Hm, I don't know - people are certainly using fsspec with arrow/parquet. _isfilestore is supposed to indicate that the class is an exact implementation (subclass) of arrow's internal file system type; this is enforced for pyarrow<2.0, but arrow should cope with all our filesystems after that. Do you think there's something specific to FTP, for example see this note ?

@matteosantama
Copy link
Author

I have a test that runs the snippet for local, s3, and ftp filesystems. Local passes because _isfilestore() == True, s3 passes because you do not need to create a directory to save a file, and ftp fails because it is trying to create a file in a non-existent directory.

If the _isfilestore() flag is meant to indicate whether the filesystem is a subclass of arrow's internal file system type, I don't know why we would check it in the _mkdir_if_not_exists() call.

If you feel confident that FTPFileSystem should not have _isfilestore() == True, then this is a problem with pyarrow.

@martindurant
Copy link
Member

It may well be something non-ideal in arrow, but you should also be aware that both the filesystem and parquet/dataset APIs have been replaced in recent times.
What could be done on our end, if to have an auto_mkdir parameter which, if true, will do makedirs for every open(), as happens in LocalFileSystem.

@matteosantama
Copy link
Author

Oh I see. At what point would that get set to True though? I imagine it would default to False

@martindurant
Copy link
Member

Maybe it should depend on the pyarrow version (cc #411 , @jorisvandenbossche )

@jorisvandenbossche
Copy link
Contributor

Late answer:

  • On the actual issue of _isfilestore / FTPFileSystem: I never fully understood this _isfilestore, and I rather not change anything about that on the Arrow side (since it's in the "legacy" code path anyway). If setting it to True for FTPFileSystem helps for you, you can maybe patch it for yourself.
  • When using an ftp filesystem path with to_parquet, currently that still uses the legacy implementation (including _mkdir_if_not_exists). The better option probably is to already opt in to the new implementation by specifying df.to_parquet(..., use_legacy_dataset=False). In that case, it will still work with an FTPFileSystem, but based on the new filesystem implementation (which will automatically wrap fsspec based filesystems if it gets those).
    (at some point we will change this default, and for reading that's already the case, but for the writing path not yet)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants