Skip to content

BUG/ENH: consistent gzip compression arguments #35645

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 2 commits into from
Aug 13, 2020
Merged

BUG/ENH: consistent gzip compression arguments #35645

merged 2 commits into from
Aug 13, 2020

Conversation

twoertwein
Copy link
Member

@twoertwein twoertwein commented Aug 9, 2020

to_csv let's the user set all keyword arguments for gzip. Depending on whether the user provides a filename or a file object different keyword arguments can be set (gzip.open vs gzip.GzipFile).

This PR always uses gzip.GzipFile. The additional keyword arguments valid for gzip.open but not valid for gzip.GzipFile (encoding, errors, and newline) are still accessible:

g = TextIOWrapper(f, encoding=encoding, errors=errors, newline="")

Using gzip.GzipFile, also allows us to set mtime to create reproducible gzip archives.

Copy link
Member

@arw2019 arw2019 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

might we need to update the docstring or do you think it's good as is?

@twoertwein
Copy link
Member Author

updating the doc string is a good idea, will do that! I assume that this will affect multiple to_* methods. Is there a good strategy instead of copy pasting the same docstring multiple times?

@arw2019
Copy link
Member

arw2019 commented Aug 11, 2020

updating the doc string is a good idea, will do that! I assume that this will affect multiple to_* methods. Is there a good strategy instead of copy pasting the same docstring multiple times?

You could maybe add the more explicit explanation to doc/source/user_guide/io.rst, under Quoting, compression, and file format, and add brief Changed in version 1.2.0 notes in docstrings of affected methods

@twoertwein
Copy link
Member Author

The PR adding arguments for bz2/gzip #33398 mentioned that it affects to_csv, to_pickle, and to_json in the whatsnew but only updated the docstring for to_csv.

I could make sure that all three to_* methods have the same compression docstring and extend the mtime test case to also cover to_json and to_pickle.

@jreback jreback added the IO Data IO issues that don't fit into a more specific label label Aug 12, 2020
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good. sligthly OT, we want to add typing for the compression arg (I think we have an issue for this), similar to StorageOptions whereby we define it in pandas._typing.py

@jreback jreback added this to the 1.2 milestone Aug 12, 2020
@jreback
Copy link
Contributor

jreback commented Aug 12, 2020

cc @gfyoung @WillAyd @TomAugspurger if comments.

@twoertwein
Copy link
Member Author

twoertwein commented Aug 12, 2020

looks good. sligthly OT, we want to add typing for the compression arg (I think we have an issue for this), similar to StorageOptions whereby we define it in pandas._typing.py

I will look into that, I assume it is going to be:

class CompressionArgs(TypedDict, total=False):
    method: str
    compresslevel: Optional[int]
    mtime:Optional[int]
    compression:int
    allowZip64:bool
    strict_timestamps:bool

technically, there are a few more but users should not pass them (filename, fileobj, buffer (deprecated since python 3.0), mode).

I could make sure that all three to_* methods have the same compression docstring and extend the mtime test case to also cover to_json and to_pickle.

Do you have opinions about that? compression does not only affect to_csv.

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm - nice PR

@pep8speaks
Copy link

pep8speaks commented Aug 12, 2020

Hello @twoertwein! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-08-13 02:56:33 UTC

@twoertwein
Copy link
Member Author

twoertwein commented Aug 12, 2020

oh, I didn't know that TypedDict requires python 3.8. I will simply use Mapping[str, Optional[Union[str, int, bool]]].

@@ -816,6 +827,8 @@ def close(self):
self.open_stream.close()
except (IOError, AttributeError):
pass
for file_handle in self.file_handles:
file_handle.close()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably unrelated to the recent CI issues, but we should definitely close those handles.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, is there a ResoucceWarning?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't seen any when reading/writing json files

@jreback jreback added the Typing type annotations, mypy/pyright type checking label Aug 13, 2020
@jreback jreback merged commit 59febbd into pandas-dev:master Aug 13, 2020
@jreback
Copy link
Contributor

jreback commented Aug 13, 2020

thanks @twoertwein very nice!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO Data IO issues that don't fit into a more specific label Typing type annotations, mypy/pyright type checking
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Deterministic gzip compressed outputs
6 participants