BUG/ENH: consistent gzip compression arguments #35645

twoertwein · 2020-08-09T17:11:31Z

closes Deterministic gzip compressed outputs #28103
tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

to_csv let's the user set all keyword arguments for gzip. Depending on whether the user provides a filename or a file object different keyword arguments can be set (gzip.open vs gzip.GzipFile).

This PR always uses gzip.GzipFile. The additional keyword arguments valid for gzip.open but not valid for gzip.GzipFile (encoding, errors, and ~~newline~~) are still accessible:

pandas/pandas/io/common.py

Line 512 in aefae55

g = TextIOWrapper(f, encoding=encoding, errors=errors, newline="")

Using gzip.GzipFile, also allows us to set mtime to create reproducible gzip archives.

arw2019

lgtm

might we need to update the docstring or do you think it's good as is?

twoertwein · 2020-08-11T01:54:06Z

updating the doc string is a good idea, will do that! I assume that this will affect multiple to_* methods. Is there a good strategy instead of copy pasting the same docstring multiple times?

arw2019 · 2020-08-11T06:54:34Z

updating the doc string is a good idea, will do that! I assume that this will affect multiple to_* methods. Is there a good strategy instead of copy pasting the same docstring multiple times?

You could maybe add the more explicit explanation to doc/source/user_guide/io.rst, under Quoting, compression, and file format, and add brief Changed in version 1.2.0 notes in docstrings of affected methods

twoertwein · 2020-08-11T17:48:01Z

The PR adding arguments for bz2/gzip #33398 mentioned that it affects to_csv, to_pickle, and to_json in the whatsnew but only updated the docstring for to_csv.

I could make sure that all three to_* methods have the same compression docstring and extend the mtime test case to also cover to_json and to_pickle.

jreback

looks good. sligthly OT, we want to add typing for the compression arg (I think we have an issue for this), similar to StorageOptions whereby we define it in pandas._typing.py

pandas/core/generic.py

jreback · 2020-08-12T15:46:37Z

cc @gfyoung @WillAyd @TomAugspurger if comments.

twoertwein · 2020-08-12T16:11:02Z

looks good. sligthly OT, we want to add typing for the compression arg (I think we have an issue for this), similar to StorageOptions whereby we define it in pandas._typing.py

I will look into that, I assume it is going to be:

class CompressionArgs(TypedDict, total=False):
    method: str
    compresslevel: Optional[int]
    mtime:Optional[int]
    compression:int
    allowZip64:bool
    strict_timestamps:bool

technically, there are a few more but users should not pass them (filename, fileobj, buffer (deprecated since python 3.0), mode).

I could make sure that all three to_* methods have the same compression docstring and extend the mtime test case to also cover to_json and to_pickle.

Do you have opinions about that? compression does not only affect to_csv.

WillAyd

lgtm - nice PR

pep8speaks · 2020-08-12T17:15:35Z

Hello @twoertwein! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-08-13 02:56:33 UTC

twoertwein · 2020-08-12T17:29:10Z

oh, I didn't know that TypedDict requires python 3.8. I will simply use Mapping[str, Optional[Union[str, int, bool]]].

pandas/tests/io/test_compression.py

twoertwein · 2020-08-13T03:15:04Z

pandas/io/json/_json.py

@@ -816,6 +827,8 @@ def close(self):
                self.open_stream.close()
            except (IOError, AttributeError):
                pass
+        for file_handle in self.file_handles:
+            file_handle.close()


probably unrelated to the recent CI issues, but we should definitely close those handles.

hmm, is there a ResoucceWarning?

I haven't seen any when reading/writing json files

jreback · 2020-08-13T22:04:56Z

thanks @twoertwein very nice!

arw2019 approved these changes Aug 10, 2020

View reviewed changes

jreback added the IO Data IO issues that don't fit into a more specific label label Aug 12, 2020

jreback requested changes Aug 12, 2020

View reviewed changes

pandas/core/generic.py Show resolved Hide resolved

jreback added this to the 1.2 milestone Aug 12, 2020

WillAyd approved these changes Aug 12, 2020

View reviewed changes

gfyoung reviewed Aug 12, 2020

View reviewed changes

pandas/tests/io/test_compression.py Show resolved Hide resolved

twoertwein added 2 commits August 12, 2020 20:29

io/common: use gzip.GzipFile instead of gzip.open

97b751c

typing for compression

8204c88

twoertwein commented Aug 13, 2020

View reviewed changes

jreback added the Typing type annotations, mypy/pyright type checking label Aug 13, 2020

jreback approved these changes Aug 13, 2020

View reviewed changes

jreback merged commit 59febbd into pandas-dev:master Aug 13, 2020

dhimmel mentioned this pull request Aug 13, 2021

Deterministic gzip compressed outputs #28103

Closed

Uh oh!

BUG/ENH: consistent gzip compression arguments #35645

BUG/ENH: consistent gzip compression arguments #35645

Uh oh!

Conversation

twoertwein commented Aug 9, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

arw2019 left a comment

Choose a reason for hiding this comment

Uh oh!

twoertwein commented Aug 11, 2020

Uh oh!

arw2019 commented Aug 11, 2020

Uh oh!

twoertwein commented Aug 11, 2020

Uh oh!

jreback left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

jreback commented Aug 12, 2020

Uh oh!

twoertwein commented Aug 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

WillAyd left a comment

Choose a reason for hiding this comment

Uh oh!

pep8speaks commented Aug 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Comment last updated at 2020-08-13 02:56:33 UTC

Uh oh!

twoertwein commented Aug 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

twoertwein Aug 13, 2020

Choose a reason for hiding this comment

Uh oh!

jreback Aug 13, 2020

Choose a reason for hiding this comment

Uh oh!

twoertwein Aug 13, 2020

Choose a reason for hiding this comment

Uh oh!

jreback commented Aug 13, 2020

Uh oh!

Uh oh!

twoertwein commented Aug 9, 2020 •

edited

Loading

twoertwein commented Aug 12, 2020 •

edited

Loading

pep8speaks commented Aug 12, 2020 •

edited

Loading

twoertwein commented Aug 12, 2020 •

edited

Loading