Skip to content

BUG: fix the bad error raised by HDFStore.put() #38919

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Jan 5, 2021

Conversation

1MLightyears
Copy link
Contributor

@1MLightyears 1MLightyears commented Jan 3, 2021

  • closes BUG: small error in pandas/io/pytables.py #34274
  • tests added / passed
  • passes black pandas
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry
    P.S. something was wrong with git diff upstream/master so I directly ran flake8 ./pandas/io/pytables.py as it's the only file changed

I was running into the same problem as #34274 and found where the error is. But (as I've just started using pandas for several days:sweet_smile:) I have few knowledge about pandas, so maybe I didn't make the full use of pandas' components.

As seen in #34274, if a DataFrame contains non-string elements and is about to be written into an HDF5 file by HDFStore.put(), a TypeError: object of type 'int' has no len() error is raised.

But it's not the right "error" expected. HDFStore.put() can't serialize some types of element, so report an exception on a column by column basis is actually needed.

This commit fixes this, now it raises TypeError: Cannot serialize the column [{column_No}] because its data contents are not string but [{non_string_type}] object dtype as expected.

@pep8speaks
Copy link

pep8speaks commented Jan 3, 2021

Hello @1MLightyears! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2021-01-05 00:15:24 UTC

@1MLightyears
Copy link
Contributor Author

The interesting thing is that in the failed tests (like Database / Linux_py37_cov (pull_request)), two of the test case were reported to be wrong, which both include the wrong raised error "object of type 'int' has no len()":
in pandas\tests\io\pytables\test_store.py, line 2052~2060:

# test append with invalid input to get good error messages

# list in column
df = tm.makeDataFrame()
df["invalid"] = [["a"]] * len(df)
assert df.dtypes["invalid"] == np.object_
msg = re.escape("object of type 'int' has no len()")
with pytest.raises(TypeError, match=msg):
    store.append("df", df)

line 2222~2226:

with ensure_clean_store(setup_path) as store:
    # this fails because we have a date in the object block......
    msg = "object of type 'int' has no len()"
    with pytest.raises(TypeError, match=msg):
        store.append("df_unimplemented", df)

should I change them too?

@jreback
Copy link
Contributor

jreback commented Jan 3, 2021

The interesting thing is that in the failed tests (like Database / Linux_py37_cov (pull_request)), two of the test case were reported to be wrong, which both include the wrong raised error "object of type 'int' has no len()":
in pandas\tests\io\pytables\test_store.py, line 2052~2060:

# test append with invalid input to get good error messages

# list in column
df = tm.makeDataFrame()
df["invalid"] = [["a"]] * len(df)
assert df.dtypes["invalid"] == np.object_
msg = re.escape("object of type 'int' has no len()")
with pytest.raises(TypeError, match=msg):
    store.append("df", df)

line 2222~2226:

with ensure_clean_store(setup_path) as store:
    # this fails because we have a date in the object block......
    msg = "object of type 'int' has no len()"
    with pytest.raises(TypeError, match=msg):
        store.append("df_unimplemented", df)

should I change them too?

yes these were wrong in the first place, your change should fix them.

pls also add a whatsnew note (1.3 bug fixes I/O)

@jreback jreback added Error Reporting Incorrect or improved errors from pandas IO HDF5 read_hdf, HDFStore labels Jan 3, 2021
* Remove bad error raised

* Add column label information when raising error for debugging

* Update test cases and comments
@1MLightyears
Copy link
Contributor Author

Now it's using a new approach to show the error column label, meanwhile keep the interface part as-is.

The test cases is also updated.

And ... as a new comer, actually I don't exactly know what a whatsnew note is 😅 I copied a commit format from git log and write my own in commit 99005b8, if I made that wrong, pls tell me and I'm willing to correct it.

About the two failed test cases, looks like it's involved with HTTP server, and should not be business of the HDF5 I/O part I commit. No idea about it so far.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nan_rep,
encoding,
errors,
block_columns: List[str] = [],
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove the default arg, the annotation is enough

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the default [] is used to keep the interface consistency.
but since the function is only called once I think it's okay to remove, maybe
will remove it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes and never should you use mutable arguments

):
# block_columns(list[str]): the label of columns for debug info use.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a doc-string template e.g. Parameter / Returns (just list the arg names, and document the new one)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it.
but giving a private function-which is only internally called once-a docstring looks a bit deluxe...
will add docstring.

):
"""
Ensure all elements in the given block are "string".
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok this is more confusing, let's just remove this doc-string entirely

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry english is not my first language and I don't totally understand why non-strings aren't okay to be written in HDF5.
it's just literally do this thing, Ensure all elements in the given block are "string".
It is now removed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the elements are not actual strings but python objects we cann't serialize to this column type.

@jreback jreback added this to the 1.3 milestone Jan 4, 2021
@@ -271,6 +271,7 @@ I/O
- Allow custom error values for parse_dates argument of :func:`read_sql`, :func:`read_sql_query` and :func:`read_sql_table` (:issue:`35185`)
- Bug in :func:`to_hdf` raising ``KeyError`` when trying to apply
for subclasses of ``DataFrame`` or ``Series`` (:issue:`33748`).
- Bug in :func:`put` raising a wrong ``TypeError`` when saving a DataFrame with non-string dtype (:issue:`34274`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needs to be :meth:`~HDFStore.put`

say this is an improved error message

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh thank you for that
It's morning in my timezone here and I just wake up 😅

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

heheh, no prob this PR looks good. just a small change.

@jreback jreback merged commit df2c7df into pandas-dev:master Jan 5, 2021
@jreback
Copy link
Contributor

jreback commented Jan 5, 2021

thanks @1MLightyears

luckyvs1 pushed a commit to luckyvs1/pandas that referenced this pull request Jan 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Error Reporting Incorrect or improved errors from pandas IO HDF5 read_hdf, HDFStore
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: small error in pandas/io/pytables.py
3 participants