Skip to content

BUG: Fix using "inf"/"-inf" in na_values for csv with int index column #22169

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
Aug 9, 2018

Conversation

Templarrr
Copy link

@Templarrr Templarrr commented Aug 2, 2018

The issue happens when you try to use 'inf' or '-inf' as part of na_values in read_csv.

Code snippet to reproduce:

from StringIO import StringIO

import pandas as pd

dataset = StringIO('''index,col1,col2,col3
1,6,10,14
2,7,11,15
3,8,12,16
4,9,13,17
5,inf,-inf,bla
''')
na_values = ['inf', '-inf', 'bla']

df = pd.read_csv(dataset, na_values=na_values, index_col='index')
print df
print df.dtypes

Without fix:

Traceback (most recent call last):
  File "/home/modintsov/workspace/DataRobot/playground.py", line 39, in <module>
    df = pd.read_csv(dataset, na_values=na_values, index_col='index')
  File "/home/modintsov/.virtualenvs/dev/local/lib/python2.7/site-packages/pandas/io/parsers.py", line 678, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/home/modintsov/.virtualenvs/dev/local/lib/python2.7/site-packages/pandas/io/parsers.py", line 446, in _read
    data = parser.read(nrows)
  File "/home/modintsov/.virtualenvs/dev/local/lib/python2.7/site-packages/pandas/io/parsers.py", line 1036, in read
    ret = self._engine.read(nrows)
  File "/home/modintsov/.virtualenvs/dev/local/lib/python2.7/site-packages/pandas/io/parsers.py", line 1922, in read
    index, names = self._make_index(data, alldata, names)
  File "/home/modintsov/.virtualenvs/dev/local/lib/python2.7/site-packages/pandas/io/parsers.py", line 1426, in _make_index
    index = self._agg_index(index)
  File "/home/modintsov/.virtualenvs/dev/local/lib/python2.7/site-packages/pandas/io/parsers.py", line 1520, in _agg_index
    arr, _ = self._infer_types(arr, col_na_values | col_na_fvalues)
  File "/home/modintsov/.virtualenvs/dev/local/lib/python2.7/site-packages/pandas/io/parsers.py", line 1600, in _infer_types
    mask = algorithms.isin(values, list(na_values))
  File "/home/modintsov/.virtualenvs/dev/local/lib/python2.7/site-packages/pandas/core/algorithms.py", line 418, in isin
    values, _, _ = _ensure_data(values, dtype=dtype)
  File "/home/modintsov/.virtualenvs/dev/local/lib/python2.7/site-packages/pandas/core/algorithms.py", line 82, in _ensure_data
    return _ensure_int64(values), 'int64', 'int64'
  File "pandas/_libs/algos_common_helper.pxi", line 3227, in pandas._libs.algos.ensure_int64
  File "pandas/_libs/algos_common_helper.pxi", line 3232, in pandas._libs.algos.ensure_int64
OverflowError: cannot convert float infinity to integer

With fix (as expected):

       col1  col2  col3
index
1       6.0  10.0  14.0
2       7.0  11.0  15.0
3       8.0  12.0  16.0
4       9.0  13.0  17.0
5       NaN   NaN   NaN
col1    float64
col2    float64
col3    float64
dtype: object

@pep8speaks
Copy link

pep8speaks commented Aug 2, 2018

Hello @Templarrr! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on August 09, 2018 at 08:44 Hours UTC

@codecov
Copy link

codecov bot commented Aug 2, 2018

Codecov Report

Merging #22169 into master will decrease coverage by <.01%.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #22169      +/-   ##
==========================================
- Coverage   92.07%   92.07%   -0.01%     
==========================================
  Files         169      169              
  Lines       50684    50683       -1     
==========================================
- Hits        46668    46666       -2     
- Misses       4016     4017       +1
Flag Coverage Δ
#multiple 90.48% <100%> (-0.01%) ⬇️
#single 42.34% <100%> (ø) ⬆️
Impacted Files Coverage Δ
pandas/core/algorithms.py 94.69% <100%> (ø) ⬆️
pandas/core/indexes/multi.py 95.25% <0%> (-0.09%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3bcc2bb...e91a840. Read the comment docs.

@jreback
Copy link
Contributor

jreback commented Aug 2, 2018

xref #10065
closes #17128

@jreback
Copy link
Contributor

jreback commented Aug 2, 2018

I believe this closes #17128 can you confirm and update the whatsnew to reflect

@@ -41,7 +41,7 @@ Bug Fixes

**Indexing**

-
- Fix OverflowError when trying to use 'inf' as na_value with int index column
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add the issue number here, use double-backticks around OverflowError. use double backticks on na_value; spell out int -> integer.

@@ -369,3 +369,13 @@ def test_no_na_filter_on_index(self):
expected = DataFrame({"a": [1, 4], "c": [3, 6]},
index=Index([np.nan, 5.0], name="b"))
tm.assert_frame_equal(out, expected)

def test_inf_na_values_with_int_index(self):
data = "idx,col1,col2\n1,3,4\n2,inf,-inf"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add the issue number here as a comment

@jreback jreback added Numeric Operations Arithmetic, Comparison, and Logical operations IO CSV read_csv, to_csv labels Aug 2, 2018
@jreback jreback added this to the 0.23.4 milestone Aug 2, 2018
@Templarrr
Copy link
Author

Oh, wow, it's actually my colleague :)
Didn't know Liau Yung Siang reported that already.

Thanks for the review, it's already late In my timezone, but I'll update the PR tomorrow!

Michael Odintsov added 3 commits August 3, 2018 10:58
@Templarrr
Copy link
Author

@jreback I've confirmed that this solves the error that @YS-L reported in #17128 and addressed your PR review comments. Can you look again?

Also yesterday the 0.23.4 changelog got updated with today's date, but I don't see on pypi newer pandas version released yet, so I'm not entirely sure - will my bugfix be a part of 0.23.4 or should I move the comment to 0.24.0 changelog?

@Templarrr
Copy link
Author

The failure in travis is some kind of network glitch, unrelated to these changes :(

@Templarrr
Copy link
Author

@jreback I see 0.23.4 was released, so this fix obviously didn't make it :) I've moved the changelog line to 0.24.0

@jreback jreback modified the milestones: 0.23.4, 0.23.5 Aug 6, 2018
@Templarrr
Copy link
Author

@jreback I see you've moved the label to 0.23.5, didn't know there was 0.23.5 planned :)
There is no 0.23.5 changelog currently, should I create a new one and move my changelog line there?

@jreback
Copy link
Contributor

jreback commented Aug 7, 2018

yeah we might do a 0.23.5. if you want to push a 0.23.5 whats (empty) pls do , but a new PR pls.

@@ -607,6 +607,7 @@ Indexing
- Fixed ``DataFrame[np.nan]`` when columns are non-unique (:issue:`21428`)
- Bug when indexing :class:`DatetimeIndex` with nanosecond resolution dates and timezones (:issue:`11679`)
- Bug where indexing with a Numpy array containing negative values would mutate the indexer (:issue:`21867`)
- Fix ``OverflowError`` when trying to use 'inf' as ``na_value`` with integer index column (:issue:`17128`)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you move to 0.23.5 (rebase on master to see the whatsnew)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do in few minutes

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback done.

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ping on green.

@@ -40,3 +40,7 @@ Bug Fixes

-
-

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you put this in the IO section, and referencde :func:read_csv here

@jreback jreback merged commit 020e948 into pandas-dev:master Aug 9, 2018
@jreback
Copy link
Contributor

jreback commented Aug 9, 2018

thanks @Templarrr

lumberbot-app bot pushed a commit that referenced this pull request Aug 9, 2018
@Templarrr
Copy link
Author

You're welcome! Always glad to help :)

jreback pushed a commit that referenced this pull request Aug 9, 2018
Sup3rGeo pushed a commit to Sup3rGeo/pandas that referenced this pull request Oct 1, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO CSV read_csv, to_csv Numeric Operations Arithmetic, Comparison, and Logical operations
Projects
None yet
Development

Successfully merging this pull request may close these issues.

OverflowError in read_csv when specifying certain na_values
3 participants