Skip to content

read_csv issues with dict for na_values #19227

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
neilser opened this issue Jan 13, 2018 · 3 comments
Closed

read_csv issues with dict for na_values #19227

neilser opened this issue Jan 13, 2018 · 3 comments
Labels
Bug IO CSV read_csv, to_csv
Milestone

Comments

@neilser
Copy link

neilser commented Jan 13, 2018

Basically, I can't get a dictionary of na_values to work properly for me, no matter what I try. Pandas version is 0.22.0.
hack.csv contains:

113125,"blah","/blaha",kjsdkj,412.166,225.874,214.008
729639,"qwer","",asdfkj,466.681,,252.373

Here are two variants of my code - the one with the list does what I expect, but the dict version doesn't:

df = pd.read_csv("hack.csv", header=None, keep_default_na=False, na_values=[214.008,'',"blah"])
df.head()

output from list version

0 1 2 3 4 5 6
113125 NaN /blaha kjsdkj 412.166 225.874 NaN
729639 qwer NaN asdfkj 466.681 NaN 252.373

looks correct, but the dict version

df = pd.read_csv("hack.csv", header=None, 
                 keep_default_na=False, na_values={2:"",6:"214.008",1:"blah",0:'113125'})
df.head()

although clearly paying attention to the columns I specify, is simply refusing to create any NaNs in those columns:

0 1 2 3 4 5 6
113125 blah /blaha kjsdkj 412.166 225.874 214.008
729639 qwer asdfkj 466.681 NaN 252.373

So... I'm stuck. Any suggestions? I really want to have column-specific NaN handling so I need the dict.

Additionally the dict version does create NaNs in columns I didn't specify in the dict, which also totally goes against my expectations for the combination of keep_default_na=False and an explicit value for na_values. Maybe I'm misreading the docs on that point.

Finally, you may notice that I used "214.008" (and other quoted numeric values) in the dict above. This is because I get a "not iterable" error when I provide unquoted numbers. This is despite that having been flagged as an issue and fixed a while back. This feels like another buglet to me.

Btw: to be picky, another doc-related quibble: I think the docs for keep_default_na are a bit misleading, in that they imply that keep_default_na=True should have no effect unless na_values is supplied (but in fact there is an effect even when na_values isn't supplied). It might be over-pedantic of me to care, but I feel that primary documentation really ought to be unambiguous. If anybody agrees with this pedantry I would be happy to propose a tweak ;-)

This issue was raised after I recently commented on two other issues with the problems described above, and @gfyoung suggested I ought to raise a new issue (comment links below).
#1657 (comment)
#12224 (comment)

Output of pd.show_versions()

NB: I quite likely have some "old" modules in the pile below, but I believe that I've updated pandas itself correctly, so any dependencies ought to have been updated too. If my errors aren't reproducible by others, it might indicate that there's a hidden version dependency(?) but I'm too much of a pandas noob to know how likely that is.

INSTALLED VERSIONS

commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Windows
OS-release: 10
machine: AMD64
processor: Intel64 Family 6 Model 158 Stepping 9, GenuineIntel
byteorder: little
LC_ALL: None
LANG: None
LOCALE: None.None

pandas: 0.22.0
pytest: 2.9.2
pip: 8.1.2
setuptools: 27.2.0
Cython: 0.24.1
numpy: 1.13.3
scipy: 1.0.0
pyarrow: None
xarray: None
IPython: 5.1.0
sphinx: 1.4.6
patsy: 0.4.1
dateutil: 2.5.3
pytz: 2016.6.1
blosc: None
bottleneck: 1.1.0
tables: 3.2.2
numexpr: 2.6.4
feather: None
matplotlib: 2.0.2
openpyxl: 2.3.2
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.3
lxml: 3.6.4
bs4: 4.5.1
html5lib: None
sqlalchemy: 1.0.13
pymysql: None
psycopg2: None
jinja2: 2.8
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@jorisvandenbossche jorisvandenbossche added Bug IO CSV read_csv, to_csv labels Jan 15, 2018
@jorisvandenbossche
Copy link
Member

Thanks for the report.
Can you check what happens if you use column names (and so can use strings instead of integers in the dict)? Just to check whether that works or whether it is a general problem with na_values

@gfyoung
Copy link
Member

gfyoung commented Jan 15, 2018

@jorisvandenbossche : So many bugs...that's all I can say. 😂 Luckily, the patch for them isn't so bad 😄 (PR coming for them soon).

gfyoung added a commit to forking-repos/pandas that referenced this issue Jan 16, 2018
Patches very buggy behavior of keep_default_na=False
whenever na_values is a dict

* Respect keep_default_na for column that doesn't
exist in na_values dictionary
* Don't crash / break when na_value is a scalar in
the na_values dictionary.

In addition, clarifies documentation on behavior of
keep_default_na with respect to na_filter and na_values.

Closes pandas-devgh-19227.
gfyoung added a commit to forking-repos/pandas that referenced this issue Jan 16, 2018
Patches very buggy behavior of keep_default_na=False
whenever na_values is a dict

* Respect keep_default_na for column that doesn't
exist in na_values dictionary
* Don't crash / break when na_value is a scalar in
the na_values dictionary.

In addition, clarifies documentation on behavior of
keep_default_na with respect to na_filter and na_values.

Closes pandas-devgh-19227.
@jreback jreback added this to the 0.23.0 milestone Jan 16, 2018
gfyoung added a commit to forking-repos/pandas that referenced this issue Jan 16, 2018
Patches very buggy behavior of keep_default_na=False
whenever na_values is a dict

* Respect keep_default_na for column that doesn't
exist in na_values dictionary
* Don't crash / break when na_value is a scalar in
the na_values dictionary.

In addition, clarifies documentation on behavior of
keep_default_na with respect to na_filter and na_values.

Closes pandas-devgh-19227.
gfyoung added a commit to forking-repos/pandas that referenced this issue Jan 16, 2018
Patches very buggy behavior of keep_default_na=False
whenever na_values is a dict

* Respect keep_default_na for column that doesn't
exist in na_values dictionary
* Don't crash / break when na_value is a scalar in
the na_values dictionary.

In addition, clarifies documentation on behavior of
keep_default_na with respect to na_filter and na_values.

Closes pandas-devgh-19227.
gfyoung added a commit to forking-repos/pandas that referenced this issue Jan 17, 2018
Patches very buggy behavior of keep_default_na=False
whenever na_values is a dict

* Respect keep_default_na for column that doesn't
exist in na_values dictionary
* Don't crash / break when na_value is a scalar in
the na_values dictionary.

In addition, clarifies documentation on behavior of
keep_default_na with respect to na_filter and na_values.

Closes pandas-devgh-19227.
gfyoung added a commit to forking-repos/pandas that referenced this issue Jan 17, 2018
Patches very buggy behavior of keep_default_na=False
whenever na_values is a dict

* Respect keep_default_na for column that doesn't
exist in na_values dictionary
* Don't crash / break when na_value is a scalar in
the na_values dictionary.

In addition, clarifies documentation on behavior of
keep_default_na with respect to na_filter and na_values.

Closes pandas-devgh-19227.
jreback pushed a commit that referenced this issue Jan 18, 2018
Patches very buggy behavior of keep_default_na=False
whenever na_values is a dict

* Respect keep_default_na for column that doesn't
exist in na_values dictionary
* Don't crash / break when na_value is a scalar in
the na_values dictionary.

In addition, clarifies documentation on behavior of
keep_default_na with respect to na_filter and na_values.

Closes gh-19227.
@neilser
Copy link
Author

neilser commented Jan 18, 2018

Wow, talk about being late to the party! Sorry folks, was tied up for last two days and am only now getting to pay attention to my backlog. But it looks as though it's all sewn up - well done and thanks! :-)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug IO CSV read_csv, to_csv
Projects
None yet
Development

No branches or pull requests

4 participants