Skip to content

Empty cells make Padas use float, even if read_csv(dtype={'FOO': str}) is used #17810

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
MartinThoma opened this issue Oct 7, 2017 · 10 comments
Labels
Dtype Conversions Unexpected or buggy dtype conversions IO CSV read_csv, to_csv Usage Question

Comments

@MartinThoma
Copy link

Code Sample, a copy-pastable example if possible

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

import pandas as pd

csv_path = 'test.csv'
df = pd.read_csv(csv_path, delimiter=';', quotechar='"',
                 decimal=',', encoding="ISO-8859-1", dtype={'FOO': str})
df.FOO = df.FOO.map(lambda n: n.zfill(6))
print(df)

test.csv:

FOO;BAR
01,23;4,56
1,23;45,6
;987

Problem description

When I use dtype={'FOO': str}, I expect pandas to treat the column as a string. This seems to work, but when an empty cell is present Pandas seems to switch to float.

Expected Output

      FOO     BAR
0  001,23    4.56
1  001,23   45.60
2  000000  987.00

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Linux
OS-release: 4.10.0-35-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.20.3
pytest: 3.2.2
pip: 9.0.1
setuptools: 20.7.0
Cython: None
numpy: 1.13.3
scipy: 0.19.0
xarray: None
IPython: 6.2.1
sphinx: None
patsy: None
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.0.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 1.0b10
sqlalchemy: 1.1.14
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None

@MartinThoma
Copy link
Author

@bobhaffner
Copy link
Contributor

bobhaffner commented Oct 7, 2017

I believe this is expected behavior.

From read_csv

dtype : Type name or dict of column -> type, default None
Data type for data or columns. E.g. {‘a’: np.float64, ‘b’: np.int32} Use str or object to preserve and not interpret dtype. If converters are specified, they will be applied INSTEAD of dtype conversion.

Maybe the converter arg to read_csv is what you're after
converters={'FOO': lambda x: str(x)}

@jorisvandenbossche
Copy link
Member

@MartinThoma If you look at the values of the column, you will see pandas correctly preserved the data as strings (as you specified with dtype={'FOO': str}):

In [20]: df.FOO.values
Out[20]: array(['01,23', '1,23', nan], dtype=object)

The only 'gotcha' is that empty strings are still seen as missing values (and thus converted to NaN), and not kept as an empty string.

So your solution of filling the missing values with empty string (df.FOO.fillna(value="")) is actually fine.

@jorisvandenbossche
Copy link
Member

The solution of using the converters arg (converters={'FOO': str}) is also fine (although I think it will be slower if you have a lot of data, but not sure).

@jreback
Copy link
Contributor

jreback commented Oct 9, 2017

I seem to recall this issue coming up before. would be helpful to link to prior discussions.

@jorisvandenbossche
Copy link
Member

I don't directly find another related issue, apart from #1450, which you can actually do as well: add na_values=[], keep_default_na=False to read_csv if you want to prevent the parsing of empty strings to NaNs.

@MartinThoma
Copy link
Author

Why is na_values=[] required? What would happen without it?

@matanox
Copy link

matanox commented Dec 14, 2018

Is it only me, or is the type inference and missing data handling part of reading input data an idiosyncratic part of pandas dataframes? Anyway thanks for all the advice.

@mroeschke
Copy link
Member

Seems like this is the intended behavior which is documented in read_csv. Going to close but happy to reopen if there are any suggestions to improve the documentation

@mroeschke mroeschke added Dtype Conversions Unexpected or buggy dtype conversions IO CSV read_csv, to_csv labels Apr 30, 2020
@Ed1123
Copy link

Ed1123 commented Sep 15, 2020

Take a look at this:

pd.read_csv('csv_file.csv', dtype={'special_id': int})

That code throwing this error:
ValueError: Integer column has NA values in column 0

It is because that given column have empty cells that I expected to be consider as NaN. Without the dtype argument I'm getting the values as floats.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Dtype Conversions Unexpected or buggy dtype conversions IO CSV read_csv, to_csv Usage Question
Projects
None yet
Development

No branches or pull requests

7 participants