Empty cells make Padas use float, even if read_csv(dtype={'FOO': str}) is used #17810

MartinThoma · 2017-10-07T06:38:37Z

Code Sample, a copy-pastable example if possible

#!/usr/bin/env python3
# -*- coding: utf-8 -*-

import pandas as pd

csv_path = 'test.csv'
df = pd.read_csv(csv_path, delimiter=';', quotechar='"',
                 decimal=',', encoding="ISO-8859-1", dtype={'FOO': str})
df.FOO = df.FOO.map(lambda n: n.zfill(6))
print(df)

test.csv:

FOO;BAR
01,23;4,56
1,23;45,6
;987

Problem description

When I use dtype={'FOO': str}, I expect pandas to treat the column as a string. This seems to work, but when an empty cell is present Pandas seems to switch to float.

Expected Output

      FOO     BAR
0  001,23    4.56
1  001,23   45.60
2  000000  987.00

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Linux
OS-release: 4.10.0-35-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.20.3
pytest: 3.2.2
pip: 9.0.1
setuptools: 20.7.0
Cython: None
numpy: 1.13.3
scipy: 0.19.0
xarray: None
IPython: 6.2.1
sphinx: None
patsy: None
dateutil: 2.6.1
pytz: 2017.2
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: 2.0.2
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml: None
bs4: None
html5lib: 1.0b10
sqlalchemy: 1.1.14
pymysql: None
psycopg2: None
jinja2: 2.9.6
s3fs: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

MartinThoma · 2017-10-07T06:39:02Z

See https://stackoverflow.com/q/46608604/562769

bobhaffner · 2017-10-07T15:18:51Z

I believe this is expected behavior.

From read_csv

dtype : Type name or dict of column -> type, default None
Data type for data or columns. E.g. {‘a’: np.float64, ‘b’: np.int32} Use str or object to preserve and not interpret dtype. If converters are specified, they will be applied INSTEAD of dtype conversion.

Maybe the converter arg to read_csv is what you're after
converters={'FOO': lambda x: str(x)}

jorisvandenbossche · 2017-10-09T08:42:30Z

@MartinThoma If you look at the values of the column, you will see pandas correctly preserved the data as strings (as you specified with dtype={'FOO': str}):

In [20]: df.FOO.values
Out[20]: array(['01,23', '1,23', nan], dtype=object)

The only 'gotcha' is that empty strings are still seen as missing values (and thus converted to NaN), and not kept as an empty string.

So your solution of filling the missing values with empty string (df.FOO.fillna(value="")) is actually fine.

jorisvandenbossche · 2017-10-09T08:46:15Z

The solution of using the converters arg (converters={'FOO': str}) is also fine (although I think it will be slower if you have a lot of data, but not sure).

jreback · 2017-10-09T12:02:44Z

I seem to recall this issue coming up before. would be helpful to link to prior discussions.

jorisvandenbossche · 2017-10-09T15:02:17Z

I don't directly find another related issue, apart from #1450, which you can actually do as well: add na_values=[], keep_default_na=False to read_csv if you want to prevent the parsing of empty strings to NaNs.

MartinThoma · 2017-10-09T17:11:25Z

Why is na_values=[] required? What would happen without it?

matanox · 2018-12-14T18:27:41Z

Is it only me, or is the type inference and missing data handling part of reading input data an idiosyncratic part of pandas dataframes? Anyway thanks for all the advice.

mroeschke · 2020-04-30T06:07:06Z

Seems like this is the intended behavior which is documented in read_csv. Going to close but happy to reopen if there are any suggestions to improve the documentation

Ed1123 · 2020-09-15T22:34:50Z

Take a look at this:

pd.read_csv('csv_file.csv', dtype={'special_id': int})

That code throwing this error:
ValueError: Integer column has NA values in column 0

It is because that given column have empty cells that I expected to be consider as NaN. Without the dtype argument I'm getting the values as floats.

jorisvandenbossche added the Usage Question label Oct 9, 2017

mroeschke closed this as completed Apr 30, 2020

mroeschke added Dtype Conversions Unexpected or buggy dtype conversions IO CSV read_csv, to_csv labels Apr 30, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Empty cells make Padas use float, even if read_csv(dtype={'FOO': str}) is used #17810

Empty cells make Padas use float, even if read_csv(dtype={'FOO': str}) is used #17810

MartinThoma commented Oct 7, 2017

INSTALLED VERSIONS

MartinThoma commented Oct 7, 2017

Uh oh!

bobhaffner commented Oct 7, 2017 •

edited

Loading

Uh oh!

jorisvandenbossche commented Oct 9, 2017

Uh oh!

jorisvandenbossche commented Oct 9, 2017

Uh oh!

jreback commented Oct 9, 2017

Uh oh!

jorisvandenbossche commented Oct 9, 2017

Uh oh!

MartinThoma commented Oct 9, 2017

Uh oh!

matanox commented Dec 14, 2018 •

edited

Loading

Uh oh!

mroeschke commented Apr 30, 2020

Uh oh!

Ed1123 commented Sep 15, 2020

Uh oh!

Uh oh!

Empty cells make Padas use float, even if read_csv(dtype={'FOO': str}) is used #17810

Empty cells make Padas use float, even if read_csv(dtype={'FOO': str}) is used #17810

Comments

MartinThoma commented Oct 7, 2017

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

MartinThoma commented Oct 7, 2017

Uh oh!

bobhaffner commented Oct 7, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jorisvandenbossche commented Oct 9, 2017

Uh oh!

jorisvandenbossche commented Oct 9, 2017

Uh oh!

jreback commented Oct 9, 2017

Uh oh!

jorisvandenbossche commented Oct 9, 2017

Uh oh!

MartinThoma commented Oct 9, 2017

Uh oh!

matanox commented Dec 14, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mroeschke commented Apr 30, 2020

Uh oh!

Ed1123 commented Sep 15, 2020

Uh oh!

Output of `pd.show_versions()`

bobhaffner commented Oct 7, 2017 •

edited

Loading

matanox commented Dec 14, 2018 •

edited

Loading