pandas read_csv demo

This repository demonstrates the behavior covered in my Issue which was later addressed in a PR by @AlexKirko

Pandas has a great function for reading data from a csv into a dataframe and automatically assigning datatypes to the columns. I do, however, have one issue with how this function handles certain keywords in string fields. Let's look at an example.

Example

I get a csv from another piece of software that outputs strings, ints, floats, and enums. It also can output just the integer and string portions of an enum.

stringField	intField	floatField	enumField	enumField_int	enumField_str
pandas	0	0.5	0:false	0	false
hello	1	1.5	1:true	1	true
world	2	2.5	2:unknown	2	unknown

To read this csv into a dataframe, first import pandas.

#!/usr/bin/python3

import pandas as pd

Then use the read_csv function to read the table into a dataframe.

df = pd.read_csv('demo_more.csv')

Check the types assigned with the dtypes function.

print(df.dtypes)

The following will be returned.

stringField       object
intField           int64
floatField       float64
enumField         object
enumField_int      int64
enumField_str     object
dtype: object

As expected, stringField is a string, intField is an int, floatField is a float, enumField is a string, enumField_int is an int, and enumField_str is a string.

This behavior should be the same for a smaller set of data, like the following.

stringField	intField	floatField	enumField	enumField_int	enumField_str
pandas	0	0.5	0:false	0	false
hello	1	1.5	1:true	1	true

df = pd.read_csv('demo_less.csv')
print(df.dtypes)

However, the following is returned instead.

stringField       object
intField           int64
floatField       float64
enumField         object
enumField_int      int64
enumField_str       bool
dtype: object

The Problem

When read_csv sees a column of string values that could all represent a boolean value, it interprets that field as a boolean. This feature could potentially be useful for some applications, but in my case, booleans will always be represented as 0 and 1 and the only time true and false will appear in my data is when they are values in a string field.

I would expect there to be a parameter to read_csv to disable this behavior, but there is not.

There are parameters that allow for more keywords to be interpretted as true and false (true_values, false_values) but the default value of these parameters is none, and setting them to an empty list doesn't disable this behavior either.

This is the only way to disable this behavior purely with the pandas library

df = pd.read_csv('demo_less.csv', dtype=str).apply(pd.to_numeric, errors='ignore')

read_csv will interpret every column as a string, then the to_numeric function can be applied to each column to attempt to convert it to a numeric type. to_numeric does not treat string values as booleans even though read_csv does.

Now if the types are checked.

print(df.dtypes)

The following will be returned.

stringField       object
intField           int64
floatField       float64
enumField         object
enumField_int      int64
enumField_str     object
dtype: object

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
.gitattributes		.gitattributes
README.md		README.md
demo.py		demo.py
demo_less.csv		demo_less.csv
demo_more.csv		demo_more.csv

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

pandas read_csv demo

Example

The Problem

About

Uh oh!

Releases

Packages

Languages

ryanbaekr/pandas_read_csv_demo

Folders and files

Latest commit

History

Repository files navigation

pandas read_csv demo

Example

The Problem

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages