Description
Research
-
I have searched the [pandas] tag on StackOverflow for similar questions.
-
I have asked my usage related question on StackOverflow.
Link to question on StackOverflow
Question about pandas
The on_bad_lines=warn
question me. Pandas team added the functionality to directly handle lines with more separators than the main lines, that's why it don't seems strange to me that some other pandas option could handle in the same way lines with less separators than main lines
Using pandas.read_csv
with on_bad_lines='warn'
option the case there is too many
columns delimiters work well, bad lines are not loaded and stderr catch the bad lines
numbers:
import pandas as pd
from io import StringIO
data = StringIO("""
nom,f,nb
bat,F,52
cat,M,66,
caw,F,15
dog,M,66,,
fly,F,61
ant,F,21""")
df = pd.read_csv(data, sep=',', on_bad_lines='warn')
b'Skipping line 4: expected 3 fields, saw 4\nSkipping line 6: expected 3 fields, saw 5\n'
df.head(10)
# nom f nb
# 0 bat F 52
# 1 caw F 15
# 2 fly F 61
# 3 ant F 21
But in case the number of delimiter (here sep=,
) is less than the main, the line
is added adding NaN
.:
import pandas as pd
from io import StringIO
data = StringIO("""
nom,f,nb
bat,F,52
catM66,
caw,F,15
dog,M66
fly,F,61
ant,F,21""")
df = pd.read_csv(data, sep=',', on_bad_lines='warn', dtype=str)
df.head(10)
# nom f nb
# 0 bat F 52
# 1 catM66 NaN NaN <==
# 2 caw F 15
# 3 dog M66 NaN <==
# 4 fly F 61
# 5 ant F 21
Is there a way to make read_csv
to not add lines with less columns delimiters than
the main lines ?