Skip to content

QST: handle lines with less separators than main lines in read_csv #48728

Open
@LuisBL

Description

@LuisBL

Research

  • I have searched the [pandas] tag on StackOverflow for similar questions.

  • I have asked my usage related question on StackOverflow.

Link to question on StackOverflow

https://stackoverflow.com/questions/73820090/make-pandas-read-csv-to-not-add-lines-with-less-columns-delimiters-than-the-main

Question about pandas

The on_bad_lines=warn question me. Pandas team added the functionality to directly handle lines with more separators than the main lines, that's why it don't seems strange to me that some other pandas option could handle in the same way lines with less separators than main lines

Using pandas.read_csv with on_bad_lines='warn' option the case there is too many
columns delimiters work well, bad lines are not loaded and stderr catch the bad lines
numbers:

    import pandas as pd
    from io import StringIO
    data = StringIO("""
    nom,f,nb
    bat,F,52
    cat,M,66,
    caw,F,15
    dog,M,66,,
    fly,F,61
    ant,F,21""")
    df = pd.read_csv(data, sep=',', on_bad_lines='warn')

    b'Skipping line 4: expected 3 fields, saw 4\nSkipping line 6: expected 3 fields, saw 5\n'

    df.head(10)
    #    nom  f  nb
    # 0  bat  F  52
    # 1  caw  F  15
    # 2  fly  F  61
    # 3  ant  F  21

But in case the number of delimiter (here sep=,) is less than the main, the line
is added adding NaN.:

    import pandas as pd
    from io import StringIO
    data = StringIO("""
    nom,f,nb
    bat,F,52
    catM66,
    caw,F,15
    dog,M66
    fly,F,61
    ant,F,21""")
    df = pd.read_csv(data, sep=',', on_bad_lines='warn', dtype=str)
    df.head(10)

    #       nom    f   nb
    # 0     bat    F   52
    # 1  catM66  NaN  NaN            <==
    # 2     caw    F   15
    # 3     dog  M66  NaN            <==
    # 4     fly    F   61
    # 5     ant    F   21

Is there a way to make read_csv to not add lines with less columns delimiters than
the main lines ?

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions