Skip to content

Conversation

amithkk
Copy link
Contributor

@amithkk amithkk commented Aug 19, 2023

This adds the on_bad_lines argument to the pyarrow engine for the from_csv parser that closely follows the behaviour of the python engine. Internally utilizes pyarrow's invalid_row_handler. The built-in callable implementation slightly differs for pyarrow, so the difference is appropriately documented by pointing to pyarrow's documentation.

Usage Example:

example.csv:

a,b,c
acol1,bcol1,ccol1
acol2,ccol2

example.py

df_arrow = pd.read_csv(r"example.csv", engine='pyarrow', dtype_backend='pyarrow', on_bad_lines='warn')
print(df_arrow)

Console output

ParserWarning: Expected 3 columns, but found 2: acol2,ccol2
       a      b      c
0  acol1  bcol1  ccol1

@amithkk amithkk force-pushed the feat-pyarrow-bad-lines branch from 8b760eb to 62c873b Compare August 20, 2023 11:56
@lithomas1
Copy link
Contributor

Does this fix any tests?

There are some xfailed tests for on_bad_lines I think somewhere (sorry that I don't remember where) for the pyarrow engine.

@amithkk
Copy link
Contributor Author

amithkk commented Aug 21, 2023

@lithomas1 The tests have been refactored to not fail for both the "python" engine and the "pyarrow" engine (see: pandas/tests/io/parser/test_unsupported.py)

I didn't find any other tests that were invalidated (as confirmed by the CI).

@mroeschke mroeschke added IO CSV read_csv, to_csv Arrow pyarrow functionality labels Aug 21, 2023
@amithkk
Copy link
Contributor Author

amithkk commented Aug 21, 2023

@lithomas1 btw, is there any specific action I should be taking to request a review?

@amithkk amithkk force-pushed the feat-pyarrow-bad-lines branch from 4ee6141 to bee7351 Compare August 22, 2023 14:28
@amithkk
Copy link
Contributor Author

amithkk commented Aug 22, 2023

Rebased from master to pull in compat workaround so that CI Passes

@amithkk amithkk requested a review from mroeschke August 22, 2023 16:29
@mroeschke
Copy link
Member

Can you add a whatsnew entry to 2.2.0.rst?

@lithomas1
Copy link
Contributor

lithomas1 commented Aug 22, 2023

@lithomas1 The tests have been refactored to not fail for both the "python" engine and the "pyarrow" engine (see: pandas/tests/io/parser/test_unsupported.py)

I didn't find any other tests that were invalidated (as confirmed by the CI).

Try looking in pandas/tests/io/parser/common/test_read_errors.py

Those tests are currently skippped, but on_bad_lines is being used there. You should be able to make at least some of them pass.

(You can try xfailing the tests there by changing the "pyarrow_skip" to a "pyarrow_xfail")

@amithkk
Copy link
Contributor Author

amithkk commented Aug 23, 2023

@lithomas1 I've gotten all the tests within test_read_errors.py applicable for this change to pass.

@amithkk amithkk requested a review from mroeschke August 23, 2023 21:11
@amithkk
Copy link
Contributor Author

amithkk commented Aug 25, 2023

@lithomas1 Are there any other changes that you'd like me to look at from the test perspective?

@amithkk amithkk requested a review from lithomas1 August 25, 2023 20:51
@amithkk amithkk requested a review from mroeschke August 26, 2023 19:48
@amithkk
Copy link
Contributor Author

amithkk commented Aug 31, 2023

Any updates on this PR?

@mroeschke mroeschke added this to the 2.2 milestone Sep 1, 2023
Copy link
Member

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM cc @lithomas1 if you have any comments

@amithkk
Copy link
Contributor Author

amithkk commented Sep 8, 2023

@lithomas1 Anything else that needs to be covered in this pull request?

Copy link
Contributor

@lithomas1 lithomas1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the ping. I left some comments.

@amithkk amithkk requested a review from lithomas1 September 11, 2023 20:57
@amithkk
Copy link
Contributor Author

amithkk commented Sep 11, 2023

@lithomas1 I've chained the ArrowInvalid exception with ParserError so that the tests are uniform. Does this match what you had in mind?

@amithkk
Copy link
Contributor Author

amithkk commented Sep 20, 2023

@lithomas1 Awaiting your review

Copy link
Contributor

@lithomas1 lithomas1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me now. Thanks!

@mroeschke mroeschke merged commit d9a2ad1 into pandas-dev:main Sep 22, 2023
@mroeschke
Copy link
Member

Thanks @amithkk

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Arrow pyarrow functionality IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: Pandas 2.0 with pyarrow engine add the argument like 'skip_bad_lines=True'
3 participants