-
-
Notifications
You must be signed in to change notification settings - Fork 18.7k
Detect Parsing errors in read_csv first row with index_col=False #40629
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 6 commits
478bbc7
a1b2e5c
806e4f6
0d55f5d
1151b93
6756639
185f62e
a305268
7d973e6
d77ccde
546e106
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -215,6 +215,13 @@ cdef extern from "parser/tokenizer.h": | |
int64_t header_start # header row start | ||
uint64_t header_end # header row end | ||
|
||
bint allow_leading_cols # Boolean: 1: can infer index col, 0: no index col | ||
bint skip_header_end # Boolean: 1: Header=None, | ||
# 0 Header is not None | ||
# This is used because header_end is | ||
# uint64_t so there is no valid NULL | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What do you mean by There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The type of This lead to incorrect logic to determine the header file because -1 would be interpreted as the max UINT64 value. I added an extra field here to effectively check if the value is null since we can't check -1. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thanks for explaining, good to fix that! |
||
# value (i.e. header_end == -1). | ||
njriasan marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
void *skipset | ||
PyObject *skipfunc | ||
int64_t skip_first_N_rows | ||
|
@@ -378,6 +385,7 @@ cdef class TextReader: | |
self.encoding_errors = PyBytes_AsString(encoding_errors) | ||
|
||
self.parser = parser_new() | ||
self.parser.allow_leading_cols = allow_leading_cols | ||
self.parser.chunksize = tokenize_chunksize | ||
|
||
self.mangle_dupe_cols = mangle_dupe_cols | ||
|
@@ -517,11 +525,13 @@ cdef class TextReader: | |
if header is None: | ||
# sentinel value | ||
self.parser.header_start = -1 | ||
self.parser.header_end = -1 | ||
self.parser.skip_header_end = True | ||
self.parser.header_end = 0 | ||
self.parser.header = -1 | ||
self.parser_start = 0 | ||
prelim_header = [] | ||
else: | ||
self.parser.skip_header_end = False | ||
if isinstance(header, list): | ||
if len(header) > 1: | ||
# need to artificially skip the final line | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -667,11 +667,10 @@ def test_blank_lines_between_header_and_data_rows(all_parsers, nrows): | |
def test_no_header_two_extra_columns(all_parsers): | ||
# GH 26218 | ||
column_names = ["one", "two", "three"] | ||
ref = DataFrame([["foo", "bar", "baz"]], columns=column_names) | ||
stream = StringIO("foo,bar,baz,bam,blah") | ||
parser = all_parsers | ||
df = parser.read_csv(stream, header=None, names=column_names, index_col=False) | ||
tm.assert_frame_equal(df, ref) | ||
with pytest.raises(ParserError, match="Expected 3 fields in line 1, saw 5"): | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. In principal I am not against changing this, but doing only for this case would cause this to fail and
to work. ALso not sure if we can do this without deprecating There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you elaborate on what you mean by work? Testing this example on my PR I get I understand the concern about deprecating. Do you have any advice on how I should modify the code to address that concern? I'm a first time Pandas contributor, so I'm not familiar with that process. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Haven't tested on your branch, sorry. Though we have tests for this which would have been changed too then. @gfyoung Could you help here? Do you think we should deprecate this before changing? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Since we had tests for this behavior (whether intentional or not), I think I would lean towards deprecation. cc @pandas-dev/pandas-core - this is a bit of an odd case. While the behavior does look buggy, the fact that we have been testing it suggests there could have something deliberate behind it. |
||
parser.read_csv(stream, header=None, names=column_names, index_col=False) | ||
|
||
|
||
def test_read_csv_names_not_accepting_sets(all_parsers): | ||
|
Uh oh!
There was an error while loading. Please reload this page.