Skip to content

Conversation

jblackburne
Copy link
Contributor

Slight change to the logic in the tokenize_delimited() and tokenize_delim_customterm() functions of the C parser.

Fixes #10022.

I believe the new logic is correct, but perhaps someone with more familiarity can double-check.

@jreback
Copy link
Contributor

jreback commented Apr 29, 2015

you need a test. As not really sure what you are fixing.

@jblackburne
Copy link
Contributor Author

Ok, test forthcoming.

@jblackburne
Copy link
Contributor Author

There is a small self-contained test in the comments on issue #10022. Would it be desirable to make it into a unit test? It takes about a second to run on my machine.

@jreback
Copy link
Contributor

jreback commented Apr 30, 2015

yep
the idea would be to add the test ; it fails on master; after your fix everything passes

@jreback jreback added the IO CSV read_csv, to_csv label Apr 30, 2015
@jblackburne
Copy link
Contributor Author

Unit test added. Any further comments?

@@ -359,6 +359,11 @@ def test_empty_field_eof(self):
names=list('abcd'), engine='c')
assert_frame_equal(df, c)

def test_chunk_begins_with_newline_whitespace(self):
data = '\n hello\nworld\n'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add the issue number as a comment here

@jreback
Copy link
Contributor

jreback commented May 7, 2015

even though the fix is only in the c-parser, IIRC, this should work in python parser as well? hence pls move the tests to test_parser.py which will tests with all parsers.

@jreback jreback added this to the 0.17.0 milestone May 7, 2015
@jreback
Copy link
Contributor

jreback commented May 7, 2015

@jblackburne looks good.

  • pls add a release note in v0.16.1
  • pls squash
  • ping when green.
    here for detailed git instructions if you need

@jreback jreback modified the milestones: 0.16.1, 0.17.0 May 7, 2015
@jblackburne
Copy link
Contributor Author

Squash all into a single commit?

@shoyer
Copy link
Member

shoyer commented May 7, 2015

@jblackburne Yes, that's what @jreback is asking for

…that start with newline.

Changed a condition in tokenize_delim_customterm to account for data chunks that start with terminator.

Added a unit test that fails in master and passes in this branch.

Moved new unit test in order to test all parser engines. Added GH issue number.

Added release note.
@jblackburne jblackburne force-pushed the read_csv-newline-chunk branch from 8f37413 to e693c3a Compare May 7, 2015 18:46
@jreback
Copy link
Contributor

jreback commented May 7, 2015

pls take a look

cc @evanpw
cc @mdmueller
cc @selasley

@evanpw
Copy link
Contributor

evanpw commented May 7, 2015

The logic looks right to me.

@jreback
Copy link
Contributor

jreback commented May 8, 2015

@jblackburne pls ping when this is green.

@jblackburne
Copy link
Contributor Author

Ok, green.

jreback added a commit that referenced this pull request May 8, 2015
@jreback jreback merged commit 2840bea into pandas-dev:master May 8, 2015
@jblackburne jblackburne deleted the read_csv-newline-chunk branch May 9, 2015 00:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging this pull request may close these issues.

read_csv(engine='c') can insert spurious rows full of NaNs
4 participants