Skip to content

BUG: Strings with exponent but no decimal point parsed as integers in python csv engine #10133

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 1 commit into from

Conversation

evanpw
Copy link
Contributor

@evanpw evanpw commented May 14, 2015

Fixes GH #9565. I also made the handling of out-of-range integers consistent with the C engine: parse as a string rather than a double.

@jreback jreback added Bug Dtype Conversions Unexpected or buggy dtype conversions IO CSV read_csv, to_csv labels May 14, 2015
@jreback jreback added this to the 0.17.0 milestone May 14, 2015
if fval < fINT64_MAX and fval > fINT64_MIN:
try:
ints[i] = int(val)
except ValueError:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this can still fail for an int > int64max (but less < than a uint64max)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It produces a string in that case:

In [4]: pd.read_csv(StringIO(str(np.iinfo(np.int64).max + 1)), header=None, engine='python').values
Out[4]: array([['9223372036854775808']], dtype=object)

In [5]: pd.read_csv(StringIO(str(np.iinfo(np.int64).min - 1)), header=None, engine='python').values
Out[5]: array([['-9223372036854775809']], dtype=object)

Do you want it to infer the dtype uint64? The C parser doesn't do that either currently.

I don't see how that line ints[i] = int(val) could ever throw an exception, though. Maybe there was some case that's now been eliminated?

@jreback
Copy link
Contributor

jreback commented May 14, 2015

pls give this a perf run on the -r csv and see if anything is changed

@jreback
Copy link
Contributor

jreback commented May 20, 2015

@evanpw looks good. (FYI if you want to add a confirming case for the out-of-range int64 you have above would be good - not sure that is really tested atm).

and pls run perf and report if anything amiss.

@evanpw
Copy link
Contributor Author

evanpw commented Jun 4, 2015

Integers near the boundary of int64 values were getting parsed as strings because the boundary comparison was performed using doubles. That's fixed in this last commit, and some tests added. Checked performance: no measurable difference with master. Rebased and moved whatsnew entry to 0.16.2.

@jreback
Copy link
Contributor

jreback commented Jun 4, 2015

fyi, you may have to rebase again as I just fixed a travis build error about 10 min ago.

@jreback jreback modified the milestones: 0.16.2, 0.17.0 Jun 4, 2015
enum: INT64_MAX
enum: INT64_MIN

from libc.stdint cimport UINT8_MAX, INT64_MAX, INT64_MIN
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesn't work on windows, just revert this section

@jreback
Copy link
Contributor

jreback commented Jun 5, 2015

couple of issues w.r.t. windows. I think some of the libraries are named differently. So just revert that part of the code. ping after travis is green.

@evanpw
Copy link
Contributor Author

evanpw commented Jun 7, 2015

cython libc imports reverted, and tests are green

@jreback
Copy link
Contributor

jreback commented Jun 9, 2015

merged via 38876df

thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Dtype Conversions Unexpected or buggy dtype conversions IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants