-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
BUG: Strings with exponent but no decimal point parsed as integers in python csv engine #10133
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
if fval < fINT64_MAX and fval > fINT64_MIN: | ||
try: | ||
ints[i] = int(val) | ||
except ValueError: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this can still fail for an int > int64max (but less < than a uint64max)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It produces a string in that case:
In [4]: pd.read_csv(StringIO(str(np.iinfo(np.int64).max + 1)), header=None, engine='python').values
Out[4]: array([['9223372036854775808']], dtype=object)
In [5]: pd.read_csv(StringIO(str(np.iinfo(np.int64).min - 1)), header=None, engine='python').values
Out[5]: array([['-9223372036854775809']], dtype=object)
Do you want it to infer the dtype uint64
? The C parser doesn't do that either currently.
I don't see how that line ints[i] = int(val)
could ever throw an exception, though. Maybe there was some case that's now been eliminated?
pls give this a perf run on the |
@evanpw looks good. (FYI if you want to add a confirming case for the out-of-range int64 you have above would be good - not sure that is really tested atm). and pls run perf and report if anything amiss. |
Integers near the boundary of int64 values were getting parsed as strings because the boundary comparison was performed using doubles. That's fixed in this last commit, and some tests added. Checked performance: no measurable difference with master. Rebased and moved whatsnew entry to 0.16.2. |
fyi, you may have to rebase again as I just fixed a travis build error about 10 min ago. |
enum: INT64_MAX | ||
enum: INT64_MIN | ||
|
||
from libc.stdint cimport UINT8_MAX, INT64_MAX, INT64_MIN |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this doesn't work on windows, just revert this section
couple of issues w.r.t. windows. I think some of the libraries are named differently. So just revert that part of the code. ping after travis is green. |
… python csv engine (GH 9565)
cython libc imports reverted, and tests are green |
merged via 38876df thanks! |
Fixes GH #9565. I also made the handling of out-of-range integers consistent with the C engine: parse as a string rather than a double.