-
-
Notifications
You must be signed in to change notification settings - Fork 18.6k
BUG: str dtype ignored for column with dot #50364
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BUG: str dtype ignored for column with dot #50364
Conversation
Could you, please, @MarcoGorelli, review my PR? |
Hey - I'm not overly familiar with this part of the code, but I'll take a look could you start by adding a test please? |
Thank you.
Yes, of course, I’ll add the test. |
pandas/io/parsers/python_parser.py
Outdated
self.columns | ||
and self.dtype | ||
and self.columns[i] in self.dtype | ||
and self.dtype[self.columns[i]] is str |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is to specific, we have to something like not is_numeric_dtype(self.dtype.get(self.columns[I])
Columns should already be defined here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @phofl, I corrected line number 882.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @phofl. Could you please take a look at my last commit? Looks like CI failures are unrelated to my changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll have to look more closely, might get to it tomorrow.
Hi @natmokval , I think maybe update |
I like the idea of updating no_convert_columns. Could you check if this works? |
I tried to use
pandas/tests/io/parser/common/test_common_basic.py I run tests locally Advice on how to deal with this is much appreciated. |
Thank you @yanxiaole, for your suggestion. I tried to use it as it is and got some test failures. Then I made slight corrections and still have some failures. Still trying to find out what are causes of these failures. |
Thanks for working on this, not an easy one Looks like this correctly parses column In [1]: data = """a;b;c\n0000.7995;16.000;0\n3.03.001.00514;0;4.000\n4923.600.041;23.000;131"""
In [2]: df1 = pd.read_csv(io.StringIO(data), sep=';', dtype={'a': str}, thousands='.', engine='c')
...: df2 = pd.read_csv(io.StringIO(data), sep=';', dtype={'a': str}, thousands='.', engine='python')
In [3]: df1
Out[3]:
a b c
0 0000.7995 16000 0
1 3.03.001.00514 0 4000
2 4923.600.041 23000 131
In [4]: df2
Out[4]:
a b c
0 0000.7995 16.0 0.0
1 3.03.001.00514 0.0 4.0
2 4923.600.041 23.0 131.0 |
@natmokval sorry I was on holiday last week. |
pandas/io/parsers/python_parser.py
Outdated
assert self._col_indices is not None | ||
for i in self._col_indices: | ||
if isinstance(self.dtype, dict) and not is_numeric_dtype( | ||
self.dtype.get(self.columns[i], None) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MarcoGorelli the issue you mentioned is due to this line.
Since your test case doesn't specify the type of the 2nd and the 3rd column, here the code will treat them as non-numeric.
I think the behavior is reasonable, how is your idea?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you change the thousand to ,
it will work as expected. probably we should raise an exception when self.thouand == self.decimal
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that the second and third columns should be parsed any differently if the dtype of a
has been specified
As in,
pd.read_csv(io.StringIO(data), sep=';', dtype={'a': str}, thousands='.', engine='python')
pd.read_csv(io.StringIO(data), sep=';', thousands='.', engine='python')
should parse columns'b'
and'c'
in the same way
probably we should raise an exception when self.thouand == self.decimal ?
sounds reasonable - this could be done in a separate PR if you're interested
While running tests locally I don't have failures like these.
I can't reproduce the problems locally to resolve them. |
) | ||
|
||
result = parser.read_csv( | ||
StringIO(data), sep=";", dtype={"A": str, "B": int, "C": int}, thousands="." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You should specify these explicitly
int is different on 32bit/windows compared to linux
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for your advice. I did specify int and that worked for me.
Seems like ci failures this time are not related to my changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for sticking with this!
I've left a quick comment, will take a closer look this coming week
Looks like my previous comment has been addressed anyway, nice
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for updating - the code checks look good to me now (@phofl any further comments?)
What I'm less sure of are the tests - it looks like you've added quite a few tests, but I don't really follow what they're all testing
test__search_replace_num_columns
looks duplicative of the others, unless I've missed something? Perhaps it can be removed?
I think the cases we need to check are:
- when
dtype
is specified to bestr
, thatthousands
is ignored (regardless of which other dtypes were specified - you have this) - when
dtype
is numeric, thatthousands
is respected - you also already test this intest_no_thousand_with_dot_convert_for_non_numeric_cols
So, would the test_no_thousand_with_dot_convert_for_non_numeric_cols
be sufficient, or is there something the others are checking which I've missed?
Thank you for the comments. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice, looks to me like we're nearly there
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me! Leaving open for others to take a look though as I'm not familiar enough with this part of the code to merge
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some comments, implementation wise its good I think
( | ||
"""\ | ||
a;b;c | ||
0000.7995;16.000;0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I might be missing something, but it looks like only the delimiter is changed between the different parametrizations? In this case please only parametrize over the delimiter and construct data in the test itself
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The delimiter is not the only difference between the parametrizations. In the 3 and 4 examples the value in column “b” was changed from 16,000
to 16,000.1
. I don’t see a way how to construct data in the test and parametrize data over the delimiter.
I suggest using two different tests for the first two examples and for the second two. In this case, we can move data construction into tests themselves.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, 2 different tests then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I split the test into 2 tests. Now data are constructed in the tests themselves.
Should I add an example for bool
dtype?
0000.7995;16.000;0 | ||
3.03.001.00514;0;4.000 | ||
4923.600.041;23.000;131""", | ||
{"a": str}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use object instead of str
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I changed dtype to object.
",", | ||
DataFrame( | ||
{ | ||
"a": ["0000,7995", "3,03,001,00514", "4923,600,041"], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First column looks to be the same everywhere? If yes, please remove from parametrisation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The first column is not exactly the same in different examples. In the last two examples, the delimiter was changed from .
to ,
.
I can remove column “a”
from parametrizations and add it to the expected DafaFrame in the test.
Then we will have one part of the expected DataFrame in the parametrization and the other in the test
Won’t it make my code less readable?
pandas/io/parsers/python_parser.py
Outdated
if ( | ||
isinstance(self.dtype, dict) | ||
and self.columns[i] in self.dtype | ||
and not is_numeric_dtype(self.dtype[self.columns[i]]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we exclude bool here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, it's done.
Did you mean that we have to avoid processing a bool value as a numeric one?
@phofl, could you please take a look at this pr? I made the changes that you suggested. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd say let's ship this, if there's further comments they can addressed in a follow-up
Thanks @natmokval , well done on persevering with this one! |
This is a direct attempt to fix the problem with 'dot' in strings. Maybe it would be better to use for this purpose attribute
_no_thousands_columns
.