Skip to content

BUG: str dtype ignored for column with dot #50364

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 27 commits into from
Mar 16, 2023

Conversation

natmokval
Copy link
Contributor

This is a direct attempt to fix the problem with 'dot' in strings. Maybe it would be better to use for this purpose attribute _no_thousands_columns.

@natmokval
Copy link
Contributor Author

Could you, please, @MarcoGorelli, review my PR?

@MarcoGorelli
Copy link
Member

Hey - I'm not overly familiar with this part of the code, but I'll take a look

could you start by adding a test please?

@natmokval
Copy link
Contributor Author

Hey - I'm not overly familiar with this part of the code, but I'll take a look

Thank you.

could you start by adding a test please?

Yes, of course, I’ll add the test.

self.columns
and self.dtype
and self.columns[i] in self.dtype
and self.dtype[self.columns[i]] is str
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is to specific, we have to something like not is_numeric_dtype(self.dtype.get(self.columns[I])

Columns should already be defined here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @phofl, I corrected line number 882.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @phofl. Could you please take a look at my last commit? Looks like CI failures are unrelated to my changes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll have to look more closely, might get to it tomorrow.

@mroeschke mroeschke added the IO CSV read_csv, to_csv label Dec 27, 2022
@yanxiaole
Copy link

Hi @natmokval , I think maybe update _no_thousands_columns is a better idea.
could you check this commit? main...yanxiaole:pandas:issue-50270

@phofl
Copy link
Member

phofl commented Dec 30, 2022

I like the idea of updating no_convert_columns. Could you check if this works?

@natmokval natmokval marked this pull request as draft January 2, 2023 09:03
@natmokval
Copy link
Contributor Author

natmokval commented Jan 5, 2023

I like the idea of updating no_convert_columns. Could you check if this works?

I tried to use _no_thousands_columns to exclude non-numeric columns but got test failures.
Some of them are related to the problem with dtype:

Attribute "dtype" are different
E       [left]:  int32
E       [right]: int64

pandas/tests/io/parser/common/test_common_basic.py
Locally the same tests pass.

I run tests locally pytest pandas -n 4 -m "not slow and not network and not db and not single_cpu" -r sxXq:
======= 189138 passed, 1418 skipped, 3340 xfailed, 14 warnings in 537.93s (0:08:57) =======

Advice on how to deal with this is much appreciated.

@natmokval
Copy link
Contributor Author

I think maybe update _no_thousands_columns is a better idea.
could you check this commit? main...yanxiaole:pandas:issue-50270

Thank you @yanxiaole, for your suggestion. I tried to use it as it is and got some test failures. Then I made slight corrections and still have some failures. Still trying to find out what are causes of these failures.

@MarcoGorelli
Copy link
Member

Thanks for working on this, not an easy one

Looks like this correctly parses column a, but now b and c have changed

In [1]: data = """a;b;c\n0000.7995;16.000;0\n3.03.001.00514;0;4.000\n4923.600.041;23.000;131"""

In [2]: df1 = pd.read_csv(io.StringIO(data), sep=';', dtype={'a': str}, thousands='.', engine='c')
   ...: df2 = pd.read_csv(io.StringIO(data), sep=';', dtype={'a': str}, thousands='.', engine='python')

In [3]: df1
Out[3]: 
                a      b     c
0       0000.7995  16000     0
1  3.03.001.00514      0  4000
2    4923.600.041  23000   131

In [4]: df2
Out[4]: 
                a     b      c
0       0000.7995  16.0    0.0
1  3.03.001.00514   0.0    4.0
2    4923.600.041  23.0  131.0

@yanxiaole
Copy link

@natmokval sorry I was on holiday last week.
the failed case seems due to 32 bit env, I'm not familiar with this part, @phofl could you take a look and give some advice?

assert self._col_indices is not None
for i in self._col_indices:
if isinstance(self.dtype, dict) and not is_numeric_dtype(
self.dtype.get(self.columns[i], None)

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MarcoGorelli the issue you mentioned is due to this line.
Since your test case doesn't specify the type of the 2nd and the 3rd column, here the code will treat them as non-numeric.
I think the behavior is reasonable, how is your idea?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you change the thousand to , it will work as expected. probably we should raise an exception when self.thouand == self.decimal ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that the second and third columns should be parsed any differently if the dtype of a has been specified

As in,

  • pd.read_csv(io.StringIO(data), sep=';', dtype={'a': str}, thousands='.', engine='python')
  • pd.read_csv(io.StringIO(data), sep=';', thousands='.', engine='python')
    should parse columns 'b' and 'c' in the same way

probably we should raise an exception when self.thouand == self.decimal ?

sounds reasonable - this could be done in a separate PR if you're interested

@natmokval
Copy link
Contributor Author

While running tests locally I don't have failures like these.

E       Attribute "dtype" are different
E       [left]:  int32
E       [right]: int64

I can't reproduce the problems locally to resolve them.
I am out of ideas on how to debug the failures. Some help is appreciated.

)

result = parser.read_csv(
StringIO(data), sep=";", dtype={"A": str, "B": int, "C": int}, thousands="."
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should specify these explicitly

int is different on 32bit/windows compared to linux

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for your advice. I did specify int and that worked for me.

Seems like ci failures this time are not related to my changes.

Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for sticking with this!

I've left a quick comment, will take a closer look this coming week

Looks like my previous comment has been addressed anyway, nice

@natmokval natmokval marked this pull request as ready for review February 18, 2023 21:10
Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for updating - the code checks look good to me now (@phofl any further comments?)

What I'm less sure of are the tests - it looks like you've added quite a few tests, but I don't really follow what they're all testing

test__search_replace_num_columns looks duplicative of the others, unless I've missed something? Perhaps it can be removed?

I think the cases we need to check are:

  • when dtype is specified to be str, that thousands is ignored (regardless of which other dtypes were specified - you have this)
  • when dtype is numeric, that thousands is respected - you also already test this in test_no_thousand_with_dot_convert_for_non_numeric_cols

So, would the test_no_thousand_with_dot_convert_for_non_numeric_cols be sufficient, or is there something the others are checking which I've missed?

@natmokval
Copy link
Contributor Author

Thank you for the comments.
I agree, after adding parametrization to test_no_thousand_with_dot_convert_for_non_numeric_cols
the function test__search_replace_num_columns becomes duplicative. I deleted it.

Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice, looks to me like we're nearly there

Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me! Leaving open for others to take a look though as I'm not familiar enough with this part of the code to merge

Copy link
Member

@phofl phofl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some comments, implementation wise its good I think

(
"""\
a;b;c
0000.7995;16.000;0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be missing something, but it looks like only the delimiter is changed between the different parametrizations? In this case please only parametrize over the delimiter and construct data in the test itself

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The delimiter is not the only difference between the parametrizations. In the 3 and 4 examples the value in column “b” was changed from 16,000 to 16,000.1. I don’t see a way how to construct data in the test and parametrize data over the delimiter.

I suggest using two different tests for the first two examples and for the second two. In this case, we can move data construction into tests themselves.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, 2 different tests then.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I split the test into 2 tests. Now data are constructed in the tests themselves.
Should I add an example for bool dtype?

0000.7995;16.000;0
3.03.001.00514;0;4.000
4923.600.041;23.000;131""",
{"a": str},
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use object instead of str

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I changed dtype to object.

",",
DataFrame(
{
"a": ["0000,7995", "3,03,001,00514", "4923,600,041"],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First column looks to be the same everywhere? If yes, please remove from parametrisation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The first column is not exactly the same in different examples. In the last two examples, the delimiter was changed from . to , .
I can remove column “a” from parametrizations and add it to the expected DafaFrame in the test.
Then we will have one part of the expected DataFrame in the parametrization and the other in the test
Won’t it make my code less readable?

if (
isinstance(self.dtype, dict)
and self.columns[i] in self.dtype
and not is_numeric_dtype(self.dtype[self.columns[i]])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we exclude bool here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, it's done.
Did you mean that we have to avoid processing a bool value as a numeric one?

@natmokval
Copy link
Contributor Author

@phofl, could you please take a look at this pr? I made the changes that you suggested.

@MarcoGorelli MarcoGorelli added this to the 2.1 milestone Mar 16, 2023
Copy link
Member

@MarcoGorelli MarcoGorelli left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say let's ship this, if there's further comments they can addressed in a follow-up

@MarcoGorelli
Copy link
Member

Thanks @natmokval , well done on persevering with this one!

@MarcoGorelli MarcoGorelli merged commit 060b9da into pandas-dev:main Mar 16, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
IO CSV read_csv, to_csv
Projects
None yet
Development

Successfully merging this pull request may close these issues.

BUG: str dtype ignored for column with '.' if thousands='.' for python engine
5 participants