BUG: str dtype ignored for column with dot #50364

natmokval · 2022-12-20T17:18:00Z

This is a direct attempt to fix the problem with 'dot' in strings. Maybe it would be better to use for this purpose attribute _no_thousands_columns.

natmokval · 2022-12-20T21:33:18Z

Could you, please, @MarcoGorelli, review my PR?

MarcoGorelli · 2022-12-20T22:14:07Z

Hey - I'm not overly familiar with this part of the code, but I'll take a look

could you start by adding a test please?

natmokval · 2022-12-20T22:32:12Z

Hey - I'm not overly familiar with this part of the code, but I'll take a look

Thank you.

could you start by adding a test please?

Yes, of course, I’ll add the test.

phofl · 2022-12-21T13:36:08Z

pandas/io/parsers/python_parser.py

+                        self.columns
+                        and self.dtype
+                        and self.columns[i] in self.dtype
+                        and self.dtype[self.columns[i]] is str


This is to specific, we have to something like not is_numeric_dtype(self.dtype.get(self.columns[I])

Columns should already be defined here

Thank you @phofl, I corrected line number 882.

Hi @phofl. Could you please take a look at my last commit? Looks like CI failures are unrelated to my changes.

I'll have to look more closely, might get to it tomorrow.

yanxiaole · 2022-12-30T09:50:43Z

Hi @natmokval , I think maybe update _no_thousands_columns is a better idea.
could you check this commit? main...yanxiaole:pandas:issue-50270

phofl · 2022-12-30T11:59:58Z

I like the idea of updating no_convert_columns. Could you check if this works?

natmokval · 2023-01-05T19:57:13Z

I like the idea of updating no_convert_columns. Could you check if this works?

I tried to use _no_thousands_columns to exclude non-numeric columns but got test failures.
Some of them are related to the problem with dtype:

Attribute "dtype" are different
E       [left]:  int32
E       [right]: int64

pandas/tests/io/parser/common/test_common_basic.py
Locally the same tests pass.

I run tests locally pytest pandas -n 4 -m "not slow and not network and not db and not single_cpu" -r sxXq:
======= 189138 passed, 1418 skipped, 3340 xfailed, 14 warnings in 537.93s (0:08:57) =======

Advice on how to deal with this is much appreciated.

natmokval · 2023-01-05T20:07:28Z

I think maybe update _no_thousands_columns is a better idea.
could you check this commit? main...yanxiaole:pandas:issue-50270

Thank you @yanxiaole, for your suggestion. I tried to use it as it is and got some test failures. Then I made slight corrections and still have some failures. Still trying to find out what are causes of these failures.

MarcoGorelli · 2023-01-30T11:02:17Z

Thanks for working on this, not an easy one

Looks like this correctly parses column a, but now b and c have changed

In [1]: data = """a;b;c\n0000.7995;16.000;0\n3.03.001.00514;0;4.000\n4923.600.041;23.000;131"""

In [2]: df1 = pd.read_csv(io.StringIO(data), sep=';', dtype={'a': str}, thousands='.', engine='c')
   ...: df2 = pd.read_csv(io.StringIO(data), sep=';', dtype={'a': str}, thousands='.', engine='python')

In [3]: df1
Out[3]: 
                a      b     c
0       0000.7995  16000     0
1  3.03.001.00514      0  4000
2    4923.600.041  23000   131

In [4]: df2
Out[4]: 
                a     b      c
0       0000.7995  16.0    0.0
1  3.03.001.00514   0.0    4.0
2    4923.600.041  23.0  131.0

yanxiaole · 2023-01-31T12:35:26Z

@natmokval sorry I was on holiday last week.
the failed case seems due to 32 bit env, I'm not familiar with this part, @phofl could you take a look and give some advice?

yanxiaole · 2023-01-31T12:35:56Z

pandas/io/parsers/python_parser.py

+            assert self._col_indices is not None
+            for i in self._col_indices:
+                if isinstance(self.dtype, dict) and not is_numeric_dtype(
+                    self.dtype.get(self.columns[i], None)


@MarcoGorelli the issue you mentioned is due to this line.
Since your test case doesn't specify the type of the 2nd and the 3rd column, here the code will treat them as non-numeric.
I think the behavior is reasonable, how is your idea?

if you change the thousand to , it will work as expected. probably we should raise an exception when self.thouand == self.decimal ?

I don't think that the second and third columns should be parsed any differently if the dtype of a has been specified

As in,

pd.read_csv(io.StringIO(data), sep=';', dtype={'a': str}, thousands='.', engine='python')

pd.read_csv(io.StringIO(data), sep=';', thousands='.', engine='python')
should parse columns 'b' and 'c' in the same way

probably we should raise an exception when self.thouand == self.decimal ?

sounds reasonable - this could be done in a separate PR if you're interested

natmokval · 2023-02-11T15:28:16Z

While running tests locally I don't have failures like these.

E       Attribute "dtype" are different
E       [left]:  int32
E       [right]: int64

I can't reproduce the problems locally to resolve them.
I am out of ideas on how to debug the failures. Some help is appreciated.

phofl · 2023-02-11T15:29:44Z

pandas/tests/io/parser/common/test_common_basic.py

+    )
+
+    result = parser.read_csv(
+        StringIO(data), sep=";", dtype={"A": str, "B": int, "C": int}, thousands="."


You should specify these explicitly

int is different on 32bit/windows compared to linux

Thank you for your advice. I did specify int and that worked for me.

Seems like ci failures this time are not related to my changes.

MarcoGorelli

Thanks for sticking with this!

I've left a quick comment, will take a closer look this coming week

Looks like my previous comment has been addressed anyway, nice

pandas/io/parsers/python_parser.py

pandas/tests/io/parser/test_python_parser_only.py

pandas/io/parsers/python_parser.py

MarcoGorelli

thanks for updating - the code checks look good to me now (@phofl any further comments?)

What I'm less sure of are the tests - it looks like you've added quite a few tests, but I don't really follow what they're all testing

test__search_replace_num_columns looks duplicative of the others, unless I've missed something? Perhaps it can be removed?

I think the cases we need to check are:

when dtype is specified to be str, that thousands is ignored (regardless of which other dtypes were specified - you have this)
when dtype is numeric, that thousands is respected - you also already test this in test_no_thousand_with_dot_convert_for_non_numeric_cols

So, would the test_no_thousand_with_dot_convert_for_non_numeric_cols be sufficient, or is there something the others are checking which I've missed?

natmokval · 2023-02-24T16:47:16Z

Thank you for the comments.
I agree, after adding parametrization to test_no_thousand_with_dot_convert_for_non_numeric_cols
the function test__search_replace_num_columns becomes duplicative. I deleted it.

MarcoGorelli

Nice, looks to me like we're nearly there

pandas/tests/io/parser/test_python_parser_only.py

MarcoGorelli

Looks good to me! Leaving open for others to take a look though as I'm not familiar enough with this part of the code to merge

phofl

some comments, implementation wise its good I think

phofl · 2023-02-27T00:43:35Z

pandas/tests/io/parser/test_python_parser_only.py

+        (
+            """\
+a;b;c
+0000.7995;16.000;0


I might be missing something, but it looks like only the delimiter is changed between the different parametrizations? In this case please only parametrize over the delimiter and construct data in the test itself

The delimiter is not the only difference between the parametrizations. In the 3 and 4 examples the value in column “b” was changed from 16,000 to 16,000.1. I don’t see a way how to construct data in the test and parametrize data over the delimiter.

I suggest using two different tests for the first two examples and for the second two. In this case, we can move data construction into tests themselves.

Yes, 2 different tests then.

I split the test into 2 tests. Now data are constructed in the tests themselves.
Should I add an example for bool dtype?

phofl · 2023-02-27T00:43:54Z

pandas/tests/io/parser/test_python_parser_only.py

+0000.7995;16.000;0
+3.03.001.00514;0;4.000
+4923.600.041;23.000;131""",
+            {"a": str},


Please use object instead of str

I changed dtype to object.

phofl · 2023-02-27T00:44:41Z

pandas/tests/io/parser/test_python_parser_only.py

+            ",",
+            DataFrame(
+                {
+                    "a": ["0000,7995", "3,03,001,00514", "4923,600,041"],


First column looks to be the same everywhere? If yes, please remove from parametrisation

The first column is not exactly the same in different examples. In the last two examples, the delimiter was changed from . to , .
I can remove column “a” from parametrizations and add it to the expected DafaFrame in the test.
Then we will have one part of the expected DataFrame in the parametrization and the other in the test
Won’t it make my code less readable?

phofl · 2023-02-27T00:46:02Z

pandas/io/parsers/python_parser.py

+                if (
+                    isinstance(self.dtype, dict)
+                    and self.columns[i] in self.dtype
+                    and not is_numeric_dtype(self.dtype[self.columns[i]])


Can we exclude bool here?

Sure, it's done.
Did you mean that we have to avoid processing a bool value as a numeric one?

natmokval · 2023-03-11T16:48:33Z

@phofl, could you please take a look at this pr? I made the changes that you suggested.

MarcoGorelli

I'd say let's ship this, if there's further comments they can addressed in a follow-up

MarcoGorelli · 2023-03-16T18:08:39Z

Thanks @natmokval , well done on persevering with this one!

BUG: str dtype ignored for column with dot I

2f40db3

phofl reviewed Dec 21, 2022

View reviewed changes

natmokval added 3 commits December 22, 2022 11:05

BUG: add test to str dtype ignored for column with dot I

b032e18

BUG: str dtype ignored for column with dot III

ea1865d

BUG: str dtype ignored for column with dot IV

1554f75

mroeschke added the IO CSV read_csv, to_csv label Dec 27, 2022

natmokval marked this pull request as draft January 2, 2023 09:03

natmokval added 2 commits January 2, 2023 10:05

BUG: str dtype ignored for column with dot V

1e09dfb

BUG: str dtype ignored for column with dot VI

615a722

natmokval added 3 commits January 29, 2023 15:17

Merge branch 'main' into 50270-str-ignore-dot

e9767f5

TEST: added assert for mypy

4ac9b77

TEST: added assert for mypy II

3bf5e1b

yanxiaole reviewed Jan 31, 2023

View reviewed changes

natmokval and others added 2 commits February 4, 2023 14:06

BUG: str dtype ignored for column with dot VII

8ea7294

Merge branch 'main' into 50270-str-ignore-dot

d47c8b0

phofl reviewed Feb 11, 2023

View reviewed changes

natmokval added 3 commits February 11, 2023 17:00

specify int64 explicitly

6e5c5fb

Merge branch 'main' into 50270-str-ignore-dot

5565c30

specify int64 explicitly II

f40184c

MarcoGorelli reviewed Feb 11, 2023

View reviewed changes

pandas/io/parsers/python_parser.py Show resolved Hide resolved

MarcoGorelli reviewed Feb 12, 2023

View reviewed changes

pandas/io/parsers/python_parser.py Outdated Show resolved Hide resolved

pandas/tests/io/parser/test_python_parser_only.py Outdated Show resolved Hide resolved

natmokval added 2 commits February 13, 2023 15:42

add the original example and remove the redundant check

8fd2e69

remove unnecessary check

c71e5b0

MarcoGorelli suggested changes Feb 14, 2023

View reviewed changes

pandas/io/parsers/python_parser.py Show resolved Hide resolved

natmokval added 4 commits February 17, 2023 20:03

add parametrize to thousand separator test

f811161

Merge branch 'main' into 50270-str-ignore-dot

5341153

Merge branch 'main' into 50270-str-ignore-dot

9b8767d

Merge branch 'main' into 50270-str-ignore-dot

c701e04

natmokval marked this pull request as ready for review February 18, 2023 21:10

MarcoGorelli suggested changes Feb 21, 2023

View reviewed changes

natmokval added 2 commits February 24, 2023 17:19

BUG: remove duplicative test

f008fab

Merge branch 'main' into 50270-str-ignore-dot

683d208

MarcoGorelli suggested changes Feb 24, 2023

View reviewed changes

pandas/tests/io/parser/test_python_parser_only.py Outdated Show resolved Hide resolved

BUG: add additional parameters in parametrization

6a329b1

MarcoGorelli approved these changes Feb 26, 2023

View reviewed changes

phofl reviewed Feb 27, 2023

View reviewed changes

natmokval added 4 commits March 1, 2023 15:36

BUG: exclude bool and add object dtype in parametrization

a31f335

BUG: change parameters of parametrization and add a second test

264346a

Merge branch 'main' into 50270-str-ignore-dot

38b1b15

Merge branch 'main' into 50270-str-ignore-dot

6d9ef48

MarcoGorelli added this to the 2.1 milestone Mar 16, 2023

MarcoGorelli approved these changes Mar 16, 2023

View reviewed changes

phofl approved these changes Mar 16, 2023

View reviewed changes

MarcoGorelli merged commit 060b9da into pandas-dev:main Mar 16, 2023

Uh oh!

BUG: str dtype ignored for column with dot #50364

BUG: str dtype ignored for column with dot #50364

Uh oh!

Conversation

natmokval commented Dec 20, 2022

Uh oh!

natmokval commented Dec 20, 2022

Uh oh!

MarcoGorelli commented Dec 20, 2022

Uh oh!

natmokval commented Dec 20, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yanxiaole commented Dec 30, 2022

Uh oh!

phofl commented Dec 30, 2022

Uh oh!

natmokval commented Jan 5, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

natmokval commented Jan 5, 2023

Uh oh!

MarcoGorelli commented Jan 30, 2023

Uh oh!

yanxiaole commented Jan 31, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

natmokval commented Feb 11, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MarcoGorelli left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

MarcoGorelli left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

natmokval commented Feb 24, 2023

Uh oh!

MarcoGorelli left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

MarcoGorelli left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

phofl left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

natmokval commented Jan 5, 2023 •

edited

Loading

MarcoGorelli left a comment •

edited

Loading

MarcoGorelli left a comment •

edited

Loading