TST: tests for GH4862, GH7401, GH7403, GH7405 #9292

behzadnouri · 2015-01-18T23:28:06Z

closes #4862
closes #7401
closes #7403
closes #7405

minor code change to #9061; otherwise only tests. the code change is to avoid #9170 issue.

jreback · 2015-01-19T00:26:01Z

pandas/core/reshape.py

-            return lev._shallow_copy(vals)
+            try:
+                return lev._shallow_copy(vals.astype(lev.dtype, copy=False))
+            except ValueError:


what does this fail on? why don't you always just do Index(vals.astype(.......), **lev._get_attributes_dict())? e.g. pass the new values to Index and let it figure out the correct index type (with the exsting attributes). This is a very common idiom.

few lines above it is inserting _get_na_value(lev.dtype.type) into the lev.values.astype('object'). if lev was, say, an Int64Index then vals.astype(lev.dtype) will fail with ValueError.

I think lev._shallow_copy is more efficient than Index(....). so i try _shallow_copy first.

I think these are really close in efficiency. But this is obfusciating the code. Either put this try: except: in another function, or just call the index constructor. We don't need another way of doing things.

_shallow_copy does not do any type inference at all; Index does inspect types quite a lot, in particular this line is very inefficient. if i do Index(vals.astype(lev.dtype, copy=False), ... or Index(vals, dtype=lev.dtype) it would still need a try ... catch block because of the possible ValueError

jreback · 2015-01-19T12:49:31Z

pandas/core/reshape.py

+            try:
+                return lev._shallow_copy(vals.astype(lev.dtype, copy=False))
+            except ValueError:  # when vals cannot be type casted to lev.dtype
+                return Index(vals, **lev._get_attributes_dict())



I get why you are doing the try: except, this ONLY needs to be around the astype though. Make this a separate method

behzadnouri · 2015-01-19T15:33:19Z

@jreback took _shallow_copy/Index to a separate functions, and removed check_dtype=False.

jreback · 2015-01-25T23:27:50Z

@behzadnouri ok this looks good. and tests ok on windows.

can you add a release note (you can put all the issues on the same one I think). ping when pushed.

behzadnouri · 2015-01-26T01:27:41Z

@jreback done!

TST: tests for GH4862, GH7401, GH7403, GH7405

jreback · 2015-01-26T01:29:28Z

thanks! a 4-some! excellent

seth-p · 2015-02-12T02:34:32Z

pandas/tests/test_frame.py

+        vals = [[3, 0, 1, 2, nan, nan, nan, nan],
+                [nan, nan, nan, nan, 4, 5, 6, 7]]
+        vals = list(map(list, zip(*vals)))
+        idx = Index([nan, 0, 1, 2, 4, 5, 6, 7], name='B')


It may not be obvious (at least it wasn't to me), but idx = Index([nan, 0, 1, 2, 4, 5, 6, 7], name='B') results in Float64Index([nan, 0.0, 1.0, 2.0, 4.0, 5.0, 6.0, 7.0], dtype='float64') -- i.e. the integer levels (i.e. what the user sees) get converted to floats. Thus you are testing that when unstacking a MultiIndex that leaves a single level with integer labels and NaNs, the result is a Float64Index. This is pretty much unavoidable -- unless you explicitly cast the integers to objects rather than float64 -- as MultIndex doesn't seem to support a single level. (Note that an integer MultiIndex level can have missing values, since they are represented by -1 in the labels and not NaN in the labels.) I think either (a) MultIndex should support a single level (and not convert to an Index), or (b) the unstacking code should convert such an integer-with-NaN index to object rather than float64. [I encountered this issue while trying to make the stack() code handle NaN labels propertly.]

unless you explicitly cast the integers to objects rather than float64

I actually explicitly cast to objects not float64. the index comes out as float because of type inference in Index.__new__, not in unstacking code.

>>> i array([nan, 1, 2], dtype=object) >>> type(i[1]) <class 'int'> >>> Index(i) Float64Index([nan, 1.0, 2.0], dtype='float64')

you have to specify a dtype=object when constructing an index to not have it coerce (if u want it to explicitly preserve it)
though can't think of a reason you would ever actually want to do this

if u really mean integer then we use -1 as a sentinel in the indexers and they are left as ints

@behzadnouri, yes, precisely, here the conversion of ints to foats is done by Index.__new__. However since this is being compared to the results of left = df.set_index(['A', 'B']).unstack(0), clearly the unstack() code is also doing such a conversion.

@jreback, understood. Two points:

First, other than specifying dtype='object', there doesn't seem to be a way to construct a (single-level) index of _int_s with NaNs. In a MultiIndex with more than one level, a level can be int with NaNs, but a MultiIndex with a single level gets converted automatically to an Index, and in the case of an int level with NaNs, a Float64Index.

In [27]: pd.Index([nan, 1, 2]) # note automatic conversion to Float64Index Out[27]: Float64Index([nan, 1.0, 2.0], dtype='float64') In [29]: pd.MultiIndex.from_tuples([(nan,), (1,), (2,)]) # note automatic conversion to Float64Index Out[29]: Float64Index([nan, 1.0, 2.0], dtype='float64') In [30]: pd.MultiIndex.from_tuples([(nan, 'a'), (1, 'a'), (2, 'a')]) Out[30]: MultiIndex(levels=[[1, 2], ['a']], labels=[[-1, 0, 1], [0, 0, 0]]) In [31]: pd.MultiIndex.from_tuples([(nan, 'a'), (1, 'a'), (2, 'a')]).levels[0] Out[31]: Int64Index([1, 2], dtype='int64') # note first level is Int64Index

Second, my point is that in the course of stacking and/or unstacking (and perhaps other operations), a single level in a multi-level MultiIndex can be "promoted" to be an Index unto itself, and in that case seems (at least currently in the case of stack() and unstack()) to be changed from ints to floats, which I don't think is a good thing to do to index values.

@behzadnouri, just to clarify, I don't doubt that within the unstacking code the conversion from ints to floats happens in Index.__init__(), but that doesn't change the fact that in such cases unstack() has the effect of converting ints to floats.

The reason I think this is a potential problem (which predates you, I'm sure, and is also present in stack()), is that if someone then wants to compare the values in the Index of the stacked/unstacked object to the values in the corresponding level of MultiIndex of the original DataFrame, they won't match.

@seth-p I think the unstack/stack code need to explicity pass the dtype into when the Index is constructed.

Currently you cannot have nans in an Int64Index, so unless you mark it as object it will by definition be coerced. A multi-index can represent this, but single level multi-indexes are not supported as that is just added complexity.

The question is what you want the behavior to be. Suppose df.index is a 2-level MultiIndex, the first level of which is (effectively) an Int64Index with missing values (i.e. -1 in labels), and that you then unstack the second level (i.e. df.unstack(1)). What should the result index be? There are two options: (a) Float64Index, and (b) Index(..., dtype='object'). With this PR, the result is (a). Personally I think (b) is preferable.

In : df = pd.DataFrame([[11,22],[33,44]], index=pd.MultiIndex.from_tuples([(1, 'a'), (None, 'b')], names=['ints', 'letters'])) In : df.index.levels[0] Out: Int64Index([1], dtype='int64') In : df.index.labels[0] Out: FrozenNDArray([0, -1], dtype='int8') In : result = df.unstack(1) In : result.index Out: Float64Index([nan, 1.0], dtype='float64')

On a related note, to see in general how messed up things are when have NaNs in indices, consider the following:

In [698]: idx = MultiIndex.from_tuples([(1, 'a'), (np.nan, 'b')]) In [699]: idx.levels[0] Out[699]: Int64Index([1], dtype='int64') In [700]: idx.get_values() Out[700]: array([(1.0, 'a'), (nan, 'b')], dtype=object) In [701]: idx[:1].get_values() Out[701]: array([(1, 'a')], dtype=object)

Why does idx.get_values() convert the int 1 to a float 1.0 just because there's a NaN somewhere else? (I can guess what the implementation is that would lead to this...) Is this a bug?

jreback added Bug Dtype Conversions Unexpected or buggy dtype conversions MultiIndex Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Jan 19, 2015

jreback added this to the 0.16.0 milestone Jan 19, 2015

jreback reviewed Jan 19, 2015
View reviewed changes

behzadnouri force-pushed the tst-unstack branch from ed9f430 to 22a81f8 Compare January 19, 2015 12:48

jreback reviewed Jan 19, 2015
View reviewed changes

behzadnouri force-pushed the tst-unstack branch from 22a81f8 to 422bc20 Compare January 19, 2015 13:54

jreback mentioned this pull request Jan 25, 2015

BUG/WIP: fix pivot with nan indexes #7481

Closed

2 tasks

tests for issues with unstack with nan

e513486

behzadnouri force-pushed the tst-unstack branch from 422bc20 to e513486 Compare January 26, 2015 00:10

jreback added a commit that referenced this pull request Jan 26, 2015

Merge pull request #9292 from behzadnouri/tst-unstack

224a66d

TST: tests for GH4862, GH7401, GH7403, GH7405

jreback merged commit 224a66d into pandas-dev:master Jan 26, 2015

behzadnouri deleted the tst-unstack branch January 26, 2015 01:53

behzadnouri mentioned this pull request Feb 3, 2015

BUG: NaN level values in stack() and unstack() #9406

Closed

seth-p mentioned this pull request Feb 4, 2015

ENH/API: DataFrame.stack() support for level=None, sequentially=True/False, and NaN level values. #9023

Closed

seth-p reviewed Feb 12, 2015
View reviewed changes

behzadnouri mentioned this pull request Feb 14, 2015

BUG: unstack with nulls & Timedelta/DateTime index #9491

Merged

seth-p mentioned this pull request Feb 16, 2015

BUG: DataFrame.unstack() with two levels containing NaN #9497

Closed

jreback mentioned this pull request Apr 8, 2015

BUG: unstack fails for mixed dtype subset #7462

Closed

Uh oh!

TST: tests for GH4862, GH7401, GH7403, GH7405 #9292

TST: tests for GH4862, GH7401, GH7403, GH7405 #9292

Uh oh!

Conversation

behzadnouri commented Jan 18, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

behzadnouri commented Jan 19, 2015

Uh oh!

jreback commented Jan 25, 2015

Uh oh!

behzadnouri commented Jan 26, 2015

Uh oh!

jreback commented Jan 26, 2015

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!