Skip to content

TST: tests for GH4862, GH7401, GH7403, GH7405 #9292

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Jan 26, 2015

Conversation

behzadnouri
Copy link
Contributor

closes #4862
closes #7401
closes #7403
closes #7405

minor code change to #9061; otherwise only tests. the code change is to avoid #9170 issue.

@jreback jreback added Bug Dtype Conversions Unexpected or buggy dtype conversions MultiIndex Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Jan 19, 2015
@jreback jreback added this to the 0.16.0 milestone Jan 19, 2015
return lev._shallow_copy(vals)
try:
return lev._shallow_copy(vals.astype(lev.dtype, copy=False))
except ValueError:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does this fail on? why don't you always just do Index(vals.astype(.......), **lev._get_attributes_dict())? e.g. pass the new values to Index and let it figure out the correct index type (with the exsting attributes). This is a very common idiom.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

few lines above it is inserting _get_na_value(lev.dtype.type) into the lev.values.astype('object'). if lev was, say, an Int64Index then vals.astype(lev.dtype) will fail with ValueError.

I think lev._shallow_copy is more efficient than Index(....). so i try _shallow_copy first.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these are really close in efficiency. But this is obfusciating the code. Either put this try: except: in another function, or just call the index constructor. We don't need another way of doing things.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_shallow_copy does not do any type inference at all; Index does inspect types quite a lot, in particular this line is very inefficient. if i do Index(vals.astype(lev.dtype, copy=False), ... or Index(vals, dtype=lev.dtype) it would still need a try ... catch block because of the possible ValueError

try:
return lev._shallow_copy(vals.astype(lev.dtype, copy=False))
except ValueError: # when vals cannot be type casted to lev.dtype
return Index(vals, **lev._get_attributes_dict())

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get why you are doing the try: except, this ONLY needs to be around the astype though. Make this a separate method

@behzadnouri
Copy link
Contributor Author

@jreback took _shallow_copy/Index to a separate functions, and removed check_dtype=False.

@jreback
Copy link
Contributor

jreback commented Jan 25, 2015

@behzadnouri ok this looks good. and tests ok on windows.

can you add a release note (you can put all the issues on the same one I think). ping when pushed.

@jreback jreback mentioned this pull request Jan 25, 2015
2 tasks
@behzadnouri
Copy link
Contributor Author

@jreback done!

jreback added a commit that referenced this pull request Jan 26, 2015
TST: tests for GH4862, GH7401, GH7403, GH7405
@jreback jreback merged commit 224a66d into pandas-dev:master Jan 26, 2015
@jreback
Copy link
Contributor

jreback commented Jan 26, 2015

thanks! a 4-some! excellent

vals = [[3, 0, 1, 2, nan, nan, nan, nan],
[nan, nan, nan, nan, 4, 5, 6, 7]]
vals = list(map(list, zip(*vals)))
idx = Index([nan, 0, 1, 2, 4, 5, 6, 7], name='B')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It may not be obvious (at least it wasn't to me), but idx = Index([nan, 0, 1, 2, 4, 5, 6, 7], name='B') results in Float64Index([nan, 0.0, 1.0, 2.0, 4.0, 5.0, 6.0, 7.0], dtype='float64') -- i.e. the integer levels (i.e. what the user sees) get converted to floats. Thus you are testing that when unstacking a MultiIndex that leaves a single level with integer labels and NaNs, the result is a Float64Index. This is pretty much unavoidable -- unless you explicitly cast the integers to objects rather than float64 -- as MultIndex doesn't seem to support a single level. (Note that an integer MultiIndex level can have missing values, since they are represented by -1 in the labels and not NaN in the labels.) I think either (a) MultIndex should support a single level (and not convert to an Index), or (b) the unstacking code should convert such an integer-with-NaN index to object rather than float64. [I encountered this issue while trying to make the stack() code handle NaN labels propertly.]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unless you explicitly cast the integers to objects rather than float64

I actually explicitly cast to objects not float64. the index comes out as float because of type inference in Index.__new__, not in unstacking code.

>>> i
array([nan, 1, 2], dtype=object)
>>> type(i[1])
<class 'int'>
>>> Index(i)
Float64Index([nan, 1.0, 2.0], dtype='float64')

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you have to specify a dtype=object when constructing an index to not have it coerce (if u want it to explicitly preserve it)
though can't think of a reason you would ever actually want to do this

if u really mean integer then we use -1 as a sentinel in the indexers and they are left as ints

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@behzadnouri, yes, precisely, here the conversion of ints to foats is done by Index.__new__. However since this is being compared to the results of left = df.set_index(['A', 'B']).unstack(0), clearly the unstack() code is also doing such a conversion.

@jreback, understood. Two points:

First, other than specifying dtype='object', there doesn't seem to be a way to construct a (single-level) index of _int_s with NaNs. In a MultiIndex with more than one level, a level can be int with NaNs, but a MultiIndex with a single level gets converted automatically to an Index, and in the case of an int level with NaNs, a Float64Index.

In [27]: pd.Index([nan, 1, 2])  # note automatic conversion to Float64Index
Out[27]: Float64Index([nan, 1.0, 2.0], dtype='float64')

In [29]: pd.MultiIndex.from_tuples([(nan,), (1,), (2,)])  # note automatic conversion to Float64Index
Out[29]: Float64Index([nan, 1.0, 2.0], dtype='float64')

In [30]: pd.MultiIndex.from_tuples([(nan, 'a'), (1, 'a'), (2, 'a')])
Out[30]:
MultiIndex(levels=[[1, 2], ['a']],
           labels=[[-1, 0, 1], [0, 0, 0]])

In [31]: pd.MultiIndex.from_tuples([(nan, 'a'), (1, 'a'), (2, 'a')]).levels[0]
Out[31]: Int64Index([1, 2], dtype='int64')  # note first level is Int64Index

Second, my point is that in the course of stacking and/or unstacking (and perhaps other operations), a single level in a multi-level MultiIndex can be "promoted" to be an Index unto itself, and in that case seems (at least currently in the case of stack() and unstack()) to be changed from ints to floats, which I don't think is a good thing to do to index values.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@behzadnouri, just to clarify, I don't doubt that within the unstacking code the conversion from ints to floats happens in Index.__init__(), but that doesn't change the fact that in such cases unstack() has the effect of converting ints to floats.

The reason I think this is a potential problem (which predates you, I'm sure, and is also present in stack()), is that if someone then wants to compare the values in the Index of the stacked/unstacked object to the values in the corresponding level of MultiIndex of the original DataFrame, they won't match.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@seth-p I think the unstack/stack code need to explicity pass the dtype into when the Index is constructed.

Currently you cannot have nans in an Int64Index, so unless you mark it as object it will by definition be coerced. A multi-index can represent this, but single level multi-indexes are not supported as that is just added complexity.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The question is what you want the behavior to be. Suppose df.index is a 2-level MultiIndex, the first level of which is (effectively) an Int64Index with missing values (i.e. -1 in labels), and that you then unstack the second level (i.e. df.unstack(1)). What should the result index be? There are two options: (a) Float64Index, and (b) Index(..., dtype='object'). With this PR, the result is (a). Personally I think (b) is preferable.

In : df = pd.DataFrame([[11,22],[33,44]], index=pd.MultiIndex.from_tuples([(1, 'a'), (None, 'b')], names=['ints', 'letters']))
In : df.index.levels[0]
Out: Int64Index([1], dtype='int64')
In : df.index.labels[0]
Out: FrozenNDArray([0, -1], dtype='int8')

In : result = df.unstack(1)
In : result.index
Out: Float64Index([nan, 1.0], dtype='float64')

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On a related note, to see in general how messed up things are when have NaNs in indices, consider the following:

In [698]: idx = MultiIndex.from_tuples([(1, 'a'), (np.nan, 'b')])

In [699]: idx.levels[0]
Out[699]: Int64Index([1], dtype='int64')

In [700]: idx.get_values()
Out[700]: array([(1.0, 'a'), (nan, 'b')], dtype=object)

In [701]: idx[:1].get_values()
Out[701]: array([(1, 'a')], dtype=object)

Why does idx.get_values() convert the int 1 to a float 1.0 just because there's a NaN somewhere else? (I can guess what the implementation is that would lead to this...) Is this a bug?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment