-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
TST: tests for GH4862, GH7401, GH7403, GH7405 #9292
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
return lev._shallow_copy(vals) | ||
try: | ||
return lev._shallow_copy(vals.astype(lev.dtype, copy=False)) | ||
except ValueError: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what does this fail on? why don't you always just do Index(vals.astype(.......), **lev._get_attributes_dict())
? e.g. pass the new values to Index
and let it figure out the correct index type (with the exsting attributes). This is a very common idiom.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
few lines above it is inserting _get_na_value(lev.dtype.type)
into the lev.values.astype('object')
. if lev
was, say, an Int64Index
then vals.astype(lev.dtype)
will fail with ValueError
.
I think lev._shallow_copy
is more efficient than Index(....)
. so i try _shallow_copy
first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these are really close in efficiency. But this is obfusciating the code. Either put this try: except: in another function, or just call the index constructor. We don't need another way of doing things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
_shallow_copy
does not do any type inference at all; Index
does inspect types quite a lot, in particular this line is very inefficient. if i do Index(vals.astype(lev.dtype, copy=False), ...
or Index(vals, dtype=lev.dtype)
it would still need a try ... catch
block because of the possible ValueError
ed9f430
to
22a81f8
Compare
try: | ||
return lev._shallow_copy(vals.astype(lev.dtype, copy=False)) | ||
except ValueError: # when vals cannot be type casted to lev.dtype | ||
return Index(vals, **lev._get_attributes_dict()) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I get why you are doing the try: except, this ONLY needs to be around the astype though. Make this a separate method
22a81f8
to
422bc20
Compare
@jreback took |
@behzadnouri ok this looks good. and tests ok on windows. can you add a release note (you can put all the issues on the same one I think). ping when pushed. |
422bc20
to
e513486
Compare
@jreback done! |
TST: tests for GH4862, GH7401, GH7403, GH7405
thanks! a 4-some! excellent |
vals = [[3, 0, 1, 2, nan, nan, nan, nan], | ||
[nan, nan, nan, nan, 4, 5, 6, 7]] | ||
vals = list(map(list, zip(*vals))) | ||
idx = Index([nan, 0, 1, 2, 4, 5, 6, 7], name='B') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It may not be obvious (at least it wasn't to me), but idx = Index([nan, 0, 1, 2, 4, 5, 6, 7], name='B')
results in Float64Index([nan, 0.0, 1.0, 2.0, 4.0, 5.0, 6.0, 7.0], dtype='float64')
-- i.e. the integer levels (i.e. what the user sees) get converted to floats. Thus you are testing that when unstacking a MultiIndex that leaves a single level with integer labels and NaN
s, the result is a Float64Index
. This is pretty much unavoidable -- unless you explicitly cast the integers to object
s rather than float64
-- as MultIndex doesn't seem to support a single level. (Note that an integer MultiIndex level can have missing values, since they are represented by -1
in the labels and not NaN
in the labels.) I think either (a) MultIndex should support a single level (and not convert to an Index), or (b) the unstacking code should convert such an integer-with-NaN index to object
rather than float64
. [I encountered this issue while trying to make the stack()
code handle NaN
labels propertly.]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unless you explicitly cast the integers to objects rather than float64
I actually explicitly cast to objects not float64. the index comes out as float because of type inference in Index.__new__
, not in unstacking code.
>>> i
array([nan, 1, 2], dtype=object)
>>> type(i[1])
<class 'int'>
>>> Index(i)
Float64Index([nan, 1.0, 2.0], dtype='float64')
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you have to specify a dtype=object when constructing an index to not have it coerce (if u want it to explicitly preserve it)
though can't think of a reason you would ever actually want to do this
if u really mean integer then we use -1 as a sentinel in the indexers and they are left as ints
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@behzadnouri, yes, precisely, here the conversion of ints to foats is done by Index.__new__
. However since this is being compared to the results of left = df.set_index(['A', 'B']).unstack(0)
, clearly the unstack()
code is also doing such a conversion.
@jreback, understood. Two points:
First, other than specifying dtype='object'
, there doesn't seem to be a way to construct a (single-level) index of _int_s with NaN
s. In a MultiIndex
with more than one level, a level can be int
with NaN
s, but a MultiIndex
with a single level gets converted automatically to an Index
, and in the case of an int
level with NaN
s, a Float64Index
.
In [27]: pd.Index([nan, 1, 2]) # note automatic conversion to Float64Index
Out[27]: Float64Index([nan, 1.0, 2.0], dtype='float64')
In [29]: pd.MultiIndex.from_tuples([(nan,), (1,), (2,)]) # note automatic conversion to Float64Index
Out[29]: Float64Index([nan, 1.0, 2.0], dtype='float64')
In [30]: pd.MultiIndex.from_tuples([(nan, 'a'), (1, 'a'), (2, 'a')])
Out[30]:
MultiIndex(levels=[[1, 2], ['a']],
labels=[[-1, 0, 1], [0, 0, 0]])
In [31]: pd.MultiIndex.from_tuples([(nan, 'a'), (1, 'a'), (2, 'a')]).levels[0]
Out[31]: Int64Index([1, 2], dtype='int64') # note first level is Int64Index
Second, my point is that in the course of stacking and/or unstacking (and perhaps other operations), a single level in a multi-level MultiIndex
can be "promoted" to be an Index
unto itself, and in that case seems (at least currently in the case of stack()
and unstack()
) to be changed from ints to floats, which I don't think is a good thing to do to index values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@behzadnouri, just to clarify, I don't doubt that within the unstacking code the conversion from ints to floats happens in Index.__init__()
, but that doesn't change the fact that in such cases unstack()
has the effect of converting ints to floats.
The reason I think this is a potential problem (which predates you, I'm sure, and is also present in stack()
), is that if someone then wants to compare the values in the Index
of the stacked/unstacked object to the values in the corresponding level of MultiIndex
of the original DataFrame
, they won't match.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@seth-p I think the unstack/stack code need to explicity pass the dtype into when the Index is constructed.
Currently you cannot have nans in an Int64Index, so unless you mark it as object it will by definition be coerced. A multi-index can represent this, but single level multi-indexes are not supported as that is just added complexity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The question is what you want the behavior to be. Suppose df.index
is a 2-level MultiIndex
, the first level of which is (effectively) an Int64Index
with missing values (i.e. -1
in labels
), and that you then unstack the second level (i.e. df.unstack(1)
). What should the result index
be? There are two options: (a) Float64Index
, and (b) Index(..., dtype='object')
. With this PR, the result is (a). Personally I think (b) is preferable.
In : df = pd.DataFrame([[11,22],[33,44]], index=pd.MultiIndex.from_tuples([(1, 'a'), (None, 'b')], names=['ints', 'letters']))
In : df.index.levels[0]
Out: Int64Index([1], dtype='int64')
In : df.index.labels[0]
Out: FrozenNDArray([0, -1], dtype='int8')
In : result = df.unstack(1)
In : result.index
Out: Float64Index([nan, 1.0], dtype='float64')
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On a related note, to see in general how messed up things are when have NaN
s in indices, consider the following:
In [698]: idx = MultiIndex.from_tuples([(1, 'a'), (np.nan, 'b')])
In [699]: idx.levels[0]
Out[699]: Int64Index([1], dtype='int64')
In [700]: idx.get_values()
Out[700]: array([(1.0, 'a'), (nan, 'b')], dtype=object)
In [701]: idx[:1].get_values()
Out[701]: array([(1, 'a')], dtype=object)
Why does idx.get_values()
convert the int 1
to a float 1.0
just because there's a NaN
somewhere else? (I can guess what the implementation is that would lead to this...) Is this a bug?
closes #4862
closes #7401
closes #7403
closes #7405
minor code change to #9061; otherwise only tests. the code change is to avoid #9170 issue.