Skip to content

BUG/API: should MultiIndex with leading integer level fallback to positional? #33355

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
jbrockmendel opened this issue Apr 7, 2020 · 10 comments · Fixed by #33404
Closed

BUG/API: should MultiIndex with leading integer level fallback to positional? #33355

jbrockmendel opened this issue Apr 7, 2020 · 10 comments · Fixed by #33404
Labels
Bug Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex
Milestone

Comments

@jbrockmendel
Copy link
Member

jbrockmendel commented Apr 7, 2020


idx = pd.Index(range(3))
dti = pd.date_range("2016-01-01", periods=2)
mi = pd.MultiIndex.from_product([idx, dti])

ser = pd.Series(range(len(mi))[::-1], index=mi)

>>> ser
0  2016-01-01    5
   2016-01-02    4
1  2016-01-01    3
   2016-01-02    2
2  2016-01-01    1
   2016-01-02    0
dtype: int64
>>> ser[3]
2

My intuition is that ser[3] should behave like indexing on just the first level of the MultIIndex:

ser2 = ser.droplevel(1)

>>> ser2
0    5
0    4
1    3
1    2
2    1
2    0
dtype: int64
>>> ser2[3]
KeyError: 3

Should we avoid the fallback-to-positional behavior in this case?

@jbrockmendel jbrockmendel added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 7, 2020
@jorisvandenbossche
Copy link
Member

Didn't look in detail, but just from the title this might be related to / duplicate of #21677 (and a recent duplicate as #32188)

@jbrockmendel
Copy link
Member Author

Unless im misunderstanding those two issues, this is different. this is about the behavior of Series.__getitem__ with a MultiIndex and integer key

@TomAugspurger
Copy link
Contributor

They're two different issues.

My intuition is that ser[3] should behave like indexing on just the first level of the MultIIndex:

That's my intuition as well. I would call this a bug. IIRC we had a similar issue with incorrect fallback indexing and decided to deprecate the behavior (rather than just changing it). Does that sound correct?

@TomAugspurger TomAugspurger added Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 8, 2020
@jorisvandenbossche
Copy link
Member

Yeah, as I said, it was only from reading the title where I saw "MultiIndex", "level", and "positional" ;-)

So __getitem__ with integers is positional except for numerical indices, is that correct? (and there should normally be no fallback, I think. Eg for int or float index, it is always label based, and does not fall back to positional if the key is not present but would be within range positionally)

So the question then is, do we see MultiIndex as a numerical index or not .. ?

@jorisvandenbossche
Copy link
Member

Interpreting a single label as indexing the first level sounds logical, but is also not fully explicit.

So I was wondering what it would give if you do this explicitly (index the first level with 0, index the second level with selecting all). But this actually is failing:

In [65]: ser[pd.IndexSlice[0, :]]   
...

~/scipy/pandas/pandas/core/series.py in _get_values_tuple(self, key)
    961 
    962         # If key is contained, would have returned by now
--> 963         indexer, new_index = self.index.get_loc_level(key)
    964         return self._constructor(self._values[indexer], index=new_index).__finalize__(
    965             self,

~/scipy/pandas/pandas/core/indexes/multi.py in get_loc_level(self, key, level, drop_level)
   2865                             continue
   2866                         else:
-> 2867                             raise TypeError(key)
   2868 
   2869                     if indexer is None:

TypeError: (0, slice(None, None, None))

(or am I using IndexSlice wrongly?)

@jbrockmendel
Copy link
Member Author

(or am I using IndexSlice wrongly?)

I'm not sure I've ever seen IndexSlice used with Series.__getitem__; I usually think of it as being a .loc thing. Series.__getitem__

The relevant logic for how we handle integers in Series.__getitem__ is in Index.get_value:

    def get_value(self, series, key):
        [...]
        try:
            loc = self.get_loc(key)
        except KeyError:
            if not self._should_fallback_to_positional():
                raise
            elif is_integer(key):
                # If the Index cannot hold integer, then this is unambiguously
                # a locational lookup.
                loc = key
            else:
                raise

        return self._get_values_for_loc(series, loc, key)

    def _should_fallback_to_positional(self) -> bool:
        """
        If an integer key is not found, should we fall back to positional indexing?
        """
        if len(self) > 0 and (self.holds_integer() or self.is_boolean()):
            return False
        return True

But MultiIndex.get_value does not have that same _should_fallback_to_positional check. My thought ATM is that MultiIndex._should_fallback_to_positional should look like:

    def _should_fallback_to_positional(self) -> bool:
        return self.levels[0]._should_fallback_to_positional()

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Apr 8, 2020

I'm not sure I've ever seen IndexSlice used with Series.__getitem__; I usually think of it as being a .loc thing.

With loc it indeed works:

In [71]: ser.loc[pd.IndexSlice[1, :]] 
Out[71]: 
1  2016-01-01    3
   2016-01-02    2
dtype: int64

But I would say that in this case getitem and loc should work similarly.

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Apr 8, 2020

Some more related cases to consider

With the example series from above, but now a list indexer:

In [98]: ser 
Out[98]: 
0  2016-01-01    5
   2016-01-02    4
1  2016-01-01    3
   2016-01-02    2
2  2016-01-01    1
   2016-01-02    0
dtype: int64

In [99]: ser[[0, 1]] 
Out[99]: 
0  2016-01-01    5
   2016-01-02    4
dtype: int64

This is also doing positional indexing. So is this also a bug?

With a FloatIndex as first level, it seems even more broken:

In [111]: idx = pd.Index(np.arange(4., 7)) 
     ...: dti = pd.date_range("2016-01-01", periods=2) 
     ...: mi = pd.MultiIndex.from_product([idx, dti]) 
     ...:  
     ...: ser2 = pd.Series(range(len(mi))[::-1], index=mi)   

In [112]: ser2  
Out[112]: 
4.0  2016-01-01    5
     2016-01-02    4
5.0  2016-01-01    3
     2016-01-02    2
6.0  2016-01-01    1
     2016-01-02    0
dtype: int64

# positional fallback
In [113]: ser2[0]                                                                                                                                                                              
Out[113]: 5

# label-based ...
In [114]: ser2[5]                                                                                                                                                                              
Out[114]: 
2016-01-01    3
2016-01-02    2
dtype: int64

(while a single-level FloatIndex is always label-based with integers)

@jreback jreback added this to the 1.1 milestone Apr 8, 2020
@jbrockmendel
Copy link
Member Author

@jorisvandenbossche heads up: i edited your last comment to define ser2 to disambiguate from the OP example.

To make sure we're on the same page: my read of that is that ser2[0] should raise KeyError and ser2[5] is correct. Does that match what you expect? If so, this is fixed by #33404 (need to add a test for this new case)

The list-like case goes through a separate path, ill have to track it down.

@jbrockmendel
Copy link
Member Author

tentative look at the list-of-ints case: in Series._get_with we end up doing

        if isinstance(key, Index):
            key_type = key.inferred_type
        else:
            key_type = lib.infer_dtype(key, skipna=False)

        # Note: The key_type == "boolean" case should be caught by the
        #  com.is_bool_indexer check in __getitem__
        if key_type == "integer":
            # We need to decide whether to treat this as a positional indexer
            #  (i.e. self.iloc) or label-based (i.e. self.loc)
            if self.index.is_integer() or self.index.is_floating():
                return self.loc[key]
            elif isinstance(self.index, IntervalIndex):
                return self.loc[key]
            else:
                return self.iloc[key]

we should also have something equivalent to

if isinstance(self.index, MultiIndex) and (self.index.levels[0].is_integer() or self.index.levels[0].is_floating()):
    return self.loc[key]

I think the is_integer and is_floating checks there should just reduce to self.index._should_fallback_to_positional(), which would nicely encompass the MultiIndex case (not sure about IntervalIndex ATM). I'm starting a branch to address this glom of code.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants