Skip to content

Bug behavior when checking element inclusion in non-unique MultiIndex #7724

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
tipanverella opened this issue Jul 10, 2014 · 17 comments · Fixed by #8526
Closed

Bug behavior when checking element inclusion in non-unique MultiIndex #7724

tipanverella opened this issue Jul 10, 2014 · 17 comments · Fixed by #8526
Labels
Bug Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex
Milestone

Comments

@tipanverella
Copy link

I am using pandas.__version__==0.14.0.
This is my first bug report in github so excuse my possibly poorly formatted submission.

This bug report might be related to, or duplicate:

  1. multi-index indexing for 3-level index behaving mysteriously. #2646
  2. get_loc() returns integer or slice or KeyError nondeterministic in multiindex data frame #6501

It was explained to me by [http://stackoverflow.com/users/1427416/brenbarn] on the following StackOverflow question, http://stackoverflow.com/questions/24683023/having-issue-with-hierarchical-index-set-behavior/24684844#24684844

Essentially, the following behavior is not desirable:

print pd.__version__
WeirdIdx = pd.MultiIndex(levels=[[0], [1]],labels=[[0, 0], [0,0]],names=[u'X', u'Y'])
print WeirdIdx
print (0, 0) in WeirdIdx
print (1, 0) in WeirdIdx
print (100, 0) in WeirdIdx
print (100, 100) in WeirdIdx

since it prints:
0.14.0
X Y
0 1
1
True
True
True
True

despite the fact that (100,0) and (100,100) are unambiguously not part of the index.

@jreback
Copy link
Contributor

jreback commented Jul 10, 2014

.get_loc((100,0)) returns a 0-len slice instead raising here, so this is an untested edge case; will mark it as a bug (its only on a non-unique MultiIndex), pretty odd to do this anyhow

why are you doing the is checking? why are you not simply indexing?

@jreback jreback added this to the 0.15.0 milestone Jul 10, 2014
@tipanverella
Copy link
Author

I am not sure that I understand what you mean by "the is checking"?

@jreback
Copy link
Contributor

jreback commented Jul 10, 2014

typo, should be in checking. e.g. why are you doing ('foo','bar') in multi_index at all?

@tipanverella
Copy link
Author

oh, ok.
I have a dataframe that I am building with batches of data that I get hourly.
So I check whether or not this particular (date,hour) combination was previously procured before fetching it (the fetching is expensive!).

@jreback
Copy link
Contributor

jreback commented Jul 10, 2014

and? so why are you checking if something is in the index?

@tipanverella
Copy link
Author

the dataframe is indexed by (date, hour).

@jreback
Copy link
Contributor

jreback commented Jul 10, 2014

why are you not simply using a DatetimeIndex. Then grouping/resampling as needed?

@jreback
Copy link
Contributor

jreback commented Jul 10, 2014

checking like this is very inefficient

@tipanverella
Copy link
Author

To be honest, I am not sure I know how to do what you have proposed.

@jreback
Copy link
Contributor

jreback commented Jul 10, 2014

why don't you show somedata, then what you are doing with it

@tipanverella
Copy link
Author

So let's say I have a table like the following:

frame = pd.DataFrame(
    [
    (dt.date(2014,1,1),13,'blue',1000),
    (dt.date(2014,1,1),13,'red',1001),
    (dt.date(2014,1,1),13,'green',1001),
    (dt.date(2014,2,1),17,'blue',2000),
    (dt.date(2014,3,1),18,'red',3000),
    (dt.date(2014,7,11),18,'greeb',4000),
    (dt.date(2014,7,11),19,'red',5000),
    (dt.date(2014,7,1),19,'blue',6000)
    ],
    columns = ['day','hour','color','trials']
).set_index(['day','hour'])

I might want to append and concat to it, with a process that essentially grabs a dataframe df:

df = pd.DataFrame(
    [
    (dt.date(2014,1,12),13,'blue',1000),
    (dt.date(2014,1,12),13,'red',1001),
    (dt.date(2014,1,12),13,'green',1001)
    ],
    columns = ['day','hour','color','trials']
).set_index(['day','hour'])

but the process that fetches df is expensive, so I need to check that I am not fetching something I already have.

@jreback
Copy link
Contributor

jreback commented Jul 10, 2014

and so what with the given day/hour combo?

@tipanverella
Copy link
Author

again, I am not sure I understand the question?

@jreback
Copy link
Contributor

jreback commented Jul 10, 2014

You are much better off using datetimes (or Timestamps directly, rather than trying to use a combination of date/hour). Its much more efficient and more flexible.

In [9]: frame = pd.DataFrame([(dt.datetime(2014,1,1,13),'blue',1000),(dt.datetime(2014,1,1,13),'red',1001),(dt.datetime(2014,1,1,13),'green',1001),(dt.datetime(2014,2,1,17),'blue',2000),(dt.datetime(2014,3,1,18),'red',3000),(dt.datetime(2014,7,11,18),'greeb',4000),(dt.datetime(2014,7,11,19),'red',5000),(dt.datetime(2014,7,1,19),'blue',6000)],columns=['date','color','trials']).set_index('date')

In [10]: frame
Out[10]: 
                     color  trials
date                              
2014-01-01 13:00:00   blue    1000
2014-01-01 13:00:00    red    1001
2014-01-01 13:00:00  green    1001
2014-02-01 17:00:00   blue    2000
2014-03-01 18:00:00    red    3000
2014-07-11 18:00:00  greeb    4000
2014-07-11 19:00:00    red    5000
2014-07-01 19:00:00   blue    6000

In [11]: frame.index.hour
Out[11]: array([13, 13, 13, 17, 18, 18, 19, 19])

In [13]: frame[frame.index>'20140301']
Out[13]: 
                     color  trials
date                              
2014-03-01 18:00:00    red    3000
2014-07-11 18:00:00  greeb    4000
2014-07-11 19:00:00    red    5000
2014-07-01 19:00:00   blue    6000

In [14]: frame[frame.index>'20140301 18']
Out[14]: 
                     color  trials
date                              
2014-07-11 18:00:00  greeb    4000
2014-07-11 19:00:00    red    5000
2014-07-01 19:00:00   blue    6000

In [16]: frame[frame.index>datetime.datetime(2014,3,1,18)]
Out[16]: 
                     color  trials
date                              
2014-07-11 18:00:00  greeb    4000
2014-07-11 19:00:00    red    5000
2014-07-01 19:00:00   blue    6000

This is the power of grouping. Grouping by the day and color to produce a multi indexed frame according to the aggregation scheme.

In [18]: frame.groupby([pd.Grouper(level='date',freq='D'),'color']).sum()
Out[18]: 
                  trials
date       color        
2014-01-01 blue     1000
           green    1001
           red      1001
2014-02-01 blue     2000
2014-03-01 red      3000
2014-07-01 blue     6000
2014-07-11 greeb    4000
           red      5000

@tipanverella
Copy link
Author

That should work fantastically! Thank you.

@BrenBarn
Copy link

Just a note: Regardless of whether this particular use case can be done another way, I think we need to keep in mind what the API is. If the index containment API is that obj in index returns true if obj is in the index, and false otherwise, then it should return true if obj is in the index, and false otherwise, every single time. It doesn't matter why you're doing it. We need to keep the API consistent so that people can use it in a holistic way without digging through the docs and source code to unearth exceptions.

@jreback
Copy link
Contributor

jreback commented Jul 11, 2014

cc @BrenBarn it's marked as a bug (and not hard to fix)
I just don't think this edge case was tested

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Indexing Related to indexing on series/frames, not to indexes themselves MultiIndex
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants