Skip to content

get_loc() returns integer or slice or KeyError nondeterministic in multiindex data frame #6501

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
colinfang opened this issue Feb 27, 2014 · 7 comments

Comments

@colinfang
Copy link

See example, if n is big, get_loc returns slice, otherwise it returns an integer. The boundary of n being big changes from time to time (but frequently 25 or 50).
http://stackoverflow.com/questions/22067205/when-does-pandas-xs-drop-dimensions-and-how-can-i-force-it-to-not-to

n=23
df = pd.DataFrame({'a':np.append(np.random.randint(0,10,n), -1),
                   'b':np.append(np.random.randint(0,10,n), -1),
                   'c':np.append(np.random.randint(0,10,n), -1),
                   'value':np.random.randint(0,100,n+1)})

df.set_index(['a','b','c'], inplace=True)
df.sortlevel(inplace = True)

#display(df.xs((-1,-1,-1)))
df.index.get_loc((-1,-1,-1))

The directly consequence is, xs would now returns a Series or a Data Frame (even if there is only 1 match) nondeterministicly (up to whether an integer or a slice is returned from get_loc )

What more, if the key is not in the indices, get_loc would sometimes throw KeyError exception, sometimes returnsSlice(0,0,None)

Try df.index.get_loc((-2,-1,-1)) more times and you will see. I suspect it depends on whether there are duplicate values in the multiindex.

@colinfang colinfang reopened this Feb 28, 2014
@jreback
Copy link
Contributor

jreback commented Feb 28, 2014

this is defined behavior and is very dependent on what the index type and what you are doing with it - for all intents this is an internal method

what are you using get_loc for?

@colinfang
Copy link
Author

I use get_loc mainly to speed things up.
df[df.index.get_loc((2, 1, 7))] is faster than df.xs((2,1,7))

But the problem here is xs returns Series and Data Frame (even if there is 1 match) nondeterministicly. And get_loc triggers exception and not nondeterministicly.

@jreback
Copy link
Contributor

jreback commented Feb 28, 2014

ok, you have to deal with the variable output then. I am not sure what you are trying to accomplish; what are you doing after you index ?

@colinfang
Copy link
Author

so that xs triggers KeyError while ix triggers IndexError if the key is not found is also a defined behavior? I can get on with it but I feel the documentation needs to be improved so I can catch those tricky / edge cases at earlier stages. I do this http://stackoverflow.com/questions/22046886/approach-to-speed-up-pandas-multilevel-index-selection if it helps.

@jreback
Copy link
Contributor

jreback commented Feb 28, 2014

read this : http://pandas.pydata.org/pandas-docs/stable/indexing.html#fallback-indexing

you should not use ix as you have an integer index, use loc instead it will KeyError

I wrote the answer to the so question

you still haven't shown what you are actually going to do

trying to speed up indexing is not the right thing to do

instead you should groupby or iterate depending in what you are actually trying to accomplish

@jreback
Copy link
Contributor

jreback commented Feb 28, 2014

see #6507 which is the only bug here

@hayd
Copy link
Contributor

hayd commented Mar 1, 2014

Presumably the reason it's 25 to 50 is probability of having multiple rows (a,b,c) with same values. examples with random numbers in make for tricky reproducing (usually best to keep the seed)! (though good for fuzztesting...)

@colinfang Since you're looking deep into the codebase, I recommend going a little further and exploring/tweaking the source while you do it - (If you fix this/something else, PRs are very welcome!) :D

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants