Skip to content

ENH: df.grep(col,pat) and df.dselect(col,"expr") #2460

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
ghost opened this issue Dec 9, 2012 · 39 comments
Closed

ENH: df.grep(col,pat) and df.dselect(col,"expr") #2460

ghost opened this issue Dec 9, 2012 · 39 comments
Labels
Ideas Long-Term Enhancement Discussions
Milestone

Comments

@ghost
Copy link

ghost commented Dec 9, 2012

wes, how would you feel about adding something like the following as
a Dataframe method? especially with method chaining I would find this
useful. Will add handling for datatypes and so on.

In [32]: from pandas.util.testing import makeCustomDataframe as mkdf
In [33]: def grep(df,col,pat):
    ...:     vals=df[col]
    ...:     return df.ix[[bool(re.search(pat,unicode(x))) for x in vals]]
In [34]: df=mkdf(20,2)
    ...: print df
C0       C_l0_g0 C_l0_g1  \
R0                         
R_l0_g0     R0C0    R0C1   
R_l0_g1     R1C0    R1C1   
R_l0_g10    R2C0    R2C1   
R_l0_g11    R3C0    R3C1   
R_l0_g12    R4C0    R4C1   
R_l0_g13    R5C0    R5C1   
R_l0_g14    R6C0    R6C1   
R_l0_g15    R7C0    R7C1   
R_l0_g16    R8C0    R8C1   
R_l0_g17    R9C0    R9C1   
R_l0_g18   R10C0   R10C1   
R_l0_g19   R11C0   R11C1   
R_l0_g2    R12C0   R12C1   
R_l0_g20   R13C0   R13C1   
R_l0_g3    R14C0   R14C1   
R_l0_g4    R15C0   R15C1   
R_l0_g5    R16C0   R16C1   
R_l0_g6    R17C0   R17C1   
R_l0_g7    R18C0   R18C1   
R_l0_g8    R19C0   R19C1   

In [35]: grep(df,"C_l0_g0","R.*6")
Out[35]: 
C0       C_l0_g0 C_l0_g1  \
R0                         
R_l0_g14    R6C0    R6C1   
R_l0_g5    R16C0   R16C1  
@ghost
Copy link
Author

ghost commented Dec 9, 2012

also

def dselect(self,col,f):
    f= eval("lambda x: " + f)
    return self.ix[[f(x) for x in self[col]]]

so this is possible:

In [13]: df=mkdf(20,2)
    ...: s=pd.Series(np.random.randint(1,100,len(df.index)),index=df.index,name="nums")
    ...: df=df.join(s)
    ...: print(df)
    ...: df.dselect("nums","x>50 or x<25").grep("C_l0_g0","[2|4|8]")
         C_l0_g0 C_l0_g1  nums  \
R0                               
R_l0_g0     R0C0    R0C1    55   
R_l0_g1     R1C0    R1C1    37   
R_l0_g10    R2C0    R2C1    61   
R_l0_g11    R3C0    R3C1    62   
R_l0_g12    R4C0    R4C1    75   
R_l0_g13    R5C0    R5C1    93   
R_l0_g14    R6C0    R6C1    31   
R_l0_g15    R7C0    R7C1    73   
R_l0_g16    R8C0    R8C1     6   
R_l0_g17    R9C0    R9C1    44   
R_l0_g18   R10C0   R10C1    97   
R_l0_g19   R11C0   R11C1    64   
R_l0_g2    R12C0   R12C1    39   
R_l0_g20   R13C0   R13C1    11   
R_l0_g3    R14C0   R14C1     5   
R_l0_g4    R15C0   R15C1    28   
R_l0_g5    R16C0   R16C1    25   
R_l0_g6    R17C0   R17C1    63   
R_l0_g7    R18C0   R18C1    21   
R_l0_g8    R19C0   R19C1    59   
Out[13]: 
         C_l0_g0 C_l0_g1  nums  \
R0                               
R_l0_g10    R2C0    R2C1    61   
R_l0_g12    R4C0    R4C1    75   
R_l0_g16    R8C0    R8C1     6   
R_l0_g3    R14C0   R14C1     5   
R_l0_g7    R18C0   R18C1    21  

not enough eval in pandas IMHO.

@ghost
Copy link
Author

ghost commented Dec 10, 2012

def grep(self,col_or_series,pat):
    if isinstance(col_or_series,pd.Series):
        col_or_series = col_or_series.name
    series=self[col_or_series]
    return self.ix[[bool(re.search(pat,unicode(x))) for x in series]]

def dselect(self,col_or_series,f=None):
    if isinstance(col_or_series,pd.Series):
        col_or_series = col_or_series.name
    series=self[col_or_series]
    if isinstance(f,basestring):
        f= eval("lambda x: " + f)
    return self.ix[[f(x) for x in series]]
pd.DataFrame.grep=grep
pd.DataFrame.dselect=dselect

dselect now accepts lambdas as well as eval snippets,
and you can pass in a dataframe column with tab completion,
not efficient but handy:

df.grep(df.colA,u'st\.[\s$]').dselect(df.colB,"x<=5").dselect(df.colC,lambda x: x[0] == myfilterval)

@ghost
Copy link
Author

ghost commented Dec 11, 2012

@wesm, if this has your blessing I'll round it out into a PR.

@jreback
Copy link
Contributor

jreback commented Dec 11, 2012

for selecting columns, does grep overlap with select/ filter?
what r use cases for:

  • filter
  • select
  • grep

@ghost
Copy link
Author

ghost commented Dec 11, 2012

AFICT select/filter operate exclusively on index labels rather then data.
dselect and grep (maybe dgrep would be a better name) operate on data rather then indicies.

I have these monkey-patched onto pd.Dataframe at load-time and find them very useful.

An example use case:
a dataset from which you wish to get only rows that match a certain regexp in a column called "names".
Right now the best I'm aware of would be something like:

df.set_index("names",False,True).select(lambda x: bool(re.search("jerry.+",x[-1]))).reset_index(-1,drop=True)

edit: cleaned up the example. the bool around the re.search doesn't seem necessary but currently is.

IMO, This is very intuitive:

df.grep("names","jerry.+")

and with some fleshing out (multindex columns) could be a useful addition to core pandas.

@changhiskhan
Copy link
Contributor

I'd prefer it if we didn't call it "grep". A more intuitive name like "search" or something would be better IMO. While you and I have no trouble understanding what "grepping" means, it's not the case for a lot of users.
And rather than having dselect, dgrep, or dsearch, whether to search the index or data or both should just be a keyword.

@jreback
Copy link
Contributor

jreback commented Dec 11, 2012

In addition, I think it makes sense to deprecate select (or make it a convenience method that uses filter; and promote usage of filter instead (which can do anything select can anyhow). my 2c

@ghost
Copy link
Author

ghost commented Dec 12, 2012

no strong opinion on naming here. whatever works.

rolling this into existing functions is a bit of a problem due to the existing
signatures.
select(crit,axis)
df.filter(items=None,like,regex=None)

  • filter would need an extra axis arg, in which case it makes sense to filter on row
    labels as well as data. So axis would either have an additional "data" axis, which
    is incompatible with the rest of the API, or an additional argument which would really
    defeat the purpose of making the API more concise.
  • Also , if an items argument were to be added to select it would have to be
    appended at the end so as not to break existing code:
    select(crit,axis,items)
    df.filter(items=None,like,regex=None,axis)
    and so the "items" arg will apear in different locations for similar functions, That's really
    really bad design, I'd prefer giving up the whole thing then commiting a change like that.
  • I could give up the implicit lambda in dselect and just enhance select to interpret
    a string crit arg as a regexp. But I think the implicit lambda is lovely.

pandas already seperates data from indicies on principle, would it be terrible
if the API reflected this as well?
There's a performance hit associated with data vs index operations,
because of the optimizations for indicies.
MIght be useful to differentiate the performant from the convenient.

@ghost
Copy link
Author

ghost commented Dec 12, 2012

related #1844 (comment)

@ghost
Copy link
Author

ghost commented Dec 16, 2012

related #2064 , i.e. "definitely more cowbell"

@ghost
Copy link
Author

ghost commented Apr 7, 2013

I find this functionaly extremely useful when working with "civic hacking" data, small data
sets where you want to filter against a name in a columns, and you really don't need to
convert them to an index.

Shouldn't pandas have (Explicitly slow and unvectorized in the general case) some functionality
for slicing and filtering according to data values? This keeps coming up as a need in data I work
with, and while monkey patching solves it for me, I'm guessing there are users who would benefit
from something like in core. the funky "string as implicit lambda body" can go away, It was just
an idea.

Any commiters +1 for something like this?

@hayd
Copy link
Contributor

hayd commented Apr 7, 2013

I definitely think something like this would be useful, however maybe it's useful to restrict to multiple columns at once (...though now I think about it you could just chain it).

I hadn't seen this thread but had been thinking something like this would be useful after DSM mentioned something about it on StackOverflow. (Note the terrible name choice.)

@jreback
Copy link
Contributor

jreback commented Apr 7, 2013

+1 for grep (search is ok too)

Also would consolidate filter and select (which filter just calls anyhow, and deprecate select) into just filter.

then filter for labels, and grep for data

We had this discussion above about trying to combine these, but I am +1 for keeping apart,
too much confusion, should be 1 way for labels, 1 for data, avoids ambiguity on what magic is happening

@y-p grep should have axis arg as well
(which I have also added into filter for my series refactor in any event)

@ghost
Copy link
Author

ghost commented Apr 7, 2013

very true about axis. I'm completely open to name changes and more functionalty,
just as long as the base form is concise.

Consider the funky lambda syntax gone, it's just a distraction and doesn't fit
in with the rest of the API.

@jreback
Copy link
Contributor

jreback commented Apr 7, 2013

I think you could accept a string (for re), a lambda/func, or even a more general evalable expression (in fact I am thinking of doing this in a more general way, mainly to be able to use numexpr)

e.g.

df['(df>0) & (df<5)'] is MUCH faster if I can defer the evaluation into numexpr, but of course this involves doing a compile and walking the AST, but its actually not that hard (this is what the numexpr evaluate does, but we have to do translation of the frames for alignment and such before passing to numexpr)

and this is de-facto what you are looking for, yes?

@ghost
Copy link
Author

ghost commented Apr 7, 2013

@jreback , filter and select, as in the existing dataframe methods? or the
methods here dfilter,dselect,dgrep, (the names are undecided)?

my choice would be to keep the data filtering methods seperate from the index
methods, to make the perf distinction clear. Just to clarify.

@jreback
Copy link
Contributor

jreback commented Apr 7, 2013

@y-p no I agree, I was +1 for combing filter/select (and keep them as label only),
then maybe have just a single method (grep?) for data filtering

@ghost
Copy link
Author

ghost commented Apr 7, 2013

numexpr boost would be good, but the predicate is not always a strictly numerical/bool expression,
it's actually string data/regex matching that I find most inconvenient to work with.

@ghost
Copy link
Author

ghost commented Apr 7, 2013

@jreback, The work you've done on boolean indexing with numexpr is great,
I'm all for making things consistent across the API, where are you going with that?
also, what's going on with "OR" in the HDFStore Term deal? should all that stuff and this
get unified?

@jreback
Copy link
Contributor

jreback commented Apr 7, 2013

@y-p no I agree, that's why what I am proposing is a sub-case of what you are doing (e.g. you handle the reg express matching / lambda eval, then you could always pass to the evaluator if necessary)

as an aside this is what where does now (and what is backing getitem), maybe just change where a bit?

@ghost
Copy link
Author

ghost commented Apr 7, 2013

clarify, what does where do?
the docstring says "cond: bool df or array", doesn't do string matching/lambda afaict.

@jreback
Copy link
Contributor

jreback commented Apr 7, 2013

@y-p to your above comment, YES!

that was the point of the core/expressions.py, to unify all of this syntatical stuff (I am not sure if that is a word?). The idea being that you could just give expressions.evaluate a something and an object and have it parse/select whatever (and possibily use numepre) and return you the results so we have unified syntax

HDFStore Term syntax involves basically the same thing, have to go full fledged and parse the expresion

(index>0) | (values['A']>0 would be something nicer to support

@jreback
Copy link
Contributor

jreback commented Apr 7, 2013

@y-p

df.where(cond, other) give you back the values of df that match the cond otherwise give you back other or np.nan if you don't specify, used in getitem/setitem

sounds like grep is basically a parser front end to this (e.g. you will create the mask)

@ghost
Copy link
Author

ghost commented Apr 7, 2013

ahh, so true.

@ghost
Copy link
Author

ghost commented Apr 7, 2013

About numexpr, I have no experience with - let me read a little an get back to that,
my conception is that numexpr is for doing a numpy-oriented boolean
expression faster only. if you have a lambda predicate, it doesn't fit well into that
except through a generated numpy array mask.

is that totally wrong?

@jreback
Copy link
Contributor

jreback commented Apr 7, 2013

@y-p no that is correct, that said most operations we do fall into that catergory (except for maybe grep!)

but there exists commonaility in that you want to parse and expression and evaluate (locally), while I want to do this for passing to numexpr, hence a common 'expression' parser (that outputs an intermediate form suitable for either usecase)

@ghost
Copy link
Author

ghost commented Apr 7, 2013

I see, the "dgrep" here is regex/str dtype specific.
how would you disambiguate a string regex from a pd.expressions-compatibke expr?
if there's a way to roll this into one, I would totally go for doing that here (and probably everywhere).

@ghost
Copy link
Author

ghost commented Apr 7, 2013

There's also the vectorized string methods to consider, don't know
if matching is part of it right now, but would be nice to hook into
the memoized machinery, whenever that lands (probably just a decorator to use?).

#1660, #2802 (comment)

@ghost
Copy link
Author

ghost commented Apr 7, 2013

@hayd, thanks, so there are users who are missing something like this.

@jreback
Copy link
Contributor

jreback commented Apr 7, 2013

@y-p I think it DOES make sense to separate functionaility a bit, so will propose this for data selection:

(we already beefed up index selection via iloc/loc and have filter, so no need to touch that, maybe except for combinging filter/select)

  1. grep could handle a string expression for regex, lambda expression, or E (an expression indicator, function that takes a string to hold and mark as an expression)
  2. getitem would handle dispatch to integer/label indexing, boolean indexing, column accessing (it already does these), and would be able to handle an expression as a string (not E required)

could put this type of expression into HDFStore as well to consolidate that

alternatively, we could always force the user to use E to indicate this is a evaluable expression

@ghost
Copy link
Author

ghost commented Apr 7, 2013

  • I'm fine with "string is re"/"lambda is predicate" combined, so this is now a single new method.
  • E is fine, but only if and when it becomes available widely in the library, for consistency.
  • how would getitem disambiguate a string label from a string expression? I vote for E everywhere
    if it's available.

@jreback
Copy link
Contributor

jreback commented Apr 7, 2013

@y-p yep I think annotating with E might be a good idea, that said an expression is necessarily not as simple as a string e.g.

df['foo>0'] COULD be interpreted as the column 'foo>0' but I think that is taking it too far, not sure we need to explicity deal with that case

@ghost
Copy link
Author

ghost commented Apr 7, 2013

E can provide that flexability, but once you overload the meaning of single
type you create weird corner cases, that's why E() is safer.

... and that's a horrible thing to consider. I'm putting it in a new library
i'm working on, so I should know.

@jreback
Copy link
Contributor

jreback commented Apr 7, 2013

yep....

I think

df[E('( df.A > 0 ) & ( df.B > 0 )')] will be ok

or combining approaches ( in reality I think we will just make df[E(....)] call df.grep directly (similar
to how we call df.where now for getting and setting of the boolean indexers)

df.grep(E('( df.A > 0 ) & ( df.B > 0)'))

and E allows passing of options too....so that's all good

So I guess grep will be the data indexer then ? (which will generate a boolean indexer)

@ghost
Copy link
Author

ghost commented Apr 7, 2013

it already creates an index in the comprehension, only it invokes self.ix rather then returning an index.
self.where returns a frame, not an indexer. wouldn't it be more consistent to do the same?
in any case, a dataframe/series is a valid indexer for getitem, just uses the index doesn't it?

@jreback
Copy link
Contributor

jreback commented Apr 8, 2013

where was meant as more of a top-level selector that doesn't change the shape of the frame,
similar to how boolean indexing works in a Series (but originally for a Frame), as opposed to selection
which generally returns a smaller shaped object (e.g. invokes ix and friends)

so answer to your first question is that where is really serving a different purpose and is more of a boolean n-d indexer (that also evaluates), really its just a generalized alignable version of np.where

your 2nd question is yest, you could just pass to ix your generated object/index whatever

if its easier to generate a boolean index (for all indicies), do that, if its easier to generate the direct index, do that
I think you can choose based upon what makes your code simpler; that said, if you want to have a single consistent path (which is actually hard to do), I would go for always generating boolean (but at a possible performance cost), that is always the tradeoff, and why even though I prefer single paths, they never happen (and why I think some of the indexing code is like it is, to try to avoid code duplication, but more importantly to avoid some perf hits, like generating an entire boolean index just to select a few elements....)

@ghost
Copy link
Author

ghost commented Apr 8, 2013

The code is really trivial, the only ? was incorporating selection methods on something
that isn't an index, which is a change from the status-quo.

both df.select and df.where return a full series/dataframe, which can also serve
as an indexer. I think it's best to do the same thing in new methods which are similar.

@ghost
Copy link
Author

ghost commented Apr 8, 2013

see #3276 for candidate for 0.11 sandbox.

@ghost ghost mentioned this issue Apr 8, 2013
5 tasks
@ghost
Copy link
Author

ghost commented Apr 9, 2013

moved to PR

@ghost ghost closed this as completed Apr 9, 2013
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Ideas Long-Term Enhancement Discussions
Projects
None yet
Development

No branches or pull requests

3 participants