-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
ENH: df.grep(col,pat) and df.dselect(col,"expr") #2460
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
also def dselect(self,col,f):
f= eval("lambda x: " + f)
return self.ix[[f(x) for x in self[col]]] so this is possible: In [13]: df=mkdf(20,2)
...: s=pd.Series(np.random.randint(1,100,len(df.index)),index=df.index,name="nums")
...: df=df.join(s)
...: print(df)
...: df.dselect("nums","x>50 or x<25").grep("C_l0_g0","[2|4|8]")
C_l0_g0 C_l0_g1 nums \
R0
R_l0_g0 R0C0 R0C1 55
R_l0_g1 R1C0 R1C1 37
R_l0_g10 R2C0 R2C1 61
R_l0_g11 R3C0 R3C1 62
R_l0_g12 R4C0 R4C1 75
R_l0_g13 R5C0 R5C1 93
R_l0_g14 R6C0 R6C1 31
R_l0_g15 R7C0 R7C1 73
R_l0_g16 R8C0 R8C1 6
R_l0_g17 R9C0 R9C1 44
R_l0_g18 R10C0 R10C1 97
R_l0_g19 R11C0 R11C1 64
R_l0_g2 R12C0 R12C1 39
R_l0_g20 R13C0 R13C1 11
R_l0_g3 R14C0 R14C1 5
R_l0_g4 R15C0 R15C1 28
R_l0_g5 R16C0 R16C1 25
R_l0_g6 R17C0 R17C1 63
R_l0_g7 R18C0 R18C1 21
R_l0_g8 R19C0 R19C1 59
Out[13]:
C_l0_g0 C_l0_g1 nums \
R0
R_l0_g10 R2C0 R2C1 61
R_l0_g12 R4C0 R4C1 75
R_l0_g16 R8C0 R8C1 6
R_l0_g3 R14C0 R14C1 5
R_l0_g7 R18C0 R18C1 21 not enough |
dselect now accepts lambdas as well as eval snippets,
|
@wesm, if this has your blessing I'll round it out into a PR. |
for selecting columns, does grep overlap with select/ filter?
|
AFICT select/filter operate exclusively on index labels rather then data. I have these monkey-patched onto pd.Dataframe at load-time and find them very useful. An example use case: df.set_index("names",False,True).select(lambda x: bool(re.search("jerry.+",x[-1]))).reset_index(-1,drop=True) edit: cleaned up the example. the IMO, This is very intuitive: df.grep("names","jerry.+") and with some fleshing out (multindex columns) could be a useful addition to core pandas. |
I'd prefer it if we didn't call it "grep". A more intuitive name like "search" or something would be better IMO. While you and I have no trouble understanding what "grepping" means, it's not the case for a lot of users. |
In addition, I think it makes sense to deprecate |
no strong opinion on naming here. whatever works. rolling this into existing functions is a bit of a problem due to the existing
pandas already seperates data from indicies on principle, would it be terrible |
related #1844 (comment) |
related #2064 , i.e. "definitely more cowbell" |
I find this functionaly extremely useful when working with "civic hacking" data, small data Shouldn't pandas have (Explicitly slow and unvectorized in the general case) some functionality Any commiters +1 for something like this? |
I definitely think something like this would be useful, however maybe it's useful to restrict to multiple columns at once (...though now I think about it you could just chain it). I hadn't seen this thread but had been thinking something like this would be useful after DSM mentioned something about it on StackOverflow. (Note the terrible name choice.) |
+1 for Also would consolidate then We had this discussion above about trying to combine these, but I am +1 for keeping apart, @y-p |
very true about Consider the funky lambda syntax gone, it's just a distraction and doesn't fit |
I think you could accept a string (for re), a lambda/func, or even a more general evalable expression (in fact I am thinking of doing this in a more general way, mainly to be able to use numexpr) e.g.
and this is de-facto what you are looking for, yes? |
@jreback , filter and select, as in the existing dataframe methods? or the my choice would be to keep the data filtering methods seperate from the index |
@y-p no I agree, I was +1 for combing filter/select (and keep them as label only), |
numexpr boost would be good, but the predicate is not always a strictly numerical/bool expression, |
@jreback, The work you've done on boolean indexing with numexpr is great, |
@y-p no I agree, that's why what I am proposing is a sub-case of what you are doing (e.g. you handle the reg express matching / lambda eval, then you could always pass to the evaluator if necessary) as an aside this is what |
clarify, what does |
@y-p to your above comment, YES! that was the point of the core/expressions.py, to unify all of this syntatical stuff (I am not sure if that is a word?). The idea being that you could just give
|
sounds like |
ahh, so true. |
About numexpr, I have no experience with - let me read a little an get back to that, is that totally wrong? |
@y-p no that is correct, that said most operations we do fall into that catergory (except for maybe grep!) but there exists commonaility in that you want to parse and expression and evaluate (locally), while I want to do this for passing to numexpr, hence a common 'expression' parser (that outputs an intermediate form suitable for either usecase) |
I see, the "dgrep" here is regex/str dtype specific. |
There's also the vectorized string methods to consider, don't know |
@hayd, thanks, so there are users who are missing something like this. |
@y-p I think it DOES make sense to separate functionaility a bit, so will propose this for data selection: (we already beefed up index selection via iloc/loc and have filter, so no need to touch that, maybe except for combinging filter/select)
could put this type of expression into HDFStore as well to consolidate that alternatively, we could always force the user to use E to indicate this is a evaluable expression |
|
@y-p yep I think annotating with E might be a good idea, that said an expression is necessarily not as simple as a string e.g.
|
E can provide that flexability, but once you overload the meaning of single ... and that's a horrible thing to consider. I'm putting it in a new library |
yep.... I think
or combining approaches ( in reality I think we will just make df[E(....)] call df.grep directly (similar
and E allows passing of options too....so that's all good So I guess grep will be the data indexer then ? (which will generate a boolean indexer) |
it already creates an index in the comprehension, only it invokes self.ix rather then returning an index. |
so answer to your first question is that where is really serving a different purpose and is more of a your 2nd question is yest, you could just pass to ix your generated object/index whatever if its easier to generate a boolean index (for all indicies), do that, if its easier to generate the direct index, do that |
The code is really trivial, the only ? was incorporating selection methods on something both |
see #3276 for candidate for 0.11 sandbox. |
moved to PR |
wes, how would you feel about adding something like the following as
a Dataframe method? especially with method chaining I would find this
useful. Will add handling for datatypes and so on.
The text was updated successfully, but these errors were encountered: