Skip to content

Shorter syntax for selecting data and expression evaluation (proposal) #18077

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
lpenguin opened this issue Nov 2, 2017 · 4 comments
Closed
Labels
Duplicate Report Duplicate issue or pull request Enhancement Needs Discussion Requires discussion from core team before further action

Comments

@lpenguin
Copy link

lpenguin commented Nov 2, 2017

Summary:
When we select data in pandas dataframe, which has long name, the code becomes bulky. For example:

subset = really_long_name_dataframe[
    really_long_name_dataframe[['ints'].between(3, 6) 
    & (really_long_name_dataframe[['mul10'] != 40)
]

I'd like to propose a way to write this expression in more compact form. We could use some reserved name as a substitution for dataframe :

subset = really_long_name_dataframe[
    _['ints'].between(3, 6)  # _ is a substitution for really_long_name_dataframe
    & (_['mul10'] != 40)
]

The other issue when it may come useful is when we apply operations to columns:

cubes = (
    really_long_name_dataframe['ints'] * really_long_name_dataframe['squares'] 
)

# Could be written as
cubes = (
    really_long_name_dataframe(_['ints'] * _['squares'])  # via __call__ magic function
)

This can come very handy when we apply the operations to the DataFrame as a chain:

(
    some_dataframe
    .groupby('squares')
    .count()
    .assign(sqrt=_.index.map(np.sqrt).astype(int))  # .assign() function
    .set_index(_.sqrt.map(str) + ' - ' + _.ints.map(str))  # .set_index() function
    [_['ints'].between(1, 20)]  # Selecting data, .__getitem__() function
    (_['sqrt'].map(np.log10) * _['ints'])  # Evaluating expressions, .__call__() function
)

I wrote a proof-of-concept module which make pandas capable of such syntax (via monkey patching): https://github.com/lpenguin/pandas-query

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Nov 2, 2017

Thanks for the examples (and proof of concept!).

You're maybe aware, but for indexing, all of __getitem__, loc, iloc, take a callable, so you can do

subset = really_long_name_datafram[lambda df: df['ints'].between(3, 6) ...]

Likewise with assign.

Python's lambda isn't the shortest, but it may be an improvement. And you can refactor the lambdas out to standalone functions if they're resused.

IIRC, libraries using _ is frowned upon, since that's what the interpreter typically uses for the last returned value, and interactive use is important to pandas (a different identifier could be used of course).

Making dataframes callable would be a big change with (I'm guessing) a lot of unintended negative consequences.

@chris-b1
Copy link
Contributor

chris-b1 commented Nov 2, 2017

This is similar to the magic X used by dplython and pandas_ply, see also #13133

@gfyoung
Copy link
Member

gfyoung commented Nov 3, 2017

Let's also not forget the .query method which allows you to use SQL-like syntax.

@gfyoung gfyoung added Enhancement Needs Discussion Requires discussion from core team before further action labels Nov 3, 2017
@jreback
Copy link
Contributor

jreback commented Nov 3, 2017

yeah, if you want to address issues discussin the X issue #13133 pls fee free.

@jreback jreback closed this as completed Nov 3, 2017
@jreback jreback added the Duplicate Report Duplicate issue or pull request label Nov 3, 2017
@jreback jreback added this to the No action milestone Nov 3, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Duplicate Report Duplicate issue or pull request Enhancement Needs Discussion Requires discussion from core team before further action
Projects
None yet
Development

No branches or pull requests

5 participants