Skip to content

Shorter syntax for selecting data and expression evaluation (proposal) #18077

Closed
@lpenguin

Description

@lpenguin

Summary:
When we select data in pandas dataframe, which has long name, the code becomes bulky. For example:

subset = really_long_name_dataframe[
    really_long_name_dataframe[['ints'].between(3, 6) 
    & (really_long_name_dataframe[['mul10'] != 40)
]

I'd like to propose a way to write this expression in more compact form. We could use some reserved name as a substitution for dataframe :

subset = really_long_name_dataframe[
    _['ints'].between(3, 6)  # _ is a substitution for really_long_name_dataframe
    & (_['mul10'] != 40)
]

The other issue when it may come useful is when we apply operations to columns:

cubes = (
    really_long_name_dataframe['ints'] * really_long_name_dataframe['squares'] 
)

# Could be written as
cubes = (
    really_long_name_dataframe(_['ints'] * _['squares'])  # via __call__ magic function
)

This can come very handy when we apply the operations to the DataFrame as a chain:

(
    some_dataframe
    .groupby('squares')
    .count()
    .assign(sqrt=_.index.map(np.sqrt).astype(int))  # .assign() function
    .set_index(_.sqrt.map(str) + ' - ' + _.ints.map(str))  # .set_index() function
    [_['ints'].between(1, 20)]  # Selecting data, .__getitem__() function
    (_['sqrt'].map(np.log10) * _['ints'])  # Evaluating expressions, .__call__() function
)

I wrote a proof-of-concept module which make pandas capable of such syntax (via monkey patching): https://github.com/lpenguin/pandas-query

Metadata

Metadata

Assignees

No one assigned

    Labels

    Duplicate ReportDuplicate issue or pull requestEnhancementNeeds DiscussionRequires discussion from core team before further action

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions