Closed
Description
Summary:
When we select data in pandas dataframe, which has long name, the code becomes bulky. For example:
subset = really_long_name_dataframe[
really_long_name_dataframe[['ints'].between(3, 6)
& (really_long_name_dataframe[['mul10'] != 40)
]
I'd like to propose a way to write this expression in more compact form. We could use some reserved name as a substitution for dataframe :
subset = really_long_name_dataframe[
_['ints'].between(3, 6) # _ is a substitution for really_long_name_dataframe
& (_['mul10'] != 40)
]
The other issue when it may come useful is when we apply operations to columns:
cubes = (
really_long_name_dataframe['ints'] * really_long_name_dataframe['squares']
)
# Could be written as
cubes = (
really_long_name_dataframe(_['ints'] * _['squares']) # via __call__ magic function
)
This can come very handy when we apply the operations to the DataFrame as a chain:
(
some_dataframe
.groupby('squares')
.count()
.assign(sqrt=_.index.map(np.sqrt).astype(int)) # .assign() function
.set_index(_.sqrt.map(str) + ' - ' + _.ints.map(str)) # .set_index() function
[_['ints'].between(1, 20)] # Selecting data, .__getitem__() function
(_['sqrt'].map(np.log10) * _['ints']) # Evaluating expressions, .__call__() function
)
I wrote a proof-of-concept module which make pandas capable of such syntax (via monkey patching): https://github.com/lpenguin/pandas-query