Skip to content

PERF: avoid copy in _obj_with_exclusions, _selected_obj #51090

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 5 commits into from
Feb 1, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions doc/source/whatsnew/v2.0.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -948,6 +948,8 @@ Performance improvements
- Performance improvement in :meth:`.SeriesGroupBy.value_counts` with categorical dtype (:issue:`46202`)
- Fixed a reference leak in :func:`read_hdf` (:issue:`37441`)
- Fixed a memory leak in :meth:`DataFrame.to_json` and :meth:`Series.to_json` when serializing datetimes and timedeltas (:issue:`40443`)
- Decreased memory usage in many :class:`DataFrameGroupBy` methods (:issue:`51090`)
-

.. ---------------------------------------------------------------------------
.. _whatsnew_200.bug_fixes:
Expand Down
6 changes: 3 additions & 3 deletions pandas/core/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -213,12 +213,12 @@ def ndim(self) -> int:
@final
@cache_readonly
def _obj_with_exclusions(self):
if self._selection is not None and isinstance(self.obj, ABCDataFrame):
return self.obj[self._selection_list]

if isinstance(self.obj, ABCSeries):
return self.obj

if self._selection is not None:
return self.obj._getitem_nocopy(self._selection_list)

if len(self.exclusions) > 0:
# equivalent to `self.obj.drop(self.exclusions, axis=1)
# but this avoids consolidating and making a copy
Expand Down
19 changes: 19 additions & 0 deletions pandas/core/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -3663,6 +3663,25 @@ def _iter_column_arrays(self) -> Iterator[ArrayLike]:
for i in range(len(self.columns)):
yield self._get_column_array(i)

def _getitem_nocopy(self, key: list):
"""
Behaves like __getitem__, but returns a view in cases where __getitem__
would make a copy.
"""
# TODO(CoW): can be removed if/when we are always Copy-on-Write
indexer = self.columns._get_indexer_strict(key, "columns")[1]
new_axis = self.columns[indexer]

new_mgr = self._mgr.reindex_indexer(
new_axis,
indexer,
axis=0,
allow_dups=True,
copy=False,
only_slice=True,
)
return self._constructor(new_mgr)

def __getitem__(self, key):
check_dict_or_set_indexers(key)
key = lib.item_from_zerodim(key)
Expand Down
24 changes: 18 additions & 6 deletions pandas/core/groupby/groupby.py
Original file line number Diff line number Diff line change
Expand Up @@ -76,6 +76,7 @@ class providing the base-class of operations.
is_bool_dtype,
is_datetime64_dtype,
is_float_dtype,
is_hashable,
is_integer,
is_integer_dtype,
is_numeric_dtype,
Expand Down Expand Up @@ -723,13 +724,24 @@ def _get_index(self, name):
@cache_readonly
def _selected_obj(self):
# Note: _selected_obj is always just `self.obj` for SeriesGroupBy

if self._selection is None or isinstance(self.obj, Series):
if self._group_selection is not None:
return self.obj[self._group_selection]
if isinstance(self.obj, Series):
return self.obj
else:
return self.obj[self._selection]

if self._selection is not None:
if is_hashable(self._selection):
# i.e. a single key, so selecting it will return a Series.
# In this case, _obj_with_exclusions would wrap the key
# in a list and return a single-column DataFrame.
return self.obj[self._selection]
Comment on lines +730 to +735
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think any time self.obj[self._selection] results in a Series will be when self is a SeriesGroupBy and self.obj is a Series, which is handled above.

If the frame has duplicate columns, then we can windup in the case where self._selection is hashable and self is a DataFrameGroupBy, but in this case I think self.obj[self_selection] will then agree with self._obj_with_exclusions (both frames).

So I think we can still return obj_with_exclusions here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That logic works for SeriesGroupBy/DataFrameGroupBy, but this code path is shared by Resampler, which defines _gotitem kludgily. Until we change that, test_resample_groupby_agg_listlike would fail if we return obj_with_exclusions here

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Until we change that

(this is part of why i pinged you about the non-unique columns in agg_dict_like. If we can make sure to always pass subset to _gotitem, i think that makes it a lot easier to de-kludge Resample._gotitem)


# Otherwise _selection is equivalent to _selection_list, so
# _selected_obj matches _obj_with_exclusions, so we can re-use
# that and avoid making a copy.
return self._obj_with_exclusions

if self._group_selection is not None:
return self._obj_with_exclusions
return self.obj

@final
def _dir_additions(self) -> set[str]:
Expand Down