Skip to content

Implement DataFrame.value_counts #27350

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 20 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions doc/source/reference/frame.rst
Original file line number Diff line number Diff line change
Expand Up @@ -176,6 +176,7 @@ Computations / descriptive stats
DataFrame.std
DataFrame.var
DataFrame.nunique
DataFrame.value_counts

Reindexing / selection / label manipulation
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
Expand Down
2 changes: 1 addition & 1 deletion doc/source/whatsnew/v1.0.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -78,7 +78,7 @@ Other API changes
^^^^^^^^^^^^^^^^^

- :meth:`pandas.api.types.infer_dtype` will now return "integer-na" for integer and ``np.nan`` mix (:issue:`27283`)
-
- Added :meth:`pandas.core.frame.DataFrame.value_counts` (:issue:`5377`).
-

.. _whatsnew_1000.deprecations:
Expand Down
122 changes: 121 additions & 1 deletion pandas/core/frame.py
Original file line number Diff line number Diff line change
Expand Up @@ -90,7 +90,7 @@
from pandas.core.index import Index, ensure_index, ensure_index_from_sequences
from pandas.core.indexes import base as ibase
from pandas.core.indexes.datetimes import DatetimeIndex
from pandas.core.indexes.multi import maybe_droplevels
from pandas.core.indexes.multi import MultiIndex, maybe_droplevels
from pandas.core.indexes.period import PeriodIndex
from pandas.core.indexing import check_bool_indexer, convert_to_index_sliceable
from pandas.core.internals import BlockManager
Expand Down Expand Up @@ -8455,6 +8455,126 @@ def isin(self, values):
self.columns,
)

def value_counts(
self, normalize=False, sort=True, ascending=False, bins=None, dropna=True
):
"""
Return a Series containing counts of unique rows in the DataFrame.

.. versionadded:: 1.0.0

The returned Series will have a MultiIndex with one level per input
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to have a subset= argument (as the 1st arg) to define the columns to group on, default would be all; this is in-line with many other DataFrame methods

column.

By default, rows that contain any NaN value are omitted from the
results.

By default, the resulting series will be in descending order so that the
first element is the most frequently-occurring row.

Parameters
----------
normalize : boolean, default False
If True then the Series returned will contain the relative
frequencies of the unique values.
sort : boolean, default True
Sort by frequencies.
ascending : boolean, default False
Sort in ascending order.
bins : integer, optional
This parameter is not yet supported and must be set to None (the
default value). It exists to ensure compatibiliy with
`Series.value_counts`.

Rather than count values, group them into half-open bins,
a convenience for ``pd.cut``, only works with single-column numeric
data.
dropna : boolean, default True
This parameter is not yet supported and must be set to True (the
default value). It exists to ensure compatibiliy with
`Series.value_counts`.

Don't include counts of rows containing NaN.

Returns
-------
counts : Series

See Also
--------
Series.value_counts: Equivalent method on Series.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have more options on the Series.value_counts, dropna for example these need to be implemented

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no option in group_by to not drop rows containing a NaN. How do I go about implementing that case?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be OK with raising a NotImplementedError for that case

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added. This changed the method pretty significantly. PTAL.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The single-column case now works, but the code raises NotImplementedError for the multi-column case.


Examples
--------

>>> df = pd.DataFrame({'num_legs': [2, 4, 4, 6],
... 'num_wings': [2, 0, 0, 0]},
... index=['falcon', 'dog', 'cat', 'ant'])
>>> df
num_legs num_wings
falcon 2 2
dog 4 0
cat 4 0
ant 6 0

>>> df.value_counts()
num_legs num_wings
4 0 2
6 0 1
2 2 1
dtype: int64

>>> df.value_counts(sort=False)
num_legs num_wings
2 2 1
4 0 2
6 0 1
dtype: int64

>>> df.value_counts(ascending=True)
num_legs num_wings
2 2 1
6 0 1
4 0 2
dtype: int64

>>> df.value_counts(normalize=True)
num_legs num_wings
4 0 0.50
6 0 0.25
2 2 0.25
dtype: float64

>>> single_col_df = df[['num_legs']]
>>> single_col_df.value_counts(bins=4)
num_legs
(3.0, 4.0] 2
(5.0, 6.0] 1
(1.995, 3.0] 1
(4.0, 5.0] 0
dtype: int64
"""

# Some features not supported yet.
if not dropna:
raise NotImplementedError(
"`dropna=False` not yet supported for dataframes."
)
if bins is not None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we're saying bins isn't implemented at all, but according to the docstring and examples the parameter is implemented for single column DataFrames, so I think you'd need a special case here as well

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. This was leftover from the special casing of parameters for single columns.

raise NotImplementedError(
"`bins` parameter not yet supported for dataframes."
)

counts = self.groupby(self.columns.tolist()).size()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

subset get's incorporate here, it must be a list-like (or None)

if sort:
counts.sort_values(ascending=ascending, inplace=True)
if normalize:
counts /= counts.sum()
# Force MultiIndex index.
if len(self.columns) == 1:
counts.index = MultiIndex.from_arrays([counts.index])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to make sure the name of this is preserved (pass name= as well), though I think we infer this now, so just make sure to test

return counts

# ----------------------------------------------------------------------
# Add plotting methods to DataFrame
plot = CachedAccessor("plot", pandas.plotting.PlotAccessor)
Expand Down
106 changes: 106 additions & 0 deletions pandas/tests/frame/test_analytics.py
Original file line number Diff line number Diff line change
Expand Up @@ -2766,3 +2766,109 @@ def test_multiindex_column_lookup(self):
result = df.nlargest(3, ("x", "b"))
expected = df.iloc[[3, 2, 1]]
tm.assert_frame_equal(result, expected)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

go ahead and make a new file, test_value_counts.py and put in pandas/tests/frame/analytics/test_value_counts.py (we will split / move analytics later)

def test_data_frame_value_counts_unsorted(self):
df = pd.DataFrame(
{"num_legs": [2, 4, 4, 6], "num_wings": [2, 0, 0, 0]},
index=["falcon", "dog", "cat", "ant"],
)
result = df.value_counts(sort=False)
expected = pd.Series(
data=[1, 2, 1],
index=pd.MultiIndex.from_arrays(
[(2, 4, 6), (2, 0, 0)], names=["num_legs", "num_wings"]
),
)
tm.assert_series_equal(result, expected)

def test_data_frame_value_counts_ascending(self):
df = pd.DataFrame(
{"num_legs": [2, 4, 4, 6], "num_wings": [2, 0, 0, 0]},
index=["falcon", "dog", "cat", "ant"],
)
result = df.value_counts(ascending=True)
expected = pd.Series(
data=[1, 1, 2],
index=pd.MultiIndex.from_arrays(
[(2, 6, 4), (2, 0, 0)], names=["num_legs", "num_wings"]
),
)
tm.assert_series_equal(result, expected)

def test_data_frame_value_counts_default(self):
df = pd.DataFrame(
{"num_legs": [2, 4, 4, 6], "num_wings": [2, 0, 0, 0]},
index=["falcon", "dog", "cat", "ant"],
)
result = df.value_counts()
expected = pd.Series(
data=[2, 1, 1],
index=pd.MultiIndex.from_arrays(
[(4, 6, 2), (0, 0, 2)], names=["num_legs", "num_wings"]
),
)
tm.assert_series_equal(result, expected)

def test_data_frame_value_counts_normalize(self):
df = pd.DataFrame(
{"num_legs": [2, 4, 4, 6], "num_wings": [2, 0, 0, 0]},
index=["falcon", "dog", "cat", "ant"],
)
result = df.value_counts(normalize=True)
expected = pd.Series(
data=[0.5, 0.25, 0.25],
index=pd.MultiIndex.from_arrays(
[(4, 6, 2), (0, 0, 2)], names=["num_legs", "num_wings"]
),
)
tm.assert_series_equal(result, expected)

def test_data_frame_value_counts_dropna_not_supported_yet(self):
df = pd.DataFrame(
{"num_legs": [2, 4, 4, 6], "num_wings": [2, 0, 0, 0]},
index=["falcon", "dog", "cat", "ant"],
)
with pytest.raises(NotImplementedError, match="not yet supported"):
df.value_counts(dropna=False)

def test_data_frame_value_counts_bins_not_supported(self):
df = pd.DataFrame(
{"num_legs": [2, 4, 4, 6], "num_wings": [2, 0, 0, 0]},
index=["falcon", "dog", "cat", "ant"],
)
with pytest.raises(NotImplementedError, match="not yet supported"):
df.value_counts(bins=2)

def test_data_frame_value_counts_single_col_default(self):
df = pd.DataFrame(
{"num_legs": [2, 4, 4, 6], "num_wings": [2, 0, 0, 0]},
index=["falcon", "dog", "cat", "ant"],
)
df_single_col = df[["num_legs"]]
result = df_single_col.value_counts()
expected = pd.Series(
data=[2, 1, 1],
index=pd.MultiIndex.from_arrays([[4, 6, 2]], names=["num_legs"]),
)
tm.assert_series_equal(result, expected)

def test_data_frame_value_counts_single_col_bins(self):
df = pd.DataFrame(
{"num_legs": [2, 4, 4, 6], "num_wings": [2, 0, 0, 0]},
index=["falcon", "dog", "cat", "ant"],
)
df_single_col = df[["num_legs"]]
with pytest.raises(NotImplementedError, match="not yet supported"):
_ = df_single_col.value_counts(bins=4)

def test_data_frame_value_counts_empty(self):
df_no_cols = pd.DataFrame()
result = df_no_cols.value_counts()
expected = pd.Series([], dtype=np.int64)
tm.assert_series_equal(result, expected)

def test_data_frame_value_counts_empty_normalize(self):
df_no_cols = pd.DataFrame()
result = df_no_cols.value_counts(normalize=True)
expected = pd.Series([], dtype=np.float64)
tm.assert_series_equal(result, expected)