Implement DataFrame.value_counts #27350

Strilanc · 2019-07-11T22:38:50Z

(This is a feature request stated in the form of a PR.)

This change makes it easy to count the number of times each unique row appears in a data frame. It also removes an unnecessary difference between DataFrame and Series (i.e. the existence of the value_counts methods).

        >>> df = pd.DataFrame({'num_legs': [2, 4, 4], 'num_wings': [2, 0, 0]},
        ...                   index=['falcon', 'dog', 'cat'])
        >>> df
                num_legs  num_wings
        falcon         2          2
        dog            4          0
        cat            4          0

        >>> df.value_counts()
        (4, 0)    2
        (2, 2)    1
        dtype: int64

        >>> df1col = df[['num_legs']]
        >>> df1col
                num_legs
        falcon         2
        dog            4
        cat            4

         >>> df1col.value_counts()
        (4,)    2
        (2,)    1
        dtype: int64

tests added / passed
passes black pandas
passes git diff upstream/master -u -- "*.py" | flake8 --diff
whatsnew entry

…ts() This change makes it easy to count the number of times each unique row appears in a data frame. It also removes an unnecessary difference between DataFrame and Series (i.e. the existence of the value_counts methods).

Strilanc · 2019-07-11T22:45:54Z

A reasonable alternative may be df.groupby(df.columns.tolist(),as_index=False).size() as used here: https://stackoverflow.com/questions/35584085/how-to-count-duplicate-rows-in-pandas-dataframe

Mostly I just want there to be a simple way to count the rows by value.

WillAyd · 2019-07-11T22:49:00Z

I think this would close #5377

WillAyd

Thanks for the PR

pandas/tests/indexes/test_base.py

pandas/core/frame.py

…value_counts

jreback

will have to review the semantics at first glance this is not very obvious

Strilanc · 2019-07-12T23:11:39Z

Are the test failures actually related to the change I'm making? They seem to be in unrelated places.

WillAyd · 2019-07-13T23:03:58Z

Yea looks like you need to run black on your changes

…value_counts

Strilanc · 2019-07-16T15:04:04Z

I ran black. It appears to have improved the situation, but the remaining failures are still hidden away amongst tens of thousands of line of console output in travis-ci.

TomAugspurger · 2019-07-16T21:34:12Z

Failure is at https://travis-ci.org/pandas-dev/pandas/jobs/559227356#L2861, which doesn't look related. Maybe try merging master and pushing again.

…

On Tue, Jul 16, 2019 at 10:04 AM Craig Gidney ***@***.***> wrote: I ran black. It appears to have improved the situation, but the remaining failures are still hidden away amongst tens of thousands of line of console output in travis-ci. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#27350?email_source=notifications&email_token=AAKAOIXOKXMDHJNN7SKNEETP7XPO5A5CNFSM4IBXSGW2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2BE2QY#issuecomment-511855939>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAKAOISJHYI46746RZ3MXKLP7XPO5ANCNFSM4IBXSGWQ> .

…value_counts

Strilanc · 2019-07-17T01:54:36Z

Merging to master didn't fix it

TomAugspurger · 2019-07-17T11:35:28Z

Ah, Dask uses duck-typing to check whether an object is Dataframe like: https://github.com/dask/dask/blob/4b2dbd4469423cf313ec4e9fc6967b606493a3b1/dask/utils.py#L1073 One of the checks is the lack of a DataFrame.value_counts. Do you have any interest in opening an issue on the Dask tracker to find an alternative implementation? You can skip the test here for Dask <= 2.1.0 (the current version of dask).

…

On Tue, Jul 16, 2019 at 8:54 PM Craig Gidney ***@***.***> wrote: Merging to master didn't fix it — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#27350?email_source=notifications&email_token=AAKAOIWFIB5FUDSQO5N6W3LP7Z3WJA5CNFSM4IBXSGW2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2CY4QY#issuecomment-512069187>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAKAOITDS6YSWECS2EUIXX3P7Z3WJANCNFSM4IBXSGWQ> .

Strilanc · 2019-07-17T22:58:32Z

I opened dask/dask#5109 on the dask repo

…value_counts

Strilanc · 2019-08-24T05:54:29Z

This is passing now that Dask has been updated.

WillAyd

Can you add a whatsnew for v1.0.0?

pandas/core/frame.py

…value_counts

Strilanc · 2019-08-27T23:02:28Z

Added a whatsnew

WillAyd

I think this looks good - @jreback thoughts?

doc/source/whatsnew/v1.0.0.rst

jreback · 2019-08-28T03:44:24Z

pandas/core/frame.py

+        """
+        The number of times each unique row appears in the DataFrame.
+
+        Rows that contain any NaN value are omitted from the results.


show a versionadded tag

What version should I enter for the tag?

jreback · 2019-08-28T03:45:36Z

pandas/core/frame.py

+        4         0            2
+        dtype: int64
+
+        >>> df1col = df[['num_legs']]


the 2nd example is showing how this works for a Series?

>>> type(df[['num_legs']]) pandas.core.frame.DataFrame

pandas/core/frame.py

jreback · 2019-08-28T03:47:53Z

pandas/core/frame.py

+
+        See Also
+        --------
+        Series.value_counts: Equivalent method on Series.


we have more options on the Series.value_counts, dropna for example these need to be implemented

There's no option in group_by to not drop rows containing a NaN. How do I go about implementing that case?

I would be OK with raising a NotImplementedError for that case

Added. This changed the method pretty significantly. PTAL.

The single-column case now works, but the code raises NotImplementedError for the multi-column case.

jreback · 2019-08-28T03:48:20Z

pandas/tests/frame/test_analytics.py

+            {"num_legs": [2, 4, 4], "num_wings": [2, 0, 0]},
+            index=["falcon", "dog", "cat"],
+        )
+        actual = df.value_counts()


use result rather than actual

…value_counts

pandas/core/frame.py

TomAugspurger · 2019-09-06T17:02:50Z

pandas/core/frame.py

+        Parameters
+        ----------
+        normalize : boolean, default False
+            If True then the object returned will contain the relative


"object" -> "Series"

pandas/core/frame.py

TomAugspurger · 2019-09-06T17:06:53Z

pandas/core/frame.py

+                dropna=dropna,
+            )
+            # Move series name into its index, as happens in multi-column case.
+            return Series(data=series.values, index=series.index.set_names(series.name))


Is this a MultiIndex? I think this method should always return a Series with a MultiIndex, even if it has one level.

TomAugspurger · 2019-09-06T17:07:41Z

pandas/tests/frame/test_analytics.py

@@ -2766,3 +2766,84 @@ def test_multiindex_column_lookup(self):
        result = df.nlargest(3, ("x", "b"))
        expected = df.iloc[[3, 2, 1]]
        tm.assert_frame_equal(result, expected)
+
+    def test_data_frame_value_counts(self):


Can you split up this test? Roughly one test per "thing" you're testing (single column, raising for unsupported keyword, etc.)

…value_counts

TomAugspurger

Looks quite close. Just one request to avoid special casing frames with 1 column. Though perhaps wait to hear from other maintainers before changing anything, if you disagree.

TomAugspurger · 2019-09-13T21:19:54Z

pandas/core/frame.py

+                ),
+            )
+
+        # Some features are only supported for single-column data.


I think these checks should be done first.

I'd rather have

df = pd.DataFrame({"A": [1, 2], "B": [3, 4]}) df.value_counts(dropna=False) df[['A']].value_counts(dropna=False)

both raise. That way, we don't have the behavior depending on the shape.

Done. I had to keep the special casing of single columns since otherwise you get an index that's not a multiindex.

dsaxton · 2019-09-16T00:37:51Z

pandas/core/frame.py

@@ -90,7 +90,7 @@
 from pandas.core.index import Index, ensure_index, ensure_index_from_sequences
 from pandas.core.indexes import base as ibase
 from pandas.core.indexes.datetimes import DatetimeIndex
-from pandas.core.indexes.multi import maybe_droplevels
+from pandas.core.indexes.multi import maybe_droplevels, MultiIndex


MultiIndex should go before maybe_droplevels

WillAyd · 2019-09-17T04:50:07Z

pandas/core/frame.py

+                "`bins` parameter not yet supported for dataframes."
+            )
+
+        # Delegate to Series.value_counts for single-column data frames.


Hmm @TomAugspurger this is the special casing you were referring to right? Sorry kind of tough to tell from history at this point.

Is there a particular reason why we would want to do this? I think the type of output should match input

I think the problem was groupby returns a regular Index when you're grouping by a single column? And in all other cases you get a MultiIndex. So we need some kind of special condition for Series to ensure we get a 1-level MI back.

I think it'd be clearer to just do the counts = self.groupby(self.columns.tolist()).size() and then follow it up with a

if len(self.columns) == 1: counts.index = pd.MultiIndex.from_arrays([counts.index])

Done. This was leftover from the special casing of parameters for single columns.

dsaxton · 2019-09-17T13:19:18Z

pandas/core/frame.py

+            raise NotImplementedError(
+                "`dropna=False` not yet supported for dataframes."
+            )
+        if bins is not None:


Here we're saying bins isn't implemented at all, but according to the docstring and examples the parameter is implemented for single column DataFrames, so I think you'd need a special case here as well

Done. This was leftover from the special casing of parameters for single columns.

…value_counts

doc/source/whatsnew/v1.0.0.rst

jreback · 2019-09-18T12:46:44Z

pandas/core/frame.py

+
+        .. versionadded:: 1.0.0
+
+        The returned Series will have a MultiIndex with one level per input


we need to have a subset= argument (as the 1st arg) to define the columns to group on, default would be all; this is in-line with many other DataFrame methods

jreback · 2019-09-18T12:47:31Z

pandas/core/frame.py

+                "`bins` parameter not yet supported for dataframes."
+            )
+
+        counts = self.groupby(self.columns.tolist()).size()


subset get's incorporate here, it must be a list-like (or None)

jreback · 2019-09-18T12:48:37Z

pandas/core/frame.py

+            counts /= counts.sum()
+        # Force MultiIndex index.
+        if len(self.columns) == 1:
+            counts.index = MultiIndex.from_arrays([counts.index])


need to make sure the name of this is preserved (pass name= as well), though I think we infer this now, so just make sure to test

jreback · 2019-09-18T12:49:23Z

pandas/tests/frame/test_analytics.py

@@ -2766,3 +2766,109 @@ def test_multiindex_column_lookup(self):
        result = df.nlargest(3, ("x", "b"))
        expected = df.iloc[[3, 2, 1]]
        tm.assert_frame_equal(result, expected)
+


go ahead and make a new file, test_value_counts.py and put in pandas/tests/frame/analytics/test_value_counts.py (we will split / move analytics later)

Strilanc · 2019-09-18T17:55:25Z

I think it's now been five cycles of doing what was asked only to be asked for more or for something different. Please state exactly what you want in full detail if you want me to continue putting effort into this PR. Are there any more arguments you're going to need? Test cases? State them.

TomAugspurger · 2019-09-19T22:37:46Z

@Strilanc we're defining a new API here, that can take some time as we think through all the design choices.

Strilanc · 2019-09-22T01:46:23Z

@TomAugspurger I understand. I'm asking you to the thinking before I do the work.

WillAyd · 2019-10-11T22:00:48Z

@Strilanc we of course try to be cognizant of contributor time but reviews are an iterative process.

Are you still interested in working this? If so can you fix merge conflict?

WillAyd · 2019-11-07T21:15:58Z

Closing as stale but @Strilanc ping if you'd like to pick back up

WillAyd requested changes Jul 11, 2019

View reviewed changes

pandas/tests/indexes/test_base.py Outdated Show resolved Hide resolved

pandas/core/frame.py Outdated Show resolved Hide resolved

pandas/core/frame.py Outdated Show resolved Hide resolved

WillAyd added API Design DataFrame DataFrame data structure labels Jul 11, 2019

Strilanc changed the title ~~DataFrame.value_counts = lambda self: self.apply(tuple, 1).value_counts()~~ Implement DataFrame.value_counts Jul 12, 2019

Strilanc added 2 commits July 12, 2019 14:46

Address comments

699ac42

Merge branch 'master' of github.com:pandas-dev/pandas into dataframe_…

2e06d15

…value_counts

jreback requested changes Jul 12, 2019

View reviewed changes

Strilanc added 2 commits July 15, 2019 19:39

black

9e725c7

Merge branch 'master' of github.com:pandas-dev/pandas into dataframe_…

6437cfc

…value_counts

Merge branch 'master' of github.com:pandas-dev/pandas into dataframe_…

078d421

…value_counts

Strilanc mentioned this pull request Jul 17, 2019

Stop using the presence of value_counts to determine that an object is not a pandas.DataFrame dask/dask#5109

Closed

TomAugspurger mentioned this pull request Jul 23, 2019

COMPAT: Adjust is_dataframe_like dask/dask#5143

Merged

Merge branch 'master' of github.com:pandas-dev/pandas into dataframe_…

619f86c

…value_counts

WillAyd reviewed Aug 26, 2019

View reviewed changes

pandas/core/frame.py Outdated Show resolved Hide resolved

WillAyd added this to the 1.0 milestone Aug 26, 2019

Strilanc added 2 commits August 27, 2019 16:00

Merge branch 'master' of github.com:pandas-dev/pandas into dataframe_…

ab897a9

…value_counts

Comments

ef8d565

WillAyd approved these changes Aug 28, 2019

View reviewed changes

jreback requested changes Aug 28, 2019

View reviewed changes

Strilanc added 4 commits August 28, 2019 15:48

Add parameters

362641a

Ref tag

181b577

Merge branch 'master' of github.com:pandas-dev/pandas into dataframe_…

3740bc2

…value_counts

fix

19a6a06

TomAugspurger reviewed Sep 6, 2019

View reviewed changes

Strilanc added 3 commits September 12, 2019 15:01

Merge branch 'master' of github.com:pandas-dev/pandas into dataframe_…

c927f4a

…value_counts

multiindex

11521c0

format

8539556

TomAugspurger reviewed Sep 13, 2019

View reviewed changes

dsaxton reviewed Sep 16, 2019

View reviewed changes

Strilanc added 2 commits September 16, 2019 17:50

Import order

af3c6f3

Drop single column support

e1f17ac

WillAyd reviewed Sep 17, 2019

View reviewed changes

dsaxton reviewed Sep 17, 2019

View reviewed changes

Strilanc added 2 commits September 17, 2019 15:26

flail

999ca18

Merge branch 'master' of github.com:pandas-dev/pandas into dataframe_…

cd7a8b6

…value_counts

jreback requested changes Sep 18, 2019

View reviewed changes

WillAyd closed this Nov 7, 2019

dsaxton mentioned this pull request Jan 23, 2020

ENH: Implement DataFrame.value_counts #31247

Merged

4 tasks


		.. versionadded:: 1.0.0

		The returned Series will have a MultiIndex with one level per input

Uh oh!

Implement DataFrame.value_counts #27350

Implement DataFrame.value_counts #27350

Uh oh!

Conversation

Strilanc commented Jul 11, 2019 • edited by WillAyd Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Strilanc commented Jul 11, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

WillAyd commented Jul 11, 2019

Uh oh!

WillAyd left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jreback left a comment

Choose a reason for hiding this comment

Uh oh!

Strilanc commented Jul 12, 2019

Uh oh!

WillAyd commented Jul 13, 2019

Uh oh!

Strilanc commented Jul 16, 2019

Uh oh!

TomAugspurger commented Jul 16, 2019 via email

Uh oh!

Strilanc commented Jul 17, 2019

Uh oh!

TomAugspurger commented Jul 17, 2019 via email

Uh oh!

Strilanc commented Jul 17, 2019

Uh oh!

Strilanc commented Aug 24, 2019

Uh oh!

WillAyd left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Strilanc commented Aug 27, 2019

Uh oh!

WillAyd left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Strilanc commented Jul 11, 2019 •

edited by WillAyd

Loading

Strilanc commented Jul 11, 2019 •

edited

Loading