Skip to content

Implement DataFrame.value_counts #27350

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 20 commits into from

Conversation

Strilanc
Copy link

@Strilanc Strilanc commented Jul 11, 2019

closes #5377

(This is a feature request stated in the form of a PR.)

This change makes it easy to count the number of times each unique row appears in a data frame. It also removes an unnecessary difference between DataFrame and Series (i.e. the existence of the value_counts methods).

        >>> df = pd.DataFrame({'num_legs': [2, 4, 4], 'num_wings': [2, 0, 0]},
        ...                   index=['falcon', 'dog', 'cat'])
        >>> df
                num_legs  num_wings
        falcon         2          2
        dog            4          0
        cat            4          0

        >>> df.value_counts()
        (4, 0)    2
        (2, 2)    1
        dtype: int64

        >>> df1col = df[['num_legs']]
        >>> df1col
                num_legs
        falcon         2
        dog            4
        cat            4

         >>> df1col.value_counts()
        (4,)    2
        (2,)    1
        dtype: int64
  • tests added / passed
  • passes black pandas
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry

…ts()

This change makes it easy to count the number of times each unique row
appears in a data frame. It also removes an unnecessary difference
between DataFrame and Series (i.e. the existence of the value_counts
methods).
@Strilanc
Copy link
Author

Strilanc commented Jul 11, 2019

A reasonable alternative may be df.groupby(df.columns.tolist(),as_index=False).size() as used here: https://stackoverflow.com/questions/35584085/how-to-count-duplicate-rows-in-pandas-dataframe

Mostly I just want there to be a simple way to count the rows by value.

@WillAyd
Copy link
Member

WillAyd commented Jul 11, 2019

I think this would close #5377

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR

@WillAyd WillAyd added API Design DataFrame DataFrame data structure labels Jul 11, 2019
@Strilanc Strilanc changed the title DataFrame.value_counts = lambda self: self.apply(tuple, 1).value_counts() Implement DataFrame.value_counts Jul 12, 2019
Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will have to review the semantics at first glance this is not very obvious

@Strilanc
Copy link
Author

Are the test failures actually related to the change I'm making? They seem to be in unrelated places.

@WillAyd
Copy link
Member

WillAyd commented Jul 13, 2019

Yea looks like you need to run black on your changes

@Strilanc
Copy link
Author

I ran black. It appears to have improved the situation, but the remaining failures are still hidden away amongst tens of thousands of line of console output in travis-ci.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jul 16, 2019 via email

@Strilanc
Copy link
Author

Merging to master didn't fix it

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jul 17, 2019 via email

@Strilanc
Copy link
Author

I opened dask/dask#5109 on the dask repo

@Strilanc
Copy link
Author

This is passing now that Dask has been updated.

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a whatsnew for v1.0.0?

@WillAyd WillAyd added this to the 1.0 milestone Aug 26, 2019
@Strilanc
Copy link
Author

Added a whatsnew

Copy link
Member

@WillAyd WillAyd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this looks good - @jreback thoughts?

"""
The number of times each unique row appears in the DataFrame.

Rows that contain any NaN value are omitted from the results.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

show a versionadded tag

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What version should I enter for the tag?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1.0.0

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

4 0 2
dtype: int64

>>> df1col = df[['num_legs']]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the 2nd example is showing how this works for a Series?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

>>> type(df[['num_legs']]) 
pandas.core.frame.DataFrame


See Also
--------
Series.value_counts: Equivalent method on Series.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have more options on the Series.value_counts, dropna for example these need to be implemented

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's no option in group_by to not drop rows containing a NaN. How do I go about implementing that case?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be OK with raising a NotImplementedError for that case

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added. This changed the method pretty significantly. PTAL.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The single-column case now works, but the code raises NotImplementedError for the multi-column case.

{"num_legs": [2, 4, 4], "num_wings": [2, 0, 0]},
index=["falcon", "dog", "cat"],
)
actual = df.value_counts()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use result rather than actual

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

Parameters
----------
normalize : boolean, default False
If True then the object returned will contain the relative
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"object" -> "Series"

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

dropna=dropna,
)
# Move series name into its index, as happens in multi-column case.
return Series(data=series.values, index=series.index.set_names(series.name))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a MultiIndex? I think this method should always return a Series with a MultiIndex, even if it has one level.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@@ -2766,3 +2766,84 @@ def test_multiindex_column_lookup(self):
result = df.nlargest(3, ("x", "b"))
expected = df.iloc[[3, 2, 1]]
tm.assert_frame_equal(result, expected)

def test_data_frame_value_counts(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you split up this test? Roughly one test per "thing" you're testing (single column, raising for unsupported keyword, etc.)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks quite close. Just one request to avoid special casing frames with 1 column. Though perhaps wait to hear from other maintainers before changing anything, if you disagree.

),
)

# Some features are only supported for single-column data.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think these checks should be done first.

I'd rather have

df = pd.DataFrame({"A": [1, 2], "B": [3, 4]})
df.value_counts(dropna=False)
df[['A']].value_counts(dropna=False)

both raise. That way, we don't have the behavior depending on the shape.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. I had to keep the special casing of single columns since otherwise you get an index that's not a multiindex.

@@ -90,7 +90,7 @@
from pandas.core.index import Index, ensure_index, ensure_index_from_sequences
from pandas.core.indexes import base as ibase
from pandas.core.indexes.datetimes import DatetimeIndex
from pandas.core.indexes.multi import maybe_droplevels
from pandas.core.indexes.multi import maybe_droplevels, MultiIndex
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MultiIndex should go before maybe_droplevels

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

"`bins` parameter not yet supported for dataframes."
)

# Delegate to Series.value_counts for single-column data frames.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm @TomAugspurger this is the special casing you were referring to right? Sorry kind of tough to tell from history at this point.

Is there a particular reason why we would want to do this? I think the type of output should match input

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the problem was groupby returns a regular Index when you're grouping by a single column? And in all other cases you get a MultiIndex. So we need some kind of special condition for Series to ensure we get a 1-level MI back.

I think it'd be clearer to just do the counts = self.groupby(self.columns.tolist()).size() and then follow it up with a

if len(self.columns) == 1:
    counts.index = pd.MultiIndex.from_arrays([counts.index])

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. This was leftover from the special casing of parameters for single columns.

raise NotImplementedError(
"`dropna=False` not yet supported for dataframes."
)
if bins is not None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we're saying bins isn't implemented at all, but according to the docstring and examples the parameter is implemented for single column DataFrames, so I think you'd need a special case here as well

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. This was leftover from the special casing of parameters for single columns.


.. versionadded:: 1.0.0

The returned Series will have a MultiIndex with one level per input
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we need to have a subset= argument (as the 1st arg) to define the columns to group on, default would be all; this is in-line with many other DataFrame methods

"`bins` parameter not yet supported for dataframes."
)

counts = self.groupby(self.columns.tolist()).size()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

subset get's incorporate here, it must be a list-like (or None)

counts /= counts.sum()
# Force MultiIndex index.
if len(self.columns) == 1:
counts.index = MultiIndex.from_arrays([counts.index])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need to make sure the name of this is preserved (pass name= as well), though I think we infer this now, so just make sure to test

@@ -2766,3 +2766,109 @@ def test_multiindex_column_lookup(self):
result = df.nlargest(3, ("x", "b"))
expected = df.iloc[[3, 2, 1]]
tm.assert_frame_equal(result, expected)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

go ahead and make a new file, test_value_counts.py and put in pandas/tests/frame/analytics/test_value_counts.py (we will split / move analytics later)

@Strilanc
Copy link
Author

I think it's now been five cycles of doing what was asked only to be asked for more or for something different. Please state exactly what you want in full detail if you want me to continue putting effort into this PR. Are there any more arguments you're going to need? Test cases? State them.

@TomAugspurger
Copy link
Contributor

@Strilanc we're defining a new API here, that can take some time as we think through all the design choices.

@Strilanc
Copy link
Author

@TomAugspurger I understand. I'm asking you to the thinking before I do the work.

@WillAyd
Copy link
Member

WillAyd commented Oct 11, 2019

@Strilanc we of course try to be cognizant of contributor time but reviews are an iterative process.

Are you still interested in working this? If so can you fix merge conflict?

@WillAyd
Copy link
Member

WillAyd commented Nov 7, 2019

Closing as stale but @Strilanc ping if you'd like to pick back up

@WillAyd WillAyd closed this Nov 7, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
API Design DataFrame DataFrame data structure
Projects
None yet
Development

Successfully merging this pull request may close these issues.

ENH: DataFrame.value_counts()
5 participants