-
-
Notifications
You must be signed in to change notification settings - Fork 18.6k
Implement DataFrame.value_counts #27350
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
…ts() This change makes it easy to count the number of times each unique row appears in a data frame. It also removes an unnecessary difference between DataFrame and Series (i.e. the existence of the value_counts methods).
A reasonable alternative may be Mostly I just want there to be a simple way to count the rows by value. |
I think this would close #5377 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will have to review the semantics at first glance this is not very obvious
Are the test failures actually related to the change I'm making? They seem to be in unrelated places. |
Yea looks like you need to run |
I ran black. It appears to have improved the situation, but the remaining failures are still hidden away amongst tens of thousands of line of console output in travis-ci. |
Failure is at https://travis-ci.org/pandas-dev/pandas/jobs/559227356#L2861,
which doesn't look related. Maybe try merging master and pushing again.
…On Tue, Jul 16, 2019 at 10:04 AM Craig Gidney ***@***.***> wrote:
I ran black. It appears to have improved the situation, but the remaining
failures are still hidden away amongst tens of thousands of line of console
output in travis-ci.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#27350?email_source=notifications&email_token=AAKAOIXOKXMDHJNN7SKNEETP7XPO5A5CNFSM4IBXSGW2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2BE2QY#issuecomment-511855939>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAKAOISJHYI46746RZ3MXKLP7XPO5ANCNFSM4IBXSGWQ>
.
|
Merging to master didn't fix it |
Ah, Dask uses duck-typing to check whether an object is Dataframe like:
https://github.com/dask/dask/blob/4b2dbd4469423cf313ec4e9fc6967b606493a3b1/dask/utils.py#L1073
One of the checks is the lack of a DataFrame.value_counts. Do you have any
interest in opening an issue on the Dask tracker to find an alternative
implementation? You can skip the
test here for Dask <= 2.1.0 (the current version of dask).
…On Tue, Jul 16, 2019 at 8:54 PM Craig Gidney ***@***.***> wrote:
Merging to master didn't fix it
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#27350?email_source=notifications&email_token=AAKAOIWFIB5FUDSQO5N6W3LP7Z3WJA5CNFSM4IBXSGW2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD2CY4QY#issuecomment-512069187>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAKAOITDS6YSWECS2EUIXX3P7Z3WJANCNFSM4IBXSGWQ>
.
|
I opened dask/dask#5109 on the dask repo |
This is passing now that Dask has been updated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a whatsnew for v1.0.0?
Added a whatsnew |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this looks good - @jreback thoughts?
pandas/core/frame.py
Outdated
""" | ||
The number of times each unique row appears in the DataFrame. | ||
|
||
Rows that contain any NaN value are omitted from the results. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
show a versionadded tag
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What version should I enter for the tag?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1.0.0
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
pandas/core/frame.py
Outdated
4 0 2 | ||
dtype: int64 | ||
|
||
>>> df1col = df[['num_legs']] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the 2nd example is showing how this works for a Series?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
>>> type(df[['num_legs']])
pandas.core.frame.DataFrame
|
||
See Also | ||
-------- | ||
Series.value_counts: Equivalent method on Series. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we have more options on the Series.value_counts, dropna for example these need to be implemented
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's no option in group_by to not drop rows containing a NaN. How do I go about implementing that case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would be OK with raising a NotImplementedError for that case
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added. This changed the method pretty significantly. PTAL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The single-column case now works, but the code raises NotImplementedError
for the multi-column case.
pandas/tests/frame/test_analytics.py
Outdated
{"num_legs": [2, 4, 4], "num_wings": [2, 0, 0]}, | ||
index=["falcon", "dog", "cat"], | ||
) | ||
actual = df.value_counts() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use result rather than actual
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
pandas/core/frame.py
Outdated
Parameters | ||
---------- | ||
normalize : boolean, default False | ||
If True then the object returned will contain the relative |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"object" -> "Series"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
pandas/core/frame.py
Outdated
dropna=dropna, | ||
) | ||
# Move series name into its index, as happens in multi-column case. | ||
return Series(data=series.values, index=series.index.set_names(series.name)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this a MultiIndex? I think this method should always return a Series with a MultiIndex, even if it has one level.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
pandas/tests/frame/test_analytics.py
Outdated
@@ -2766,3 +2766,84 @@ def test_multiindex_column_lookup(self): | |||
result = df.nlargest(3, ("x", "b")) | |||
expected = df.iloc[[3, 2, 1]] | |||
tm.assert_frame_equal(result, expected) | |||
|
|||
def test_data_frame_value_counts(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you split up this test? Roughly one test per "thing" you're testing (single column, raising for unsupported keyword, etc.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks quite close. Just one request to avoid special casing frames with 1 column. Though perhaps wait to hear from other maintainers before changing anything, if you disagree.
pandas/core/frame.py
Outdated
), | ||
) | ||
|
||
# Some features are only supported for single-column data. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think these checks should be done first.
I'd rather have
df = pd.DataFrame({"A": [1, 2], "B": [3, 4]})
df.value_counts(dropna=False)
df[['A']].value_counts(dropna=False)
both raise. That way, we don't have the behavior depending on the shape.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. I had to keep the special casing of single columns since otherwise you get an index that's not a multiindex.
pandas/core/frame.py
Outdated
@@ -90,7 +90,7 @@ | |||
from pandas.core.index import Index, ensure_index, ensure_index_from_sequences | |||
from pandas.core.indexes import base as ibase | |||
from pandas.core.indexes.datetimes import DatetimeIndex | |||
from pandas.core.indexes.multi import maybe_droplevels | |||
from pandas.core.indexes.multi import maybe_droplevels, MultiIndex |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MultiIndex
should go before maybe_droplevels
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
pandas/core/frame.py
Outdated
"`bins` parameter not yet supported for dataframes." | ||
) | ||
|
||
# Delegate to Series.value_counts for single-column data frames. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm @TomAugspurger this is the special casing you were referring to right? Sorry kind of tough to tell from history at this point.
Is there a particular reason why we would want to do this? I think the type of output should match input
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the problem was groupby returns a regular Index when you're grouping by a single column? And in all other cases you get a MultiIndex. So we need some kind of special condition for Series to ensure we get a 1-level MI back.
I think it'd be clearer to just do the counts = self.groupby(self.columns.tolist()).size()
and then follow it up with a
if len(self.columns) == 1:
counts.index = pd.MultiIndex.from_arrays([counts.index])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. This was leftover from the special casing of parameters for single columns.
raise NotImplementedError( | ||
"`dropna=False` not yet supported for dataframes." | ||
) | ||
if bins is not None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here we're saying bins
isn't implemented at all, but according to the docstring and examples the parameter is implemented for single column DataFrames
, so I think you'd need a special case here as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. This was leftover from the special casing of parameters for single columns.
|
||
.. versionadded:: 1.0.0 | ||
|
||
The returned Series will have a MultiIndex with one level per input |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we need to have a subset=
argument (as the 1st arg) to define the columns to group on, default would be all; this is in-line with many other DataFrame methods
"`bins` parameter not yet supported for dataframes." | ||
) | ||
|
||
counts = self.groupby(self.columns.tolist()).size() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
subset get's incorporate here, it must be a list-like (or None)
counts /= counts.sum() | ||
# Force MultiIndex index. | ||
if len(self.columns) == 1: | ||
counts.index = MultiIndex.from_arrays([counts.index]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
need to make sure the name of this is preserved (pass name= as well), though I think we infer this now, so just make sure to test
@@ -2766,3 +2766,109 @@ def test_multiindex_column_lookup(self): | |||
result = df.nlargest(3, ("x", "b")) | |||
expected = df.iloc[[3, 2, 1]] | |||
tm.assert_frame_equal(result, expected) | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
go ahead and make a new file, test_value_counts.py and put in pandas/tests/frame/analytics/test_value_counts.py (we will split / move analytics later)
I think it's now been five cycles of doing what was asked only to be asked for more or for something different. Please state exactly what you want in full detail if you want me to continue putting effort into this PR. Are there any more arguments you're going to need? Test cases? State them. |
@Strilanc we're defining a new API here, that can take some time as we think through all the design choices. |
@TomAugspurger I understand. I'm asking you to the thinking before I do the work. |
@Strilanc we of course try to be cognizant of contributor time but reviews are an iterative process. Are you still interested in working this? If so can you fix merge conflict? |
Closing as stale but @Strilanc ping if you'd like to pick back up |
closes #5377
(This is a feature request stated in the form of a PR.)
This change makes it easy to count the number of times each unique row appears in a data frame. It also removes an unnecessary difference between DataFrame and Series (i.e. the existence of the
value_counts
methods).black pandas
git diff upstream/master -u -- "*.py" | flake8 --diff