-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
Updated value_counts documentation and implementation and added single label subset test #50955
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
@mroeschke this was a first attempt. Should the definition of subset be worded like by in groupby or linked to it? And I went with the most straight forward implementation and test if either needs to be more thorough please lmk |
The doc change looks good. Appears the test you added is failing. |
pandas/core/frame.py
Outdated
subset : mapping, function, label, list of labels, optional | ||
Columns to use when counting unique combinations. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For mapping and function, I think we need to indicate how they are used. Saying "Columns to use when counting unique combinations" isn't correct in this case, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think the following would work?
Columns to use when counting unique combinations. If subset
is a function, it's called on each value of the object's index. If a mapping is passed, the Series or dict VALUES will be used to determine the columns or groups (the Series' values are first aligned; see align()
method).
This was derived from the by definition in the groupby docs page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If subset
does not support len
(in particular, functions), then it appears to me the current implementation will raise. I would also recommend not supporting functions. It may give our users whiplash to have a method that does value_counts on the columns all of a sudden doing it on the index 😆 !
I find the mapping argument case quite odd. I don't think I'd expect df.value_counts(subset={'a': 1, 'b': 2})
to work, but if I must choose a behavior, then my first guess would be that it behaves the same as df.value_counts(subset=list({'a': 1, 'b': 2}))
which would be using subset on the keys rather than values. I'm not strongly opposed to including mapping though, and if we do I think what you have looks good.
For supporting single-label arguments, the implementation where we check len(subset)
needs to be improved.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rhshadrach I agree with you and will make it just say subset : label, list of labels, optional
and then I think the saying "Column(s) to use when counting unique combinations" would be appropriate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doc changes look good! It looks like some accidental changes are being made - I've commented below.
On L7036 we check len(subset)
, but subset can now be strings or integers. I think this should be changed to
if is_list_like(subset) and len(subset) == 1:
pandas/core/frame.py
Outdated
if not PYPY and using_copy_on_write(): | ||
if sys.getrefcount(self) <= 3: | ||
raise ChainedAssignmentError(_chained_assignment_msg) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is this changing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did an upstream merge main in to my branch thinking it would be best to have my branch up to date with main so I will remove these changes.
pandas/core/frame.py
Outdated
@@ -6734,7 +6724,7 @@ def sort_values( | |||
else: | |||
return self.copy(deep=None) | |||
|
|||
if is_range_indexer(indexer, len(indexer)): | |||
if array_equal_fast(indexer, np.arange(0, len(indexer), dtype=indexer.dtype)): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same
pandas/core/frame.py
Outdated
# error: Argument "qs" to "quantile" of "BlockManager" has incompatible type | ||
# "Index"; expected "Float64Index" | ||
res = data._mgr.quantile( | ||
qs=q, axis=1, interpolation=interpolation # type: ignore[arg-type] | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same
@@ -144,3 +144,20 @@ def test_data_frame_value_counts_dropna_false(nulls_fixture): | |||
) | |||
|
|||
tm.assert_series_equal(result, expected) | |||
|
|||
|
|||
def test_data_frame_value_counts_subset(nulls_fixture): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add a case where the column labels are integers.
a989f51
to
e5f96b2
Compare
df = pd.DataFrame( | ||
{100: [2, 100, 5, 9], 200: [2, 6, 2, 6], 300: [4, 6, 2, 1]}, | ||
) | ||
result = df.value_counts([200]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think lists already have test coverage, my earlier request was to test df.value_counts(200)
. This will require the fix for the implementation I mentioned in #50955.
Also, instead of duplicating the code for the tests, can you parameterize this; e.g. add
@pytest.mark.parametrize("columns", (["first_name", "middle_name"], [0, 1]))
to the top of the test and then use columns[0]
and columns[1]
in the original test where appropriate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My bad. I'll be more than happy to work on this.
Not sure why the following in Code Checks / Docstring validation, typing, and other manual pre-commit hooks (pull_request) is failing:
Is this a conflict with a change that has been merged into the main branch since I have opened the pull request? |
Looks like this may have been fixed, can you merge main & resolve conflicts to check. |
@@ -7057,7 +7065,7 @@ def value_counts( | |||
counts /= counts.sum() | |||
|
|||
# Force MultiIndex for single column | |||
if len(subset) == 1: | |||
if is_list_like(subset) and len(subset) == 1: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mroeschke - this will make value_counts return a MultiIndex for any list-like (one element or not) but an Index (non-multi) for a single label. Wanted a 2nd opinion to make sure that is desirable behavior.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that's reasonable behavior. Might be good to document that return difference
@tpackard1 - failures look network related; likely a hiccup. Can you try merging main again. |
Thanks @tpackard1 - I'm now seeing the failed test:
|
Hopefully everything should be good now. @rhshadrach should I go ahead and update the docs (and possibly the whatsnew?) that @mroeschke mentioned about the return difference or would it be best to open a separate PR for that? |
@tpackard1 - yes, updating the docs here would be good. I think a line in the whatsnew about not returning a MultiIndex for a single label would be good. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm
Thanks @tpackard1! |
Uh oh!
There was an error while loading. Please reload this page.