Skip to content

Updated value_counts documentation and implementation and added single label subset test #50955

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 14 commits into from
Feb 20, 2023

Conversation

tpackard1
Copy link
Contributor

@tpackard1 tpackard1 commented Jan 24, 2023

@tpackard1
Copy link
Contributor Author

@mroeschke this was a first attempt. Should the definition of subset be worded like by in groupby or linked to it? And I went with the most straight forward implementation and test if either needs to be more thorough please lmk

@mroeschke
Copy link
Member

The doc change looks good. Appears the test you added is failing.

@mroeschke mroeschke added Testing pandas testing functions or related to the test suite Docs Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff labels Jan 24, 2023
Comment on lines 6935 to 6936
subset : mapping, function, label, list of labels, optional
Columns to use when counting unique combinations.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For mapping and function, I think we need to indicate how they are used. Saying "Columns to use when counting unique combinations" isn't correct in this case, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think the following would work?

Columns to use when counting unique combinations. If subset is a function, it's called on each value of the object's index. If a mapping is passed, the Series or dict VALUES will be used to determine the columns or groups (the Series' values are first aligned; see align() method).

This was derived from the by definition in the groupby docs page.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If subset does not support len (in particular, functions), then it appears to me the current implementation will raise. I would also recommend not supporting functions. It may give our users whiplash to have a method that does value_counts on the columns all of a sudden doing it on the index 😆 !

I find the mapping argument case quite odd. I don't think I'd expect df.value_counts(subset={'a': 1, 'b': 2}) to work, but if I must choose a behavior, then my first guess would be that it behaves the same as df.value_counts(subset=list({'a': 1, 'b': 2})) which would be using subset on the keys rather than values. I'm not strongly opposed to including mapping though, and if we do I think what you have looks good.

For supporting single-label arguments, the implementation where we check len(subset) needs to be improved.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rhshadrach I agree with you and will make it just say subset : label, list of labels, optional and then I think the saying "Column(s) to use when counting unique combinations" would be appropriate.

Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doc changes look good! It looks like some accidental changes are being made - I've commented below.

On L7036 we check len(subset), but subset can now be strings or integers. I think this should be changed to

if is_list_like(subset) and len(subset) == 1:

Comment on lines 3871 to 3907
if not PYPY and using_copy_on_write():
if sys.getrefcount(self) <= 3:
raise ChainedAssignmentError(_chained_assignment_msg)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this changing?

Copy link
Contributor Author

@tpackard1 tpackard1 Jan 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did an upstream merge main in to my branch thinking it would be best to have my branch up to date with main so I will remove these changes.

@@ -6734,7 +6724,7 @@ def sort_values(
else:
return self.copy(deep=None)

if is_range_indexer(indexer, len(indexer)):
if array_equal_fast(indexer, np.arange(0, len(indexer), dtype=indexer.dtype)):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same

Comment on lines 10880 to 10884
# error: Argument "qs" to "quantile" of "BlockManager" has incompatible type
# "Index"; expected "Float64Index"
res = data._mgr.quantile(
qs=q, axis=1, interpolation=interpolation # type: ignore[arg-type]
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same

@@ -144,3 +144,20 @@ def test_data_frame_value_counts_dropna_false(nulls_fixture):
)

tm.assert_series_equal(result, expected)


def test_data_frame_value_counts_subset(nulls_fixture):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a case where the column labels are integers.

df = pd.DataFrame(
{100: [2, 100, 5, 9], 200: [2, 6, 2, 6], 300: [4, 6, 2, 1]},
)
result = df.value_counts([200])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think lists already have test coverage, my earlier request was to test df.value_counts(200). This will require the fix for the implementation I mentioned in #50955.

Also, instead of duplicating the code for the tests, can you parameterize this; e.g. add

@pytest.mark.parametrize("columns", (["first_name", "middle_name"], [0, 1]))

to the top of the test and then use columns[0] and columns[1] in the original test where appropriate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My bad. I'll be more than happy to work on this.

@tpackard1
Copy link
Contributor Author

tpackard1 commented Feb 1, 2023

Not sure why the following in Code Checks / Docstring validation, typing, and other manual pre-commit hooks (pull_request) is failing:

 Error: /home/runner/work/pandas/pandas/pandas/core/indexes/base.py:285:EX02:pandas.Index:Examples do not pass tests:
**********************************************************************
Line 49, in pandas.Index
Failed example:
    pd.Index([1, 2, 3], dtype="uint8")
Expected:
    NumericIndex([1, 2, 3], dtype='uint8')
Got:
    Index([1, 2, 3], dtype='uint8')

Partially validate docstrings (EX02) DONE
Error: Process completed with exit code 1.

Is this a conflict with a change that has been merged into the main branch since I have opened the pull request?

@rhshadrach
Copy link
Member

Looks like this may have been fixed, can you merge main & resolve conflicts to check.

@@ -7057,7 +7065,7 @@ def value_counts(
counts /= counts.sum()

# Force MultiIndex for single column
if len(subset) == 1:
if is_list_like(subset) and len(subset) == 1:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mroeschke - this will make value_counts return a MultiIndex for any list-like (one element or not) but an Index (non-multi) for a single label. Wanted a 2nd opinion to make sure that is desirable behavior.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that's reasonable behavior. Might be good to document that return difference

@rhshadrach
Copy link
Member

@tpackard1 - failures look network related; likely a hiccup. Can you try merging main again.

@rhshadrach
Copy link
Member

Thanks @tpackard1 - I'm now seeing the failed test:

pandas/tests/frame/methods/test_value_counts.py::test_data_frame_value_counts_subset

@tpackard1
Copy link
Contributor Author

Hopefully everything should be good now. @rhshadrach should I go ahead and update the docs (and possibly the whatsnew?) that @mroeschke mentioned about the return difference or would it be best to open a separate PR for that?

@tpackard1 tpackard1 requested a review from rhshadrach February 8, 2023 14:29
@rhshadrach
Copy link
Member

rhshadrach commented Feb 9, 2023

@tpackard1 - yes, updating the docs here would be good. I think a line in the whatsnew about not returning a MultiIndex for a single label would be good.

Copy link
Member

@rhshadrach rhshadrach left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@rhshadrach rhshadrach added this to the 2.0 milestone Feb 20, 2023
@rhshadrach rhshadrach merged commit 9a7bfe6 into pandas-dev:main Feb 20, 2023
@rhshadrach
Copy link
Member

Thanks @tpackard1!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Docs Testing pandas testing functions or related to the test suite
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DOC: DataFrame.count_values supports single label
3 participants