Updated value_counts documentation and implementation and added single label subset test #50955

tpackard1 · 2023-01-24T05:45:21Z

closes DOC: DataFrame.count_values supports single label #50829
Tests added and passed.
All code checks passed.

tpackard1 · 2023-01-24T05:51:15Z

@mroeschke this was a first attempt. Should the definition of subset be worded like by in groupby or linked to it? And I went with the most straight forward implementation and test if either needs to be more thorough please lmk

mroeschke · 2023-01-24T18:15:05Z

The doc change looks good. Appears the test you added is failing.

rhshadrach · 2023-01-24T22:34:22Z

pandas/core/frame.py

+        subset : mapping, function, label, list of labels, optional
            Columns to use when counting unique combinations.


For mapping and function, I think we need to indicate how they are used. Saying "Columns to use when counting unique combinations" isn't correct in this case, right?

Do you think the following would work?

Columns to use when counting unique combinations. If subset is a function, it's called on each value of the object's index. If a mapping is passed, the Series or dict VALUES will be used to determine the columns or groups (the Series' values are first aligned; see align() method).

This was derived from the by definition in the groupby docs page.

If subset does not support len (in particular, functions), then it appears to me the current implementation will raise. I would also recommend not supporting functions. It may give our users whiplash to have a method that does value_counts on the columns all of a sudden doing it on the index 😆 !

I find the mapping argument case quite odd. I don't think I'd expect df.value_counts(subset={'a': 1, 'b': 2}) to work, but if I must choose a behavior, then my first guess would be that it behaves the same as df.value_counts(subset=list({'a': 1, 'b': 2})) which would be using subset on the keys rather than values. I'm not strongly opposed to including mapping though, and if we do I think what you have looks good.

For supporting single-label arguments, the implementation where we check len(subset) needs to be improved.

@rhshadrach I agree with you and will make it just say subset : label, list of labels, optional and then I think the saying "Column(s) to use when counting unique combinations" would be appropriate.

rhshadrach

Doc changes look good! It looks like some accidental changes are being made - I've commented below.

On L7036 we check len(subset), but subset can now be strings or integers. I think this should be changed to

if is_list_like(subset) and len(subset) == 1:

rhshadrach · 2023-01-29T12:27:35Z

pandas/core/frame.py

-        if not PYPY and using_copy_on_write():
-            if sys.getrefcount(self) <= 3:
-                raise ChainedAssignmentError(_chained_assignment_msg)
-


Why is this changing?

I did an upstream merge main in to my branch thinking it would be best to have my branch up to date with main so I will remove these changes.

rhshadrach · 2023-01-29T12:27:42Z

pandas/core/frame.py

@@ -6734,7 +6724,7 @@ def sort_values(
            else:
                return self.copy(deep=None)

-        if is_range_indexer(indexer, len(indexer)):
+        if array_equal_fast(indexer, np.arange(0, len(indexer), dtype=indexer.dtype)):


rhshadrach · 2023-01-29T12:28:01Z

pandas/core/frame.py

+            # error: Argument "qs" to "quantile" of "BlockManager" has incompatible type
+            # "Index"; expected "Float64Index"
+            res = data._mgr.quantile(
+                qs=q, axis=1, interpolation=interpolation  # type: ignore[arg-type]
+            )


rhshadrach · 2023-01-29T12:38:11Z

pandas/tests/frame/methods/test_value_counts.py

@@ -144,3 +144,20 @@ def test_data_frame_value_counts_dropna_false(nulls_fixture):
    )

    tm.assert_series_equal(result, expected)
+
+
+def test_data_frame_value_counts_subset(nulls_fixture):


Can you add a case where the column labels are integers.

rhshadrach · 2023-01-31T04:32:20Z

pandas/tests/frame/methods/test_value_counts.py

+    df = pd.DataFrame(
+        {100: [2, 100, 5, 9], 200: [2, 6, 2, 6], 300: [4, 6, 2, 1]},
+    )
+    result = df.value_counts([200])


I think lists already have test coverage, my earlier request was to test df.value_counts(200). This will require the fix for the implementation I mentioned in #50955.

Also, instead of duplicating the code for the tests, can you parameterize this; e.g. add

@pytest.mark.parametrize("columns", (["first_name", "middle_name"], [0, 1]))

to the top of the test and then use columns[0] and columns[1] in the original test where appropriate.

My bad. I'll be more than happy to work on this.

tpackard1 · 2023-02-01T19:59:31Z

Not sure why the following in Code Checks / Docstring validation, typing, and other manual pre-commit hooks (pull_request) is failing:

 Error: /home/runner/work/pandas/pandas/pandas/core/indexes/base.py:285:EX02:pandas.Index:Examples do not pass tests:
**********************************************************************
Line 49, in pandas.Index
Failed example:
    pd.Index([1, 2, 3], dtype="uint8")
Expected:
    NumericIndex([1, 2, 3], dtype='uint8')
Got:
    Index([1, 2, 3], dtype='uint8')

Partially validate docstrings (EX02) DONE
Error: Process completed with exit code 1.

Is this a conflict with a change that has been merged into the main branch since I have opened the pull request?

rhshadrach · 2023-02-01T22:07:25Z

Looks like this may have been fixed, can you merge main & resolve conflicts to check.

rhshadrach · 2023-02-06T22:15:47Z

pandas/core/frame.py

@@ -7057,7 +7065,7 @@ def value_counts(
            counts /= counts.sum()

        # Force MultiIndex for single column
-        if len(subset) == 1:
+        if is_list_like(subset) and len(subset) == 1:


@mroeschke - this will make value_counts return a MultiIndex for any list-like (one element or not) but an Index (non-multi) for a single label. Wanted a 2nd opinion to make sure that is desirable behavior.

I think that's reasonable behavior. Might be good to document that return difference

rhshadrach · 2023-02-06T22:16:26Z

@tpackard1 - failures look network related; likely a hiccup. Can you try merging main again.

rhshadrach · 2023-02-08T04:06:41Z

Thanks @tpackard1 - I'm now seeing the failed test:

pandas/tests/frame/methods/test_value_counts.py::test_data_frame_value_counts_subset

tpackard1 · 2023-02-08T05:46:14Z

Hopefully everything should be good now. @rhshadrach should I go ahead and update the docs (and possibly the whatsnew?) that @mroeschke mentioned about the return difference or would it be best to open a separate PR for that?

rhshadrach · 2023-02-09T04:21:54Z

@tpackard1 - yes, updating the docs here would be good. I think a line in the whatsnew about not returning a MultiIndex for a single label would be good.

rhshadrach

lgtm

rhshadrach · 2023-02-20T03:27:47Z

Thanks @tpackard1!

mroeschke added Testing pandas testing functions or related to the test suite Docs Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff labels Jan 24, 2023

rhshadrach requested changes Jan 24, 2023

View reviewed changes

rhshadrach requested changes Jan 29, 2023

View reviewed changes

tpackard1 added 6 commits January 29, 2023 20:37

first value_counts commit

44c31db

updated test_data_frame_value_counts_subset

c20e994

updated subset docstring

da41988

Back to pre updated subset docstring commit

efae1bf

added integer column label test

245526b

updated value_counts subset test

e5f96b2

tpackard1 force-pushed the update-value_counts branch from a989f51 to e5f96b2 Compare January 30, 2023 04:46

tpackard1 requested a review from rhshadrach January 30, 2023 14:49

rhshadrach requested changes Jan 31, 2023

View reviewed changes

parameterized tests and fixed implimentation for integers

80d0576

resolved conflict issues

b90811f

rhshadrach reviewed Feb 6, 2023

View reviewed changes

tpackard1 added 3 commits February 6, 2023 22:17

Merge remote-tracking branch 'upstream/main' into update-value_counts

ea3677e

fixed docstring

ff87dc8

Merge remote-tracking branch 'upstream/main' into update-value_counts

343a701

fixed test_data_frame_value_counts_subset

a7f61c5

tpackard1 requested a review from rhshadrach February 8, 2023 14:29

tpackard1 added 2 commits February 18, 2023 12:58

Merge remote-tracking branch 'upstream/main' into update-value_counts

abed50a

updated whatsnew and docs

c3adbc9

rhshadrach approved these changes Feb 20, 2023

View reviewed changes

rhshadrach added this to the 2.0 milestone Feb 20, 2023

rhshadrach merged commit 9a7bfe6 into pandas-dev:main Feb 20, 2023

		subset : mapping, function, label, list of labels, optional
		Columns to use when counting unique combinations.

Uh oh!

Updated value_counts documentation and implementation and added single label subset test #50955

Updated value_counts documentation and implementation and added single label subset test #50955

Uh oh!

Conversation

tpackard1 commented Jan 24, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tpackard1 commented Jan 24, 2023

Uh oh!

mroeschke commented Jan 24, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rhshadrach left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tpackard1 Jan 29, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

tpackard1 commented Feb 1, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rhshadrach commented Feb 1, 2023

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rhshadrach commented Feb 6, 2023

Uh oh!

rhshadrach commented Feb 8, 2023

Uh oh!

tpackard1 commented Feb 8, 2023

Uh oh!

rhshadrach commented Feb 9, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rhshadrach left a comment

Choose a reason for hiding this comment

Uh oh!

rhshadrach commented Feb 20, 2023

Uh oh!

Uh oh!

tpackard1 commented Jan 24, 2023 •

edited

Loading

tpackard1 Jan 29, 2023 •

edited

Loading

tpackard1 commented Feb 1, 2023 •

edited

Loading

rhshadrach commented Feb 9, 2023 •

edited

Loading