-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
API: add DataFrame.nunique() and DataFrameGroupBy.nunique() #14376
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Current coverage is 84.77% (diff: 75.00%)@@ master #14376 diff @@
==========================================
Files 145 145
Lines 51090 51133 +43
Methods 0 0
Messages 0 0
Branches 0 0
==========================================
+ Hits 43315 43347 +32
- Misses 7775 7786 +11
Partials 0 0
|
these are not going to be very efficient. Pls add some benchmarks. |
I could not get the benchmarks to run with There is no 'fast-path' in the implementation of |
can you rebase and post some asv numbers |
OK, rebased. None of the proposed methods to run
|
---------- | ||
dropna : boolean, default True | ||
Don't include NaN in the counts. | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add a Returns and Examples sections
dropna : boolean, default True | ||
Don't include NaN in the counts. | ||
|
||
Returns |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you add Examples section
'B': list('abxacc'), | ||
'C': list('abbacx'), | ||
}) | ||
expected = DataFrame({'A': [1] * 3, 'B': [1, 2, 1], 'C': [1, 1, 2]}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test with both as_index=True and False
}) | ||
expected = DataFrame({'A': [1] * 3, 'B': [1, 2, 1], 'C': [1, 1, 2]}) | ||
result = df.groupby('A', as_index=False).nunique() | ||
tm.assert_frame_equal(result, expected) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also can you test with dropna=True and False
@xflr6 ok lgtm. just some added tests / docs. ping on green. don't worry about the benchmarks. |
Thanks, extended the docs and tests, CI passed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xflr6 Nice enhancement! I added some more comments.
|
||
Examples | ||
-------- | ||
>>> df = DataFrame({'A': [1, 2, 3], 'B': [1, 1, 1]}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DataFrame -> pd.DataFrame
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed (note that there are still some other example sections with bare DataFrame
).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, but we try to have the standards for new code a bit higher :-)
|
||
Examples | ||
-------- | ||
>>> df = DataFrame({'id': ['spam', 'egg', 'egg', 'spam', 'ham', 'ham'], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
DataFrame -> pd.DataFrame
ham 1 1 2 | ||
spam 1 2 1 | ||
|
||
>>> df.groupby('id').filter(lambda g: (g.nunique() > 1).any()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a nice example (although more of the filter method and dataframe.nunique), but can you add one sentence introducing it? (explaining what we are going to do in the next example)
""" | ||
from functools import partial | ||
func = partial(Series.nunique, dropna=dropna) | ||
return self.apply(lambda g: g.apply(func)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the use of functools.partial
actually needed here? As extra keyword like dropna passed to apply
will normally be passed through to the func
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TIL about kwargs-passing in apply()
, simplified both nunique
-methods with that.
f = lambda s: len(algorithms.unique1d(s.dropna())) | ||
self._check_stat_op('nunique', f, has_skipna=False, | ||
check_dtype=False, check_dates=True) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you add tests for the dropna and axis keywords as well?
Parameters | ||
---------- | ||
axis : {0 or 'index', 1 or 'columns'}, default 0 | ||
0 or 'index' for row-wise, 1 or 'columns' for column-wise |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
row-wise / column-wise is probably used a lot in other methods here as well (can you check?), but, I find it in this case a bit confusing. As I would interpret 'column-wise' as "distinct observations for each column". And this is not correct, as that is the default of axis=0/'index'. So the axis=1 is more 'over/along the columns'
But English is not my mother tongue. What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As is, the wording is consistent with all the other methods such as count()
: Wouldn't it be better to have a dedicated PR for that, in case all the axis
docstrings are to be improved?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xflr6 yes could have a PR for improving these kinds of things in general (we already use shared_docs for this type of thing anyhow), so these are pretty general. Here is not as .nunique
has separate doc-strings for Series/DataFrame, which is why @jorisvandenbossche is asking.
ok with actually fixing that (so this would hook into our more general doc-strings system).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually Series.nunique
is defined in pandas.core.base
(so its the same for Index
). But these could easily hook into the same doc-string system as I said above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that it is not fully consistent throughout frame.py as well. There are also some methods that explain this differently (eg apply
, mode
. corrwith
actually switches the row and column-wise ("0 or 'index' to compute column-wise, 1 or 'columns' for row-wise")).
The thing is also that the explanation can be different depending on what the function does I think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xflr6 can you make the doc-string references to the axis consistent with other methods
e.g. example from another method.
axis : {0 or 'index', 1 or 'columns'}, default 0
Sort index/rows versus columns
I think you can simply drop the 2nd line of the axis parm
lgtm. @jorisvandenbossche if you have any final comments. |
thanks @xflr6 |
One remaining remark is that I think row-wise / column-wise is a wrong explanation in this case. |
@jorisvandenbossche I took that out on the merge, FYI |
closes pandas-dev#14336 Author: Sebastian Bank <[email protected]> Closes pandas-dev#14376 from xflr6/nunique and squashes the following commits: a0558e7 [Sebastian Bank] use apply()-kwargs instead of partial, more tests, better examples c8d3ac4 [Sebastian Bank] extend docs and tests fd0f22d [Sebastian Bank] add simple benchmarks 5c4b325 [Sebastian Bank] API: add DataFrame.nunique() and DataFrameGroupBy.nunique()
git diff upstream/master | flake8 --diff