API: add DataFrame.nunique() and DataFrameGroupBy.nunique() #14376

xflr6 · 2016-10-07T19:09:05Z

closes API: add DataFrame.nunique() and DataFrameGroupBy.nunique() #14336
tests added / passed
passes git diff upstream/master | flake8 --diff
whatsnew entry

codecov-io · 2016-10-07T19:55:42Z

Current coverage is 84.77% (diff: 75.00%)

Merging #14376 into master will decrease coverage by <.01%

@@             master     #14376   diff @@
==========================================
  Files           145        145          
  Lines         51090      51133    +43   
  Methods           0          0          
  Messages          0          0          
  Branches          0          0          
==========================================
+ Hits          43315      43347    +32   
- Misses         7775       7786    +11   
  Partials          0          0

Powered by Codecov. Last update 3e3434b...a0558e7

jreback · 2016-10-07T22:49:16Z

these are not going to be very efficient. Pls add some benchmarks.

xflr6 · 2016-10-31T17:22:13Z

I could not get the benchmarks to run with asv under win32, so now there is something basically following what is already there for count().

There is no 'fast-path' in the implementation of nunique() for describe(include='all') either, right?

jreback · 2016-12-16T23:56:41Z

can you rebase and post some asv numbers

xflr6 · 2016-12-18T14:26:49Z

OK, rebased. None of the proposed methods to run asv on this seem to work on my win machine, e.g.:

(C:\Users\User\Miniconda2) C:\Users\User\Desktop\pandas\asv_bench>asv run 4c13e.
.9ab44 -b _nunique
· Creating environments
· Discovering benchmarks
·· Uninstalling from conda-py2.7-Cython-libpython-matplotlib-numexpr-numpy-openp
yxl-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
·· Installing into conda-py2.7-Cython-libpython-matplotlib-numexpr-numpy-openpyx
l-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
· Running 3 total benchmarks (1 commits * 1 environments * 3 benchmarks)
[  0.00%] · For pandas commit hash 9ab44eec:
[  0.00%] ·· Building for conda-py2.7-Cython-libpython-matplotlib-numexpr-numpy-
openpyxl-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[  0.00%] ··· Error running C:\Users\User\Desktop\pandas\asv_bench\env\eebfa40b3
935b6d282712e13c756274b\python.exe setup.py build
              STDOUT -------->
[...]
              build\temp.win-amd64-2.7\Release\pandas\src\datetime\np_datetime_s
trings.o:np_datetime_strings.c:(.text+0x12): undefined reference to `localtime_r
'
              collect2.exe: error: ld returned 1 exit status
              error: command 'C:\\Users\\User\\Miniconda2\\Scripts\\gcc.bat' fai
led with exit status 1

[  0.00%] ·· Skipping conda-py2.7-Cython-libpython-matplotlib-numexpr-numpy-open
pyxl-pytables-scipy-sqlalchemy-xlrd-xlsxwriter-xlwt
[ 33.33%] ··· Benchmark groupby.groupby_nunique.time_groupby_nunique skipped
[ 66.67%] ··· Benchmark groupby.GroupBySuite.time_nunique skipped
[100.00%] ··· Benchmark frame_methods.frame_nunique.time_frame_nunique skipped

jreback · 2016-12-20T13:55:05Z

pandas/core/groupby.py

+        ----------
+        dropna : boolean, default True
+            Don't include NaN in the counts.
+        """


can you add a Returns and Examples sections

jreback · 2016-12-20T13:55:14Z

pandas/core/frame.py

+        dropna : boolean, default True
+            Don't include NaN in the counts.
+
+        Returns


can you add Examples section

jreback · 2016-12-20T13:58:01Z

pandas/tests/groupby/test_groupby.py

+            'B': list('abxacc'),
+            'C': list('abbacx'),
+        })
+        expected = DataFrame({'A': [1] * 3, 'B': [1, 2, 1], 'C': [1, 1, 2]})


test with both as_index=True and False

jreback · 2016-12-20T13:59:49Z

pandas/tests/groupby/test_groupby.py

+        })
+        expected = DataFrame({'A': [1] * 3, 'B': [1, 2, 1], 'C': [1, 1, 2]})
+        result = df.groupby('A', as_index=False).nunique()
+        tm.assert_frame_equal(result, expected)


also can you test with dropna=True and False

jreback · 2016-12-20T14:00:30Z

@xflr6 ok lgtm. just some added tests / docs. ping on green.

don't worry about the benchmarks.

xflr6 · 2016-12-28T20:27:06Z

Thanks, extended the docs and tests, CI passed.

jorisvandenbossche

@xflr6 Nice enhancement! I added some more comments.

jorisvandenbossche · 2016-12-28T23:47:08Z

pandas/core/frame.py

+
+        Examples
+        --------
+        >>> df = DataFrame({'A': [1, 2, 3], 'B': [1, 1, 1]})


DataFrame -> pd.DataFrame

Changed (note that there are still some other example sections with bare DataFrame).

yep, but we try to have the standards for new code a bit higher :-)

jorisvandenbossche · 2016-12-28T23:48:07Z

pandas/core/groupby.py

+
+        Examples
+        --------
+        >>> df = DataFrame({'id': ['spam', 'egg', 'egg', 'spam', 'ham', 'ham'],


DataFrame -> pd.DataFrame

jorisvandenbossche · 2016-12-28T23:49:33Z

pandas/core/groupby.py

+        ham    1       1       2
+        spam   1       2       1
+
+        >>> df.groupby('id').filter(lambda g: (g.nunique() > 1).any())


This is a nice example (although more of the filter method and dataframe.nunique), but can you add one sentence introducing it? (explaining what we are going to do in the next example)

jorisvandenbossche · 2016-12-28T23:50:31Z

pandas/core/groupby.py

+        """
+        from functools import partial
+        func = partial(Series.nunique, dropna=dropna)
+        return self.apply(lambda g: g.apply(func))


Is the use of functools.partial actually needed here? As extra keyword like dropna passed to apply will normally be passed through to the func

TIL about kwargs-passing in apply(), simplified both nunique-methods with that.

jorisvandenbossche · 2016-12-28T23:51:37Z

pandas/tests/frame/test_analytics.py

+        f = lambda s: len(algorithms.unique1d(s.dropna()))
+        self._check_stat_op('nunique', f, has_skipna=False,
+                            check_dtype=False, check_dates=True)
+


Can you add tests for the dropna and axis keywords as well?

jorisvandenbossche · 2016-12-28T23:57:00Z

pandas/core/frame.py

+        Parameters
+        ----------
+        axis : {0 or 'index', 1 or 'columns'}, default 0
+            0 or 'index' for row-wise, 1 or 'columns' for column-wise


row-wise / column-wise is probably used a lot in other methods here as well (can you check?), but, I find it in this case a bit confusing. As I would interpret 'column-wise' as "distinct observations for each column". And this is not correct, as that is the default of axis=0/'index'. So the axis=1 is more 'over/along the columns'
But English is not my mother tongue. What do you think?

As is, the wording is consistent with all the other methods such as count(): Wouldn't it be better to have a dedicated PR for that, in case all the axis docstrings are to be improved?

@xflr6 yes could have a PR for improving these kinds of things in general (we already use shared_docs for this type of thing anyhow), so these are pretty general. Here is not as .nunique has separate doc-strings for Series/DataFrame, which is why @jorisvandenbossche is asking.

ok with actually fixing that (so this would hook into our more general doc-strings system).

actually Series.nunique is defined in pandas.core.base (so its the same for Index). But these could easily hook into the same doc-string system as I said above.

Note that it is not fully consistent throughout frame.py as well. There are also some methods that explain this differently (eg apply, mode. corrwith actually switches the row and column-wise ("0 or 'index' to compute column-wise, 1 or 'columns' for row-wise")).

The thing is also that the explanation can be different depending on what the function does I think.

@xflr6 can you make the doc-string references to the axis consistent with other methods

e.g. example from another method.

axis : {0 or 'index', 1 or 'columns'}, default 0 Sort index/rows versus columns

I think you can simply drop the 2nd line of the axis parm

jorisvandenbossche · 2016-12-28T23:59:34Z

@xflr6 As @jreback said, it is no problem that you do not get the benchmarks working for this PR. But, just to try to solve it: there seems to go something wrong in building the environment. Can you yourself build pandas master in a conda env?

jreback · 2017-01-02T20:24:25Z

lgtm. @jorisvandenbossche if you have any final comments.

jreback · 2017-01-23T13:41:35Z

thanks @xflr6

jorisvandenbossche · 2017-01-23T20:37:01Z

One remaining remark is that I think row-wise / column-wise is a wrong explanation in this case.

jreback · 2017-01-23T20:45:41Z

@jorisvandenbossche I took that out on the merge, FYI

closes pandas-dev#14336 Author: Sebastian Bank <[email protected]> Closes pandas-dev#14376 from xflr6/nunique and squashes the following commits: a0558e7 [Sebastian Bank] use apply()-kwargs instead of partial, more tests, better examples c8d3ac4 [Sebastian Bank] extend docs and tests fd0f22d [Sebastian Bank] add simple benchmarks 5c4b325 [Sebastian Bank] API: add DataFrame.nunique() and DataFrameGroupBy.nunique()

jreback added Enhancement Groupby labels Oct 8, 2016

jreback reviewed Dec 20, 2016

View reviewed changes

pandas/core/frame.py

dropna : boolean, default True

Don't include NaN in the counts.

Returns

Copy link

Contributor

jreback Dec 20, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add Examples section

jreback reviewed Dec 20, 2016

View reviewed changes

xflr6 added 3 commits December 28, 2016 16:35

API: add DataFrame.nunique() and DataFrameGroupBy.nunique()

5c4b325

add simple benchmarks

fd0f22d

extend docs and tests

c8d3ac4

jorisvandenbossche reviewed Dec 28, 2016

View reviewed changes

use apply()-kwargs instead of partial, more tests, better examples

a0558e7

jreback added this to the 0.20.0 milestone Jan 2, 2017

jreback closed this in a1b6587 Jan 23, 2017

jreback mentioned this pull request Jan 23, 2017

PERF: DataFrame.groupby.nunique is non-performant #15197

Closed

jreback mentioned this pull request Mar 23, 2017

Add .nunique method to DataFrame and .groupby #15794

Closed

Uh oh!

API: add DataFrame.nunique() and DataFrameGroupBy.nunique() #14376

API: add DataFrame.nunique() and DataFrameGroupBy.nunique() #14376

Uh oh!

Conversation

xflr6 commented Oct 7, 2016

Uh oh!

codecov-io commented Oct 7, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Current coverage is 84.77% (diff: 75.00%)

Uh oh!

jreback commented Oct 7, 2016

Uh oh!

xflr6 commented Oct 31, 2016

Uh oh!

jreback commented Dec 16, 2016

Uh oh!

xflr6 commented Dec 18, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jreback commented Dec 20, 2016

Uh oh!

xflr6 commented Dec 28, 2016

Uh oh!

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche commented Dec 28, 2016

Uh oh!

jreback commented Jan 2, 2017

Uh oh!

jreback commented Jan 23, 2017

Uh oh!

jorisvandenbossche commented Jan 23, 2017

Uh oh!

jreback commented Jan 23, 2017

Uh oh!

Uh oh!

codecov-io commented Oct 7, 2016 •

edited

Loading