DOC: Update pandas.cut docstring #20104

ikoevska · 2018-03-10T10:08:23Z

Checklist for the pandas documentation sprint (ignore this if you are doing
an unrelated PR):

[X ] PR title is "DOC: update the docstring"
[X ] The validation script passes: scripts/validate_docstrings.py <your-function-or-method>
[X ] The PEP8 style check passes: git diff upstream/master -u -- "*.py" | flake8 --diff
[X ] The html version looks good: python doc/make.py --single <your-function-or-method>
[X ] It has been proofread on language by another sprint participant

Please include the output of the validation script below between the "```" ticks:

################################################################################
############################ Docstring (pandas.cut) ############################
################################################################################

Bin `x` and return data about the bin to which each `x` value belongs.

Splits `x` into the specified number of equal-width half-open bins.
Based on the parameters specified and the input, returns data about
the half-open bins to which each value of `x` belongs or the bins
themselves.
Use `cut` when you need to segment and sort data values into bins. This
function is also useful for going from a continuous variable to a
categorical variable. For example, `cut` could convert ages to groups
of age ranges.

Parameters
----------
x : array-like
    The input array to be binned. Must be 1-dimensional.
bins : int, sequence of scalars, or pandas.IntervalIndex
    If int, defines the number of equal-width bins in the range of `x`.
    The range of `x` is extended by .1% on each side to include the min or
    max values of `x`.
    If a sequence, defines the bin edges allowing for non-uniform width.
    No extension of the range of `x` is done.
right : bool, default 'True'
    Indicates whether the `bins` include the rightmost edge or not. If
    `right == True` (the default), then the `bins` [1,2,3,4] indicate
    (1,2], (2,3], (3,4].
labels : array or bool, optional
    Specifies the labels for the returned bins. Must be the same length as
    the resulting bins. If False, returns only integer indicators of the
    bins.
retbins : bool, default 'False'
    Whether to return the bins or not. Useful when bins is provided
    as a scalar.
precision : int, default '3'
    The precision at which to store and display the bins labels.
include_lowest : bool, default 'False'
    Whether the first interval should be left-inclusive or not.

Returns
-------
out : pandas.Categorical, Series, or ndarray
    An array-like object representing the respective bin for each value
    of `x`. The type depends on the value of `labels`.

    * True : returns a Series for Series `x` or a pandas.Categorical for
    pandas.Categorial `x`.

    * False : returns an ndarray of integers.
bins : numpy.ndarray of floats
    Returned when `retbins` is 'True'.

See Also
--------
qcut : Discretize variable into equal-sized buckets based on rank
    or based on sample quantiles.
pandas.Categorical : Represents a categorical variable in
    classic R / S-plus fashion.
Series : One-dimensional ndarray with axis labels (including time series).
pandas.IntervalIndex : Immutable Index implementing an ordered,
    sliceable set. IntervalIndex represents an Index of intervals that
    are all closed on the same side.

Notes
-----
Any NA values will be NA in the result. Out of bounds values will be NA in
the resulting pandas.Categorical object.

Examples
--------
>>> pd.cut(np.array([1,7,5,4,6,3]), 3)
... # doctest: +ELLIPSIS
[(0.994, 3.0], (5.0, 7.0], (3.0, 5.0], (3.0, 5.0], (5.0, 7.0], ...
Categories (3, interval[float64]): [(0.994, 3.0] < (3.0, 5.0] ...

>>> pd.cut(np.array([1,7,5,4,6,3]), 3, retbins=True)
... # doctest: +ELLIPSIS
([(0.994, 3.0], (5.0, 7.0], (3.0, 5.0], (3.0, 5.0], (5.0, 7.0], ...
Categories (3, interval[float64]): [(0.994, 3.0] < (3.0, 5.0] ...
array([0.994, 3.   , 5.   , 7.   ]))

>>> pd.cut(np.array([.2, 1.4, 2.5, 6.2, 9.7, 2.1]),
...        3, labels=["good", "medium", "bad"])
... # doctest: +SKIP
[good, good, good, medium, bad, good]
Categories (3, object): [good < medium < bad]

>>> pd.cut(np.ones(5, dtype='int64'), 4, labels=False)
array([1, 1, 1, 1, 1], dtype=int64)

>>> s = pd.Series(np.array([2,4,6,8,10]), index=['a', 'b', 'c', 'd', 'e'])
>>> pd.cut(s, 3)
... # doctest: +ELLIPSIS
a    (1.992, 4.667]
b    (1.992, 4.667]
c    (4.667, 7.333]
d     (7.333, 10.0]
e     (7.333, 10.0]
dtype: category
Categories (3, interval[float64]): [(1.992, 4.667] < (4.667, ...

################################################################################
################################## Validation ##################################
################################################################################

Docstring for "pandas.cut" correct. :)

Comment: Resubmitting #20069 after a botched rebase.

jorisvandenbossche · 2018-03-10T10:15:39Z

Did you already update according to my comments?

ikoevska · 2018-03-10T10:30:08Z

Nope, updating right now. :)

ikoevska · 2018-03-10T11:53:10Z

@jorisvandenbossche Updated based on your comments and added some more examples. I've also updated the validation script output in the description of the PR.

jreback · 2018-03-10T13:30:09Z

pandas/core/reshape/tile.py

-    Return indices of half-open bins to which each value of `x` belongs.
+    Bin `x` and return data about the bin to which each `x` value belongs.
+
+    This function splits `x` into the specified number of equal-width half-


can remove function.

you say 'data' above and use 'information' here. pick one.

jreback · 2018-03-10T13:30:23Z

pandas/core/reshape/tile.py

+    This function splits `x` into the specified number of equal-width half-
+    open bins. Based on the parameters specified and the input, returns
+    information about the half-open bins to which each value of `x` belongs
+    or the bins themselves.


if retbins=True

I disagree with this suggested change. Currently, the complete sentence begins with "Based on the parameters specified and the input" which points to looking at parameters and other information below. If I add "if retbins=True", I will be pressed to explain all other options in the extended description as well. Not to mention that the sentence will become way too long and hence with reduced readability.

jreback · 2018-03-10T13:31:57Z

pandas/core/reshape/tile.py

-        Returned only if `retbins` is True.
+    out : pandas.Categorical or Series, or array of int if `labels` is 'False'
+        The return type depends on the input.
+        If the input is a Series, a Series of type category is returned.


not sure about the formatting here, @TomAugspurger ?

@TomAugspurger, still waiting for more feedback on that. Thanks!

jreback · 2018-03-10T13:32:24Z

pandas/core/reshape/tile.py

+        or based on sample quantiles.
+    pandas.Categorical : Represents a categorical variable in
+        classic R / S-plus fashion.
+    Series : One-dimensional ndarray with axis labels (including time series).


I dont' think Series is needed here

As this is something that you can input or get as an output from the command, I think this needs to be here. There is also an example with series below.

jreback · 2018-03-10T13:33:18Z

pandas/core/reshape/tile.py

@@ -88,7 +109,18 @@ def cut(x, bins, right=True, labels=None, retbins=False, precision=3,
    Categories (3, object): [good < medium < bad]

    >>> pd.cut(np.ones(5), 4, labels=False)


specify dtype='int64' to np.ones (this is normally a platform int)

codecov · 2018-03-12T17:53:15Z

Codecov Report

Merging #20104 into master will increase coverage by 0.02%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master   #20104      +/-   ##
==========================================
+ Coverage    91.7%   91.72%   +0.02%     
==========================================
  Files         150      150              
  Lines       49165    49156       -9     
==========================================
+ Hits        45087    45090       +3     
+ Misses       4078     4066      -12

Flag	Coverage Δ
#multiple	`90.11% <ø> (+0.02%)`	⬆️
#single	`41.85% <ø> (-0.02%)`	⬇️

Impacted Files	Coverage Δ
pandas/core/reshape/tile.py	`93.37% <ø> (ø)`	⬆️
pandas/plotting/_core.py	`82.23% <0%> (-0.04%)`	⬇️
pandas/core/indexes/datetimelike.py	`96.7% <0%> (-0.02%)`	⬇️
pandas/core/ops.py	`96.33% <0%> (-0.02%)`	⬇️
pandas/core/generic.py	`95.84% <0%> (-0.02%)`	⬇️
pandas/core/window.py	`96.31% <0%> (ø)`	⬆️
pandas/core/frame.py	`97.18% <0%> (ø)`	⬆️
pandas/core/indexes/multi.py	`95.06% <0%> (ø)`	⬆️
pandas/core/strings.py	`98.32% <0%> (ø)`	⬆️
pandas/core/indexing.py	`93.02% <0%> (ø)`	⬆️
... and 8 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7169830...f22e45f. Read the comment docs.

TomAugspurger · 2018-03-12T18:01:04Z

pandas/core/reshape/tile.py

-        represented as categories when categorical data is returned.
-    bins : ndarray of floats
-        Returned only if `retbins` is True.
+    out : pandas.Categorical or Series, or array of int if `labels` is 'False'


I think something liek

out : pandas.Categorical, Series, or ndarray An array-like object representing the bin for each value of `x`. The type depends on the value of `labels`. * True : returns a Series for Series `x` or a Categorical for Categorical `x`. * False : returns an ndarray of integers.

TomAugspurger · 2018-03-12T18:02:37Z

Looks like some git issues @ikoevska. LMK if you need some help.

TomAugspurger · 2018-03-12T18:03:42Z

Did you have any changes in 2dffb60? If not, I think

git reset --hard d24c749b0 
git merge upstream/master

should do it. It'll throw away changes from that commit though.

ikoevska · 2018-03-12T18:03:58Z

Yep, I did a sync with the upstream and then a rebase on master and this happened. Suggestions how to fix it?

UPDATE: @TomAugspurger that fixed it, thanks so much!

pep8speaks · 2018-03-12T19:07:41Z

Hello @ikoevska! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on March 15, 2018 at 20:28 Hours UTC

ikoevska · 2018-03-12T19:09:02Z

@jreback Can you take a look at the latest changes and my comments? Thanks!

TomAugspurger · 2018-03-12T19:24:22Z

pandas/core/reshape/tile.py

-    precision : int, optional
-        The precision at which to store and display the bins labels
-    include_lowest : bool, optional
+    precision : int, default '3'


No quotes around '3'. Quotes makes it look like a string.

TomAugspurger · 2018-03-12T19:25:45Z

pandas/core/reshape/tile.py

+        If int, defines the number of equal-width bins in the range of `x`.
+        The range of `x` is extended by .1% on each side to include the min or
+        max values of `x`.
+        If a sequence, defines the bin edges allowing for non-uniform width.


Could format these as

* int : the number of equal-... * sequence : integers defining the bin edges (refer to \`right\` parameter.) * IntervalIndex : sequence of Intervals to use for binning...

What happens with right and IntervalIndex? Is it ignored?

TomAugspurger · 2018-03-12T19:27:12Z

pandas/core/reshape/tile.py

-    include_lowest : bool, optional
+    precision : int, default '3'
+        The precision at which to store and display the bins labels.
+    include_lowest : bool, default 'False'


I think a single backtick around False, or just False. Not sure.

No quoting or backticks in this case

TomAugspurger · 2018-03-12T19:28:51Z

pandas/core/reshape/tile.py


    Examples
    --------
-    >>> pd.cut(np.array([.2, 1.4, 2.5, 6.2, 9.7, 2.1]), 3, retbins=True)
+    >>> pd.cut(np.array([1,7,5,4,6,3]), 3)


PEP8 on all these (spaces after ,)

jorisvandenbossche

Can you put some explanation between the different examples? (shortly explaining what the following example is doing, or how it is different from the previous one, to give some context for the reader looking at those examples)

jorisvandenbossche · 2018-03-12T20:31:36Z

pandas/core/reshape/tile.py

+        max values of `x`.
+        If a sequence, defines the bin edges allowing for non-uniform width.
+        No extension of the range of `x` is done.
+    right : bool, default 'True'


similar comment here for True -> no quotes (makes it look like a string)

jorisvandenbossche · 2018-03-12T20:32:21Z

pandas/core/reshape/tile.py

        bins.
-    retbins : bool, optional
-        Whether to return the bins or not. Can be useful if bins is given
+    retbins : bool, default 'False'


similar, no quotes

jorisvandenbossche · 2018-03-12T20:32:53Z

pandas/core/reshape/tile.py

-    include_lowest : bool, optional
+    precision : int, default '3'
+        The precision at which to store and display the bins labels.
+    include_lowest : bool, default 'False'


No quoting or backticks in this case

jorisvandenbossche · 2018-03-12T20:35:49Z

pandas/core/reshape/tile.py

-    a categorical variable. For example, `cut` could convert ages to groups
-    of age ranges.
+        * True : returns a Series for Series `x` or a pandas.Categorical for
+        pandas.Categorial `x`.


I think a categorical is returned for any array-like that is not a Series (not only for Categorical) ?

Also, is it worth mentioning that it is a Categorical of Intervals?

For series input it's a Series with categorical dtype.

And yes, saying Categorical of Intervals somewhere is good. NOt sure if here is best though.

jorisvandenbossche · 2018-03-12T20:38:16Z

pandas/core/reshape/tile.py

-    the resulting Categorical object
+    See Also
+    --------
+    qcut : Discretize variable into equal-sized buckets based on rank


I suppose for qcut the bins will not be equal sized given it is based on quantiles?

jorisvandenbossche · 2018-03-12T20:39:19Z

pandas/core/reshape/tile.py

+    qcut : Discretize variable into equal-sized buckets based on rank
+        or based on sample quantiles.
+    pandas.Categorical : Represents a categorical variable in
+        classic R / S-plus fashion.


You copied this from Categorical, so that is fine, but I think we should really change this explanation to something not referring to R, many of our users are learning pandas without knowing R :)

jorisvandenbossche · 2018-03-12T20:50:11Z

pandas/core/reshape/tile.py

@@ -26,69 +26,104 @@
 def cut(x, bins, right=True, labels=None, retbins=False, precision=3,
        include_lowest=False):
    """
-    Return indices of half-open bins to which each value of `x` belongs.
+    Bin `x` and return data about the bin to which each `x` value belongs.


I find "return data about the bin" a bit "wordy" without saying much.
I suppose you want to be general because the return type can be either the bins, or the indices indexing into the bins? But I won't care about this: the summary line should be short and give an idea of the main use case. The extended summary can go more into detail (as you already do).

Starting from the above, I would cut that to the essential and just "Bin x", but that is maybe too short :-)
What about "Convert continuous values into discrete bins" or "Discretize values in specified bins" or "Bin values in specified intervals", but not sure what is the best wording.
@TomAugspurger any input?

Yes, this should be simplified. "Bin values into discrete intervals"?

I don't 100% like "in specified intervals" as you don't have to specify the intervals.

"Bin values into discrete intervals"?

+1, perfect combination of all my attempts :-)

[ci skip]

TomAugspurger

Updated, if anyone wants to take a look.

DOC: Update docs for pandas.cut

e50beb7

Udated with comments from Joris

49e002f

jreback added Docs Reshaping Concat, Merge/Join, Stack/Unstack, Explode Interval Interval data type labels Mar 10, 2018

jreback requested changes Mar 10, 2018

View reviewed changes

Updated as per comments

d24c749

TomAugspurger reviewed Mar 12, 2018

View reviewed changes

Merge remote-tracking branch 'upstream/master' into pandas-cut-patch-2

44861a6

ikoevska force-pushed the pandas-cut-patch-2 branch from 2dffb60 to 44861a6 Compare March 12, 2018 18:06

Updated as per comments

1f3caf6

Fixed whitespace issue

544af0e

TomAugspurger reviewed Mar 12, 2018

View reviewed changes

jorisvandenbossche reviewed Mar 12, 2018

View reviewed changes

Updated [ci skip]

f22e45f

[ci skip]

TomAugspurger approved these changes Mar 15, 2018

View reviewed changes

jorisvandenbossche approved these changes Mar 15, 2018

View reviewed changes

TomAugspurger merged commit 7ee65bc into pandas-dev:master Mar 16, 2018

ikoevska deleted the pandas-cut-patch-2 branch June 9, 2018 11:34

		@@ -88,7 +109,18 @@ def cut(x, bins, right=True, labels=None, retbins=False, precision=3,
		Categories (3, object): [good < medium < bad]

		>>> pd.cut(np.ones(5), 4, labels=False)

Uh oh!

DOC: Update pandas.cut docstring #20104

DOC: Update pandas.cut docstring #20104

Uh oh!

Conversation

ikoevska commented Mar 10, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jorisvandenbossche commented Mar 10, 2018

Uh oh!

ikoevska commented Mar 10, 2018

Uh oh!

ikoevska commented Mar 10, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ikoevska Mar 12, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codecov bot commented Mar 12, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Choose a reason for hiding this comment

Uh oh!

TomAugspurger commented Mar 12, 2018

Uh oh!

TomAugspurger commented Mar 12, 2018

Uh oh!

ikoevska commented Mar 12, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pep8speaks commented Mar 12, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Comment last updated on March 15, 2018 at 20:28 Hours UTC

Uh oh!

ikoevska commented Mar 12, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jorisvandenbossche left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ikoevska commented Mar 10, 2018 •

edited

Loading

ikoevska Mar 12, 2018 •

edited

Loading

codecov bot commented Mar 12, 2018 •

edited

Loading

ikoevska commented Mar 12, 2018 •

edited

Loading

pep8speaks commented Mar 12, 2018 •

edited

Loading