Skip to content

DOC: Update pandas.cut docstring #20104

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Mar 16, 2018

Conversation

ikoevska
Copy link
Contributor

@ikoevska ikoevska commented Mar 10, 2018

Checklist for the pandas documentation sprint (ignore this if you are doing
an unrelated PR):

  • [X ] PR title is "DOC: update the docstring"
  • [X ] The validation script passes: scripts/validate_docstrings.py <your-function-or-method>
  • [X ] The PEP8 style check passes: git diff upstream/master -u -- "*.py" | flake8 --diff
  • [X ] The html version looks good: python doc/make.py --single <your-function-or-method>
  • [X ] It has been proofread on language by another sprint participant

Please include the output of the validation script below between the "```" ticks:

################################################################################
############################ Docstring (pandas.cut) ############################
################################################################################

Bin `x` and return data about the bin to which each `x` value belongs.

Splits `x` into the specified number of equal-width half-open bins.
Based on the parameters specified and the input, returns data about
the half-open bins to which each value of `x` belongs or the bins
themselves.
Use `cut` when you need to segment and sort data values into bins. This
function is also useful for going from a continuous variable to a
categorical variable. For example, `cut` could convert ages to groups
of age ranges.

Parameters
----------
x : array-like
    The input array to be binned. Must be 1-dimensional.
bins : int, sequence of scalars, or pandas.IntervalIndex
    If int, defines the number of equal-width bins in the range of `x`.
    The range of `x` is extended by .1% on each side to include the min or
    max values of `x`.
    If a sequence, defines the bin edges allowing for non-uniform width.
    No extension of the range of `x` is done.
right : bool, default 'True'
    Indicates whether the `bins` include the rightmost edge or not. If
    `right == True` (the default), then the `bins` [1,2,3,4] indicate
    (1,2], (2,3], (3,4].
labels : array or bool, optional
    Specifies the labels for the returned bins. Must be the same length as
    the resulting bins. If False, returns only integer indicators of the
    bins.
retbins : bool, default 'False'
    Whether to return the bins or not. Useful when bins is provided
    as a scalar.
precision : int, default '3'
    The precision at which to store and display the bins labels.
include_lowest : bool, default 'False'
    Whether the first interval should be left-inclusive or not.

Returns
-------
out : pandas.Categorical, Series, or ndarray
    An array-like object representing the respective bin for each value
    of `x`. The type depends on the value of `labels`.

    * True : returns a Series for Series `x` or a pandas.Categorical for
    pandas.Categorial `x`.

    * False : returns an ndarray of integers.
bins : numpy.ndarray of floats
    Returned when `retbins` is 'True'.

See Also
--------
qcut : Discretize variable into equal-sized buckets based on rank
    or based on sample quantiles.
pandas.Categorical : Represents a categorical variable in
    classic R / S-plus fashion.
Series : One-dimensional ndarray with axis labels (including time series).
pandas.IntervalIndex : Immutable Index implementing an ordered,
    sliceable set. IntervalIndex represents an Index of intervals that
    are all closed on the same side.

Notes
-----
Any NA values will be NA in the result. Out of bounds values will be NA in
the resulting pandas.Categorical object.

Examples
--------
>>> pd.cut(np.array([1,7,5,4,6,3]), 3)
... # doctest: +ELLIPSIS
[(0.994, 3.0], (5.0, 7.0], (3.0, 5.0], (3.0, 5.0], (5.0, 7.0], ...
Categories (3, interval[float64]): [(0.994, 3.0] < (3.0, 5.0] ...

>>> pd.cut(np.array([1,7,5,4,6,3]), 3, retbins=True)
... # doctest: +ELLIPSIS
([(0.994, 3.0], (5.0, 7.0], (3.0, 5.0], (3.0, 5.0], (5.0, 7.0], ...
Categories (3, interval[float64]): [(0.994, 3.0] < (3.0, 5.0] ...
array([0.994, 3.   , 5.   , 7.   ]))

>>> pd.cut(np.array([.2, 1.4, 2.5, 6.2, 9.7, 2.1]),
...        3, labels=["good", "medium", "bad"])
... # doctest: +SKIP
[good, good, good, medium, bad, good]
Categories (3, object): [good < medium < bad]

>>> pd.cut(np.ones(5, dtype='int64'), 4, labels=False)
array([1, 1, 1, 1, 1], dtype=int64)

>>> s = pd.Series(np.array([2,4,6,8,10]), index=['a', 'b', 'c', 'd', 'e'])
>>> pd.cut(s, 3)
... # doctest: +ELLIPSIS
a    (1.992, 4.667]
b    (1.992, 4.667]
c    (4.667, 7.333]
d     (7.333, 10.0]
e     (7.333, 10.0]
dtype: category
Categories (3, interval[float64]): [(1.992, 4.667] < (4.667, ...

################################################################################
################################## Validation ##################################
################################################################################

Docstring for "pandas.cut" correct. :)

Comment: Resubmitting #20069 after a botched rebase.

@jorisvandenbossche
Copy link
Member

Did you already update according to my comments?

@ikoevska
Copy link
Contributor Author

Nope, updating right now. :)

@ikoevska
Copy link
Contributor Author

@jorisvandenbossche Updated based on your comments and added some more examples. I've also updated the validation script output in the description of the PR.

@jreback jreback added Docs Reshaping Concat, Merge/Join, Stack/Unstack, Explode Interval Interval data type labels Mar 10, 2018
Return indices of half-open bins to which each value of `x` belongs.
Bin `x` and return data about the bin to which each `x` value belongs.

This function splits `x` into the specified number of equal-width half-
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can remove function.

you say 'data' above and use 'information' here. pick one.

This function splits `x` into the specified number of equal-width half-
open bins. Based on the parameters specified and the input, returns
information about the half-open bins to which each value of `x` belongs
or the bins themselves.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if retbins=True

Copy link
Contributor Author

@ikoevska ikoevska Mar 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree with this suggested change. Currently, the complete sentence begins with "Based on the parameters specified and the input" which points to looking at parameters and other information below. If I add "if retbins=True", I will be pressed to explain all other options in the extended description as well. Not to mention that the sentence will become way too long and hence with reduced readability.

Returned only if `retbins` is True.
out : pandas.Categorical or Series, or array of int if `labels` is 'False'
The return type depends on the input.
If the input is a Series, a Series of type category is returned.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure about the formatting here, @TomAugspurger ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TomAugspurger, still waiting for more feedback on that. Thanks!

or based on sample quantiles.
pandas.Categorical : Represents a categorical variable in
classic R / S-plus fashion.
Series : One-dimensional ndarray with axis labels (including time series).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont' think Series is needed here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this is something that you can input or get as an output from the command, I think this needs to be here. There is also an example with series below.

@@ -88,7 +109,18 @@ def cut(x, bins, right=True, labels=None, retbins=False, precision=3,
Categories (3, object): [good < medium < bad]

>>> pd.cut(np.ones(5), 4, labels=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

specify dtype='int64' to np.ones (this is normally a platform int)

@codecov
Copy link

codecov bot commented Mar 12, 2018

Codecov Report

Merging #20104 into master will increase coverage by 0.02%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #20104      +/-   ##
==========================================
+ Coverage    91.7%   91.72%   +0.02%     
==========================================
  Files         150      150              
  Lines       49165    49156       -9     
==========================================
+ Hits        45087    45090       +3     
+ Misses       4078     4066      -12
Flag Coverage Δ
#multiple 90.11% <ø> (+0.02%) ⬆️
#single 41.85% <ø> (-0.02%) ⬇️
Impacted Files Coverage Δ
pandas/core/reshape/tile.py 93.37% <ø> (ø) ⬆️
pandas/plotting/_core.py 82.23% <0%> (-0.04%) ⬇️
pandas/core/indexes/datetimelike.py 96.7% <0%> (-0.02%) ⬇️
pandas/core/ops.py 96.33% <0%> (-0.02%) ⬇️
pandas/core/generic.py 95.84% <0%> (-0.02%) ⬇️
pandas/core/window.py 96.31% <0%> (ø) ⬆️
pandas/core/frame.py 97.18% <0%> (ø) ⬆️
pandas/core/indexes/multi.py 95.06% <0%> (ø) ⬆️
pandas/core/strings.py 98.32% <0%> (ø) ⬆️
pandas/core/indexing.py 93.02% <0%> (ø) ⬆️
... and 8 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 7169830...f22e45f. Read the comment docs.

represented as categories when categorical data is returned.
bins : ndarray of floats
Returned only if `retbins` is True.
out : pandas.Categorical or Series, or array of int if `labels` is 'False'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think something liek

out : pandas.Categorical, Series, or ndarray
    An array-like object representing the bin for each value of `x`.
    The type depends on the value of `labels`.

    * True : returns a Series for Series `x` or a Categorical for Categorical `x`.
    * False : returns an ndarray of integers.

@TomAugspurger
Copy link
Contributor

Looks like some git issues @ikoevska. LMK if you need some help.

@TomAugspurger
Copy link
Contributor

Did you have any changes in 2dffb60? If not, I think

git reset --hard d24c749b0 
git merge upstream/master

should do it. It'll throw away changes from that commit though.

@ikoevska
Copy link
Contributor Author

ikoevska commented Mar 12, 2018

Yep, I did a sync with the upstream and then a rebase on master and this happened. Suggestions how to fix it?

UPDATE: @TomAugspurger that fixed it, thanks so much!

@ikoevska ikoevska force-pushed the pandas-cut-patch-2 branch from 2dffb60 to 44861a6 Compare March 12, 2018 18:06
@pep8speaks
Copy link

pep8speaks commented Mar 12, 2018

Hello @ikoevska! Thanks for updating the PR.

Cheers ! There are no PEP8 issues in this Pull Request. 🍻

Comment last updated on March 15, 2018 at 20:28 Hours UTC

@ikoevska
Copy link
Contributor Author

@jreback Can you take a look at the latest changes and my comments? Thanks!

precision : int, optional
The precision at which to store and display the bins labels
include_lowest : bool, optional
precision : int, default '3'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No quotes around '3'. Quotes makes it look like a string.

If int, defines the number of equal-width bins in the range of `x`.
The range of `x` is extended by .1% on each side to include the min or
max values of `x`.
If a sequence, defines the bin edges allowing for non-uniform width.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could format these as

* int : the number of equal-...
* sequence : integers defining the bin edges (refer to \`right\` parameter.)
* IntervalIndex : sequence of Intervals to use for binning...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens with right and IntervalIndex? Is it ignored?

include_lowest : bool, optional
precision : int, default '3'
The precision at which to store and display the bins labels.
include_lowest : bool, default 'False'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a single backtick around False, or just False. Not sure.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No quoting or backticks in this case


Examples
--------
>>> pd.cut(np.array([.2, 1.4, 2.5, 6.2, 9.7, 2.1]), 3, retbins=True)
>>> pd.cut(np.array([1,7,5,4,6,3]), 3)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PEP8 on all these (spaces after ,)

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you put some explanation between the different examples? (shortly explaining what the following example is doing, or how it is different from the previous one, to give some context for the reader looking at those examples)

max values of `x`.
If a sequence, defines the bin edges allowing for non-uniform width.
No extension of the range of `x` is done.
right : bool, default 'True'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar comment here for True -> no quotes (makes it look like a string)

bins.
retbins : bool, optional
Whether to return the bins or not. Can be useful if bins is given
retbins : bool, default 'False'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar, no quotes

include_lowest : bool, optional
precision : int, default '3'
The precision at which to store and display the bins labels.
include_lowest : bool, default 'False'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No quoting or backticks in this case

a categorical variable. For example, `cut` could convert ages to groups
of age ranges.
* True : returns a Series for Series `x` or a pandas.Categorical for
pandas.Categorial `x`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a categorical is returned for any array-like that is not a Series (not only for Categorical) ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, is it worth mentioning that it is a Categorical of Intervals?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For series input it's a Series with categorical dtype.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And yes, saying Categorical of Intervals somewhere is good. NOt sure if here is best though.

the resulting Categorical object
See Also
--------
qcut : Discretize variable into equal-sized buckets based on rank
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose for qcut the bins will not be equal sized given it is based on quantiles?

qcut : Discretize variable into equal-sized buckets based on rank
or based on sample quantiles.
pandas.Categorical : Represents a categorical variable in
classic R / S-plus fashion.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You copied this from Categorical, so that is fine, but I think we should really change this explanation to something not referring to R, many of our users are learning pandas without knowing R :)

@@ -26,69 +26,104 @@
def cut(x, bins, right=True, labels=None, retbins=False, precision=3,
include_lowest=False):
"""
Return indices of half-open bins to which each value of `x` belongs.
Bin `x` and return data about the bin to which each `x` value belongs.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find "return data about the bin" a bit "wordy" without saying much.
I suppose you want to be general because the return type can be either the bins, or the indices indexing into the bins? But I won't care about this: the summary line should be short and give an idea of the main use case. The extended summary can go more into detail (as you already do).

Starting from the above, I would cut that to the essential and just "Bin x", but that is maybe too short :-)
What about "Convert continuous values into discrete bins" or "Discretize values in specified bins" or "Bin values in specified intervals", but not sure what is the best wording.
@TomAugspurger any input?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this should be simplified. "Bin values into discrete intervals"?

I don't 100% like "in specified intervals" as you don't have to specify the intervals.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Bin values into discrete intervals"?

+1, perfect combination of all my attempts :-)

Copy link
Contributor

@TomAugspurger TomAugspurger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated, if anyone wants to take a look.

@TomAugspurger TomAugspurger merged commit 7ee65bc into pandas-dev:master Mar 16, 2018
@ikoevska ikoevska deleted the pandas-cut-patch-2 branch June 9, 2018 11:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Docs Interval Interval data type Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants