Skip to content

DOC: Fixed example & description for pandas.cut #20069

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
wants to merge 20 commits into from

Conversation

ikoevska
Copy link
Contributor

@ikoevska ikoevska commented Mar 9, 2018

  • closes #xxxx
  • tests added / passed
  • passes git diff upstream/master -u -- "*.py" | flake8 --diff
  • whatsnew entry
################################################################################
############################ Docstring (pandas.cut) ############################
################################################################################

Return indices of half-open `bins` to which each value of `x` belongs.

Use `cut` when you need to segment and sort data values into bins or
buckets of data. This function is also useful for going from a continuous
variable to a categorical variable. For example, `cut` could convert ages
to groups of age ranges.

Parameters
----------
x : array-like
    Input array to be binned. It has to be 1-dimensional.
bins : int, sequence of scalars, or pandas.IntervalIndex
    If `bins` is an int, defines the number of equal-width bins in the
    range of `x`. The range of `x` is extended by .1% on each side to
    include the min or max values of `x`.
    If `bins` is a sequence, defines the bin edges allowing for
    non-uniform bin width. No extension of the range of `x` is done.
right : bool, optional, default 'True'
    Indicates whether the `bins` include the rightmost edge or not. If
    `right == True` (the default), then the `bins` [1,2,3,4] indicate
    (1,2], (2,3], (3,4].
labels : array or bool, optional
    Used as labels for the resulting `bins`. Must be of the same length as
    the resulting `bins`. If False, returns only integer indicators of the
    `bins`.
retbins : bool, optional, default 'False'
    Whether to return the `bins` or not. Useful when `bins` is provided
    as a scalar.
precision : int, optional, default '3'
    The precision at which to store and display the `bins` labels.
include_lowest : bool, optional, default 'False'
    Whether the first interval should be left-inclusive or not.

Returns
-------
out : pandas.Categorical or Series, or array of int if `labels` is 'False'
    The return type depends on the input.
    If the input is a Series, a Series of type category is returned.
    Else - pandas.Categorical is returned. `Bins` are represented as
    categories when categorical data is returned.
bins : numpy.ndarray of floats
    Returned only if `retbins` is 'True'.

See Also
--------
qcut : Discretize variable into equal-sized buckets based on rank
    or based on sample quantiles.
pandas.Categorical : Represents a categorical variable in
    classic R / S-plus fashion.
Series : One-dimensional ndarray with axis labels (including time series).
pandas.IntervalIndex : Immutable Index implementing an ordered,
    sliceable set. IntervalIndex represents an Index of intervals that
    are all closed on the same side.

Notes
-----
Any NA values will be NA in the result. Out of bounds values will be NA in
the resulting pandas.Categorical object.

Examples
--------
>>> pd.cut(np.array([.2, 1.4, 2.5, 6.2, 9.7, 2.1]), 3, retbins=True)
... # doctest: +ELLIPSIS
([(0.19, 3.367], (0.19, 3.367], (0.19, 3.367], (3.367, 6.533], ...
Categories (3, interval[float64]): [(0.19, 3.367] < (3.367, 6.533] ...

>>> pd.cut(np.array([.2, 1.4, 2.5, 6.2, 9.7, 2.1]),
...        3, labels=["good", "medium", "bad"])
... # doctest: +SKIP
[good, good, good, medium, bad, good]
Categories (3, object): [good < medium < bad]

>>> pd.cut(np.ones(5), 4, labels=False)
array([1, 1, 1, 1, 1], dtype=int64)

################################################################################
################################## Validation ##################################
################################################################################

Docstring for "pandas.cut" correct. :)

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR! Already added a few comments, will take a closer look later.

For the examples, can you start it with a default example? So one not using retbins=True. And then you can after that explicitly say the difference if you do retbins=True. I would also show an example with Series to illustrate the return type explanation.

Use `cut` when you need to segment and sort data values into bins or
buckets of data. This function is also useful for going from a continuous
variable to a categorical variable. For example, `cut` could convert ages
to groups of age ranges.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice explanation!

@@ -24,53 +24,64 @@
def cut(x, bins, right=True, labels=None, retbins=False, precision=3,
include_lowest=False):
"""
Return indices of half-open bins to which each value of `x` belongs.
Return indices of half-open `bins` to which each value of `x` belongs.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know it was already there, but I wondering if we can make this first sentence better. Because I have to say I have to read it very carefully to actually understand it :)

Some ideas:

include the min or max values of `x`.
If `bins` is a sequence, defines the bin edges allowing for
non-uniform bin width. No extension of the range of `x` is done.
right : bool, optional, default 'True'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can leave out the "optional" here (it's indeed optional to specify it, but it has a default value, so is not purely optional)

(same for the ones below where you have both 'optional' and 'default ..'

@codecov
Copy link

codecov bot commented Mar 9, 2018

Codecov Report

Merging #20069 into master will increase coverage by 0.02%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #20069      +/-   ##
==========================================
+ Coverage   91.69%   91.72%   +0.02%     
==========================================
  Files         150      150              
  Lines       49112    49112              
==========================================
+ Hits        45035    45047      +12     
+ Misses       4077     4065      -12
Flag Coverage Δ
#multiple 90.1% <ø> (+0.02%) ⬆️
#single 41.86% <ø> (ø) ⬆️
Impacted Files Coverage Δ
pandas/core/reshape/tile.py 92.94% <ø> (ø) ⬆️
pandas/plotting/_converter.py 66.81% <0%> (+1.73%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 1d73cf3...db337c1. Read the comment docs.

@pep8speaks
Copy link

Hello @ikoevska! Thanks for updating the PR.

Line 1159:15: W291 trailing whitespace

@ikoevska
Copy link
Contributor Author

I am closing this one as it got totally messed up after a rebase. Will open a new one with @jorisvandenbossche comments applied.

@jorisvandenbossche
Copy link
Member

@ikoevska For future reference, please update this PR. Even if you make a new branch locally, you can force push to the same branch on your fork, and this PR gets updated. But no problem for this time!

@ikoevska ikoevska deleted the patch-1 branch June 9, 2018 11:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants