Skip to content

DOC: Update pandas.cut docstring #20104

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Mar 16, 2018
Merged
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
106 changes: 69 additions & 37 deletions pandas/core/reshape/tile.py
Original file line number Diff line number Diff line change
Expand Up @@ -26,60 +26,81 @@
def cut(x, bins, right=True, labels=None, retbins=False, precision=3,
include_lowest=False):
"""
Return indices of half-open bins to which each value of `x` belongs.
Bin `x` and return data about the bin to which each `x` value belongs.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find "return data about the bin" a bit "wordy" without saying much.
I suppose you want to be general because the return type can be either the bins, or the indices indexing into the bins? But I won't care about this: the summary line should be short and give an idea of the main use case. The extended summary can go more into detail (as you already do).

Starting from the above, I would cut that to the essential and just "Bin x", but that is maybe too short :-)
What about "Convert continuous values into discrete bins" or "Discretize values in specified bins" or "Bin values in specified intervals", but not sure what is the best wording.
@TomAugspurger any input?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this should be simplified. "Bin values into discrete intervals"?

I don't 100% like "in specified intervals" as you don't have to specify the intervals.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Bin values into discrete intervals"?

+1, perfect combination of all my attempts :-)


This function splits `x` into the specified number of equal-width half-
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can remove function.

you say 'data' above and use 'information' here. pick one.

open bins. Based on the parameters specified and the input, returns
information about the half-open bins to which each value of `x` belongs
or the bins themselves.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if retbins=True

Copy link
Contributor Author

@ikoevska ikoevska Mar 12, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I disagree with this suggested change. Currently, the complete sentence begins with "Based on the parameters specified and the input" which points to looking at parameters and other information below. If I add "if retbins=True", I will be pressed to explain all other options in the extended description as well. Not to mention that the sentence will become way too long and hence with reduced readability.

Use `cut` when you need to segment and sort data values into bins. This
function is also useful for going from a continuous variable to a
categorical variable. For example, `cut` could convert ages to groups
of age ranges.

Parameters
----------
x : array-like
Input array to be binned. It has to be 1-dimensional.
bins : int, sequence of scalars, or IntervalIndex
If `bins` is an int, it defines the number of equal-width bins in the
range of `x`. However, in this case, the range of `x` is extended
by .1% on each side to include the min or max values of `x`. If
`bins` is a sequence it defines the bin edges allowing for
non-uniform bin width. No extension of the range of `x` is done in
this case.
right : bool, optional
Indicates whether the bins include the rightmost edge or not. If
right == True (the default), then the bins [1,2,3,4] indicate
The input array to be binned. Must be 1-dimensional.
bins : int, sequence of scalars, or pandas.IntervalIndex
If int, defines the number of equal-width bins in the range of `x`.
The range of `x` is extended by .1% on each side to include the min or
max values of `x`.
If a sequence, defines the bin edges allowing for non-uniform width.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could format these as

* int : the number of equal-...
* sequence : integers defining the bin edges (refer to \`right\` parameter.)
* IntervalIndex : sequence of Intervals to use for binning...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens with right and IntervalIndex? Is it ignored?

No extension of the range of `x` is done.
right : bool, default 'True'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar comment here for True -> no quotes (makes it look like a string)

Indicates whether the `bins` include the rightmost edge or not. If
`right == True` (the default), then the `bins` [1,2,3,4] indicate
(1,2], (2,3], (3,4].
labels : array or boolean, default None
Used as labels for the resulting bins. Must be of the same length as
the resulting bins. If False, return only integer indicators of the
labels : array or bool, optional
Specifies the labels for the returned bins. Must be the same length as
the resulting bins. If False, returns only integer indicators of the
bins.
retbins : bool, optional
Whether to return the bins or not. Can be useful if bins is given
retbins : bool, default 'False'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

similar, no quotes

Whether to return the bins or not. Useful when bins is provided
as a scalar.
precision : int, optional
The precision at which to store and display the bins labels
include_lowest : bool, optional
precision : int, default '3'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No quotes around '3'. Quotes makes it look like a string.

The precision at which to store and display the bins labels.
include_lowest : bool, default 'False'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a single backtick around False, or just False. Not sure.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No quoting or backticks in this case

Whether the first interval should be left-inclusive or not.

Returns
-------
out : Categorical or Series or array of integers if labels is False
The return type (Categorical or Series) depends on the input: a Series
of type category if input is a Series else Categorical. Bins are
represented as categories when categorical data is returned.
bins : ndarray of floats
Returned only if `retbins` is True.
out : pandas.Categorical or Series, or array of int if `labels` is 'False'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think something liek

out : pandas.Categorical, Series, or ndarray
    An array-like object representing the bin for each value of `x`.
    The type depends on the value of `labels`.

    * True : returns a Series for Series `x` or a Categorical for Categorical `x`.
    * False : returns an ndarray of integers.

The return type depends on the input.
If the input is a Series, a Series of type category is returned.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure about the formatting here, @TomAugspurger ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@TomAugspurger, still waiting for more feedback on that. Thanks!

Else - pandas.Categorical is returned. Bins are represented as
categories when categorical data is returned.
bins : numpy.ndarray of floats
Returned when `retbins` is 'True'.

See Also
--------
qcut : Discretize variable into equal-sized buckets based on rank
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose for qcut the bins will not be equal sized given it is based on quantiles?

or based on sample quantiles.
pandas.Categorical : Represents a categorical variable in
classic R / S-plus fashion.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You copied this from Categorical, so that is fine, but I think we should really change this explanation to something not referring to R, many of our users are learning pandas without knowing R :)

Series : One-dimensional ndarray with axis labels (including time series).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont' think Series is needed here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As this is something that you can input or get as an output from the command, I think this needs to be here. There is also an example with series below.

pandas.IntervalIndex : Immutable Index implementing an ordered,
sliceable set. IntervalIndex represents an Index of intervals that
are all closed on the same side.

Notes
-----
The `cut` function can be useful for going from a continuous variable to
a categorical variable. For example, `cut` could convert ages to groups
of age ranges.

Any NA values will be NA in the result. Out of bounds values will be NA in
the resulting Categorical object

Any NA values will be NA in the result. Out of bounds values will be NA in
the resulting pandas.Categorical object.

Examples
--------
>>> pd.cut(np.array([.2, 1.4, 2.5, 6.2, 9.7, 2.1]), 3, retbins=True)
>>> pd.cut(np.array([1,7,5,4,6,3]), 3)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PEP8 on all these (spaces after ,)

... # doctest: +ELLIPSIS
[(0.994, 3.0], (5.0, 7.0], (3.0, 5.0], (3.0, 5.0], (5.0, 7.0], ...
Categories (3, interval[float64]): [(0.994, 3.0] < (3.0, 5.0] ...

>>> pd.cut(np.array([1,7,5,4,6,3]), 3, retbins=True)
... # doctest: +ELLIPSIS
([(0.19, 3.367], (0.19, 3.367], (0.19, 3.367], (3.367, 6.533], ...
Categories (3, interval[float64]): [(0.19, 3.367] < (3.367, 6.533] ...
([(0.994, 3.0], (5.0, 7.0], (3.0, 5.0], (3.0, 5.0], (5.0, 7.0], ...
Categories (3, interval[float64]): [(0.994, 3.0] < (3.0, 5.0] ...
array([0.994, 3. , 5. , 7. ]))

>>> pd.cut(np.array([.2, 1.4, 2.5, 6.2, 9.7, 2.1]),
... 3, labels=["good", "medium", "bad"])
Expand All @@ -88,7 +109,18 @@ def cut(x, bins, right=True, labels=None, retbins=False, precision=3,
Categories (3, object): [good < medium < bad]

>>> pd.cut(np.ones(5), 4, labels=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

specify dtype='int64' to np.ones (this is normally a platform int)

array([1, 1, 1, 1, 1])
array([1, 1, 1, 1, 1], dtype=int64)

>>> s = pd.Series(np.array([2,4,6,8,10]), index=['a', 'b', 'c', 'd', 'e'])
>>> pd.cut(s, 3)
... # doctest: +ELLIPSIS
a (1.992, 4.667]
b (1.992, 4.667]
c (4.667, 7.333]
d (7.333, 10.0]
e (7.333, 10.0]
dtype: category
Categories (3, interval[float64]): [(1.992, 4.667] < (4.667, ...
"""
# NOTE: this binning code is changed a bit from histogram for var(x) == 0

Expand Down