-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
DOC: Update pandas.cut docstring #20104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 2 commits
e50beb7
49e002f
d24c749
44861a6
1f3caf6
544af0e
f22e45f
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -26,60 +26,81 @@ | |
def cut(x, bins, right=True, labels=None, retbins=False, precision=3, | ||
include_lowest=False): | ||
""" | ||
Return indices of half-open bins to which each value of `x` belongs. | ||
Bin `x` and return data about the bin to which each `x` value belongs. | ||
|
||
This function splits `x` into the specified number of equal-width half- | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. can remove function. you say 'data' above and use 'information' here. pick one. |
||
open bins. Based on the parameters specified and the input, returns | ||
information about the half-open bins to which each value of `x` belongs | ||
or the bins themselves. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. if There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I disagree with this suggested change. Currently, the complete sentence begins with "Based on the parameters specified and the input" which points to looking at parameters and other information below. If I add "if retbins=True", I will be pressed to explain all other options in the extended description as well. Not to mention that the sentence will become way too long and hence with reduced readability. |
||
Use `cut` when you need to segment and sort data values into bins. This | ||
function is also useful for going from a continuous variable to a | ||
categorical variable. For example, `cut` could convert ages to groups | ||
of age ranges. | ||
|
||
Parameters | ||
---------- | ||
x : array-like | ||
Input array to be binned. It has to be 1-dimensional. | ||
bins : int, sequence of scalars, or IntervalIndex | ||
If `bins` is an int, it defines the number of equal-width bins in the | ||
range of `x`. However, in this case, the range of `x` is extended | ||
by .1% on each side to include the min or max values of `x`. If | ||
`bins` is a sequence it defines the bin edges allowing for | ||
non-uniform bin width. No extension of the range of `x` is done in | ||
this case. | ||
right : bool, optional | ||
Indicates whether the bins include the rightmost edge or not. If | ||
right == True (the default), then the bins [1,2,3,4] indicate | ||
The input array to be binned. Must be 1-dimensional. | ||
bins : int, sequence of scalars, or pandas.IntervalIndex | ||
If int, defines the number of equal-width bins in the range of `x`. | ||
The range of `x` is extended by .1% on each side to include the min or | ||
max values of `x`. | ||
If a sequence, defines the bin edges allowing for non-uniform width. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Could format these as
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What happens with |
||
No extension of the range of `x` is done. | ||
right : bool, default 'True' | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. similar comment here for True -> no quotes (makes it look like a string) |
||
Indicates whether the `bins` include the rightmost edge or not. If | ||
`right == True` (the default), then the `bins` [1,2,3,4] indicate | ||
(1,2], (2,3], (3,4]. | ||
labels : array or boolean, default None | ||
Used as labels for the resulting bins. Must be of the same length as | ||
the resulting bins. If False, return only integer indicators of the | ||
labels : array or bool, optional | ||
Specifies the labels for the returned bins. Must be the same length as | ||
the resulting bins. If False, returns only integer indicators of the | ||
bins. | ||
retbins : bool, optional | ||
Whether to return the bins or not. Can be useful if bins is given | ||
retbins : bool, default 'False' | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. similar, no quotes |
||
Whether to return the bins or not. Useful when bins is provided | ||
as a scalar. | ||
precision : int, optional | ||
The precision at which to store and display the bins labels | ||
include_lowest : bool, optional | ||
precision : int, default '3' | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No quotes around |
||
The precision at which to store and display the bins labels. | ||
include_lowest : bool, default 'False' | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think a single backtick around There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. No quoting or backticks in this case |
||
Whether the first interval should be left-inclusive or not. | ||
|
||
Returns | ||
------- | ||
out : Categorical or Series or array of integers if labels is False | ||
The return type (Categorical or Series) depends on the input: a Series | ||
of type category if input is a Series else Categorical. Bins are | ||
represented as categories when categorical data is returned. | ||
bins : ndarray of floats | ||
Returned only if `retbins` is True. | ||
out : pandas.Categorical or Series, or array of int if `labels` is 'False' | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think something liek
|
||
The return type depends on the input. | ||
If the input is a Series, a Series of type category is returned. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. not sure about the formatting here, @TomAugspurger ? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @TomAugspurger, still waiting for more feedback on that. Thanks! |
||
Else - pandas.Categorical is returned. Bins are represented as | ||
categories when categorical data is returned. | ||
bins : numpy.ndarray of floats | ||
Returned when `retbins` is 'True'. | ||
|
||
See Also | ||
-------- | ||
qcut : Discretize variable into equal-sized buckets based on rank | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I suppose for qcut the bins will not be equal sized given it is based on quantiles? |
||
or based on sample quantiles. | ||
pandas.Categorical : Represents a categorical variable in | ||
classic R / S-plus fashion. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You copied this from Categorical, so that is fine, but I think we should really change this explanation to something not referring to R, many of our users are learning pandas without knowing R :) |
||
Series : One-dimensional ndarray with axis labels (including time series). | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I dont' think Series is needed here There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As this is something that you can input or get as an output from the command, I think this needs to be here. There is also an example with series below. |
||
pandas.IntervalIndex : Immutable Index implementing an ordered, | ||
sliceable set. IntervalIndex represents an Index of intervals that | ||
are all closed on the same side. | ||
|
||
Notes | ||
----- | ||
The `cut` function can be useful for going from a continuous variable to | ||
a categorical variable. For example, `cut` could convert ages to groups | ||
of age ranges. | ||
|
||
Any NA values will be NA in the result. Out of bounds values will be NA in | ||
the resulting Categorical object | ||
|
||
Any NA values will be NA in the result. Out of bounds values will be NA in | ||
the resulting pandas.Categorical object. | ||
|
||
Examples | ||
-------- | ||
>>> pd.cut(np.array([.2, 1.4, 2.5, 6.2, 9.7, 2.1]), 3, retbins=True) | ||
>>> pd.cut(np.array([1,7,5,4,6,3]), 3) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. PEP8 on all these (spaces after |
||
... # doctest: +ELLIPSIS | ||
[(0.994, 3.0], (5.0, 7.0], (3.0, 5.0], (3.0, 5.0], (5.0, 7.0], ... | ||
Categories (3, interval[float64]): [(0.994, 3.0] < (3.0, 5.0] ... | ||
|
||
>>> pd.cut(np.array([1,7,5,4,6,3]), 3, retbins=True) | ||
... # doctest: +ELLIPSIS | ||
([(0.19, 3.367], (0.19, 3.367], (0.19, 3.367], (3.367, 6.533], ... | ||
Categories (3, interval[float64]): [(0.19, 3.367] < (3.367, 6.533] ... | ||
([(0.994, 3.0], (5.0, 7.0], (3.0, 5.0], (3.0, 5.0], (5.0, 7.0], ... | ||
Categories (3, interval[float64]): [(0.994, 3.0] < (3.0, 5.0] ... | ||
array([0.994, 3. , 5. , 7. ])) | ||
|
||
>>> pd.cut(np.array([.2, 1.4, 2.5, 6.2, 9.7, 2.1]), | ||
... 3, labels=["good", "medium", "bad"]) | ||
|
@@ -88,7 +109,18 @@ def cut(x, bins, right=True, labels=None, retbins=False, precision=3, | |
Categories (3, object): [good < medium < bad] | ||
|
||
>>> pd.cut(np.ones(5), 4, labels=False) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. specify |
||
array([1, 1, 1, 1, 1]) | ||
array([1, 1, 1, 1, 1], dtype=int64) | ||
|
||
>>> s = pd.Series(np.array([2,4,6,8,10]), index=['a', 'b', 'c', 'd', 'e']) | ||
>>> pd.cut(s, 3) | ||
... # doctest: +ELLIPSIS | ||
a (1.992, 4.667] | ||
b (1.992, 4.667] | ||
c (4.667, 7.333] | ||
d (7.333, 10.0] | ||
e (7.333, 10.0] | ||
dtype: category | ||
Categories (3, interval[float64]): [(1.992, 4.667] < (4.667, ... | ||
""" | ||
# NOTE: this binning code is changed a bit from histogram for var(x) == 0 | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find "return data about the bin" a bit "wordy" without saying much.
I suppose you want to be general because the return type can be either the bins, or the indices indexing into the bins? But I won't care about this: the summary line should be short and give an idea of the main use case. The extended summary can go more into detail (as you already do).
Starting from the above, I would cut that to the essential and just "Bin x", but that is maybe too short :-)
What about "Convert continuous values into discrete bins" or "Discretize values in specified bins" or "Bin values in specified intervals", but not sure what is the best wording.
@TomAugspurger any input?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this should be simplified. "Bin values into discrete intervals"?
I don't 100% like "in specified intervals" as you don't have to specify the intervals.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, perfect combination of all my attempts :-)