-
-
Notifications
You must be signed in to change notification settings - Fork 18.5k
DOC: Update pandas.cut docstring #20104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Did you already update according to my comments? |
Nope, updating right now. :) |
@jorisvandenbossche Updated based on your comments and added some more examples. I've also updated the validation script output in the description of the PR. |
pandas/core/reshape/tile.py
Outdated
Return indices of half-open bins to which each value of `x` belongs. | ||
Bin `x` and return data about the bin to which each `x` value belongs. | ||
|
||
This function splits `x` into the specified number of equal-width half- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can remove function.
you say 'data' above and use 'information' here. pick one.
pandas/core/reshape/tile.py
Outdated
This function splits `x` into the specified number of equal-width half- | ||
open bins. Based on the parameters specified and the input, returns | ||
information about the half-open bins to which each value of `x` belongs | ||
or the bins themselves. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if retbins=True
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I disagree with this suggested change. Currently, the complete sentence begins with "Based on the parameters specified and the input" which points to looking at parameters and other information below. If I add "if retbins=True", I will be pressed to explain all other options in the extended description as well. Not to mention that the sentence will become way too long and hence with reduced readability.
pandas/core/reshape/tile.py
Outdated
Returned only if `retbins` is True. | ||
out : pandas.Categorical or Series, or array of int if `labels` is 'False' | ||
The return type depends on the input. | ||
If the input is a Series, a Series of type category is returned. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure about the formatting here, @TomAugspurger ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@TomAugspurger, still waiting for more feedback on that. Thanks!
pandas/core/reshape/tile.py
Outdated
or based on sample quantiles. | ||
pandas.Categorical : Represents a categorical variable in | ||
classic R / S-plus fashion. | ||
Series : One-dimensional ndarray with axis labels (including time series). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dont' think Series is needed here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As this is something that you can input or get as an output from the command, I think this needs to be here. There is also an example with series below.
pandas/core/reshape/tile.py
Outdated
@@ -88,7 +109,18 @@ def cut(x, bins, right=True, labels=None, retbins=False, precision=3, | |||
Categories (3, object): [good < medium < bad] | |||
|
|||
>>> pd.cut(np.ones(5), 4, labels=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
specify dtype='int64'
to np.ones (this is normally a platform int)
Codecov Report
@@ Coverage Diff @@
## master #20104 +/- ##
==========================================
+ Coverage 91.7% 91.72% +0.02%
==========================================
Files 150 150
Lines 49165 49156 -9
==========================================
+ Hits 45087 45090 +3
+ Misses 4078 4066 -12
Continue to review full report at Codecov.
|
pandas/core/reshape/tile.py
Outdated
represented as categories when categorical data is returned. | ||
bins : ndarray of floats | ||
Returned only if `retbins` is True. | ||
out : pandas.Categorical or Series, or array of int if `labels` is 'False' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think something liek
out : pandas.Categorical, Series, or ndarray
An array-like object representing the bin for each value of `x`.
The type depends on the value of `labels`.
* True : returns a Series for Series `x` or a Categorical for Categorical `x`.
* False : returns an ndarray of integers.
Looks like some git issues @ikoevska. LMK if you need some help. |
Did you have any changes in 2dffb60? If not, I think
should do it. It'll throw away changes from that commit though. |
Yep, I did a sync with the upstream and then a rebase on master and this happened. Suggestions how to fix it? UPDATE: @TomAugspurger that fixed it, thanks so much! |
2dffb60
to
44861a6
Compare
Hello @ikoevska! Thanks for updating the PR. Cheers ! There are no PEP8 issues in this Pull Request. 🍻 Comment last updated on March 15, 2018 at 20:28 Hours UTC |
@jreback Can you take a look at the latest changes and my comments? Thanks! |
pandas/core/reshape/tile.py
Outdated
precision : int, optional | ||
The precision at which to store and display the bins labels | ||
include_lowest : bool, optional | ||
precision : int, default '3' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No quotes around '3'
. Quotes makes it look like a string.
pandas/core/reshape/tile.py
Outdated
If int, defines the number of equal-width bins in the range of `x`. | ||
The range of `x` is extended by .1% on each side to include the min or | ||
max values of `x`. | ||
If a sequence, defines the bin edges allowing for non-uniform width. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could format these as
* int : the number of equal-...
* sequence : integers defining the bin edges (refer to \`right\` parameter.)
* IntervalIndex : sequence of Intervals to use for binning...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happens with right
and IntervalIndex
? Is it ignored?
pandas/core/reshape/tile.py
Outdated
include_lowest : bool, optional | ||
precision : int, default '3' | ||
The precision at which to store and display the bins labels. | ||
include_lowest : bool, default 'False' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a single backtick around False
, or just False. Not sure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No quoting or backticks in this case
pandas/core/reshape/tile.py
Outdated
|
||
Examples | ||
-------- | ||
>>> pd.cut(np.array([.2, 1.4, 2.5, 6.2, 9.7, 2.1]), 3, retbins=True) | ||
>>> pd.cut(np.array([1,7,5,4,6,3]), 3) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PEP8 on all these (spaces after ,
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you put some explanation between the different examples? (shortly explaining what the following example is doing, or how it is different from the previous one, to give some context for the reader looking at those examples)
pandas/core/reshape/tile.py
Outdated
max values of `x`. | ||
If a sequence, defines the bin edges allowing for non-uniform width. | ||
No extension of the range of `x` is done. | ||
right : bool, default 'True' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
similar comment here for True -> no quotes (makes it look like a string)
pandas/core/reshape/tile.py
Outdated
bins. | ||
retbins : bool, optional | ||
Whether to return the bins or not. Can be useful if bins is given | ||
retbins : bool, default 'False' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
similar, no quotes
pandas/core/reshape/tile.py
Outdated
include_lowest : bool, optional | ||
precision : int, default '3' | ||
The precision at which to store and display the bins labels. | ||
include_lowest : bool, default 'False' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No quoting or backticks in this case
pandas/core/reshape/tile.py
Outdated
a categorical variable. For example, `cut` could convert ages to groups | ||
of age ranges. | ||
* True : returns a Series for Series `x` or a pandas.Categorical for | ||
pandas.Categorial `x`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a categorical is returned for any array-like that is not a Series (not only for Categorical) ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, is it worth mentioning that it is a Categorical of Intervals?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For series input it's a Series with categorical dtype.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And yes, saying Categorical of Intervals somewhere is good. NOt sure if here is best though.
pandas/core/reshape/tile.py
Outdated
the resulting Categorical object | ||
See Also | ||
-------- | ||
qcut : Discretize variable into equal-sized buckets based on rank |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suppose for qcut the bins will not be equal sized given it is based on quantiles?
pandas/core/reshape/tile.py
Outdated
qcut : Discretize variable into equal-sized buckets based on rank | ||
or based on sample quantiles. | ||
pandas.Categorical : Represents a categorical variable in | ||
classic R / S-plus fashion. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You copied this from Categorical, so that is fine, but I think we should really change this explanation to something not referring to R, many of our users are learning pandas without knowing R :)
pandas/core/reshape/tile.py
Outdated
@@ -26,69 +26,104 @@ | |||
def cut(x, bins, right=True, labels=None, retbins=False, precision=3, | |||
include_lowest=False): | |||
""" | |||
Return indices of half-open bins to which each value of `x` belongs. | |||
Bin `x` and return data about the bin to which each `x` value belongs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I find "return data about the bin" a bit "wordy" without saying much.
I suppose you want to be general because the return type can be either the bins, or the indices indexing into the bins? But I won't care about this: the summary line should be short and give an idea of the main use case. The extended summary can go more into detail (as you already do).
Starting from the above, I would cut that to the essential and just "Bin x", but that is maybe too short :-)
What about "Convert continuous values into discrete bins" or "Discretize values in specified bins" or "Bin values in specified intervals", but not sure what is the best wording.
@TomAugspurger any input?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this should be simplified. "Bin values into discrete intervals"?
I don't 100% like "in specified intervals" as you don't have to specify the intervals.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Bin values into discrete intervals"?
+1, perfect combination of all my attempts :-)
[ci skip]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated, if anyone wants to take a look.
Checklist for the pandas documentation sprint (ignore this if you are doing
an unrelated PR):
scripts/validate_docstrings.py <your-function-or-method>
git diff upstream/master -u -- "*.py" | flake8 --diff
python doc/make.py --single <your-function-or-method>
Please include the output of the validation script below between the "```" ticks:
Comment: Resubmitting #20069 after a botched rebase.