-
-
Notifications
You must be signed in to change notification settings - Fork 18.6k
ENH: Declare a BoolBlock as a NumericBlock #3162
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
this will break on certain operations which I know are not tested right now, specifically grouping on a bool type in addition most numeric operations do not make sense on bool types, eg what if u add three Trues let me take a look tomorrow |
If you can give me an example of a command that doesn't work properly with this change, I'd be happy to update this change to make both cases work properly and write tests for them. I really don't want to break any existing functionality, only make the explanatory commands on dataframes work more seamlessly for boolean values. |
The following is with your change
At the very least these are confusing
This give wrong results
To be fair these are also 'broken' in master as well
|
I'm confused as to how the result of the df.T.describe() is wrong? This As for the other operations, I don't think any of those are broken or give
And what python does:
I think it is just better to just say in pandas (because it is the case in I especially think the output of df.describe() above is an improvement, as -Dan On Mon, Mar 25, 2013 at 4:52 AM, jreback [email protected] wrote:
|
on 2nd thought your are right, better to have describe include bool types |
Updated commit. |
@wesm this ok by me....any objections? |
Can we get a unit test that illustrates how behavior has changed? |
I'm also +1 on getting this in 0.11 |
The only change in behavior (afaik) is that in mixed dataframes, boolean columns now appear when trying to do an operation on numeric data (.describe, .mean, .corr, etc), whereas previously they would not have. So in this most recent update I: Note: the values returned by .describe() for a boolean column didn't change in any way, only the fact that if you call .describe() on a mixed dataframe they are now included. |
@danbirken look great! can setup travis (if you update to master, then rebase it will force an update), then can merge. thanks |
BUG: GH2641 fixes "df.decribe() with boolean column" This change will make all numeric operations on boolean data work, by just transparently treating them as integers values 1 and 0. This is not pandas specific behavior, this is the default operations of both numpy and python.
Done. |
ENH: Declare a BoolBlock as a NumericBlock
BUG: #2641 fixes "df.decribe() with boolean column"
Numpy refers to a boolean type as a "numerical type", and both Python
and numpy cast True bools into the value 1 and False bools into the
value 0, so all numpy numerical operations always work.
This basically is to solve this issue, which I always found a bit puzzling:
import pandas
df = pandas.DataFrame({
'int1': [1, 2, 3],
'bool1': [False, False, True],
'bool2': [True, True, False],
})
print df.corr()
print df[['bool1', 'bool2']].corr()
After the change:
print df.corr()
This also applies to quite a few other numeric operations on dataframes, which when the dataframe is mixed type defaults to using "is_numeric" to decide which ones to perform the operation on.
I'm not sure how deep the rabbit hole goes for this change and how much stuff it might affect, but all of the tests pass (after of course editing the one that specifically tested this functionality). If there are other potential issues I'd be happy to look into them and make other related fixes.