Skip to content

ENH: Declare a BoolBlock as a NumericBlock #3162

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 1 commit into from
Apr 2, 2013

Conversation

danbirken
Copy link
Contributor

BUG: #2641 fixes "df.decribe() with boolean column"

Numpy refers to a boolean type as a "numerical type", and both Python
and numpy cast True bools into the value 1 and False bools into the
value 0, so all numpy numerical operations always work.


This basically is to solve this issue, which I always found a bit puzzling:

import pandas

df = pandas.DataFrame({
'int1': [1, 2, 3],
'bool1': [False, False, True],
'bool2': [True, True, False],
})

print df.corr()

       int1
int1     1

print df[['bool1', 'bool2']].corr()

       bool1  bool2
bool1      1     -1
bool2     -1      1

After the change:

print df.corr()

          bool1     bool2      int1
bool1  1.000000 -1.000000  0.866025
bool2 -1.000000  1.000000 -0.866025
int1   0.866025 -0.866025  1.000000

This also applies to quite a few other numeric operations on dataframes, which when the dataframe is mixed type defaults to using "is_numeric" to decide which ones to perform the operation on.

I'm not sure how deep the rabbit hole goes for this change and how much stuff it might affect, but all of the tests pass (after of course editing the one that specifically tested this functionality). If there are other potential issues I'd be happy to look into them and make other related fixes.

@jreback
Copy link
Contributor

jreback commented Mar 25, 2013

this will break on certain operations which I know are not tested right now, specifically grouping on a bool type
we r trying to preserve he types as much as possible. I know your want the example to work, but even though python/numpy treat books as number (for computation ease) as that is how they r implemented
they really are not a numeric type, nor should hey be casted automatically

in addition most numeric operations do not make sense on bool types, eg what if u add three Trues
your free to astype of course

let me take a look tomorrow

@danbirken
Copy link
Contributor Author

If you can give me an example of a command that doesn't work properly with this change, I'd be happy to update this change to make both cases work properly and write tests for them. I really don't want to break any existing functionality, only make the explanatory commands on dataframes work more seamlessly for boolean values.

@jreback
Copy link
Contributor

jreback commented Mar 25, 2013

The following is with your change

In [27]: df = pd.DataFrame(np.random.randn(8,3),columns=list('ABC'))

In [28]: df['bool'] = True

In [29]: df.loc[0:3,'bool'] = False

In [30]: df['string'] = 'foo'

In [31]: df.loc[0:5,'string'] = 'bar'

In [32]: df
Out[32]: 
          A         B         C   bool string
0 -0.417453 -1.640528 -0.719365  False    bar
1  0.893366 -0.020587 -0.167532  False    bar
2 -0.974600  0.221804  1.069638  False    bar
3 -0.460144 -2.064976 -1.585644  False    bar
4 -0.998505  0.510901  1.372870   True    bar
5  1.259004  0.472066 -0.521407   True    bar
6 -0.682720 -0.087097 -1.307866   True    foo
7 -0.907043  1.241735  0.534841   True    foo

At the very least these are confusing

In [33]: df.describe()
Out[33]: 
              A         B         C       bool
count  8.000000  8.000000  8.000000          8
mean  -0.286012 -0.170835 -0.165558        0.5
std    0.874272  1.121540  1.078218  0.5345225
min   -0.998505 -2.064976 -1.585644      False
25%   -0.923933 -0.475455 -0.866490          0
50%   -0.571432  0.100608 -0.344470        0.5
75%   -0.089748  0.481775  0.668540          1
max    1.259004  1.241735  1.372870       True

This give wrong results

In [34]: df.T.describe()
Out[34]: 
               0      1      2         3     4     5         6         7
count   5.000000      5      5  5.000000     5     5  5.000000  5.000000
unique  5.000000      5      5  5.000000     5     5  5.000000  5.000000
top    -0.417453  False  False -1.585644  True  True -0.087097  0.534841
freq    1.000000      1      1  1.000000     1     1  1.000000  1.000000

To be fair these are also 'broken' in master as well
Do you have an opinion on what these should actually do?
There is an open issue about this: #2954

In [35]: df*2
Out[35]: 
          A         B         C  bool  string
0 -0.834906 -3.281057 -1.438730     0  barbar
1  1.786731 -0.041175 -0.335065     0  barbar
2 -1.949201  0.443608  2.139277     0  barbar
3 -0.920287 -4.129953 -3.171288     0  barbar
4 -1.997011  1.021803  2.745739     2  barbar
5  2.518008  0.944132 -1.042813     2  barbar
6 -1.365440 -0.174194 -2.615732     2  foofoo
7 -1.814087  2.483471  1.069681     2  foofoo

In [36]: df.groupby('string').sum()
Out[36]: 
               A         B         C  bool
string                                    
bar    -0.698332 -2.521321 -0.551440     2
foo    -1.589763  1.154638 -0.773026     2

@danbirken
Copy link
Contributor Author

I'm confused as to how the result of the df.T.describe() is wrong? This
particular operation doesn't really make sense, but I think pandas is doing
the right thing.

As for the other operations, I don't think any of those are broken or give
the wrong results. Considering this is what numpy does:

import numpy
a = numpy.array([True, True, False])
a
array([ True, True, False], dtype=bool)
a * 2
array([2, 2, 0])

And what python does:

True * 2
2

I think it is just better to just say in pandas (because it is the case in
numpy and python) is that True is equal to 1, False is equal to 0, and all
operations on them will act as if that were the case --- however, where
possible their Boolean type will be preserved. And with that assumption,
all of the other operations are "right".

I especially think the output of df.describe() above is an improvement, as
that solves issue #2641 #2641.

-Dan

On Mon, Mar 25, 2013 at 4:52 AM, jreback [email protected] wrote:

The following is with your change

In [27]: df = pd.DataFrame(np.random.randn(8,3),columns=list('ABC'))

In [28]: df['bool'] = True

In [29]: df.loc[0:3,'bool'] = False

In [30]: df['string'] = 'foo'

In [31]: df.loc[0:5,'string'] = 'bar'

In [32]: df
Out[32]:
A B C bool string
0 -0.417453 -1.640528 -0.719365 False bar
1 0.893366 -0.020587 -0.167532 False bar
2 -0.974600 0.221804 1.069638 False bar
3 -0.460144 -2.064976 -1.585644 False bar
4 -0.998505 0.510901 1.372870 True bar
5 1.259004 0.472066 -0.521407 True bar
6 -0.682720 -0.087097 -1.307866 True foo
7 -0.907043 1.241735 0.534841 True foo

At the very least these are confusing

In [33]: df.describe()
Out[33]:
A B C bool
count 8.000000 8.000000 8.000000 8
mean -0.286012 -0.170835 -0.165558 0.5
std 0.874272 1.121540 1.078218 0.5345225
min -0.998505 -2.064976 -1.585644 False
25% -0.923933 -0.475455 -0.866490 0
50% -0.571432 0.100608 -0.344470 0.5
75% -0.089748 0.481775 0.668540 1
max 1.259004 1.241735 1.372870 True

This give wrong results

In [34]: df.T.describe()
Out[34]:
0 1 2 3 4 5 6 7
count 5.000000 5 5 5.000000 5 5 5.000000 5.000000
unique 5.000000 5 5 5.000000 5 5 5.000000 5.000000
top -0.417453 False False -1.585644 True True -0.087097 0.534841
freq 1.000000 1 1 1.000000 1 1 1.000000 1.000000

To be fair these are also 'broken' in master as well
Do you have an opinion on what these should actually do?
There is an open issue about this: #2954#2954

In [35]: df*2
Out[35]:
A B C bool string
0 -0.834906 -3.281057 -1.438730 0 barbar
1 1.786731 -0.041175 -0.335065 0 barbar
2 -1.949201 0.443608 2.139277 0 barbar
3 -0.920287 -4.129953 -3.171288 0 barbar
4 -1.997011 1.021803 2.745739 2 barbar
5 2.518008 0.944132 -1.042813 2 barbar
6 -1.365440 -0.174194 -2.615732 2 foofoo
7 -1.814087 2.483471 1.069681 2 foofoo

In [36]: df.groupby('string').sum()
Out[36]:
A B C bool
string
bar -0.698332 -2.521321 -0.551440 2
foo -1.589763 1.154638 -0.773026 2


Reply to this email directly or view it on GitHubhttps://github.com//pull/3162#issuecomment-15388933
.

@jreback
Copy link
Contributor

jreback commented Mar 31, 2013

on 2nd thought your are right, better to have describe include bool types
why don't u add a note in v0.11.0.txt and RELEASE.rst and can merge this

@danbirken
Copy link
Contributor Author

Updated commit.

@jreback
Copy link
Contributor

jreback commented Mar 31, 2013

@wesm this ok by me....any objections?

@wesm
Copy link
Member

wesm commented Apr 1, 2013

Can we get a unit test that illustrates how behavior has changed?

@wesm
Copy link
Member

wesm commented Apr 1, 2013

I'm also +1 on getting this in 0.11

@danbirken
Copy link
Contributor Author

The only change in behavior (afaik) is that in mixed dataframes, boolean columns now appear when trying to do an operation on numeric data (.describe, .mean, .corr, etc), whereas previously they would not have.

So in this most recent update I:
a) Fixed a test about the behavior of get_numeric_data() on a DF, including the boolean column
b) Added a test with a mixed dataframe correctly returning the boolean column when doing a .describe() along with verifying the operations return the expected values.

Note: the values returned by .describe() for a boolean column didn't change in any way, only the fact that if you call .describe() on a mixed dataframe they are now included.

@jreback
Copy link
Contributor

jreback commented Apr 1, 2013

@danbirken look great! can setup travis (if you update to master, then rebase it will force an update), then can merge. thanks

BUG: GH2641 fixes "df.decribe() with boolean column"

This change will make all numeric operations on boolean data work, by
just transparently treating them as integers values 1 and 0.  This is
not pandas specific behavior, this is the default operations of both
numpy and python.
@danbirken
Copy link
Contributor Author

Done.

jreback added a commit that referenced this pull request Apr 2, 2013
ENH: Declare a BoolBlock as a NumericBlock
@jreback jreback merged commit ed1618e into pandas-dev:master Apr 2, 2013
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants