ENH: Declare a BoolBlock as a NumericBlock #3162

danbirken · 2013-03-25T00:43:23Z

BUG: #2641 fixes "df.decribe() with boolean column"

Numpy refers to a boolean type as a "numerical type", and both Python
and numpy cast True bools into the value 1 and False bools into the
value 0, so all numpy numerical operations always work.

This basically is to solve this issue, which I always found a bit puzzling:

import pandas

df = pandas.DataFrame({
'int1': [1, 2, 3],
'bool1': [False, False, True],
'bool2': [True, True, False],
})

print df.corr()

       int1
int1     1

print df[['bool1', 'bool2']].corr()

       bool1  bool2
bool1      1     -1
bool2     -1      1

After the change:

print df.corr()

          bool1     bool2      int1
bool1  1.000000 -1.000000  0.866025
bool2 -1.000000  1.000000 -0.866025
int1   0.866025 -0.866025  1.000000

This also applies to quite a few other numeric operations on dataframes, which when the dataframe is mixed type defaults to using "is_numeric" to decide which ones to perform the operation on.

I'm not sure how deep the rabbit hole goes for this change and how much stuff it might affect, but all of the tests pass (after of course editing the one that specifically tested this functionality). If there are other potential issues I'd be happy to look into them and make other related fixes.

jreback · 2013-03-25T00:50:39Z

this will break on certain operations which I know are not tested right now, specifically grouping on a bool type
we r trying to preserve he types as much as possible. I know your want the example to work, but even though python/numpy treat books as number (for computation ease) as that is how they r implemented
they really are not a numeric type, nor should hey be casted automatically

in addition most numeric operations do not make sense on bool types, eg what if u add three Trues
your free to astype of course

let me take a look tomorrow

danbirken · 2013-03-25T01:49:53Z

If you can give me an example of a command that doesn't work properly with this change, I'd be happy to update this change to make both cases work properly and write tests for them. I really don't want to break any existing functionality, only make the explanatory commands on dataframes work more seamlessly for boolean values.

jreback · 2013-03-25T11:52:26Z

The following is with your change

In [27]: df = pd.DataFrame(np.random.randn(8,3),columns=list('ABC'))

In [28]: df['bool'] = True

In [29]: df.loc[0:3,'bool'] = False

In [30]: df['string'] = 'foo'

In [31]: df.loc[0:5,'string'] = 'bar'

In [32]: df
Out[32]: 
          A         B         C   bool string
0 -0.417453 -1.640528 -0.719365  False    bar
1  0.893366 -0.020587 -0.167532  False    bar
2 -0.974600  0.221804  1.069638  False    bar
3 -0.460144 -2.064976 -1.585644  False    bar
4 -0.998505  0.510901  1.372870   True    bar
5  1.259004  0.472066 -0.521407   True    bar
6 -0.682720 -0.087097 -1.307866   True    foo
7 -0.907043  1.241735  0.534841   True    foo

At the very least these are confusing

In [33]: df.describe()
Out[33]: 
              A         B         C       bool
count  8.000000  8.000000  8.000000          8
mean  -0.286012 -0.170835 -0.165558        0.5
std    0.874272  1.121540  1.078218  0.5345225
min   -0.998505 -2.064976 -1.585644      False
25%   -0.923933 -0.475455 -0.866490          0
50%   -0.571432  0.100608 -0.344470        0.5
75%   -0.089748  0.481775  0.668540          1
max    1.259004  1.241735  1.372870       True

This give wrong results

In [34]: df.T.describe()
Out[34]: 
               0      1      2         3     4     5         6         7
count   5.000000      5      5  5.000000     5     5  5.000000  5.000000
unique  5.000000      5      5  5.000000     5     5  5.000000  5.000000
top    -0.417453  False  False -1.585644  True  True -0.087097  0.534841
freq    1.000000      1      1  1.000000     1     1  1.000000  1.000000

To be fair these are also 'broken' in master as well
Do you have an opinion on what these should actually do?
There is an open issue about this: #2954

In [35]: df*2
Out[35]: 
          A         B         C  bool  string
0 -0.834906 -3.281057 -1.438730     0  barbar
1  1.786731 -0.041175 -0.335065     0  barbar
2 -1.949201  0.443608  2.139277     0  barbar
3 -0.920287 -4.129953 -3.171288     0  barbar
4 -1.997011  1.021803  2.745739     2  barbar
5  2.518008  0.944132 -1.042813     2  barbar
6 -1.365440 -0.174194 -2.615732     2  foofoo
7 -1.814087  2.483471  1.069681     2  foofoo

In [36]: df.groupby('string').sum()
Out[36]: 
               A         B         C  bool
string                                    
bar    -0.698332 -2.521321 -0.551440     2
foo    -1.589763  1.154638 -0.773026     2

danbirken · 2013-03-30T19:04:03Z

I'm confused as to how the result of the df.T.describe() is wrong? This
particular operation doesn't really make sense, but I think pandas is doing
the right thing.

As for the other operations, I don't think any of those are broken or give
the wrong results. Considering this is what numpy does:

import numpy
a = numpy.array([True, True, False])
a
array([ True, True, False], dtype=bool)
a * 2
array([2, 2, 0])

And what python does:

True * 2
2

I think it is just better to just say in pandas (because it is the case in
numpy and python) is that True is equal to 1, False is equal to 0, and all
operations on them will act as if that were the case --- however, where
possible their Boolean type will be preserved. And with that assumption,
all of the other operations are "right".

I especially think the output of df.describe() above is an improvement, as
that solves issue #2641 #2641.

-Dan

On Mon, Mar 25, 2013 at 4:52 AM, jreback [email protected] wrote:

The following is with your change

In [27]: df = pd.DataFrame(np.random.randn(8,3),columns=list('ABC'))

In [28]: df['bool'] = True

In [29]: df.loc[0:3,'bool'] = False

In [30]: df['string'] = 'foo'

In [31]: df.loc[0:5,'string'] = 'bar'

In [32]: df
Out[32]:
A B C bool string
0 -0.417453 -1.640528 -0.719365 False bar
1 0.893366 -0.020587 -0.167532 False bar
2 -0.974600 0.221804 1.069638 False bar
3 -0.460144 -2.064976 -1.585644 False bar
4 -0.998505 0.510901 1.372870 True bar
5 1.259004 0.472066 -0.521407 True bar
6 -0.682720 -0.087097 -1.307866 True foo
7 -0.907043 1.241735 0.534841 True foo

At the very least these are confusing

In [33]: df.describe()
Out[33]:
A B C bool
count 8.000000 8.000000 8.000000 8
mean -0.286012 -0.170835 -0.165558 0.5
std 0.874272 1.121540 1.078218 0.5345225
min -0.998505 -2.064976 -1.585644 False
25% -0.923933 -0.475455 -0.866490 0
50% -0.571432 0.100608 -0.344470 0.5
75% -0.089748 0.481775 0.668540 1
max 1.259004 1.241735 1.372870 True

This give wrong results

In [34]: df.T.describe()
Out[34]:
0 1 2 3 4 5 6 7
count 5.000000 5 5 5.000000 5 5 5.000000 5.000000
unique 5.000000 5 5 5.000000 5 5 5.000000 5.000000
top -0.417453 False False -1.585644 True True -0.087097 0.534841
freq 1.000000 1 1 1.000000 1 1 1.000000 1.000000

To be fair these are also 'broken' in master as well
Do you have an opinion on what these should actually do?
There is an open issue about this: #2954 #2954

In [35]: df*2
Out[35]:
A B C bool string
0 -0.834906 -3.281057 -1.438730 0 barbar
1 1.786731 -0.041175 -0.335065 0 barbar
2 -1.949201 0.443608 2.139277 0 barbar
3 -0.920287 -4.129953 -3.171288 0 barbar
4 -1.997011 1.021803 2.745739 2 barbar
5 2.518008 0.944132 -1.042813 2 barbar
6 -1.365440 -0.174194 -2.615732 2 foofoo
7 -1.814087 2.483471 1.069681 2 foofoo

In [36]: df.groupby('string').sum()
Out[36]:
A B C bool
string
bar -0.698332 -2.521321 -0.551440 2
foo -1.589763 1.154638 -0.773026 2

—
Reply to this email directly or view it on GitHubhttps://github.com//pull/3162#issuecomment-15388933
.

jreback · 2013-03-31T00:31:48Z

on 2nd thought your are right, better to have describe include bool types
why don't u add a note in v0.11.0.txt and RELEASE.rst and can merge this

danbirken · 2013-03-31T21:17:24Z

Updated commit.

jreback · 2013-03-31T21:27:50Z

@wesm this ok by me....any objections?

wesm · 2013-04-01T02:02:22Z

Can we get a unit test that illustrates how behavior has changed?

wesm · 2013-04-01T02:02:53Z

I'm also +1 on getting this in 0.11

danbirken · 2013-04-01T04:32:12Z

The only change in behavior (afaik) is that in mixed dataframes, boolean columns now appear when trying to do an operation on numeric data (.describe, .mean, .corr, etc), whereas previously they would not have.

So in this most recent update I:
a) Fixed a test about the behavior of get_numeric_data() on a DF, including the boolean column
b) Added a test with a mixed dataframe correctly returning the boolean column when doing a .describe() along with verifying the operations return the expected values.

Note: the values returned by .describe() for a boolean column didn't change in any way, only the fact that if you call .describe() on a mixed dataframe they are now included.

jreback · 2013-04-01T10:10:54Z

@danbirken look great! can setup travis (if you update to master, then rebase it will force an update), then can merge. thanks

BUG: GH2641 fixes "df.decribe() with boolean column" This change will make all numeric operations on boolean data work, by just transparently treating them as integers values 1 and 0. This is not pandas specific behavior, this is the default operations of both numpy and python.

danbirken · 2013-04-01T18:39:40Z

Done.

ENH: Declare a BoolBlock as a NumericBlock

jreback added a commit that referenced this pull request Apr 2, 2013

Merge pull request #3162 from danbirken/bool_as_numeric_type

ed1618e

ENH: Declare a BoolBlock as a NumericBlock

jreback merged commit ed1618e into pandas-dev:master Apr 2, 2013

jreback mentioned this pull request Apr 2, 2013

df.decribe() with boolean column #2641

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

ENH: Declare a BoolBlock as a NumericBlock #3162

ENH: Declare a BoolBlock as a NumericBlock #3162

Uh oh!

danbirken commented Mar 25, 2013

Uh oh!

jreback commented Mar 25, 2013

Uh oh!

danbirken commented Mar 25, 2013

Uh oh!

jreback commented Mar 25, 2013

Uh oh!

danbirken commented Mar 30, 2013

Uh oh!

jreback commented Mar 31, 2013

Uh oh!

danbirken commented Mar 31, 2013

Uh oh!

jreback commented Mar 31, 2013

Uh oh!

wesm commented Apr 1, 2013

Uh oh!

wesm commented Apr 1, 2013

Uh oh!

danbirken commented Apr 1, 2013

Uh oh!

jreback commented Apr 1, 2013

Uh oh!

danbirken commented Apr 1, 2013

Uh oh!

Uh oh!

Uh oh!

ENH: Declare a BoolBlock as a NumericBlock #3162

ENH: Declare a BoolBlock as a NumericBlock #3162

Uh oh!

Conversation

danbirken commented Mar 25, 2013

Uh oh!

jreback commented Mar 25, 2013

Uh oh!

danbirken commented Mar 25, 2013

Uh oh!

jreback commented Mar 25, 2013

Uh oh!

danbirken commented Mar 30, 2013

Uh oh!

jreback commented Mar 31, 2013

Uh oh!

danbirken commented Mar 31, 2013

Uh oh!

jreback commented Mar 31, 2013

Uh oh!

wesm commented Apr 1, 2013

Uh oh!

wesm commented Apr 1, 2013

Uh oh!

danbirken commented Apr 1, 2013

Uh oh!

jreback commented Apr 1, 2013

Uh oh!

danbirken commented Apr 1, 2013

Uh oh!

Uh oh!