Merge fails when dataframe contains categoricals #9426

lminer · 2015-02-05T22:54:07Z

Trying to perform a left merge between two dataframes using a column of type object. If I include categoricals in the right dataframe, I get the following error. Trying to reproduce with a toy dataset but no luck so far.

out = pd.merge(left, right, how='left', left_on='left_id', right_on='right_id')
Traceback (most recent call last):
  File ".../pandas/tools/merge.py", line 39, in merge return op.get_result()
  File ".../pandas/tools/merge.py", line 201, in get_result concat_axis=0, copy=self.copy)
  File ".../pandas/core/internals.py", line 4046, in concatenate_block_managers for placement, join_units in concat_plan]
  File ".../pandas/core/internals.py", line 4135, in concatenate_join_units empty_dtype, upcasted_na = get_empty_dtype_and_na(join_units)
  File ".../pandas/core/internals.py", line 4074, in get_empty_dtype_and_na dtypes[i] = unit.dtype
  File ".../pandas/src/properties.pyx", line 34, in pandas.lib.cache_readonly.__get__ (pandas/lib.c:40664)
  File ".../pandas/core/internals.py", line 4349, in dtype self.block.fill_value)[0])
  File ".../pandas/core/common.py", line 1128, in _maybe_promote if issubclass(np.dtype(dtype).type, compat.string_types):
TypeError: data type not understood

jreback · 2015-02-05T22:57:47Z

pd.show_versions()

df.info() and df.head() for each frame

lminer · 2015-02-05T23:18:59Z

df = pd.merge(left, right, how='left', left_on='b', right_on='c')

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.9.final.0
python-bits: 64
OS: Linux
OS-release: 3.13.0-45-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.15.2
nose: 1.3.4
Cython: 0.21
numpy: 1.9.1
scipy: 0.15.1
statsmodels: None
IPython: 2.3.1
sphinx: 1.2.3
patsy: 0.3.0
dateutil: 1.5
pytz: 2014.9
bottleneck: None
tables: 3.1.1
numexpr: 2.3.1
matplotlib: 1.4.2
openpyxl: 1.8.5
xlrd: 0.9.3
xlwt: 0.7.5
xlsxwriter: 0.5.7
lxml: 3.4.0
bs4: 4.3.2
html5lib: None
httplib2: None
apiclient: None
rpy2: 2.4.4
sqlalchemy: 0.9.7
pymysql: None
psycopg2: None
None
print left.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 29040 entries, 0 to 29039
Data columns (total 2 columns):
a    29040 non-null object
b    29040 non-null object
dtypes: object(2)
memory usage: 680.6+ KB
None
print left.head()
                    a                   b
0  00640000008PbqmAAC  0013000000CBGKbAAP
1  00640000008PbqmAAC  0013000000CBGKbAAP
2  00640000008PbqmAAC  0013000000CBGKbAAP
3  00640000008PbqmAAC  0013000000CBGKbAAP
4  00640000008PbqmAAC  0013000000CBGKbAAP
print right.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2952 entries, 0 to 2951
Data columns (total 2 columns):
c    2952 non-null object
d    2952 non-null category
dtypes: category(1), object(1)
memory usage: 49.2+ KB
None
print right.head()
                    c     d
0  0014000000G3eszAAB  null
1  0014000000G3TTVAA3  null
2  0014000000G4H6yAAF  null
3  0014000000G4HpmAAF  null
4  0014000000G4IR8AAN  null

jreback · 2015-02-05T23:29:03Z

and you merging in the categorical column? iirc I think we allow this kind of object/cat merging (as the merge column) but would need a specifc example to see what the issue is

lminer · 2015-02-05T23:30:31Z

I'm merging on an object column and merging in a category column.
I have a reproducible example now:

right = pd.DataFrame({'c': {0: 'a', 1: 'b', 2: 'c', 3: 'd', 4: 'e'},
 'd': {0: 'null', 1: 'null', 2: 'null', 3: 'null', 4: 'null'}})
right['d'] = right['d'].astype('category')

left = pd.DataFrame({'a': {0: 'f', 1: 'f', 2: 'f', 3: 'f', 4: 'f'},
 'b': {0: 'g', 1: 'g', 2: 'g', 3: 'g', 4: 'g'}})
df = pd.merge(left, right, how='left', left_on='b', right_on='c')

jreback · 2015-02-05T23:52:18Z

hmm I don't think this is tested (only with concat). ok, marking as a bug. I think pretty easy to resolve though. You are welcome to dig in if you'd like.

philippmuller · 2015-03-04T14:32:46Z

I just ran into this in production code. Any hints on how this could be fixed? I'd gladly try.

lminer · 2015-03-05T03:35:47Z

FYI, I don't get this bug in 0.15.1

hmgaudecker · 2015-03-05T09:32:23Z

@lminer: Confirmed here, downgrading helped.

philippmuller · 2015-03-05T09:38:42Z

@lminer, thanks! Confirmed as well, working fine with 0.15.1.

sebp · 2015-05-19T15:45:23Z

I'm getting the same error with 0.15.1, 0.15.2 and 0.16.1 (I didn't test any other versions):

import pandas
import numpy

df1 = pandas.DataFrame(numpy.random.randn(6, 3), columns=["a", "b", "c"])

df2 = pandas.DataFrame(numpy.random.randn(7, 4), columns=["g", "h", "a", "c"])
df2['h'] = pandas.Series(pandas.Categorical(["one", "one", "two", "one", "two", "two", "one"]))
df = pandas.concat((df1, df2))

Traceback with 0.16.1:

Traceback (most recent call last):
  File "/home/sebp/.PyCharm40/config/scratches/scratch", line 9, in <module>
    df = pandas.concat((df1, df2))
  File "/home/sebp/anaconda3/lib/python3.4/site-packages/pandas/tools/merge.py", line 755, in concat
    return op.get_result()
  File "/home/sebp/anaconda3/lib/python3.4/site-packages/pandas/tools/merge.py", line 926, in get_result
    mgrs_indexers, self.new_axes, concat_axis=self.axis, copy=self.copy)
  File "/home/sebp/anaconda3/lib/python3.4/site-packages/pandas/core/internals.py", line 4059, in concatenate_block_managers
    for placement, join_units in concat_plan]
  File "/home/sebp/anaconda3/lib/python3.4/site-packages/pandas/core/internals.py", line 4059, in <listcomp>
    for placement, join_units in concat_plan]
  File "/home/sebp/anaconda3/lib/python3.4/site-packages/pandas/core/internals.py", line 4152, in concatenate_join_units
    for ju in join_units]
  File "/home/sebp/anaconda3/lib/python3.4/site-packages/pandas/core/internals.py", line 4152, in <listcomp>
    for ju in join_units]
  File "/home/sebp/anaconda3/lib/python3.4/site-packages/pandas/core/internals.py", line 4401, in get_reindexed_values
    missing_arr = np.empty(self.shape, dtype=empty_dtype)
TypeError: data type not understood

pandas.show_version():

INSTALLED VERSIONS
------------------
commit: None
python: 3.4.3.final.0
python-bits: 64
OS: Linux
OS-release: 3.19.7-200.fc21.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: de_DE.utf8

pandas: 0.16.1
nose: 1.3.6
Cython: 0.21.2
numpy: 1.9.2
scipy: 0.15.1
statsmodels: 0.6.1
IPython: 3.1.0
sphinx: 1.2.3
patsy: 0.3.0
dateutil: 2.4.1
pytz: 2015.4
bottleneck: None
tables: 3.1.1
numexpr: 2.3.1
matplotlib: 1.4.3
openpyxl: 1.8.5
xlrd: 0.9.3
xlwt: None
xlsxwriter: 0.5.7
lxml: 3.4.0
bs4: 4.3.2
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
None

df1.info():

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6 entries, 0 to 5
Data columns (total 3 columns):
a    6 non-null float64
b    6 non-null float64
c    6 non-null float64
dtypes: float64(3)
memory usage: 192.0 bytes
None

df2.info():

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7 entries, 0 to 6
Data columns (total 4 columns):
g    7 non-null float64
h    7 non-null category
a    7 non-null float64
c    7 non-null float64
dtypes: category(1), float64(3)
memory usage: 247.0 bytes
None

jreback · 2015-05-20T11:55:07Z

you need to specify the axis

In [11]: pandas.concat((df1, df2),axis=1)
Out[11]: 
          a         b         c         g    h         a         c
0  0.165903  0.653897 -0.922319  0.005155  one -0.061915 -0.073384
1  0.046094 -0.064848  1.967550  1.503491  one -0.390496  1.337330
2  0.791940 -0.896089 -1.598779  1.001303  two  1.536334 -0.367639
3 -0.253877  1.135221 -0.264409  0.149479  one -1.929875 -0.021116
4  1.083382  0.366590  1.833362 -0.277670  two -0.971455 -0.179325
5 -0.053932  0.099949 -0.545455 -0.946396  two  0.436236 -1.000864
6       NaN       NaN       NaN -0.120871  one  0.304886 -1.347874

jreback · 2015-05-20T11:55:58Z

though there is an error of axis=0 hmm

jreback · 2015-05-20T11:59:03Z

@sebp new issue is #10177

jreback added Bug Categorical Categorical Data Type Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Feb 5, 2015

jreback added this to the 0.16.0 milestone Feb 5, 2015

jreback mentioned this issue Mar 6, 2015

BUG: Regression in merging Categorical and object dtypes (GH9426) #9597

Merged

jreback closed this as completed in #9597 Mar 6, 2015

jreback mentioned this issue May 20, 2015

BUG: concat on axis=0 with categorical #10177

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Merge fails when dataframe contains categoricals #9426

Merge fails when dataframe contains categoricals #9426

lminer commented Feb 5, 2015

jreback commented Feb 5, 2015

Uh oh!

lminer commented Feb 5, 2015

Uh oh!

jreback commented Feb 5, 2015

Uh oh!

lminer commented Feb 5, 2015

Uh oh!

jreback commented Feb 5, 2015

Uh oh!

philippmuller commented Mar 4, 2015

Uh oh!

lminer commented Mar 5, 2015

Uh oh!

hmgaudecker commented Mar 5, 2015

Uh oh!

philippmuller commented Mar 5, 2015

Uh oh!

sebp commented May 19, 2015

Uh oh!

jreback commented May 20, 2015

Uh oh!

jreback commented May 20, 2015

Uh oh!

jreback commented May 20, 2015

Uh oh!

Uh oh!

Merge fails when dataframe contains categoricals #9426

Merge fails when dataframe contains categoricals #9426

Comments

lminer commented Feb 5, 2015

jreback commented Feb 5, 2015

Uh oh!

lminer commented Feb 5, 2015

Uh oh!

jreback commented Feb 5, 2015

Uh oh!

lminer commented Feb 5, 2015

Uh oh!

jreback commented Feb 5, 2015

Uh oh!

philippmuller commented Mar 4, 2015

Uh oh!

lminer commented Mar 5, 2015

Uh oh!

hmgaudecker commented Mar 5, 2015

Uh oh!

philippmuller commented Mar 5, 2015

Uh oh!

sebp commented May 19, 2015

Uh oh!

jreback commented May 20, 2015

Uh oh!

jreback commented May 20, 2015

Uh oh!

jreback commented May 20, 2015

Uh oh!