Skip to content

Merge fails when dataframe contains categoricals #9426

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
lminer opened this issue Feb 5, 2015 · 13 comments · Fixed by #9597
Closed

Merge fails when dataframe contains categoricals #9426

lminer opened this issue Feb 5, 2015 · 13 comments · Fixed by #9597
Labels
Bug Categorical Categorical Data Type Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Milestone

Comments

@lminer
Copy link

lminer commented Feb 5, 2015

Trying to perform a left merge between two dataframes using a column of type object. If I include categoricals in the right dataframe, I get the following error. Trying to reproduce with a toy dataset but no luck so far.

out = pd.merge(left, right, how='left', left_on='left_id', right_on='right_id')
Traceback (most recent call last):
  File ".../pandas/tools/merge.py", line 39, in merge return op.get_result()
  File ".../pandas/tools/merge.py", line 201, in get_result concat_axis=0, copy=self.copy)
  File ".../pandas/core/internals.py", line 4046, in concatenate_block_managers for placement, join_units in concat_plan]
  File ".../pandas/core/internals.py", line 4135, in concatenate_join_units empty_dtype, upcasted_na = get_empty_dtype_and_na(join_units)
  File ".../pandas/core/internals.py", line 4074, in get_empty_dtype_and_na dtypes[i] = unit.dtype
  File ".../pandas/src/properties.pyx", line 34, in pandas.lib.cache_readonly.__get__ (pandas/lib.c:40664)
  File ".../pandas/core/internals.py", line 4349, in dtype self.block.fill_value)[0])
  File ".../pandas/core/common.py", line 1128, in _maybe_promote if issubclass(np.dtype(dtype).type, compat.string_types):
TypeError: data type not understood
@jreback
Copy link
Contributor

jreback commented Feb 5, 2015

pd.show_versions()

df.info() and df.head() for each frame

@lminer
Copy link
Author

lminer commented Feb 5, 2015

df = pd.merge(left, right, how='left', left_on='b', right_on='c')

INSTALLED VERSIONS
------------------
commit: None
python: 2.7.9.final.0
python-bits: 64
OS: Linux
OS-release: 3.13.0-45-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8

pandas: 0.15.2
nose: 1.3.4
Cython: 0.21
numpy: 1.9.1
scipy: 0.15.1
statsmodels: None
IPython: 2.3.1
sphinx: 1.2.3
patsy: 0.3.0
dateutil: 1.5
pytz: 2014.9
bottleneck: None
tables: 3.1.1
numexpr: 2.3.1
matplotlib: 1.4.2
openpyxl: 1.8.5
xlrd: 0.9.3
xlwt: 0.7.5
xlsxwriter: 0.5.7
lxml: 3.4.0
bs4: 4.3.2
html5lib: None
httplib2: None
apiclient: None
rpy2: 2.4.4
sqlalchemy: 0.9.7
pymysql: None
psycopg2: None
None
print left.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 29040 entries, 0 to 29039
Data columns (total 2 columns):
a    29040 non-null object
b    29040 non-null object
dtypes: object(2)
memory usage: 680.6+ KB
None
print left.head()
                    a                   b
0  00640000008PbqmAAC  0013000000CBGKbAAP
1  00640000008PbqmAAC  0013000000CBGKbAAP
2  00640000008PbqmAAC  0013000000CBGKbAAP
3  00640000008PbqmAAC  0013000000CBGKbAAP
4  00640000008PbqmAAC  0013000000CBGKbAAP
print right.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2952 entries, 0 to 2951
Data columns (total 2 columns):
c    2952 non-null object
d    2952 non-null category
dtypes: category(1), object(1)
memory usage: 49.2+ KB
None
print right.head()
                    c     d
0  0014000000G3eszAAB  null
1  0014000000G3TTVAA3  null
2  0014000000G4H6yAAF  null
3  0014000000G4HpmAAF  null
4  0014000000G4IR8AAN  null

@jreback
Copy link
Contributor

jreback commented Feb 5, 2015

and you merging in the categorical column? iirc I think we allow this kind of object/cat merging (as the merge column) but would need a specifc example to see what the issue is

@lminer
Copy link
Author

lminer commented Feb 5, 2015

I'm merging on an object column and merging in a category column.
I have a reproducible example now:

right = pd.DataFrame({'c': {0: 'a', 1: 'b', 2: 'c', 3: 'd', 4: 'e'},
 'd': {0: 'null', 1: 'null', 2: 'null', 3: 'null', 4: 'null'}})
right['d'] = right['d'].astype('category')

left = pd.DataFrame({'a': {0: 'f', 1: 'f', 2: 'f', 3: 'f', 4: 'f'},
 'b': {0: 'g', 1: 'g', 2: 'g', 3: 'g', 4: 'g'}})
df = pd.merge(left, right, how='left', left_on='b', right_on='c')

@jreback
Copy link
Contributor

jreback commented Feb 5, 2015

hmm I don't think this is tested (only with concat). ok, marking as a bug. I think pretty easy to resolve though. You are welcome to dig in if you'd like.

@jreback jreback added Bug Categorical Categorical Data Type Reshaping Concat, Merge/Join, Stack/Unstack, Explode labels Feb 5, 2015
@jreback jreback added this to the 0.16.0 milestone Feb 5, 2015
@philippmuller
Copy link

I just ran into this in production code. Any hints on how this could be fixed? I'd gladly try.

@lminer
Copy link
Author

lminer commented Mar 5, 2015

FYI, I don't get this bug in 0.15.1

@hmgaudecker
Copy link

@lminer: Confirmed here, downgrading helped.

@philippmuller
Copy link

@lminer, thanks! Confirmed as well, working fine with 0.15.1.

@sebp
Copy link

sebp commented May 19, 2015

I'm getting the same error with 0.15.1, 0.15.2 and 0.16.1 (I didn't test any other versions):

import pandas
import numpy

df1 = pandas.DataFrame(numpy.random.randn(6, 3), columns=["a", "b", "c"])

df2 = pandas.DataFrame(numpy.random.randn(7, 4), columns=["g", "h", "a", "c"])
df2['h'] = pandas.Series(pandas.Categorical(["one", "one", "two", "one", "two", "two", "one"]))
df = pandas.concat((df1, df2))

Traceback with 0.16.1:

Traceback (most recent call last):
  File "/home/sebp/.PyCharm40/config/scratches/scratch", line 9, in <module>
    df = pandas.concat((df1, df2))
  File "/home/sebp/anaconda3/lib/python3.4/site-packages/pandas/tools/merge.py", line 755, in concat
    return op.get_result()
  File "/home/sebp/anaconda3/lib/python3.4/site-packages/pandas/tools/merge.py", line 926, in get_result
    mgrs_indexers, self.new_axes, concat_axis=self.axis, copy=self.copy)
  File "/home/sebp/anaconda3/lib/python3.4/site-packages/pandas/core/internals.py", line 4059, in concatenate_block_managers
    for placement, join_units in concat_plan]
  File "/home/sebp/anaconda3/lib/python3.4/site-packages/pandas/core/internals.py", line 4059, in <listcomp>
    for placement, join_units in concat_plan]
  File "/home/sebp/anaconda3/lib/python3.4/site-packages/pandas/core/internals.py", line 4152, in concatenate_join_units
    for ju in join_units]
  File "/home/sebp/anaconda3/lib/python3.4/site-packages/pandas/core/internals.py", line 4152, in <listcomp>
    for ju in join_units]
  File "/home/sebp/anaconda3/lib/python3.4/site-packages/pandas/core/internals.py", line 4401, in get_reindexed_values
    missing_arr = np.empty(self.shape, dtype=empty_dtype)
TypeError: data type not understood

pandas.show_version():

INSTALLED VERSIONS
------------------
commit: None
python: 3.4.3.final.0
python-bits: 64
OS: Linux
OS-release: 3.19.7-200.fc21.x86_64
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: de_DE.utf8

pandas: 0.16.1
nose: 1.3.6
Cython: 0.21.2
numpy: 1.9.2
scipy: 0.15.1
statsmodels: 0.6.1
IPython: 3.1.0
sphinx: 1.2.3
patsy: 0.3.0
dateutil: 2.4.1
pytz: 2015.4
bottleneck: None
tables: 3.1.1
numexpr: 2.3.1
matplotlib: 1.4.3
openpyxl: 1.8.5
xlrd: 0.9.3
xlwt: None
xlsxwriter: 0.5.7
lxml: 3.4.0
bs4: 4.3.2
html5lib: None
httplib2: None
apiclient: None
sqlalchemy: None
pymysql: None
psycopg2: None
None

df1.info():

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6 entries, 0 to 5
Data columns (total 3 columns):
a    6 non-null float64
b    6 non-null float64
c    6 non-null float64
dtypes: float64(3)
memory usage: 192.0 bytes
None

df2.info():

<class 'pandas.core.frame.DataFrame'>
Int64Index: 7 entries, 0 to 6
Data columns (total 4 columns):
g    7 non-null float64
h    7 non-null category
a    7 non-null float64
c    7 non-null float64
dtypes: category(1), float64(3)
memory usage: 247.0 bytes
None

@jreback
Copy link
Contributor

jreback commented May 20, 2015

you need to specify the axis

In [11]: pandas.concat((df1, df2),axis=1)
Out[11]: 
          a         b         c         g    h         a         c
0  0.165903  0.653897 -0.922319  0.005155  one -0.061915 -0.073384
1  0.046094 -0.064848  1.967550  1.503491  one -0.390496  1.337330
2  0.791940 -0.896089 -1.598779  1.001303  two  1.536334 -0.367639
3 -0.253877  1.135221 -0.264409  0.149479  one -1.929875 -0.021116
4  1.083382  0.366590  1.833362 -0.277670  two -0.971455 -0.179325
5 -0.053932  0.099949 -0.545455 -0.946396  two  0.436236 -1.000864
6       NaN       NaN       NaN -0.120871  one  0.304886 -1.347874

@jreback
Copy link
Contributor

jreback commented May 20, 2015

though there is an error of axis=0 hmm

@jreback
Copy link
Contributor

jreback commented May 20, 2015

@sebp new issue is #10177

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Categorical Categorical Data Type Reshaping Concat, Merge/Join, Stack/Unstack, Explode
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants