-
-
Notifications
You must be signed in to change notification settings - Fork 18.6k
Merge fails when dataframe contains categoricals #9426
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
pd.show_versions() df.info() and df.head() for each frame |
df = pd.merge(left, right, how='left', left_on='b', right_on='c')
INSTALLED VERSIONS
------------------
commit: None
python: 2.7.9.final.0
python-bits: 64
OS: Linux
OS-release: 3.13.0-45-generic
machine: x86_64
processor: x86_64
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
pandas: 0.15.2
nose: 1.3.4
Cython: 0.21
numpy: 1.9.1
scipy: 0.15.1
statsmodels: None
IPython: 2.3.1
sphinx: 1.2.3
patsy: 0.3.0
dateutil: 1.5
pytz: 2014.9
bottleneck: None
tables: 3.1.1
numexpr: 2.3.1
matplotlib: 1.4.2
openpyxl: 1.8.5
xlrd: 0.9.3
xlwt: 0.7.5
xlsxwriter: 0.5.7
lxml: 3.4.0
bs4: 4.3.2
html5lib: None
httplib2: None
apiclient: None
rpy2: 2.4.4
sqlalchemy: 0.9.7
pymysql: None
psycopg2: None
None
print left.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 29040 entries, 0 to 29039
Data columns (total 2 columns):
a 29040 non-null object
b 29040 non-null object
dtypes: object(2)
memory usage: 680.6+ KB
None
print left.head()
a b
0 00640000008PbqmAAC 0013000000CBGKbAAP
1 00640000008PbqmAAC 0013000000CBGKbAAP
2 00640000008PbqmAAC 0013000000CBGKbAAP
3 00640000008PbqmAAC 0013000000CBGKbAAP
4 00640000008PbqmAAC 0013000000CBGKbAAP
print right.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 2952 entries, 0 to 2951
Data columns (total 2 columns):
c 2952 non-null object
d 2952 non-null category
dtypes: category(1), object(1)
memory usage: 49.2+ KB
None
print right.head()
c d
0 0014000000G3eszAAB null
1 0014000000G3TTVAA3 null
2 0014000000G4H6yAAF null
3 0014000000G4HpmAAF null
4 0014000000G4IR8AAN null |
and you merging in the categorical column? iirc I think we allow this kind of object/cat merging (as the merge column) but would need a specifc example to see what the issue is |
I'm merging on an object column and merging in a category column. right = pd.DataFrame({'c': {0: 'a', 1: 'b', 2: 'c', 3: 'd', 4: 'e'},
'd': {0: 'null', 1: 'null', 2: 'null', 3: 'null', 4: 'null'}})
right['d'] = right['d'].astype('category')
left = pd.DataFrame({'a': {0: 'f', 1: 'f', 2: 'f', 3: 'f', 4: 'f'},
'b': {0: 'g', 1: 'g', 2: 'g', 3: 'g', 4: 'g'}})
df = pd.merge(left, right, how='left', left_on='b', right_on='c') |
hmm I don't think this is tested (only with concat). ok, marking as a bug. I think pretty easy to resolve though. You are welcome to dig in if you'd like. |
I just ran into this in production code. Any hints on how this could be fixed? I'd gladly try. |
FYI, I don't get this bug in 0.15.1 |
@lminer: Confirmed here, downgrading helped. |
@lminer, thanks! Confirmed as well, working fine with 0.15.1. |
I'm getting the same error with 0.15.1, 0.15.2 and 0.16.1 (I didn't test any other versions): import pandas
import numpy
df1 = pandas.DataFrame(numpy.random.randn(6, 3), columns=["a", "b", "c"])
df2 = pandas.DataFrame(numpy.random.randn(7, 4), columns=["g", "h", "a", "c"])
df2['h'] = pandas.Series(pandas.Categorical(["one", "one", "two", "one", "two", "two", "one"]))
df = pandas.concat((df1, df2)) Traceback with 0.16.1: Traceback (most recent call last):
File "/home/sebp/.PyCharm40/config/scratches/scratch", line 9, in <module>
df = pandas.concat((df1, df2))
File "/home/sebp/anaconda3/lib/python3.4/site-packages/pandas/tools/merge.py", line 755, in concat
return op.get_result()
File "/home/sebp/anaconda3/lib/python3.4/site-packages/pandas/tools/merge.py", line 926, in get_result
mgrs_indexers, self.new_axes, concat_axis=self.axis, copy=self.copy)
File "/home/sebp/anaconda3/lib/python3.4/site-packages/pandas/core/internals.py", line 4059, in concatenate_block_managers
for placement, join_units in concat_plan]
File "/home/sebp/anaconda3/lib/python3.4/site-packages/pandas/core/internals.py", line 4059, in <listcomp>
for placement, join_units in concat_plan]
File "/home/sebp/anaconda3/lib/python3.4/site-packages/pandas/core/internals.py", line 4152, in concatenate_join_units
for ju in join_units]
File "/home/sebp/anaconda3/lib/python3.4/site-packages/pandas/core/internals.py", line 4152, in <listcomp>
for ju in join_units]
File "/home/sebp/anaconda3/lib/python3.4/site-packages/pandas/core/internals.py", line 4401, in get_reindexed_values
missing_arr = np.empty(self.shape, dtype=empty_dtype)
TypeError: data type not understood pandas.show_version():
df1.info():
df2.info():
|
you need to specify the axis
|
though there is an error of |
Trying to perform a left merge between two dataframes using a column of type object. If I include categoricals in the right dataframe, I get the following error. Trying to reproduce with a toy dataset but no luck so far.
The text was updated successfully, but these errors were encountered: