Skip to content

Pandas assign malfunctions with if/else conditional check #30357

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
JayMan91 opened this issue Dec 19, 2019 · 3 comments
Closed

Pandas assign malfunctions with if/else conditional check #30357

JayMan91 opened this issue Dec 19, 2019 · 3 comments

Comments

@JayMan91
Copy link

Problem description

I need to create a new column within a pandas daframe from the values of another column. For instance, provided the city column in the following dataframe , the new column will duplicate the value from the city column only if it is within a list , otherwise the correspnding netries woill be populated as "other". The following code snippet works like a charm.

df = pd.DataFrame({'city': ['Kolkata','Delhi','Mumbai','Bankura','Dhaka','Jaipur','Goa',
                           'Delhi','Mumbai','Kolkata'],'temp':[i for i in range(10)]})
df['MajorCity'] = df.apply(lambda x: x.city if x.city in ['Kolkata','Delhi','Mumbai'] else 'other',axis=1)
print(df)

Output

city temp MajorCity
Kolkata 0 Kolkata
Delhi 1 Delhi
Mumbai 2 Mumbai
Bankura 3 other
Dhaka 4 other
Jaipur 5 other
Goa 6 other
Delhi 7 Delhi
Mumbai 8 Mumbai
Kolkata 9 Kolkata

But when I try to implement it with pandas assign function as below

# Your code here
df = pd.DataFrame({'city': ['Kolkata','Delhi','Mumbai','Bankura','Dhaka','Jaipur','Goa',
                           'Delhi','Mumbai','Kolkata'],'temp':[i for i in range(10)]})
df.assign(MajorCity = lambda x:x.city if x.city in ['Kolkata','Delhi','Mumbai'] else 'other')

I received the following error:

Error message:

ValueError Traceback (most recent call last)
in
1 df = pd.DataFrame({'city': ['Kolkata','Delhi','Mumbai','Bankura','Dhaka','Jaipur','Goa',
2 'Delhi','Mumbai','Kolkata'],'temp':[i for i in range(10)]})
----> 3 df.assign(MajorCity = lambda x:x.city if x.city in ['Kolkata','Delhi','Mumbai'] else 'other')

~/anaconda3/lib/python3.7/site-packages/pandas/core/frame.py in assign(self, **kwargs)
3667 if PY36:
3668 for k, v in kwargs.items():
-> 3669 data[k] = com.apply_if_callable(v, data)
3670 else:
3671 # <= 3.5: do all calculations first...

~/anaconda3/lib/python3.7/site-packages/pandas/core/common.py in apply_if_callable(maybe_callable, obj, **kwargs)
363
364 if callable(maybe_callable):
--> 365 return maybe_callable(obj, **kwargs)
366
367 return maybe_callable

in (x)
1 df = pd.DataFrame({'city': ['Kolkata','Delhi','Mumbai','Bankura','Dhaka','Jaipur','Goa',
2 'Delhi','Mumbai','Kolkata'],'temp':[i for i in range(10)]})
----> 3 df.assign(MajorCity = lambda x:x.city if x.city in ['Kolkata','Delhi','Mumbai'] else 'other')

~/anaconda3/lib/python3.7/site-packages/pandas/core/generic.py in nonzero(self)
1553 "The truth value of a {0} is ambiguous. "
1554 "Use a.empty, a.bool(), a.item(), a.any() or a.all().".format(
-> 1555 self.class.name
1556 )
1557 )

ValueError: The truth value of a Series is ambiguous. Use a.empty, a.bool(), a.item(), a.any() or a.all().
INSTALLED VERSIONS ------------------ commit : None python : 3.7.3.final.0 python-bits : 64 OS : Linux OS-release : 4.15.0-72-generic machine : x86_64 processor : x86_64 byteorder : little LC_ALL : None LANG : en_US.UTF-8 LOCALE : en_US.UTF-8

pandas : 0.25.3
numpy : 1.16.4
pytz : 2019.1
dateutil : 2.8.0
pip : 19.1.1
setuptools : 41.0.1
Cython : 0.29.10
pytest : 4.6.2
hypothesis : None
sphinx : 2.1.0
blosc : None
feather : None
xlsxwriter : 1.1.8
lxml.etree : 4.3.3
html5lib : 1.0.1
pymysql : None
psycopg2 : None
jinja2 : 2.10.1
IPython : 7.5.0
pandas_datareader: None
bs4 : 4.7.1
bottleneck : 1.2.1
fastparquet : None
gcsfs : None
lxml.etree : 4.3.3
matplotlib : 3.1.0
numexpr : 2.6.9
odfpy : None
openpyxl : 2.6.2
pandas_gbq : None
pyarrow : None
pytables : None
s3fs : None
scipy : 1.2.1
sqlalchemy : 1.3.4
tables : 3.5.2
xarray : None
xlrd : 1.2.0
xlwt : 1.3.0
xlsxwriter : 1.1.8

@asishm
Copy link
Contributor

asishm commented Dec 20, 2019

The callable you pass to df.assign takes as input the entire dataframe, but with df.apply(callable, axis=1) you are passing each row to the callable.

in the case of

df.assign(MajorCity = lambda x:x.city if x.city in ['Kolkata','Delhi','Mumbai'] else 'other')

the x in the lambda function is df and x.city is a Series equivalent to df.city which is where x.city in ['a', 'b', 'c'] leads to the error. In the apply case, x is one row in df

@TomAugspurger
Copy link
Contributor

Thanks for the explanation @asishm!

@TomAugspurger TomAugspurger added this to the No action milestone Dec 20, 2019
@JayMan91
Copy link
Author

Thanks for the explanation. Is there any way to create the new column by pands.assign command in this case?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants