Skip to content

DataFrame query method - numexpr safety check fails #22435

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
machow opened this issue Aug 21, 2018 · 9 comments · Fixed by #43301
Closed

DataFrame query method - numexpr safety check fails #22435

machow opened this issue Aug 21, 2018 · 9 comments · Fixed by #43301
Labels
Bug expressions pd.eval, query
Milestone

Comments

@machow
Copy link

machow commented Aug 21, 2018

Code Sample, a copy-pastable example if possible

# Your code here
import pandas as pd
df = pd.DataFrame({'a': ['1','2','3'], 'b': [4,5,6]})
df.query("a.astype('int') < 2")

raises TypeError: unhashable type: 'numpy.ndarray'

Problem description

Background
When using numexpr, Pandas has an internal function, _check_ne_builtin_clash, for detecting when a variable used in a method like query clashes with a numexpr built-in.

Here's an example of the function raising an error as intended..

df = pd.DataFrame({'abs': [1,2,3]})
df.query("abs > 2")
# Raises NumExprClobberingError: Variables ... overlap with builtins: ('abs')

Mostly, the names it protects again are math functions like sin, cos, sum, etc..

Why my original example fails

The trouble with my original code is that check_ne_builtin_clash is checking the name of both sides of the BinaryExpr AST node corresponding to "a.astype('int') < 2".
It does this by putting them into a frozenset.
However, the LHS ends up being a Constant node, with the name array([1,2,3]), which is an ndarray, so is not hashable.

Solution

It seems like the helper function _check_ne_builtin_clash should consider any name that is unhashable safe, since it can't conflict with the function names being searched for. If this seems like a reasonable behavior, let me know and I will submit a PR!

code for function:

def _check_ne_builtin_clash(expr):
"""Attempt to prevent foot-shooting in a helpful way.
Parameters
----------
terms : Term
Terms can contain
"""
names = expr.names
overlap = names & _ne_builtins
if overlap:
s = ', '.join(map(repr, overlap))
raise NumExprClobberingError('Variables in expression "{expr}" '
'overlap with builtins: ({s})'
.format(expr=expr, s=s))

code for var names it looks for:

https://github.com/pandas-dev/pandas/blob/master/pandas/core/computation/ops.py#L20-L26

Expected Output

> df.query("a.astype('int') < 2")
   a  b
0  1  4

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Darwin
OS-release: 15.6.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.23.4
pytest: 3.2.1
pip: 9.0.1
setuptools: 40.0.0
Cython: 0.24
numpy: 1.15.0
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 6.5.0
sphinx: 1.4.9
patsy: 0.5.0
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: 1.1.0
tables: 3.2.2
numexpr: 2.6.5
feather: None
matplotlib: 2.2.2
openpyxl: 2.3.2
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.2
lxml: 4.2.2
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.2.10
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

@machow machow changed the title query - numexpr safety check fails DataFrame query method - numexpr safety check fails Aug 21, 2018
@mroeschke mroeschke added Bug Numeric Operations Arithmetic, Comparison, and Logical operations labels Jan 13, 2019
@jbrockmendel jbrockmendel added the expressions pd.eval, query label Oct 22, 2019
@ianozsvald
Copy link
Contributor

I'd like to add another example to reinforce the above message. I came across this when using Pandas on a remote machine with numexpr whilst using Pandas locally without numexpr - the remote version failed, the local version ran.

Using the df in the parent comment both of the following will work if numexpr is not installed, they'll both fail the same way if it is installed:

import pandas as pd
df = pd.DataFrame({'a': ['1','2','3'], 'b': [4,5,6]}) # same as parent
df.query("a.astype('int') < 2") # same as parent
df.query('b.abs() < 5') # new example

Both of the query lines raise the same exception:

~/miniconda3/envs/.../pandas/core/computation/expr.py in names(self)
    786         if is_term(self.terms):
    787             return frozenset([self.terms.name])
--> 788         return frozenset(term.name for term in com.flatten(self.terms))
    789 
    790 
TypeError: unhashable type: 'numpy.ndarray'

I ran the above example locally without numexpr, then with (which fails), then without again (which succeeds again).

I'm using Pandas 1.0.3, Python 3.8, numexpr 2.7.1.

pd.show_versions()                                                                                                                                                                                                                    

INSTALLED VERSIONS
------------------
commit           : None
python           : 3.8.2.final.0
python-bits      : 64
OS               : Linux
OS-release       : 5.6.7-050607-generic
machine          : x86_64
processor        : x86_64
byteorder        : little
LC_ALL           : None
LANG             : en_GB.UTF-8
LOCALE           : en_GB.UTF-8

pandas           : 1.0.3
numpy            : 1.17.5
pytz             : 2020.1
dateutil         : 2.8.1
pip              : 20.1.1
setuptools       : 47.1.1.post20200529
Cython           : None
pytest           : None
hypothesis       : None
sphinx           : None
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : None
html5lib         : None
pymysql          : None
psycopg2         : None
jinja2           : 2.11.2
IPython          : 7.15.0
pandas_datareader: None
bs4              : None
bottleneck       : None
fastparquet      : None
gcsfs            : None
lxml.etree       : None
matplotlib       : 3.2.1
numexpr          : None
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : None
pytables         : None
pytest           : None
pyxlsb           : None
s3fs             : None
scipy            : 1.4.1
sqlalchemy       : None
tables           : None
tabulate         : None
xarray           : None
xlrd             : None
xlwt             : None
xlsxwriter       : None
numba            : 0.49.1

@machow
Copy link
Author

machow commented Aug 11, 2020

I tested the example in my original post, and the one by the commenter above and both seem to work w/ v1.1. Going to close--thanks for all the work put into pandas!

@machow machow closed this as completed Aug 11, 2020
@simonjayhawkins
Copy link
Member

@machow i'll reopen this as this does not appear to be fixed in 1.1.0 or master. did you have numexpr installed?

@simonjayhawkins
Copy link
Member

also we should generally add tests to prevent regression before closing issues.

@simonjayhawkins simonjayhawkins added this to the Contributions Welcome milestone Aug 17, 2020
@machow
Copy link
Author

machow commented Aug 17, 2020

Ah, thanks @simonjayhawkins -- two years elapsed from the time this was opened, and I didn't see a response from any pandas devs, so assumed it may have gone stale (I likely don't have time to submit a PR for this anymore, but am happy to test).

edit: thanks for the pointer--after installing numexpr the error reappears. Any feedback on the original suggestion?

It seems like the helper function _check_ne_builtin_clash should consider any name that is unhashable safe, since it can't conflict with the function names being searched for.

@mroeschke mroeschke removed the Numeric Operations Arithmetic, Comparison, and Logical operations label Jun 22, 2021
@AlexisMignon
Copy link
Contributor

Hello,

Anything new about this issue ?

The following code:

import pandas as pd

df = pd.DataFrame([[0.0, 0.0], [0.0, 0.0]], columns=["A", "B"])
df.query("A.isnull()")

Crashes with the message:

Traceback (most recent call last):
 File "test_pandas.py", line 4, in <module>
   df.query("A.isnull()")
 File "/home/amignon/Projets/OffreFormation/exploratory-data-analysis-in-python/venv/lib/python3.8/site-packages/pandas/core/frame.py", line 4055, in query
   res = self.eval(expr, **kwargs)
 File "/home/amignon/Projets/OffreFormation/exploratory-data-analysis-in-python/venv/lib/python3.8/site-packages/pandas/core/frame.py", line 4186, in eval
   return _eval(expr, inplace=inplace, **kwargs)
 File "/home/amignon/Projets/OffreFormation/exploratory-data-analysis-in-python/venv/lib/python3.8/site-packages/pandas/core/computation/eval.py", line 353, in eval
   ret = eng_inst.evaluate()
 File "/home/amignon/Projets/OffreFormation/exploratory-data-analysis-in-python/venv/lib/python3.8/site-packages/pandas/core/computation/engines.py", line 80, in evaluate
   res = self._evaluate()
 File "/home/amignon/Projets/OffreFormation/exploratory-data-analysis-in-python/venv/lib/python3.8/site-packages/pandas/core/computation/engines.py", line 120, in _evaluate
   _check_ne_builtin_clash(self.expr)
 File "/home/amignon/Projets/OffreFormation/exploratory-data-analysis-in-python/venv/lib/python3.8/site-packages/pandas/core/computation/engines.py", line 36, in _check_ne_builtin_clash
   names = expr.names
 File "/home/amignon/Projets/OffreFormation/exploratory-data-analysis-in-python/venv/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 833, in names
   return frozenset([self.terms.name])
TypeError: unhashable type: 'Series'

only when numexpr is installed with pandas==1.3.2 and numexpr==2.7.3

@AlexisMignon
Copy link
Contributor

The following solution was proposed by @machow :

Solution

It seems like the helper function _check_ne_builtin_clash should consider any name that is unhashable safe, since it can't conflict with the function names being searched for. If this seems like a reasonable behavior, let me know and I will submit a PR!

It is unfortunately not enough. I've replaced the code starting at

:

    @property
    def names(self):
        """
        Get the names in an expression.
        """
        if is_term(self.terms):
            return frozenset([self.terms.name])
        return frozenset(term.name for term in com.flatten(self.terms))

by

    @property
    def names(self):
        """
        Get the names in an expression.
        """
        if is_term(self.terms):
            if self.terms.name.__hash__ is not None:
                return frozenset([self.terms.name])
            else:
                return frozenset()
        return frozenset(term.name for term in com.flatten(self.terms))

which is probably not the best way. It however allowed me to go further in the execution.

At some point:

return ne.evaluate(s, local_dict=scope)

a string is passed to numexpr.evaluate. In my example above, this string is:

"'0    False\\n1    False\\nName: A, dtype: bool'"

It seems that the result of the expression parsing with PandasExprVisitor is a Constantterm associated with the evaluation of the expression. When passed tone.evaluate()` it's the string representation which is used.

@untanglereality
Copy link

untanglereality commented Sep 9, 2021

If anyone is having trouble with unhashable type error when using Pandas query, you can add engine="python" argument if the performance isn't a problem.

Example:

orders.query("item_name.str.contains('Chicken')", engine="python")

You can pass engine='python' to evaluate an expression using Python itself as a backend. This is not recommended as it is inefficient compared to using numexpr as the engine.
Source: DataFrame.query documentation

You can also use the old-style masking instead.

orders[orders.item_name.str.contains('Chicken')]

@jiagengliu
Copy link

jiagengliu commented Jul 31, 2022

If anyone is having trouble with unhashable type error when using the Pandas query, you can upgrade to pandas 1.4 (which requires Python 3.8).

pip install pandas==1.4.3 fixes the problem for me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug expressions pd.eval, query
Projects
None yet
Development

Successfully merging a pull request may close this issue.

9 participants