DataFrame query method - numexpr safety check fails #22435

machow · 2018-08-21T00:54:01Z

Code Sample, a copy-pastable example if possible

# Your code here
import pandas as pd
df = pd.DataFrame({'a': ['1','2','3'], 'b': [4,5,6]})
df.query("a.astype('int') < 2")

raises TypeError: unhashable type: 'numpy.ndarray'

Problem description

Background
When using numexpr, Pandas has an internal function, _check_ne_builtin_clash, for detecting when a variable used in a method like query clashes with a numexpr built-in.

Here's an example of the function raising an error as intended..

df = pd.DataFrame({'abs': [1,2,3]})
df.query("abs > 2")
# Raises NumExprClobberingError: Variables ... overlap with builtins: ('abs')

Mostly, the names it protects again are math functions like sin, cos, sum, etc..

Why my original example fails

The trouble with my original code is that check_ne_builtin_clash is checking the name of both sides of the BinaryExpr AST node corresponding to "a.astype('int') < 2".
It does this by putting them into a frozenset.
However, the LHS ends up being a Constant node, with the name array([1,2,3]), which is an ndarray, so is not hashable.

Solution

It seems like the helper function _check_ne_builtin_clash should consider any name that is unhashable safe, since it can't conflict with the function names being searched for. If this seems like a reasonable behavior, let me know and I will submit a PR!

code for function:

pandas/pandas/core/computation/engines.py

Lines 23 to 38 in b822535

    
           def _check_ne_builtin_clash(expr): 
        
               """Attempt to prevent foot-shooting in a helpful way. 
        
               Parameters 
        
               ---------- 
        
               terms : Term 
        
                   Terms can contain 
        
               """ 
        
               names = expr.names 
        
               overlap = names & _ne_builtins 
        
               if overlap: 
        
                   s = ', '.join(map(repr, overlap)) 
        
                   raise NumExprClobberingError('Variables in expression "{expr}" ' 
        
                                                'overlap with builtins: ({s})' 
        
                                                .format(expr=expr, s=s))

code for var names it looks for:

https://github.com/pandas-dev/pandas/blob/master/pandas/core/computation/ops.py#L20-L26

Expected Output

> df.query("a.astype('int') < 2")
   a  b
0  1  4

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.5.2.final.0
python-bits: 64
OS: Darwin
OS-release: 15.6.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.23.4
pytest: 3.2.1
pip: 9.0.1
setuptools: 40.0.0
Cython: 0.24
numpy: 1.15.0
scipy: 1.1.0
pyarrow: None
xarray: None
IPython: 6.5.0
sphinx: 1.4.9
patsy: 0.5.0
dateutil: 2.7.3
pytz: 2018.5
blosc: None
bottleneck: 1.1.0
tables: 3.2.2
numexpr: 2.6.5
feather: None
matplotlib: 2.2.2
openpyxl: 2.3.2
xlrd: 1.0.0
xlwt: 1.1.2
xlsxwriter: 0.9.2
lxml: 4.2.2
bs4: 4.6.0
html5lib: 1.0.1
sqlalchemy: 1.2.10
pymysql: None
psycopg2: None
jinja2: 2.10
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None

The text was updated successfully, but these errors were encountered:

ianozsvald · 2020-06-03T13:46:05Z

I'd like to add another example to reinforce the above message. I came across this when using Pandas on a remote machine with numexpr whilst using Pandas locally without numexpr - the remote version failed, the local version ran.

Using the df in the parent comment both of the following will work if numexpr is not installed, they'll both fail the same way if it is installed:

import pandas as pd
df = pd.DataFrame({'a': ['1','2','3'], 'b': [4,5,6]}) # same as parent
df.query("a.astype('int') < 2") # same as parent
df.query('b.abs() < 5') # new example

Both of the query lines raise the same exception:

~/miniconda3/envs/.../pandas/core/computation/expr.py in names(self)
    786         if is_term(self.terms):
    787             return frozenset([self.terms.name])
--> 788         return frozenset(term.name for term in com.flatten(self.terms))
    789 
    790 
TypeError: unhashable type: 'numpy.ndarray'

I ran the above example locally without numexpr, then with (which fails), then without again (which succeeds again).

I'm using Pandas 1.0.3, Python 3.8, numexpr 2.7.1.

pd.show_versions()                                                                                                                                                                                                                    

INSTALLED VERSIONS
------------------
commit           : None
python           : 3.8.2.final.0
python-bits      : 64
OS               : Linux
OS-release       : 5.6.7-050607-generic
machine          : x86_64
processor        : x86_64
byteorder        : little
LC_ALL           : None
LANG             : en_GB.UTF-8
LOCALE           : en_GB.UTF-8

pandas           : 1.0.3
numpy            : 1.17.5
pytz             : 2020.1
dateutil         : 2.8.1
pip              : 20.1.1
setuptools       : 47.1.1.post20200529
Cython           : None
pytest           : None
hypothesis       : None
sphinx           : None
blosc            : None
feather          : None
xlsxwriter       : None
lxml.etree       : None
html5lib         : None
pymysql          : None
psycopg2         : None
jinja2           : 2.11.2
IPython          : 7.15.0
pandas_datareader: None
bs4              : None
bottleneck       : None
fastparquet      : None
gcsfs            : None
lxml.etree       : None
matplotlib       : 3.2.1
numexpr          : None
odfpy            : None
openpyxl         : None
pandas_gbq       : None
pyarrow          : None
pytables         : None
pytest           : None
pyxlsb           : None
s3fs             : None
scipy            : 1.4.1
sqlalchemy       : None
tables           : None
tabulate         : None
xarray           : None
xlrd             : None
xlwt             : None
xlsxwriter       : None
numba            : 0.49.1

machow · 2020-08-11T16:59:52Z

I tested the example in my original post, and the one by the commenter above and both seem to work w/ v1.1. Going to close--thanks for all the work put into pandas!

simonjayhawkins · 2020-08-17T14:34:04Z

@machow i'll reopen this as this does not appear to be fixed in 1.1.0 or master. did you have numexpr installed?

simonjayhawkins · 2020-08-17T14:34:52Z

also we should generally add tests to prevent regression before closing issues.

machow · 2020-08-17T17:02:53Z

Ah, thanks @simonjayhawkins -- two years elapsed from the time this was opened, and I didn't see a response from any pandas devs, so assumed it may have gone stale (I likely don't have time to submit a PR for this anymore, but am happy to test).

edit: thanks for the pointer--after installing numexpr the error reappears. Any feedback on the original suggestion?

It seems like the helper function _check_ne_builtin_clash should consider any name that is unhashable safe, since it can't conflict with the function names being searched for.

AlexisMignon · 2021-08-27T15:56:40Z

Hello,

Anything new about this issue ?

The following code:

import pandas as pd

df = pd.DataFrame([[0.0, 0.0], [0.0, 0.0]], columns=["A", "B"])
df.query("A.isnull()")

Crashes with the message:

Traceback (most recent call last):
 File "test_pandas.py", line 4, in <module>
   df.query("A.isnull()")
 File "/home/amignon/Projets/OffreFormation/exploratory-data-analysis-in-python/venv/lib/python3.8/site-packages/pandas/core/frame.py", line 4055, in query
   res = self.eval(expr, **kwargs)
 File "/home/amignon/Projets/OffreFormation/exploratory-data-analysis-in-python/venv/lib/python3.8/site-packages/pandas/core/frame.py", line 4186, in eval
   return _eval(expr, inplace=inplace, **kwargs)
 File "/home/amignon/Projets/OffreFormation/exploratory-data-analysis-in-python/venv/lib/python3.8/site-packages/pandas/core/computation/eval.py", line 353, in eval
   ret = eng_inst.evaluate()
 File "/home/amignon/Projets/OffreFormation/exploratory-data-analysis-in-python/venv/lib/python3.8/site-packages/pandas/core/computation/engines.py", line 80, in evaluate
   res = self._evaluate()
 File "/home/amignon/Projets/OffreFormation/exploratory-data-analysis-in-python/venv/lib/python3.8/site-packages/pandas/core/computation/engines.py", line 120, in _evaluate
   _check_ne_builtin_clash(self.expr)
 File "/home/amignon/Projets/OffreFormation/exploratory-data-analysis-in-python/venv/lib/python3.8/site-packages/pandas/core/computation/engines.py", line 36, in _check_ne_builtin_clash
   names = expr.names
 File "/home/amignon/Projets/OffreFormation/exploratory-data-analysis-in-python/venv/lib/python3.8/site-packages/pandas/core/computation/expr.py", line 833, in names
   return frozenset([self.terms.name])
TypeError: unhashable type: 'Series'

only when numexpr is installed with pandas==1.3.2 and numexpr==2.7.3

AlexisMignon · 2021-08-30T10:46:14Z

The following solution was proposed by @machow :

Solution

It seems like the helper function _check_ne_builtin_clash should consider any name that is unhashable safe, since it can't conflict with the function names being searched for. If this seems like a reasonable behavior, let me know and I will submit a PR!

It is unfortunately not enough. I've replaced the code starting at

pandas/pandas/core/computation/expr.py

Line 827 in c979bd8

@property

:

    @property
    def names(self):
        """
        Get the names in an expression.
        """
        if is_term(self.terms):
            return frozenset([self.terms.name])
        return frozenset(term.name for term in com.flatten(self.terms))

by

    @property
    def names(self):
        """
        Get the names in an expression.
        """
        if is_term(self.terms):
            if self.terms.name.__hash__ is not None:
                return frozenset([self.terms.name])
            else:
                return frozenset()
        return frozenset(term.name for term in com.flatten(self.terms))

which is probably not the best way. It however allowed me to go further in the execution.

At some point:

pandas/pandas/core/computation/engines.py

Line 121 in c979bd8

return ne.evaluate(s, local_dict=scope)

a string is passed to numexpr.evaluate. In my example above, this string is:

"'0    False\\n1    False\\nName: A, dtype: bool'"

It seems that the result of the expression parsing with PandasExprVisitor is a Constantterm associated with the evaluation of the expression. When passed tone.evaluate()` it's the string representation which is used.

untanglereality · 2021-09-09T22:19:12Z

If anyone is having trouble with unhashable type error when using Pandas query, you can add engine="python" argument if the performance isn't a problem.

Example:

orders.query("item_name.str.contains('Chicken')", engine="python")

You can pass engine='python' to evaluate an expression using Python itself as a backend. This is not recommended as it is inefficient compared to using numexpr as the engine.
Source: DataFrame.query documentation

You can also use the old-style masking instead.

orders[orders.item_name.str.contains('Chicken')]

jiagengliu · 2022-07-31T18:25:55Z

If anyone is having trouble with unhashable type error when using the Pandas query, you can upgrade to pandas 1.4 (which requires Python 3.8).

pip install pandas==1.4.3 fixes the problem for me.

machow changed the title ~~query - numexpr safety check fails~~ DataFrame query method - numexpr safety check fails Aug 21, 2018

mroeschke added Bug Numeric Operations Arithmetic, Comparison, and Logical operations labels Jan 13, 2019

jbrockmendel added the expressions pd.eval, query label Oct 22, 2019

machow closed this as completed Aug 11, 2020

simonjayhawkins reopened this Aug 17, 2020

simonjayhawkins added this to the Contributions Welcome milestone Aug 17, 2020

mroeschke removed the Numeric Operations Arithmetic, Comparison, and Logical operations label Jun 22, 2021

AlexisMignon mentioned this issue Aug 30, 2021

BUG: Solves errors when calling series methods in DataFrame.query with numexpr #43301

Merged

4 tasks

jreback modified the milestones: Contributions Welcome, 1.4 Aug 31, 2021

jreback closed this as completed in #43301 Sep 25, 2021

bl-young mentioned this issue Jul 8, 2022

in geo.filtered_fips: TypeError: unhashable type: 'Series' USEPA/flowsa#238

Closed

katxiao mentioned this issue Sep 27, 2022

Catch typeerror in new row synthesis query sdv-dev/SDMetrics#234

Merged

npatki mentioned this issue Feb 24, 2023

Fix ValueError in NewRowSynthesis if size of row_filter exceeds 31 sdv-dev/SDMetrics#313

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

DataFrame query method - numexpr safety check fails #22435

DataFrame query method - numexpr safety check fails #22435

machow commented Aug 21, 2018 •

edited

Loading

INSTALLED VERSIONS

ianozsvald commented Jun 3, 2020

Uh oh!

machow commented Aug 11, 2020

Uh oh!

simonjayhawkins commented Aug 17, 2020

Uh oh!

simonjayhawkins commented Aug 17, 2020

Uh oh!

machow commented Aug 17, 2020 •

edited

Loading

Uh oh!

AlexisMignon commented Aug 27, 2021

Uh oh!

AlexisMignon commented Aug 30, 2021

Uh oh!

untanglereality commented Sep 9, 2021 •

edited

Loading

Uh oh!

jiagengliu commented Jul 31, 2022 •

edited

Loading

Uh oh!

Uh oh!

DataFrame query method - numexpr safety check fails #22435

DataFrame query method - numexpr safety check fails #22435

Comments

machow commented Aug 21, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Sample, a copy-pastable example if possible

Problem description

Expected Output

Output of pd.show_versions()

INSTALLED VERSIONS

ianozsvald commented Jun 3, 2020

Uh oh!

machow commented Aug 11, 2020

Uh oh!

simonjayhawkins commented Aug 17, 2020

Uh oh!

simonjayhawkins commented Aug 17, 2020

Uh oh!

machow commented Aug 17, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

AlexisMignon commented Aug 27, 2021

Uh oh!

AlexisMignon commented Aug 30, 2021

Uh oh!

untanglereality commented Sep 9, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jiagengliu commented Jul 31, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

machow commented Aug 21, 2018 •

edited

Loading

Output of `pd.show_versions()`

machow commented Aug 17, 2020 •

edited

Loading

untanglereality commented Sep 9, 2021 •

edited

Loading

jiagengliu commented Jul 31, 2022 •

edited

Loading