-
-
Notifications
You must be signed in to change notification settings - Fork 18.6k
Extremely Slow Block.setitem()
when value
is a builtin python type.
#25756
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
Block.setitem()
when value
is a builtin type.Block.setitem()
when value
is a builtin python type.
Updated code snippet, had a typo. |
Could you simplify your performance test to just highlight the assignment portion (if thats the main issue)? I don't see how the loop is really relevant (and not really idiomatic to assign values in a DataFrame with looping). |
Apologies for the poor example. I've simplified the code to illustrate the point more directly. |
Thanks for simplifying the example. When assigning arrays with only value, I could see an optimization where we don't need to cast to object and treat the assignment the same as a scalar. Definitely feel free to put a PR! |
These looks comparable on master now. Could use a ASV benchmark.
|
Hi @kykosic @mroeschke , I just starting out on my open source journey. I'm interested in the issue. Is there any way you guys can assign the issue to me? |
take |
@mroeschke |
Uh oh!
There was an error while loading. Please reload this page.
Code Sample
This code example is simply calling repeated
setitem
via aLocIndexer
on a reasonably sized DataFrame of all bools. The key is that it takes exponentially longer to call this when you're setting a pythonbool
type value versus anp.array[bool]
value.Problem description
A code profiling reveals that almost all the computation time in the longer run is due to
ensure_object
being called incast.py:78
from aBlock.setitem
.In the above example, a bool
True
is passed to theblocks.py:Block.setitem(...)
method viaLocIndexer
. Since a builtin bool has nodtype
attribute, the dtype is set toinfer
. This causes thecast.py:maybe_downcast_to_dtype(...)
to be called withdtype == 'infer'
, and if the dataset is large theensure_object(...)
method takes a long time to run. Wrapping the value in a numpy arraynp.array(True)
exponentially reduces the computation since it then has adtype
.This behavior did not exist in older versions of pandas (pre-20ish). Could we do some pre-defined type checking in
Block.setitem(...)
to infer builtin python dtypes instead of requiringdtype
attributed objects?I would be happy to code a fix for this, but I am unsure of the reasoning behind only supporting objects with a
dtype
attribute as a value.Output of
pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.6.7.final.0
python-bits: 64
OS: Darwin
OS-release: 18.2.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.25.0.dev0+266.g707c7201a
pytest: None
pip: 19.0.3
setuptools: 40.8.0
Cython: 0.29.6
numpy: 1.16.2
scipy: None
pyarrow: None
xarray: None
IPython: 7.3.0
sphinx: None
patsy: None
dateutil: 2.8.0
pytz: 2018.9
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None
The text was updated successfully, but these errors were encountered: