Skip to content

Extremely Slow Block.setitem() when value is a builtin python type. #25756

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
kykosic opened this issue Mar 17, 2019 · 8 comments · Fixed by #53177
Closed

Extremely Slow Block.setitem() when value is a builtin python type. #25756

kykosic opened this issue Mar 17, 2019 · 8 comments · Fixed by #53177
Assignees
Labels
Benchmark Performance (ASV) benchmarks good first issue Indexing Related to indexing on series/frames, not to indexes themselves

Comments

@kykosic
Copy link
Contributor

kykosic commented Mar 17, 2019

Code Sample

This code example is simply calling repeated setitem via a LocIndexer on a reasonably sized DataFrame of all bools. The key is that it takes exponentially longer to call this when you're setting a python bool type value versus a np.array[bool] value.

from datetime import datetime
import pandas as pd
import numpy as np

# Create dataframe of 500 columns, 1 year of dates, all False values
df = pd.DataFrame(
    False,
    columns=np.arange(500).astype(str),
    index=pd.date_range('2010-01-01', '2011-01-01')
)

def test(true_value):
    """ Test time for assigning a slice `True` and `np.array(True)` """
    tmp_df = df.copy()
    
    start = datetime(2010, 5, 1)
    end = datetime(2010, 9, 1)
    tmp_df.loc[start:end, :] = true_value

print("True")
%timeit test(True)
print("\nnp.array(True)")
%timeit test(np.array(True))
Output:

True
3.4 ms ± 39.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

np.array(True)
512 µs ± 4.91 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Problem description

A code profiling reveals that almost all the computation time in the longer run is due to ensure_object being called in cast.py:78 from a Block.setitem.

In the above example, a bool True is passed to the blocks.py:Block.setitem(...) method via LocIndexer. Since a builtin bool has no dtype attribute, the dtype is set to infer. This causes the cast.py:maybe_downcast_to_dtype(...) to be called with dtype == 'infer', and if the dataset is large the ensure_object(...) method takes a long time to run. Wrapping the value in a numpy array np.array(True) exponentially reduces the computation since it then has a dtype.

This behavior did not exist in older versions of pandas (pre-20ish). Could we do some pre-defined type checking in Block.setitem(...) to infer builtin python dtypes instead of requiring dtype attributed objects?

I would be happy to code a fix for this, but I am unsure of the reasoning behind only supporting objects with a dtype attribute as a value.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.7.final.0
python-bits: 64
OS: Darwin
OS-release: 18.2.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.25.0.dev0+266.g707c7201a
pytest: None
pip: 19.0.3
setuptools: 40.8.0
Cython: 0.29.6
numpy: 1.16.2
scipy: None
pyarrow: None
xarray: None
IPython: 7.3.0
sphinx: None
patsy: None
dateutil: 2.8.0
pytz: 2018.9
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

@kykosic kykosic changed the title Extremely Slow Block.setitem() when value is a builtin type. Extremely Slow Block.setitem() when value is a builtin python type. Mar 17, 2019
@kykosic
Copy link
Contributor Author

kykosic commented Mar 17, 2019

Updated code snippet, had a typo.

@mroeschke
Copy link
Member

Could you simplify your performance test to just highlight the assignment portion (if thats the main issue)? I don't see how the loop is really relevant (and not really idiomatic to assign values in a DataFrame with looping).

@kykosic
Copy link
Contributor Author

kykosic commented Mar 20, 2019

Apologies for the poor example. I've simplified the code to illustrate the point more directly.

@mroeschke
Copy link
Member

Thanks for simplifying the example.

When assigning arrays with only value, I could see an optimization where we don't need to cast to object and treat the assignment the same as a scalar. Definitely feel free to put a PR!

@mroeschke mroeschke added Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance labels Mar 20, 2019
@mroeschke
Copy link
Member

These looks comparable on master now. Could use a ASV benchmark.

In [16]: from datetime import datetime
    ...: import pandas as pd
    ...: import numpy as np
    ...:
    ...: # Create dataframe of 500 columns, 1 year of dates, all False values
    ...: df = pd.DataFrame(
    ...:     False,
    ...:     columns=np.arange(500).astype(str),
    ...:     index=pd.date_range('2010-01-01', '2011-01-01')
    ...: )
    ...:
    ...: def test(true_value):
    ...:     """ Test time for assigning a slice `True` and `np.array(True)` """
    ...:     tmp_df = df.copy()
    ...:
    ...:     start = datetime(2010, 5, 1)
    ...:     end = datetime(2010, 9, 1)
    ...:     tmp_df.loc[start:end, :] = true_value
    ...:
    ...: print("True")
    ...: %timeit test(True)
    ...: print("\nnp.array(True)")
    ...: %timeit test(np.array(True))
True
218 µs ± 3.52 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

np.array(True)
233 µs ± 18.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

@mroeschke mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions Benchmark Performance (ASV) benchmarks and removed Performance Memory or execution speed performance Needs Tests Unit test(s) needed to prevent regressions labels Jun 27, 2021
@aswinj18
Copy link

Hi @kykosic @mroeschke ,

I just starting out on my open source journey. I'm interested in the issue. Is there any way you guys can assign the issue to me?

@aswinj18
Copy link

take

@aswinj18
Copy link

aswinj18 commented Apr 30, 2023

@mroeschke
The issue seems to be fixed. Is there anything that I could help with, since the issue is still open?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Benchmark Performance (ASV) benchmarks good first issue Indexing Related to indexing on series/frames, not to indexes themselves
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants