Extremely Slow `Block.setitem()` when `value` is a builtin python type. #25756

kykosic · 2019-03-17T20:13:49Z

Code Sample

This code example is simply calling repeated setitem via a LocIndexer on a reasonably sized DataFrame of all bools. The key is that it takes exponentially longer to call this when you're setting a python bool type value versus a np.array[bool] value.

from datetime import datetime
import pandas as pd
import numpy as np

# Create dataframe of 500 columns, 1 year of dates, all False values
df = pd.DataFrame(
    False,
    columns=np.arange(500).astype(str),
    index=pd.date_range('2010-01-01', '2011-01-01')
)

def test(true_value):
    """ Test time for assigning a slice `True` and `np.array(True)` """
    tmp_df = df.copy()
    
    start = datetime(2010, 5, 1)
    end = datetime(2010, 9, 1)
    tmp_df.loc[start:end, :] = true_value

print("True")
%timeit test(True)
print("\nnp.array(True)")
%timeit test(np.array(True))

Output:

True
3.4 ms ± 39.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

np.array(True)
512 µs ± 4.91 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Problem description

A code profiling reveals that almost all the computation time in the longer run is due to ensure_object being called in cast.py:78 from a Block.setitem.

In the above example, a bool True is passed to the blocks.py:Block.setitem(...) method via LocIndexer. Since a builtin bool has no dtype attribute, the dtype is set to infer. This causes the cast.py:maybe_downcast_to_dtype(...) to be called with dtype == 'infer', and if the dataset is large the ensure_object(...) method takes a long time to run. Wrapping the value in a numpy array np.array(True) exponentially reduces the computation since it then has a dtype.

This behavior did not exist in older versions of pandas (pre-20ish). Could we do some pre-defined type checking in Block.setitem(...) to infer builtin python dtypes instead of requiring dtype attributed objects?

I would be happy to code a fix for this, but I am unsure of the reasoning behind only supporting objects with a dtype attribute as a value.

Output of `pd.show_versions()`

INSTALLED VERSIONS

commit: None
python: 3.6.7.final.0
python-bits: 64
OS: Darwin
OS-release: 18.2.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.25.0.dev0+266.g707c7201a
pytest: None
pip: 19.0.3
setuptools: 40.8.0
Cython: 0.29.6
numpy: 1.16.2
scipy: None
pyarrow: None
xarray: None
IPython: 7.3.0
sphinx: None
patsy: None
dateutil: 2.8.0
pytz: 2018.9
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

The text was updated successfully, but these errors were encountered:

kykosic · 2019-03-17T21:00:06Z

Updated code snippet, had a typo.

mroeschke · 2019-03-20T05:24:16Z

Could you simplify your performance test to just highlight the assignment portion (if thats the main issue)? I don't see how the loop is really relevant (and not really idiomatic to assign values in a DataFrame with looping).

kykosic · 2019-03-20T13:36:06Z

Apologies for the poor example. I've simplified the code to illustrate the point more directly.

mroeschke · 2019-03-20T16:22:59Z

Thanks for simplifying the example.

When assigning arrays with only value, I could see an optimization where we don't need to cast to object and treat the assignment the same as a scalar. Definitely feel free to put a PR!

mroeschke · 2021-06-27T19:54:55Z

These looks comparable on master now. Could use a ASV benchmark.

In [16]: from datetime import datetime
    ...: import pandas as pd
    ...: import numpy as np
    ...:
    ...: # Create dataframe of 500 columns, 1 year of dates, all False values
    ...: df = pd.DataFrame(
    ...:     False,
    ...:     columns=np.arange(500).astype(str),
    ...:     index=pd.date_range('2010-01-01', '2011-01-01')
    ...: )
    ...:
    ...: def test(true_value):
    ...:     """ Test time for assigning a slice `True` and `np.array(True)` """
    ...:     tmp_df = df.copy()
    ...:
    ...:     start = datetime(2010, 5, 1)
    ...:     end = datetime(2010, 9, 1)
    ...:     tmp_df.loc[start:end, :] = true_value
    ...:
    ...: print("True")
    ...: %timeit test(True)
    ...: print("\nnp.array(True)")
    ...: %timeit test(np.array(True))
True
218 µs ± 3.52 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

np.array(True)
233 µs ± 18.3 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

aswinj18 · 2023-04-18T14:44:30Z

Hi @kykosic @mroeschke ,

I just starting out on my open source journey. I'm interested in the issue. Is there any way you guys can assign the issue to me?

aswinj18 · 2023-04-23T07:02:54Z

take

aswinj18 · 2023-04-30T14:11:57Z

@mroeschke
The issue seems to be fixed. Is there anything that I could help with, since the issue is still open?

kykosic changed the title ~~Extremely Slow Block.setitem() when value is a builtin type.~~ Extremely Slow Block.setitem() when value is a builtin python type. Mar 17, 2019

mroeschke added Indexing Related to indexing on series/frames, not to indexes themselves Performance Memory or execution speed performance labels Mar 20, 2019

mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions Benchmark Performance (ASV) benchmarks and removed Performance Memory or execution speed performance Needs Tests Unit test(s) needed to prevent regressions labels Jun 27, 2021

github-actions bot assigned aswinj18 Apr 23, 2023

steliospetrakis02 mentioned this issue May 11, 2023

Add asv benchmarks for Block.setitem() #53177

Merged

5 tasks

mroeschke closed this as completed in #53177 May 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Extremely Slow `Block.setitem()` when `value` is a builtin python type. #25756

Extremely Slow `Block.setitem()` when `value` is a builtin python type. #25756

kykosic commented Mar 17, 2019 •

edited

Loading

INSTALLED VERSIONS

kykosic commented Mar 17, 2019

Uh oh!

mroeschke commented Mar 20, 2019

Uh oh!

kykosic commented Mar 20, 2019

Uh oh!

mroeschke commented Mar 20, 2019

Uh oh!

mroeschke commented Jun 27, 2021

Uh oh!

aswinj18 commented Apr 18, 2023

Uh oh!

aswinj18 commented Apr 23, 2023

Uh oh!

aswinj18 commented Apr 30, 2023 •

edited

Loading

Uh oh!

Uh oh!

Extremely Slow Block.setitem() when value is a builtin python type. #25756

Extremely Slow Block.setitem() when value is a builtin python type. #25756

Comments

kykosic commented Mar 17, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Sample

Problem description

Output of pd.show_versions()

INSTALLED VERSIONS

kykosic commented Mar 17, 2019

Uh oh!

mroeschke commented Mar 20, 2019

Uh oh!

kykosic commented Mar 20, 2019

Uh oh!

mroeschke commented Mar 20, 2019

Uh oh!

mroeschke commented Jun 27, 2021

Uh oh!

aswinj18 commented Apr 18, 2023

Uh oh!

aswinj18 commented Apr 23, 2023

Uh oh!

aswinj18 commented Apr 30, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Extremely Slow `Block.setitem()` when `value` is a builtin python type. #25756

Extremely Slow `Block.setitem()` when `value` is a builtin python type. #25756

kykosic commented Mar 17, 2019 •

edited

Loading

Output of `pd.show_versions()`

aswinj18 commented Apr 30, 2023 •

edited

Loading