Skip to content

Extremely Slow Block.setitem() when value is a builtin python type. #25756

@kykosic

Description

@kykosic

Code Sample

This code example is simply calling repeated setitem via a LocIndexer on a reasonably sized DataFrame of all bools. The key is that it takes exponentially longer to call this when you're setting a python bool type value versus a np.array[bool] value.

from datetime import datetime
import pandas as pd
import numpy as np

# Create dataframe of 500 columns, 1 year of dates, all False values
df = pd.DataFrame(
    False,
    columns=np.arange(500).astype(str),
    index=pd.date_range('2010-01-01', '2011-01-01')
)

def test(true_value):
    """ Test time for assigning a slice `True` and `np.array(True)` """
    tmp_df = df.copy()
    
    start = datetime(2010, 5, 1)
    end = datetime(2010, 9, 1)
    tmp_df.loc[start:end, :] = true_value

print("True")
%timeit test(True)
print("\nnp.array(True)")
%timeit test(np.array(True))
Output:

True
3.4 ms ± 39.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

np.array(True)
512 µs ± 4.91 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Problem description

A code profiling reveals that almost all the computation time in the longer run is due to ensure_object being called in cast.py:78 from a Block.setitem.

In the above example, a bool True is passed to the blocks.py:Block.setitem(...) method via LocIndexer. Since a builtin bool has no dtype attribute, the dtype is set to infer. This causes the cast.py:maybe_downcast_to_dtype(...) to be called with dtype == 'infer', and if the dataset is large the ensure_object(...) method takes a long time to run. Wrapping the value in a numpy array np.array(True) exponentially reduces the computation since it then has a dtype.

This behavior did not exist in older versions of pandas (pre-20ish). Could we do some pre-defined type checking in Block.setitem(...) to infer builtin python dtypes instead of requiring dtype attributed objects?

I would be happy to code a fix for this, but I am unsure of the reasoning behind only supporting objects with a dtype attribute as a value.

Output of pd.show_versions()

INSTALLED VERSIONS

commit: None
python: 3.6.7.final.0
python-bits: 64
OS: Darwin
OS-release: 18.2.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8

pandas: 0.25.0.dev0+266.g707c7201a
pytest: None
pip: 19.0.3
setuptools: 40.8.0
Cython: 0.29.6
numpy: 1.16.2
scipy: None
pyarrow: None
xarray: None
IPython: 7.3.0
sphinx: None
patsy: None
dateutil: 2.8.0
pytz: 2018.9
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None

Metadata

Metadata

Assignees

Labels

BenchmarkPerformance (ASV) benchmarksIndexingRelated to indexing on series/frames, not to indexes themselvesgood first issue

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions