-
-
Notifications
You must be signed in to change notification settings - Fork 18.8k
Description
Code Sample
This code example is simply calling repeated setitem
via a LocIndexer
on a reasonably sized DataFrame of all bools. The key is that it takes exponentially longer to call this when you're setting a python bool
type value versus a np.array[bool]
value.
from datetime import datetime
import pandas as pd
import numpy as np
# Create dataframe of 500 columns, 1 year of dates, all False values
df = pd.DataFrame(
False,
columns=np.arange(500).astype(str),
index=pd.date_range('2010-01-01', '2011-01-01')
)
def test(true_value):
""" Test time for assigning a slice `True` and `np.array(True)` """
tmp_df = df.copy()
start = datetime(2010, 5, 1)
end = datetime(2010, 9, 1)
tmp_df.loc[start:end, :] = true_value
print("True")
%timeit test(True)
print("\nnp.array(True)")
%timeit test(np.array(True))
Output:
True
3.4 ms ± 39.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
np.array(True)
512 µs ± 4.91 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
Problem description
A code profiling reveals that almost all the computation time in the longer run is due to ensure_object
being called in cast.py:78
from a Block.setitem
.
In the above example, a bool True
is passed to the blocks.py:Block.setitem(...)
method via LocIndexer
. Since a builtin bool has no dtype
attribute, the dtype is set to infer
. This causes the cast.py:maybe_downcast_to_dtype(...)
to be called with dtype == 'infer'
, and if the dataset is large the ensure_object(...)
method takes a long time to run. Wrapping the value in a numpy array np.array(True)
exponentially reduces the computation since it then has a dtype
.
This behavior did not exist in older versions of pandas (pre-20ish). Could we do some pre-defined type checking in Block.setitem(...)
to infer builtin python dtypes instead of requiring dtype
attributed objects?
I would be happy to code a fix for this, but I am unsure of the reasoning behind only supporting objects with a dtype
attribute as a value.
Output of pd.show_versions()
INSTALLED VERSIONS
commit: None
python: 3.6.7.final.0
python-bits: 64
OS: Darwin
OS-release: 18.2.0
machine: x86_64
processor: i386
byteorder: little
LC_ALL: None
LANG: en_US.UTF-8
LOCALE: en_US.UTF-8
pandas: 0.25.0.dev0+266.g707c7201a
pytest: None
pip: 19.0.3
setuptools: 40.8.0
Cython: 0.29.6
numpy: 1.16.2
scipy: None
pyarrow: None
xarray: None
IPython: 7.3.0
sphinx: None
patsy: None
dateutil: 2.8.0
pytz: 2018.9
blosc: None
bottleneck: None
tables: None
numexpr: None
feather: None
matplotlib: None
openpyxl: None
xlrd: None
xlwt: None
xlsxwriter: None
lxml.etree: None
bs4: None
html5lib: None
sqlalchemy: None
pymysql: None
psycopg2: None
jinja2: None
s3fs: None
fastparquet: None
pandas_gbq: None
pandas_datareader: None
gcsfs: None