-
-
Notifications
You must be signed in to change notification settings - Fork 18.9k
Description
Pandas version checks
-
I have checked that this issue has not already been reported.
-
I have confirmed this issue exists on the latest version of pandas.
-
I have confirmed this issue exists on the main branch of pandas.
Reproducible Example
Conversion to pyarrow
datatypes changes the performance drastically. I did a bit of profiling, it looks like agg
is to blame here. With the recent introduction of PEP 668 testing the code on the latest branch is cumbersome and so I didn't. There are also potentially relevant issues, but it's not the same: #50121, #46505
from datetime import datetime
import numpy as np
import pandas as pd
symbols = 1000
start = datetime(2023, 1, 1)
end = datetime(2023, 1, 2)
data_cols = ['A', 'B', 'C', 'D', 'E']
agg_props = {'A': 'first', 'B': 'max', 'C': 'min', 'D': 'last', 'E': 'sum'}
base, sample = '1min', '5min'
def pandas_resample(df: pd.DataFrame):
return (df
.sort_values(['sid', 'timestamp'])
.set_index('timestamp')
.groupby('sid')
.resample(sample, label='left', closed='left')
.agg(agg_props)
.reset_index()
)
_rng = np.random.default_rng(123)
timestamps = pd.date_range(start, end, freq=base)
df = pd.DataFrame({'timestamp': pd.DatetimeIndex(timestamps),
**{_col: _rng.integers(50, 150, len(timestamps)) for _col in data_cols}})
ids = pd.DataFrame({'sid': _rng.integers(1000, 2000, symbols)})
df['id'] = 1
ids['id'] = 1
full_df = ids.merge(df, on='id').drop(columns=['id'])
%timeit pandas_resample(full_df.copy())
1.68 s ± 50.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
full_df['sid'] = full_df['sid'].astype("uint16[pyarrow]")
full_df['A'] = full_df['A'].astype("int16[pyarrow]")
full_df['B'] = full_df['B'].astype("int16[pyarrow]")
full_df['C'] = full_df['C'].astype("int16[pyarrow]")
full_df['D'] = full_df['D'].astype("int16[pyarrow]")
full_df['E'] = full_df['E'].astype("int16[pyarrow]")
%timeit pandas_resample(full_df.copy())
36.4 s ± 1.26 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
Installed Versions
INSTALLED VERSIONS
commit : 965ceca
python : 3.11.3.final.0
python-bits : 64
OS : Linux
OS-release : 6.4.2-arch1-1
Version : #1 SMP PREEMPT_DYNAMIC Thu, 06 Jul 2023 18:35:54 +0000
machine : x86_64
processor :
byteorder : little
LC_ALL : en_GB.UTF-8
LANG : en_GB.UTF-8
LOCALE : en_GB.UTF-8
pandas : 2.0.2
numpy : 1.25.0
pytz : 2023.3
dateutil : 2.8.2
setuptools : 68.0.0
pip : 23.1.2
Cython : 0.29.36
pytest : 7.4.0
hypothesis : 6.75.3
sphinx : 7.0.1
blosc : None
feather : None
xlsxwriter : None
lxml.etree : 4.9.2
html5lib : 1.1
pymysql : None
psycopg2 : None
jinja2 : 3.1.2
IPython : 8.14.0
pandas_datareader: None
bs4 : 4.12.2
bottleneck : None
brotli : None
fastparquet : None
fsspec : 2022.11.0
gcsfs : None
matplotlib : 3.7.1
numba : None
numexpr : None
odfpy : None
openpyxl : 3.1.2
pandas_gbq : None
pyarrow : 10.0.1
pyreadstat : None
pyxlsb : None
s3fs : None
scipy : 1.11.1
snappy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : 2023.1.0
xlrd : None
zstandard : None
tzdata : 2023.3
qtpy : None
pyqt5 : None
Prior Performance
No response