Description
Pandas version checks
-
I have checked that this issue has not already been reported.
-
I have confirmed this bug exists on the latest version of pandas.
-
I have confirmed this bug exists on the main branch of pandas.
Reproducible Example
import pandas as pd
df = pd.read_parquet("mypath/myfile.parquet", engine="pyarrow")
df = df.convert_dtypes(dtype_backend="pyarrow")
len(df[df.full_path.str.endswith("90_WW") ])
Issue Description
On large datasets (24 million rows) getting "pyarrow.lib.ArrowInvalid: offset overflow while concatenating arrays error" when performing string operations. with Python 3.11 and pandas 2.2.2. Seems to be related to this pervious issue: #55606
Traceback (most recent call last):
File "C:\Users\corey\Analytics_Software\pycharm\PyCharm 2024.1.1\plugins\python\helpers\pydev\pydevconsole.py", line 364, in runcode
coro = func()
^^^^^^
File "<input>", line 1, in <module>
File "C:\Users\corey\Miniconda3\envs\analytics311\Lib\site-packages\pandas\core\frame.py", line 4093, in __getitem__
return self._getitem_bool_array(key)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\corey\Miniconda3\envs\analytics311\Lib\site-packages\pandas\core\frame.py", line 4155, in _getitem_bool_array
return self._take_with_is_copy(indexer, axis=0)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\corey\Miniconda3\envs\analytics311\Lib\site-packages\pandas\core\generic.py", line 4153, in _take_with_is_copy
result = self.take(indices=indices, axis=axis)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\corey\Miniconda3\envs\analytics311\Lib\site-packages\pandas\core\generic.py", line 4133, in take
new_data = self._mgr.take(
^^^^^^^^^^^^^^^
File "C:\Users\corey\Miniconda3\envs\analytics311\Lib\site-packages\pandas\core\internals\managers.py", line 894, in take
return self.reindex_indexer(
^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\corey\Miniconda3\envs\analytics311\Lib\site-packages\pandas\core\internals\managers.py", line 687, in reindex_indexer
new_blocks = [
^
File "C:\Users\corey\Miniconda3\envs\analytics311\Lib\site-packages\pandas\core\internals\managers.py", line 688, in <listcomp>
blk.take_nd(
File "C:\Users\corey\Miniconda3\envs\analytics311\Lib\site-packages\pandas\core\internals\blocks.py", line 1307, in take_nd
new_values = algos.take_nd(
^^^^^^^^^^^^^^
File "C:\Users\corey\Miniconda3\envs\analytics311\Lib\site-packages\pandas\core\array_algos\take.py", line 114, in take_nd
return arr.take(indexer, fill_value=fill_value, allow_fill=allow_fill)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "C:\Users\corey\Miniconda3\envs\analytics311\Lib\site-packages\pandas\core\arrays\arrow\array.py", line 1309, in take
return type(self)(self._pa_array.take(indices))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "pyarrow\\table.pxi", line 1052, in pyarrow.lib.ChunkedArray.take
File "C:\Users\corey\Miniconda3\envs\analytics311\Lib\site-packages\pyarrow\compute.py", line 487, in take
return call_function('take', [data, indices], options, memory_pool)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "pyarrow\\_compute.pyx", line 590, in pyarrow._compute.call_function
File "pyarrow\\_compute.pyx", line 385, in pyarrow._compute.Function.call
File "pyarrow\\error.pxi", line 155, in pyarrow.lib.pyarrow_internal_check_status
File "pyarrow\\error.pxi", line 92, in pyarrow.lib.check_status
pyarrow.lib.ArrowInvalid: offset overflow while concatenating arrays
Expected Behavior
string operations work with arrow backend and large datasets
Installed Versions
INSTALLED VERSIONS
commit : d9cdd2e
python : 3.11.9.final.0
python-bits : 64
OS : Windows
OS-release : 10
Version : 10.0.22621
machine : AMD64
processor : Intel64 Family 6 Model 165 Stepping 2, GenuineIntel
byteorder : little
LC_ALL : None
LANG : None
LOCALE : English_United States.1252
pandas : 2.2.2
numpy : 2.1.0
pytz : 2024.1
dateutil : 2.9.0.post0
setuptools : 72.1.0
pip : 24.2
Cython : None
pytest : None
hypothesis : None
sphinx : None
blosc : None
feather : None
xlsxwriter : None
lxml.etree : None
html5lib : None
pymysql : None
psycopg2 : None
jinja2 : None
IPython : None
pandas_datareader : None
adbc-driver-postgresql: None
adbc-driver-sqlite : None
bs4 : None
bottleneck : None
dataframe-api-compat : None
fastparquet : None
fsspec : None
gcsfs : None
matplotlib : None
numba : None
numexpr : None
odfpy : None
openpyxl : None
pandas_gbq : None
pyarrow : 17.0.0
pyreadstat : None
python-calamine : None
pyxlsb : None
s3fs : None
scipy : None
sqlalchemy : None
tables : None
tabulate : None
xarray : None
xlrd : None
zstandard : None
tzdata : 2024.1
qtpy : None
pyqt5 : None