-
-
Notifications
You must be signed in to change notification settings - Fork 18.6k
BUG: New param [use_nullable_dtypes] of pd.read_parquet() can't handle empty parquet file #41241
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Comments
I've run into this as well on pandas 1.2.4 and pyarrow 3.0.0. a simple repro is:
seems just the presence of the nullable dtype is enough to trigger the error |
hmm. cc @jorisvandenbossche |
fixed in commit: [1fb626d] BUG: Handle zero-chunked pyarrow.ChunkedArray in StringArray (#41052) @bob-zhao-work can you try on master |
@simonjayhawkins |
Thanks @nakatomotoi. pandas has a test suite that is run on ci when a PR is opened. This issue requires a test to be added to the test suite so that we can close the issue knowing that future similar regressions should be less likely. see https://github.com/pandas-dev/pandas/issues?q=is%3Aissue+is%3Aclosed+label%3A%22Needs+Tests%22 for issues like this that have been closed and check out the associated PRs for insipiration. The developer guide is https://pandas.pydata.org/pandas-docs/dev/development/index.html |
@simonjayhawkins |
take |
Uh oh!
There was an error while loading. Please reload this page.
I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.
Code Sample, a copy-pastable example
Problem description
Get error when add the new parameter use_nullable_dtypes to pd.read_parquet().
If remove it , everything go back to normal.
OS: Ubuntu 16
Python: 3.8
A empty parquet file from spark causes the problem. Its schema is:
Authors,AuthorId,int64
Authors,Rank,int32
Authors,NormalizedName,string
Authors,DisplayName,string
Authors,LastKnownAffiliationId,int64
Authors,PaperCount,int64
Authors,PaperFamilyCount,int64
Authors,CitationCount,int64
Authors,CreatedDate,date32[day]
error msg:
File "/vjan/lib/python3.8/site-packages/pandas/io/parquet.py", line 459, in read_parquet
return impl.read(
File "/vjan/lib/python3.8/site-packages/pandas/io/parquet.py", line 221, in read
return self.api.parquet.read_table(
File "pyarrow/array.pxi", line 751, in pyarrow.lib._PandasConvertible.to_pandas
File "pyarrow/table.pxi", line 1668, in pyarrow.lib.Table._to_pandas
File "/vjan/lib/python3.8/site-packages/pyarrow/pandas_compat.py", line 792, in table_to_blockmanager
blocks = _table_to_blocks(options, table, categories, ext_columns_dtypes)
File "/vjan/lib/python3.8/site-packages/pyarrow/pandas_compat.py", line 1133, in _table_to_blocks
return [_reconstruct_block(item, columns, extension_columns)
File "/vjan/lib/python3.8/site-packages/pyarrow/pandas_compat.py", line 1133, in
return [_reconstruct_block(item, columns, extension_columns)
File "/vjan/lib/python3.8/site-packages/pyarrow/pandas_compat.py", line 751, in _reconstruct_block
pd_ext_arr = pandas_dtype.from_arrow(arr)
File "/vjan/lib/python3.8/site-packages/pandas/core/arrays/integer.py", line 121, in from_arrow
return IntegerArray._concat_same_type(results)
File "/vjan/lib/python3.8/site-packages/pandas/core/arrays/masked.py", line 271, in _concat_same_type
data = np.concatenate([x._data for x in to_concat])
File "<array_function internals>", line 5, in concatenate
ValueError: need at least one array to concatenate
Expected Output
read the empty parquet file and generate an empty df
Output of
pd.show_versions()
1.2.4
The text was updated successfully, but these errors were encountered: