BUG: New param [use_nullable_dtypes] of pd.read_parquet() can't handle empty parquet file #41241

bob-zhao-work · 2021-04-30T18:40:56Z

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of pandas.
(optional) I have confirmed this bug exists on the master branch of pandas.

Code Sample, a copy-pastable example

df_pq = pd.read_parquet(x, use_nullable_dtypes = True)

Problem description

Get error when add the new parameter use_nullable_dtypes to pd.read_parquet().
If remove it , everything go back to normal.
OS: Ubuntu 16
Python: 3.8

A empty parquet file from spark causes the problem. Its schema is:

Authors,AuthorId,int64
Authors,Rank,int32
Authors,NormalizedName,string
Authors,DisplayName,string
Authors,LastKnownAffiliationId,int64
Authors,PaperCount,int64
Authors,PaperFamilyCount,int64
Authors,CitationCount,int64
Authors,CreatedDate,date32[day]

error msg:

df_pq = pd.read_parquet(x,use_nullable_dtypes = True)

File "/vjan/lib/python3.8/site-packages/pandas/io/parquet.py", line 459, in read_parquet
return impl.read(
File "/vjan/lib/python3.8/site-packages/pandas/io/parquet.py", line 221, in read
return self.api.parquet.read_table(
File "pyarrow/array.pxi", line 751, in pyarrow.lib._PandasConvertible.to_pandas
File "pyarrow/table.pxi", line 1668, in pyarrow.lib.Table._to_pandas
File "/vjan/lib/python3.8/site-packages/pyarrow/pandas_compat.py", line 792, in table_to_blockmanager
blocks = _table_to_blocks(options, table, categories, ext_columns_dtypes)
File "/vjan/lib/python3.8/site-packages/pyarrow/pandas_compat.py", line 1133, in _table_to_blocks
return [_reconstruct_block(item, columns, extension_columns)
File "/vjan/lib/python3.8/site-packages/pyarrow/pandas_compat.py", line 1133, in
return [_reconstruct_block(item, columns, extension_columns)
File "/vjan/lib/python3.8/site-packages/pyarrow/pandas_compat.py", line 751, in _reconstruct_block
pd_ext_arr = pandas_dtype.from_arrow(arr)
File "/vjan/lib/python3.8/site-packages/pandas/core/arrays/integer.py", line 121, in from_arrow
return IntegerArray._concat_same_type(results)
File "/vjan/lib/python3.8/site-packages/pandas/core/arrays/masked.py", line 271, in _concat_same_type
data = np.concatenate([x._data for x in to_concat])
File "<array_function internals>", line 5, in concatenate
ValueError: need at least one array to concatenate

Expected Output

read the empty parquet file and generate an empty df

Output of `pd.show_versions()`

1.2.4

The text was updated successfully, but these errors were encountered:

toryhaavik · 2021-06-03T15:30:47Z

I've run into this as well on pandas 1.2.4 and pyarrow 3.0.0. a simple repro is:

import pandas as pd
df = pd.DataFrame({"value": pd.array([], dtype=pd.Int64Dtype()})
df.to_parquet("/path")
df2 = pd.read_parquet("/path")
...
ValueError: need at least one array to concatenate

seems just the presence of the nullable dtype is enough to trigger the error

jreback · 2021-06-03T17:50:19Z

hmm. cc @jorisvandenbossche

simonjayhawkins · 2021-06-07T16:03:04Z

I've run into this as well on pandas 1.2.4 and pyarrow 3.0.0. a simple repro is:

fixed in commit: [1fb626d] BUG: Handle zero-chunked pyarrow.ChunkedArray in StringArray (#41052)

@bob-zhao-work can you try on master

nakatomotoi · 2021-08-25T07:18:02Z

@simonjayhawkins
Hello, I would like to work on this issue. This will be my first contribution to oss.
I understand that I can test this commit in master branch, is that correct?

simonjayhawkins · 2021-08-25T09:37:20Z

Thanks @nakatomotoi. pandas has a test suite that is run on ci when a PR is opened. This issue requires a test to be added to the test suite so that we can close the issue knowing that future similar regressions should be less likely.

see https://github.com/pandas-dev/pandas/issues?q=is%3Aissue+is%3Aclosed+label%3A%22Needs+Tests%22 for issues like this that have been closed and check out the associated PRs for insipiration.

The developer guide is https://pandas.pydata.org/pandas-docs/dev/development/index.html

nakatomotoi · 2021-08-25T09:46:11Z

@simonjayhawkins
Thank you very much for your kind reply.
Now I am going to try to add a test case.

nakatomotoi · 2021-09-01T08:51:52Z

take

bob-zhao-work added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 30, 2021

bob-zhao-work changed the title ~~BUG:new param [use_nullable_dtypes] of pd.read_parquet() can't handle empty parquet file~~ BUG: New param [use_nullable_dtypes] of pd.read_parquet() can't handle empty parquet file May 1, 2021

jreback added IO Parquet parquet, feather Regression Functionality that used to work in a prior pandas version and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 3, 2021

simonjayhawkins added a commit to simonjayhawkins/pandas that referenced this issue Jun 7, 2021

code sample for pandas-dev#41241

c426ece

mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Bug labels Aug 20, 2021

github-actions bot assigned nakatomotoi Sep 1, 2021

nakatomotoi mentioned this issue Sep 8, 2021

TST: add test to read empty array #43459

Merged

4 tasks

jreback added this to the 1.4 milestone Sep 12, 2021

jreback closed this as completed in #43459 Sep 29, 2021

yokomotod mentioned this issue Mar 19, 2025

fix: empty record dtypes googleapis/python-bigquery#2147

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

BUG: New param [use_nullable_dtypes] of pd.read_parquet() can't handle empty parquet file #41241

BUG: New param [use_nullable_dtypes] of pd.read_parquet() can't handle empty parquet file #41241

bob-zhao-work commented Apr 30, 2021 •

edited

Loading

toryhaavik commented Jun 3, 2021

Uh oh!

jreback commented Jun 3, 2021

Uh oh!

simonjayhawkins commented Jun 7, 2021

Uh oh!

nakatomotoi commented Aug 25, 2021

Uh oh!

simonjayhawkins commented Aug 25, 2021

Uh oh!

nakatomotoi commented Aug 25, 2021

Uh oh!

nakatomotoi commented Sep 1, 2021

Uh oh!

Uh oh!

BUG: New param [use_nullable_dtypes] of pd.read_parquet() can't handle empty parquet file #41241

BUG: New param [use_nullable_dtypes] of pd.read_parquet() can't handle empty parquet file #41241

Comments

bob-zhao-work commented Apr 30, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Sample, a copy-pastable example

Problem description

Expected Output

Output of pd.show_versions()

toryhaavik commented Jun 3, 2021

Uh oh!

jreback commented Jun 3, 2021

Uh oh!

simonjayhawkins commented Jun 7, 2021

Uh oh!

nakatomotoi commented Aug 25, 2021

Uh oh!

simonjayhawkins commented Aug 25, 2021

Uh oh!

nakatomotoi commented Aug 25, 2021

Uh oh!

nakatomotoi commented Sep 1, 2021

Uh oh!

bob-zhao-work commented Apr 30, 2021 •

edited

Loading

Output of `pd.show_versions()`