Skip to content

BUG: New param [use_nullable_dtypes] of pd.read_parquet() can't handle empty parquet file #41241

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
2 of 3 tasks
bob-zhao-work opened this issue Apr 30, 2021 · 7 comments · Fixed by #43459
Closed
2 of 3 tasks
Assignees
Labels
good first issue IO Parquet parquet, feather Needs Tests Unit test(s) needed to prevent regressions Regression Functionality that used to work in a prior pandas version
Milestone

Comments

@bob-zhao-work
Copy link

bob-zhao-work commented Apr 30, 2021

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • (optional) I have confirmed this bug exists on the master branch of pandas.


Code Sample, a copy-pastable example

df_pq = pd.read_parquet(x, use_nullable_dtypes = True)

Problem description

Get error when add the new parameter use_nullable_dtypes to pd.read_parquet().
If remove it , everything go back to normal.
OS: Ubuntu 16
Python: 3.8

A empty parquet file from spark causes the problem. Its schema is:

Authors,AuthorId,int64
Authors,Rank,int32
Authors,NormalizedName,string
Authors,DisplayName,string
Authors,LastKnownAffiliationId,int64
Authors,PaperCount,int64
Authors,PaperFamilyCount,int64
Authors,CitationCount,int64
Authors,CreatedDate,date32[day]

error msg:

df_pq = pd.read_parquet(x,use_nullable_dtypes = True)

File "/vjan/lib/python3.8/site-packages/pandas/io/parquet.py", line 459, in read_parquet
return impl.read(
File "/vjan/lib/python3.8/site-packages/pandas/io/parquet.py", line 221, in read
return self.api.parquet.read_table(
File "pyarrow/array.pxi", line 751, in pyarrow.lib._PandasConvertible.to_pandas
File "pyarrow/table.pxi", line 1668, in pyarrow.lib.Table._to_pandas
File "/vjan/lib/python3.8/site-packages/pyarrow/pandas_compat.py", line 792, in table_to_blockmanager
blocks = _table_to_blocks(options, table, categories, ext_columns_dtypes)
File "/vjan/lib/python3.8/site-packages/pyarrow/pandas_compat.py", line 1133, in _table_to_blocks
return [_reconstruct_block(item, columns, extension_columns)
File "/vjan/lib/python3.8/site-packages/pyarrow/pandas_compat.py", line 1133, in
return [_reconstruct_block(item, columns, extension_columns)
File "/vjan/lib/python3.8/site-packages/pyarrow/pandas_compat.py", line 751, in _reconstruct_block
pd_ext_arr = pandas_dtype.from_arrow(arr)
File "/vjan/lib/python3.8/site-packages/pandas/core/arrays/integer.py", line 121, in from_arrow
return IntegerArray._concat_same_type(results)
File "/vjan/lib/python3.8/site-packages/pandas/core/arrays/masked.py", line 271, in _concat_same_type
data = np.concatenate([x._data for x in to_concat])
File "<array_function internals>", line 5, in concatenate
ValueError: need at least one array to concatenate

Expected Output

read the empty parquet file and generate an empty df

Output of pd.show_versions()

1.2.4

@bob-zhao-work bob-zhao-work added Bug Needs Triage Issue that has not been reviewed by a pandas team member labels Apr 30, 2021
@bob-zhao-work bob-zhao-work changed the title BUG:new param [use_nullable_dtypes] of pd.read_parquet() can't handle empty parquet file BUG: New param [use_nullable_dtypes] of pd.read_parquet() can't handle empty parquet file May 1, 2021
@toryhaavik
Copy link

I've run into this as well on pandas 1.2.4 and pyarrow 3.0.0. a simple repro is:

import pandas as pd
df = pd.DataFrame({"value": pd.array([], dtype=pd.Int64Dtype()})
df.to_parquet("/path")
df2 = pd.read_parquet("/path")
...
ValueError: need at least one array to concatenate

seems just the presence of the nullable dtype is enough to trigger the error

@jreback
Copy link
Contributor

jreback commented Jun 3, 2021

hmm. cc @jorisvandenbossche

@jreback jreback added IO Parquet parquet, feather Regression Functionality that used to work in a prior pandas version and removed Needs Triage Issue that has not been reviewed by a pandas team member labels Jun 3, 2021
simonjayhawkins added a commit to simonjayhawkins/pandas that referenced this issue Jun 7, 2021
@simonjayhawkins
Copy link
Member

I've run into this as well on pandas 1.2.4 and pyarrow 3.0.0. a simple repro is:

fixed in commit: [1fb626d] BUG: Handle zero-chunked pyarrow.ChunkedArray in StringArray (#41052)

@bob-zhao-work can you try on master

@mroeschke mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed Bug labels Aug 20, 2021
@nakatomotoi
Copy link
Contributor

@simonjayhawkins
Hello, I would like to work on this issue. This will be my first contribution to oss.
I understand that I can test this commit in master branch, is that correct?

@simonjayhawkins
Copy link
Member

Thanks @nakatomotoi. pandas has a test suite that is run on ci when a PR is opened. This issue requires a test to be added to the test suite so that we can close the issue knowing that future similar regressions should be less likely.

see https://github.com/pandas-dev/pandas/issues?q=is%3Aissue+is%3Aclosed+label%3A%22Needs+Tests%22 for issues like this that have been closed and check out the associated PRs for insipiration.

The developer guide is https://pandas.pydata.org/pandas-docs/dev/development/index.html

@nakatomotoi
Copy link
Contributor

@simonjayhawkins
Thank you very much for your kind reply.
Now I am going to try to add a test case.

@nakatomotoi
Copy link
Contributor

take

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue IO Parquet parquet, feather Needs Tests Unit test(s) needed to prevent regressions Regression Functionality that used to work in a prior pandas version
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants