Skip to content

[Python][Parquet] pa.schema() silently drops metadata if a schema object is passed #38575

@igozali

Description

@igozali

Describe the bug, including details regarding any error messages, version, and platform.

Seems like a regression from arrow 13 -> 14

Here's a simple repro script

import pyarrow as pa
import pyarrow.parquet as pq

print(pa.__version__)

metadata = {b"foo": b"bar"}
schema = pa.schema([pa.field("foo", pa.int32())])
wrapped_schema = pa.schema(schema, metadata=metadata)

w = pq.ParquetWriter(
    "foo.parquet", 
    wrapped_schema,
    flavor="spark", 
    compression="snappy"
)
w.close()

s = pq.read_schema("foo.parquet")
print(f"{s.metadata=} {metadata=}")
print(f"{s.metadata == metadata=}")

Running above script on pyarrow 13 and 14 gives these outputs:

(arrow13)
[11:43:42 | last: 43s] (  0) | ~
igozali@host $ python test.py
13.0.0
s.metadata={b'foo': b'bar'} metadata={b'foo': b'bar'}
s.metadata == metadata=True

(arrow13)
[11:43:44 | last: 0s] (  0) | ~
igozali@host $ conda activate arrow14

(arrow14)
[11:43:49 | last: 0s] (  0) | ~
igozali@host $ python test.py
14.0.0
s.metadata=None metadata={b'foo': b'bar'}
s.metadata == metadata=False

Few workarounds:

  1. Use schema.with_metadata() instead of pa.schema()
  2. Instead of pa.schema(old_schema, metadata=...), passing a list of fields seems to fix it too e.g. pa.schema(list(old_schema), metadata=...)

Component(s)

Parquet, Python

Metadata

Metadata

Assignees

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions