Skip to content

v2.metadata and v3.metadata encode fill_value bytes differently  #2322

@rabernat

Description

@rabernat

Here I am creating an array and specifying the fill_value as raw bytes b'X'

import zarr

fv = b'X'

a = zarr.create(shape=10, dtype=bytes, zarr_version=2, fill_value=fv)
ad = a.metadata.to_dict()
print(ad)
# -> {'shape': (10,), 'fill_value': 'WA==', 'attributes': {}, 'zarr_format': 2, 'order': 'C', 'filters': None, 'dimension_separator': '.', 'compressor': None, 'chunks': (10,), 'dtype': '|S0'}


b = zarr.create(shape=10, dtype=bytes, zarr_version=3, fill_value=fv)
bd = b.metadata.to_dict()
print(bd)
# -> {'shape': (10,), 'fill_value': (88,), 'chunk_grid': {'name': 'regular', 'configuration': {'chunk_shape': (10,)}}, 'attributes': {}, 'zarr_format': 3, 'data_type': <DataType.bytes: 'bytes'>, 'chunk_key_encoding': {'name': 'default', 'configuration': {'separator': '/'}}, 'codecs': ({'name': 'vlen-bytes', 'configuration': {}},), 'node_type': 'array', 'storage_transformers': ()}

assert zarr.core.metadata.v2.ArrayV2Metadata.from_dict(ad).fill_value == fv
assert zarr.core.metadata.v3.ArrayV3Metadata.from_dict(bd).fill_value == fv

As we can see, the way this fill value is encoded looks quite different from these two. Remarkably, it gets translated back to something reasonable in both cases.

In both cases, the bytes are going through this path:

elif isinstance(value, Sequence):
out_dict[key] = tuple(v.to_dict() if isinstance(v, Metadata) else v for v in value)

This converts the bytes to a tuple of ints.

However, for v2, #2286 added this additional special handling for fill_value:

if dtype.kind in "SV":
fill_value_encoded = _data.get("fill_value")
if fill_value_encoded is not None:
fill_value = base64.standard_b64decode(fill_value_encoded)
_data["fill_value"] = fill_value

According to the V3 spec:

Raw data types (r)
An array of integers, with length equal to , where each integer is in the range [0, 255].

This seems in line with what is happening.

This is relevant to pydata/xarray#5475

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    Status

    Todo

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions