This is a prototype for being able to query metadata about Zarr arrays using DataFusion, an extensible query engine written in Rust.
In particular, we assume there is a Zarr store with multiple 1-dimensional arrays:
- Inside a Zarr group named
"meta"-
An array named
"date"withntimestamps, stored as a numpydatetime64[ms]array -
An array named
"collection"withnstring values, stored as aVariableLengthUTF8array -
An array named
"bbox"withnstring values, stored as aVariableLengthUTF8array, where each string is a WKT-encoded Polygon (or MultiPolygon) with the bounding box of that Zarr record.In the future, we will likely use a binary encoding like WKB, but Zarr's binary dtype is not currently well-specified.
-
This data schema may change over time.
DataFusion distributes Python bindings via the datafusion PyPI package.
In addition, DataFusion-Python supports custom table providers. These allow you to define a custom data source as a standalone Rust package, compile it as its own standalone Python package, but then load it into DataFusion-Python at runtime.
Note
The underlying DataFusion TableProvider ABI is not entirely stable. So for now you must use the same version of DataFusion-Python as the version of DataFusion used to compile the custom table provider.
from zarr_datafusion_search import ZarrTable
from datafusion import SessionContext
# Create a new DataFusion session context
ctx = SessionContext()
# Register a specific Zarr store as a table named "zarr_data"
ctx.register_table_provider("zarr_data", ZarrTable("zarr_store.zarr"))
# Now you can run SQL queries against the Zarr data
df = ctx.sql("SELECT * FROM zarr_data;")
df.show()