Skip to content

developmentseed/zarr-datafusion-search

Repository files navigation

zarr-datafusion-search

This is a prototype for being able to query metadata about Zarr arrays using DataFusion, an extensible query engine written in Rust.

Zarr Schema

In particular, we assume there is a Zarr store with multiple 1-dimensional arrays:

  • Inside a Zarr group named "meta"
    • An array named "date" with n timestamps, stored as a numpy datetime64[ms] array

    • An array named "collection" with n string values, stored as a VariableLengthUTF8 array

    • An array named "bbox" with n string values, stored as a VariableLengthUTF8 array, where each string is a WKT-encoded Polygon (or MultiPolygon) with the bounding box of that Zarr record.

      In the future, we will likely use a binary encoding like WKB, but Zarr's binary dtype is not currently well-specified.

This data schema may change over time.

Python API

DataFusion distributes Python bindings via the datafusion PyPI package.

In addition, DataFusion-Python supports custom table providers. These allow you to define a custom data source as a standalone Rust package, compile it as its own standalone Python package, but then load it into DataFusion-Python at runtime.

Note

The underlying DataFusion TableProvider ABI is not entirely stable. So for now you must use the same version of DataFusion-Python as the version of DataFusion used to compile the custom table provider.

from zarr_datafusion_search import ZarrTable
from datafusion import SessionContext

# Create a new DataFusion session context
ctx = SessionContext()

# Register a specific Zarr store as a table named "zarr_data"
ctx.register_table_provider("zarr_data", ZarrTable("zarr_store.zarr"))

# Now you can run SQL queries against the Zarr data
df = ctx.sql("SELECT * FROM zarr_data;")
df.show()

About

Internal exploration for zarr + DataFusion

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published