Add arrow-avro `SchemaStore` and fingerprinting #8039

jecsand838 · 2025-08-04T23:16:38Z

Which issue does this PR close?

Part of Add Avro Support #4886
Pre-work for Implement arrow-avro SchemaStore and Fingerprinting To Enable Schema Resolution #8006

Rationale for this change

Apache Avro’s single object encoding prefixes every record with the marker 0xC3 0x01 followed by a Rabin schema fingerprint so that readers can identify the correct writer schema without carrying the full definition in each message.
While the current arrow‑avro implementation can read container files, it cannot ingest these framed messages or handle streams where the writer schema changes over time.

The Avro specification recommends computing a 64‑bit CRC‑64‑AVRO (Rabin) hashed fingerprint of the parsed canonical form of a schema to look up the Schema from a local schema store or registry.

This PR introduces SchemaStore and fingerprinting to enable:

Zero‑copy schema identification for decoding streaming Avro messages published in single‑object format (i.e. Kafka, Pulsar, etc) into Arrow.
Dynamic schema evolution by laying the foundation to resolve writer reader schema differences on the fly.

NOTE: Integration with Decoder and Reader coming in next PR.

What changes are included in this PR?

Area	Highlights
`schema.rs`	New `Fingerprint`, `SchemaStore`, and `SINGLE_OBJECT_MAGIC`; canonical‑form generator; Rabin fingerprint calculator; `compare_schemas` helper.
`lib.rs`	`mod schema` is now `pub`
Unit tests	New tests covering fingerprint generation, store registration/lookup, unknown‑fingerprint errors, and interaction with UTF8‑view decoding.
Docs & Examples	Extensive inline docs with examples on all new public methods / structs.

Are these changes tested?

Yes. New tests cover:

Fingerprinting against the canonical examples from the Avro spec
SchemaStore behavior deduplication, duplicate registration, and lookup.

Are there any user-facing changes?

N/A

jecsand838 · 2025-08-04T23:17:59Z

@scovich @alamb Here's that first PR for the SchemaStore work.

scovich

LGTM.

Aside: I get the appeal of zero-copy schemas, but I'm pretty sure this schema store will be very difficult to use in practice unless all possible schemas are known up front. Adding a new schema to the store partway through decoding will be ~impossible. But that's a problem with the existing schema API, not this new schema store.

scovich · 2025-08-05T13:04:52Z

arrow-avro/src/schema.rs

+                        let field_type =
+                            build_canonical(&f.r#type, child_ns.as_deref().or(enclosing_ns))?;
+                        Ok(format!(
+                            r#"{{"name":{},"type":{}}}"#,


What's the difference between this and the json! macro (since we anyway have a dependency on serde_json crate)? I guess the macro uses too much whitespace that avro canonical schema forbids?

I believe the whitespace is handled automatically by Serde, but the canonical form also mandates attribute order, absence of extraneous keys, and deterministic byte output. As I understand it, json! produces a serde_json::Value whose serialization order depends on map implementation and cargo features, and always allocates owned Strings.

Building the fragment with format! avoids those pitfalls.

Makes sense, yup!

… `SchemaStore` to use `AvroSchema`, and adjust related tests and logic.

alamb

Thank you @jecsand838 and @scovich

(I just quickly skimmed this PR, and am mostly relying on @scovich 's review)

Looks well tested and well commented to me

jecsand838 · 2025-08-05T17:14:18Z

@scovich

LGTM.

Aside: I get the appeal of zero-copy schemas, but I'm pretty sure this schema store will be very difficult to use in practice unless all possible schemas are known up front. Adding a new schema to the store partway through decoding will be ~impossible. But that's a problem with the existing schema API, not this new schema store.

Avro pretty much requires you to know all possible schemas upfront. The one inconvenience I can foresee is related to developing a SchemaStore trait which can stay up to date with an external registry. However in a real world scenario what would likely occur is the current reader_schema becoming another writer_schema and a new reader_schema being assigned. So this would have more complications than just the lifetimes.

I think for this initial implementation it's acceptable to have the caller responsible for making a new Decoder upon schema change. Just my 2 cents though of course.

alamb · 2025-08-05T19:27:58Z

Let's keep the code flowing

Add arrow-avro schema store and fingerprinting

06914a1

github-actions bot added the arrow Changes to the arrow crate label Aug 4, 2025

scovich approved these changes Aug 5, 2025

View reviewed changes

Refactor arrow-avro schema handling: Add AvroSchema wrapper, modify…

b7ba42e

… `SchemaStore` to use `AvroSchema`, and adjust related tests and logic.

alamb approved these changes Aug 5, 2025

View reviewed changes

alamb merged commit 5dd3463 into apache:main Aug 5, 2025
24 checks passed

jecsand838 deleted the avro-schema-store branch August 7, 2025 02:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add arrow-avro `SchemaStore` and fingerprinting #8039

Add arrow-avro `SchemaStore` and fingerprinting #8039

jecsand838 commented Aug 4, 2025

Uh oh!

jecsand838 commented Aug 4, 2025

Uh oh!

scovich left a comment

Uh oh!

scovich Aug 5, 2025

Uh oh!

jecsand838 Aug 5, 2025

Uh oh!

scovich Aug 5, 2025

Uh oh!

alamb left a comment

Uh oh!

jecsand838 commented Aug 5, 2025

Uh oh!

Uh oh!

alamb commented Aug 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add arrow-avro SchemaStore and fingerprinting #8039

Add arrow-avro SchemaStore and fingerprinting #8039

Conversation

jecsand838 commented Aug 4, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

jecsand838 commented Aug 4, 2025

Uh oh!

scovich left a comment

Choose a reason for hiding this comment

Uh oh!

scovich Aug 5, 2025

Choose a reason for hiding this comment

Uh oh!

jecsand838 Aug 5, 2025

Choose a reason for hiding this comment

Uh oh!

scovich Aug 5, 2025

Choose a reason for hiding this comment

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

jecsand838 commented Aug 5, 2025

Uh oh!

Uh oh!

alamb commented Aug 5, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Add arrow-avro `SchemaStore` and fingerprinting #8039

Add arrow-avro `SchemaStore` and fingerprinting #8039