Skip to content

Conversation

@jecsand838
Copy link
Contributor

Which issue does this PR close?

Rationale for this change

Apache Avro’s single object encoding prefixes every record with the marker 0xC3 0x01 followed by a Rabin schema fingerprint so that readers can identify the correct writer schema without carrying the full definition in each message.
While the current arrow‑avro implementation can read container files, it cannot ingest these framed messages or handle streams where the writer schema changes over time.

The Avro specification recommends computing a 64‑bit CRC‑64‑AVRO (Rabin) hashed fingerprint of the parsed canonical form of a schema to look up the Schema from a local schema store or registry.

This PR introduces SchemaStore and fingerprinting to enable:

  • Zero‑copy schema identification for decoding streaming Avro messages published in single‑object format (i.e. Kafka, Pulsar, etc) into Arrow.
  • Dynamic schema evolution by laying the foundation to resolve writer reader schema differences on the fly.

NOTE: Integration with Decoder and Reader coming in next PR.

What changes are included in this PR?

Area Highlights
schema.rs New Fingerprint, SchemaStore, and SINGLE_OBJECT_MAGIC; canonical‑form generator; Rabin fingerprint calculator; compare_schemas helper.
lib.rs mod schema is now pub
Unit tests New tests covering fingerprint generation, store registration/lookup, unknown‑fingerprint errors, and interaction with UTF8‑view decoding.
Docs & Examples Extensive inline docs with examples on all new public methods / structs.

Are these changes tested?

Yes. New tests cover:

  1. Fingerprinting against the canonical examples from the Avro spec
  2. SchemaStore behavior deduplication, duplicate registration, and lookup.

Are there any user-facing changes?

N/A

@github-actions github-actions bot added the arrow Changes to the arrow crate label Aug 4, 2025
@jecsand838
Copy link
Contributor Author

@scovich @alamb Here's that first PR for the SchemaStore work.

Copy link
Contributor

@scovich scovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

Aside: I get the appeal of zero-copy schemas, but I'm pretty sure this schema store will be very difficult to use in practice unless all possible schemas are known up front. Adding a new schema to the store partway through decoding will be ~impossible. But that's a problem with the existing schema API, not this new schema store.

let field_type =
build_canonical(&f.r#type, child_ns.as_deref().or(enclosing_ns))?;
Ok(format!(
r#"{{"name":{},"type":{}}}"#,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the difference between this and the json! macro (since we anyway have a dependency on serde_json crate)? I guess the macro uses too much whitespace that avro canonical schema forbids?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe the whitespace is handled automatically by Serde, but the canonical form also mandates attribute order, absence of extraneous keys, and deterministic byte output. As I understand it, json! produces a serde_json::Value whose serialization order depends on map implementation and cargo features, and always allocates owned Strings.

Building the fragment with format! avoids those pitfalls.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, yup!

… `SchemaStore` to use `AvroSchema`, and adjust related tests and logic.
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @jecsand838 and @scovich

(I just quickly skimmed this PR, and am mostly relying on @scovich 's review)

Looks well tested and well commented to me

@jecsand838
Copy link
Contributor Author

@scovich

LGTM.

Aside: I get the appeal of zero-copy schemas, but I'm pretty sure this schema store will be very difficult to use in practice unless all possible schemas are known up front. Adding a new schema to the store partway through decoding will be ~impossible. But that's a problem with the existing schema API, not this new schema store.

Avro pretty much requires you to know all possible schemas upfront. The one inconvenience I can foresee is related to developing a SchemaStore trait which can stay up to date with an external registry. However in a real world scenario what would likely occur is the current reader_schema becoming another writer_schema and a new reader_schema being assigned. So this would have more complications than just the lifetimes.

I think for this initial implementation it's acceptable to have the caller responsible for making a new Decoder upon schema change. Just my 2 cents though of course.

@alamb alamb merged commit 5dd3463 into apache:main Aug 5, 2025
24 checks passed
@alamb
Copy link
Contributor

alamb commented Aug 5, 2025

Let's keep the code flowing

@jecsand838 jecsand838 deleted the avro-schema-store branch August 7, 2025 02:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants