- 
                Notifications
    You must be signed in to change notification settings 
- Fork 1k
          Add arrow-avro SchemaStore and fingerprinting
          #8039
        
          New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Aside: I get the appeal of zero-copy schemas, but I'm pretty sure this schema store will be very difficult to use in practice unless all possible schemas are known up front. Adding a new schema to the store partway through decoding will be ~impossible. But that's a problem with the existing schema API, not this new schema store.
| let field_type = | ||
| build_canonical(&f.r#type, child_ns.as_deref().or(enclosing_ns))?; | ||
| Ok(format!( | ||
| r#"{{"name":{},"type":{}}}"#, | 
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the difference between this and the json! macro (since we anyway have a dependency on serde_json crate)? I guess the macro uses too much whitespace that avro canonical schema forbids?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe the whitespace is handled automatically by Serde, but the canonical form also mandates attribute order, absence of extraneous keys, and deterministic byte output. As I understand it, json! produces a serde_json::Value whose serialization order depends on map implementation and cargo features, and always allocates owned Strings.
Building the fragment with format! avoids those pitfalls.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense, yup!
… `SchemaStore` to use `AvroSchema`, and adjust related tests and logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @jecsand838 and @scovich
(I just quickly skimmed this PR, and am mostly relying on @scovich 's review)
Looks well tested and well commented to me
| 
 Avro pretty much requires you to know all possible schemas upfront. The one inconvenience I can foresee is related to developing a  I think for this initial implementation it's acceptable to have the caller responsible for making a new  | 
| Let's keep the code flowing | 
Which issue does this PR close?
Part of Add Avro Support #4886
Pre-work for Implement arrow-avro SchemaStore and Fingerprinting To Enable Schema Resolution #8006
Rationale for this change
Apache Avro’s single object encoding prefixes every record with the marker
0xC3 0x01followed by aRabinschema fingerprint so that readers can identify the correct writer schema without carrying the full definition in each message.While the current
arrow‑avroimplementation can read container files, it cannot ingest these framed messages or handle streams where the writer schema changes over time.The Avro specification recommends computing a 64‑bit CRC‑64‑AVRO (Rabin) hashed fingerprint of the parsed canonical form of a schema to look up the
Schemafrom a local schema store or registry.This PR introduces
SchemaStoreand fingerprinting to enable:NOTE: Integration with
DecoderandReadercoming in next PR.What changes are included in this PR?
schema.rsFingerprint,SchemaStore, andSINGLE_OBJECT_MAGIC; canonical‑form generator; Rabin fingerprint calculator;compare_schemashelper.lib.rsmod schemais nowpubAre these changes tested?
Yes. New tests cover:
SchemaStorebehavior deduplication, duplicate registration, and lookup.Are there any user-facing changes?
N/A