Commit 5dd3463
authored
Add arrow-avro
# Which issue does this PR close?
- Part of #4886
- Pre-work for #8006
# Rationale for this change
Apache Avro’s [single object
encoding](https://avro.apache.org/docs/1.11.1/specification/#single-object-encoding)
prefixes every record with the marker `0xC3 0x01` followed by a `Rabin`
[schema fingerprint
](https://avro.apache.org/docs/1.11.1/specification/#schema-fingerprints)
so that readers can identify the correct writer schema without carrying
the full definition in each message.
While the current `arrow‑avro` implementation can read container files,
it cannot ingest these framed messages or handle streams where the
writer schema changes over time.
The Avro specification recommends computing a 64‑bit CRC‑64‑AVRO (Rabin)
hashed fingerprint of the [parsed canonical form of a
schema](https://avro.apache.org/docs/1.11.1/specification/#parsing-canonical-form-for-schemas)
to look up the `Schema` from a local schema store or registry.
This PR introduces **`SchemaStore`** and **fingerprinting** to enable:
* **Zero‑copy schema identification** for decoding streaming Avro
messages published in single‑object format (i.e. Kafka, Pulsar, etc)
into Arrow.
* **Dynamic schema evolution** by laying the foundation to resolve
writer reader schema differences on the fly.
**NOTE:** Integration with `Decoder` and `Reader` coming in next PR.
# What changes are included in this PR?
| Area | Highlights |
| ------------------- |
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|
| **`schema.rs`** | *New* `Fingerprint`, `SchemaStore`, and
`SINGLE_OBJECT_MAGIC`; canonical‑form generator; Rabin fingerprint
calculator; `compare_schemas` helper. |
| **`lib.rs`** | `mod schema` is now `pub` |
| **Unit tests** | New tests covering fingerprint generation, store
registration/lookup, unknown‑fingerprint errors, and interaction with
UTF8‑view decoding. |
| **Docs & Examples** | Extensive inline docs with examples on all new
public methods / structs. |
# Are these changes tested?
Yes. New tests cover:
1. **Fingerprinting** against the canonical examples from the Avro spec
2. **`SchemaStore` behavior** deduplication, duplicate registration, and
lookup.
# Are there any user-facing changes?
N/ASchemaStore and fingerprinting (#8039)1 parent a3d144f commit 5dd3463
3 files changed
+564
-5
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
56 | 56 | | |
57 | 57 | | |
58 | 58 | | |
| 59 | + | |
59 | 60 | | |
60 | 61 | | |
61 | 62 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
33 | 33 | | |
34 | 34 | | |
35 | 35 | | |
36 | | - | |
37 | | - | |
38 | | - | |
39 | | - | |
| 36 | + | |
| 37 | + | |
| 38 | + | |
| 39 | + | |
40 | 40 | | |
41 | 41 | | |
42 | 42 | | |
| |||
0 commit comments