-
Notifications
You must be signed in to change notification settings - Fork 1k
Added arrow-avro schema resolution foundations and type promotion #8047
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added arrow-avro schema resolution foundations and type promotion #8047
Conversation
920e15d
to
88b4c99
Compare
eaa70aa
to
34c687d
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jecsand838
Notes to other reviewers, I found this PR eaiser to understand with whitespace blind diff: https://github.com/apache/arrow-rs/pull/8047/files?w=1 (specifically, it makes the refactoring into Maker
easier I found)
} | ||
} | ||
|
||
#[test] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, new unit tests have been added to verify the schema resolution logic, including tests for type promotions and handling of default values.
is this the only new test ? I didn't see any others for the promotion rules added in this PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alamb That's a good callout. I was trying to mindful of this PR's size and most of the tests I wrote are e2e tests. I probably should have included more tests in this PR though.
I went ahead and pushed up another 9 unit tests in codec.rs
covering all promotion resolution paths. I also wired up the schema resolution path to arrow-avro/src/reader/mod.rs
and updated the schema_store
and handle_prefix
tests to have valid promotion paths to give us extra coverage.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you (both for the tests and being mindful about our limited review capacity)
- Add support for reader/writer schema resolution with promotions (e.g., int to long, float to double) in `RecordDecoder` and related components. - Enhance decoding logic to handle schema evolution for various Avro types. - Introduce tests for schema promotions and compatibility.
…n support in `resolve_type`, wired up the handle prefix logic in `reader/mod.rs`, and fixed illegal schema promotions in existing tests.
🚀 |
19d5b4f
to
4768dfc
Compare
I just resolved the conflicts.
|
I think this was due to my change in (specifically I didn't add the correct label) I fixed the label and retriggered the job and it now passed. Thanks again @jecsand838 |
#8124) # Which issue does this PR close? - Part of #4886 - Follows up on #8047 # Rationale for this change Avro allows safe widening between numeric primitives and interoperability between `bytes` and UTF‑8 `string` during **schema resolution**. Implementing promotion-aware decoding lets us: - Honor the Avro spec’s resolution matrix directly in the reader, improving interoperability with evolving schemas. - Decode **directly into the target Arrow type** (avoiding extra passes and temporary arrays). - Produce clear errors for **illegal promotions**, instead of surprising behavior. (Per the spec, unresolved writer/reader mismatches are errors.) # What changes are included in this PR? **Core decoding (`arrow-avro/src/reader/record.rs`):** - Add promotion-aware decoder variants: - `Int32ToInt64`, `Int32ToFloat32`, `Int32ToFloat64` - `Int64ToFloat32`, `Int64ToFloat64` - `Float32ToFloat64` - `BytesToString`, `StringToBytes` - Teach `Decoder::try_new` to inspect `ResolutionInfo::Promotion` and select the appropriate variant, so conversion happens **as we decode**, not after. - Extend `decode`, `append_null`, and `flush` to handle the new variants and materialize the correct Arrow arrays (`Int64Array`, `Float32Array`, `Float64Array`, `StringArray`, `BinaryArray`). - Keep existing behavior for `Utf8View` for non-promoted strings; promotions to `string` materialize a `StringArray` (not `StringViewArray`) for correctness and simplicity. (StringView remains available for native UTF‑8 paths.) **Integration tests & helpers (`arrow-avro/src/reader/mod.rs`):** - Add utilities to load a file’s **writer schema** JSON and synthesize a **reader schema** with field-level promotions (`make_reader_schema_with_promotions`). - Add cross‑codec tests on `alltypes_plain` (no compression, snappy, zstd, bzip2, xz) that validate: - Mixed numeric promotions to `float`/`double` and `int to long`. - `bytes to string` and `string to bytes`. - Timestamp/timezone behavior unchanged. - Add **negative** test ensuring **illegal promotions** (e.g., `boolean to double`) produce a descriptive error. # Are these changes tested? Yes. - **Unit tests** (in `record.rs`) for each promotion path: - `int to long`, `int to float`, `int to double` - `long to float`, `long to double` - `float to double` - `bytes to string` (including non‑ASCII UTF‑8) and `string to bytes` - Verifies that **illegal** promotions fail fast. - **Integration tests** (in `mod.rs`) reading real `alltypes_plain` Avro files across multiple compression codecs, asserting exact Arrow outputs for promoted fields. - Existing tests continue to pass. # Are there any user-facing changes? N/A
# Which issue does this PR close? - Part of #4886 - Follows up on #8047 # Rationale for this change When reading Avro into Arrow with a projection or a reader schema that omits some writer fields, we were still decoding those writer‑only fields item‑by‑item. This is unnecessary work and can dominate CPU time for large arrays/maps or deeply nested records. Avro’s binary format explicitly allows fast skipping for arrays/maps by encoding data in blocks: when the count is negative, the next `long` gives the byte size of the block, enabling O(1) skipping of that block without decoding each item. This PR teaches the record reader to recognize and leverage that, and to avoid constructing decoders for fields we will skip altogether. # What changes are included in this PR? **Reader / decoding architecture** - **Skip-aware record decoding**: - At construction time, we now precompute per-record **skip decoders** for writer fields that the reader will ignore. - Introduced a resolved-record path (`RecordResolved`) that carries: - `writer_to_reader` mapping for field alignment, - a prebuilt list of **skip decoders** for fields not present in the reader, - the set of active per-field decoders for the projected fields. - **Codec builder enhancements**: In `arrow-avro/src/codec.rs`, record construction now: - Builds Arrow `Field`s and their decoders only for fields that are read, - Builds `skip_decoders` (via `build_skip_decoders`) for fields to ignore. - **Error handling and consistency**: Kept existing strict-mode behavior; improved internal branching to avoid inconsistent states during partial decodes. **Tests** - **Unit tests (in `arrow-avro/src/reader/record.rs`)** - Added focused tests that exercise the new skip logic: - Skipping writer‑only fields inside **arrays** and **maps** (including negative‑count block skipping and mixed multi‑block payloads). - Skipping nested structures within records to ensure offsets and lengths remain correct for the fields that are read. - Ensured nullability and union handling remain correct when adjacent fields are skipped. - **Integration tests (in `arrow-avro/src/reader/mod.rs`)** - Added end‑to‑end test using `avro/alltypes_plain.avro` to validate that projecting a subset of fields (reader schema omits some writer fields) both: - Produces the correct Arrow arrays for the selected fields, and - Avoids decoding skipped fields (validated indirectly via behavior and block boundaries). - The test covers compressed and uncompressed variants already present in the suite to ensure behavior is consistent across codecs. # Are these changes tested? - **New unit tests** cover: - Fast skipping for arrays/maps using negative block counts and block sizes (per Avro spec). - Nested and nullable scenarios to ensure correct offsets, validity bitmaps, and flush behavior when adjacent fields are skipped. - **New integration test** in `reader/mod.rs`: - Reads `avro/alltypes_plain.avro` with a reader schema that omits several writer fields and asserts the resulting `RecordBatch` matches the expected arrays while exercising the skip path. - Existing promotion, enum, decimal, fixed, and union tests continue to pass, ensuring no regressions in unrelated areas. # Are there any user-facing changes? N/A since `arrow-avro` is not public yet.
# Which issue does this PR close? - Part of #4886 - Follows up on #8047 # Rationale for this change Avro `enum` values are **encoded by index** but are **semantically identified by symbol name**. During schema evolution it is legal for the writer and reader to use different enum symbol *orders* so long as the **symbol set is compatible**. The Avro specification requires that, when resolving a writer enum against a reader enum, the value be mapped **by symbol name**, not by the writer’s numeric index. If the writer’s symbol is not present in the reader’s enum and the reader defines a default, the default is used; otherwise it is an error. # What changes are included in this PR? **Core changes** - Implement **writer to reader enum symbol remapping**: - Build a fast lookup table at schema resolution time from **writer enum index to reader enum index** using symbol **names**. - Apply this mapping during decode so the produced Arrow dictionary keys always reference the **reader’s** symbol order. - If a writer symbol is not found in the reader enum, surface a clear error. # Are these changes tested? Yes. This PR adds comprehensive **unit tests** for enum mapping in `reader/record.rs` and a **real‑file integration test** in `reader/mod.rs` using `avro/simple_enum.avro`. # Are there any user-facing changes? N/A due to `arrow-avro` not being public yet.
Which issue does this PR close?
Rationale for this change
This change introduces the foundation in
codec.rs
for supporting for Avro schema evolution, a key feature of the Avro specification. It enables reading Avro data when the writer's schema and the reader's schema do not match exactly but are compatible according to Avro's resolution rules. This makes data consumption more robust and flexible.This approach focuses on "annotating" each
AvroDataType
with optionalResolutionInfo
and then building theCodec
using thereader_schema
. ThisResolutionInfo
will be used downstream in my next PR by theRecordDecoder
to efficiently read and decode the raw record bytes into thereader_schema
.Once this is merged in, promotion schema resolution support will need to be added to the
RecordDecoder
in a follow-up PR. TheseRecordDecoder
updates will resemble this:What changes are included in this PR?
Maker
struct. This handles:int
tolong
orstring
tobytes
.ResolutionInfo
: An enum that captures the necessary information for resolving schema differences.ResolvedRecord
: A struct that holds the mapping between writer and reader record fields.AvroLiteral
: Represents Avro default values.Promotion
: An enum for different kinds of type promotions.EnumMapping
: A struct for enum symbol mapping.AvroFieldBuilder
: TheAvroFieldBuilder
has been updated to accept both a writer's and an optional reader's schema to facilitate schema resolution.PartialEq
Derivations:PartialEq
has been derived for several structs to simplify testing.Maker
struct for better organization.Are these changes tested?
Yes, new unit tests have been added to verify the schema resolution logic, including tests for type promotions and handling of default values.
Are there any user-facing changes?
N/A
Follow-up PRs
RecordDecoder