Added arrow-avro schema resolution foundations and type promotion #8047

jecsand838 · 2025-08-05T20:23:04Z

Which issue does this PR close?

Part of Add Avro Support #4886

Rationale for this change

This change introduces the foundation in codec.rs for supporting for Avro schema evolution, a key feature of the Avro specification. It enables reading Avro data when the writer's schema and the reader's schema do not match exactly but are compatible according to Avro's resolution rules. This makes data consumption more robust and flexible.

This approach focuses on "annotating" each AvroDataType with optional ResolutionInfo and then building the Codec using the reader_schema. This ResolutionInfo will be used downstream in my next PR by the RecordDecoder to efficiently read and decode the raw record bytes into the reader_schema.

Once this is merged in, promotion schema resolution support will need to be added to the RecordDecoder in a follow-up PR. These RecordDecoder updates will resemble this:

                    Promotion::IntToLong => Int32ToInt64(BufferBuilder::new(DEFAULT_CAPACITY)),
                    Promotion::IntToFloat => Int32ToFloat32(BufferBuilder::new(DEFAULT_CAPACITY)),
                    Promotion::IntToDouble => Int32ToFloat64(BufferBuilder::new(DEFAULT_CAPACITY)),
                    Promotion::LongToFloat => Int64ToFloat32(BufferBuilder::new(DEFAULT_CAPACITY)),
                    Promotion::LongToDouble => Int64ToFloat64(BufferBuilder::new(DEFAULT_CAPACITY)),
                    Promotion::FloatToDouble => {
                        Float32ToFloat64(BufferBuilder::new(DEFAULT_CAPACITY))
                    }
                    Promotion::BytesToString => BytesToString(
                        OffsetBufferBuilder::new(DEFAULT_CAPACITY),
                        BufferBuilder::new(DEFAULT_CAPACITY),
                    ),
                    Promotion::StringToBytes => StringToBytes(
                        OffsetBufferBuilder::new(DEFAULT_CAPACITY),
                        BufferBuilder::new(DEFAULT_CAPACITY),
                    ),

What changes are included in this PR?

Schema Resolution Logic: The core of this PR is the new schema resolution logic, which is encapsulated in the Maker struct. This handles:
- Type Promotions: E.g., promoting int to long or string to bytes.
- Default Values: Using default values from the reader's schema when a field is missing in the writer's schema.
- Record Evolution: Resolving differences in record fields between the writer and reader schemas. This includes adding or removing fields.
- Enum Evolution: Mapping enum symbols between the writer's and reader's schemas.
New Data Structures: Several new data structures have been added to support schema resolution:
- ResolutionInfo: An enum that captures the necessary information for resolving schema differences.
- ResolvedRecord: A struct that holds the mapping between writer and reader record fields.
- AvroLiteral: Represents Avro default values.
- Promotion: An enum for different kinds of type promotions.
- EnumMapping: A struct for enum symbol mapping.
Updated AvroFieldBuilder: The AvroFieldBuilder has been updated to accept both a writer's and an optional reader's schema to facilitate schema resolution.
PartialEq Derivations: PartialEq has been derived for several structs to simplify testing.
Refactoring: The schema parsing logic has been refactored from a standalone function into the new Maker struct for better organization.

Are these changes tested?

Yes, new unit tests have been added to verify the schema resolution logic, including tests for type promotions and handling of default values.

Are there any user-facing changes?

N/A

Follow-up PRs

Promotion Schema Resolution support in RecordDecoder
Default Value Schema resolution support (codec + decoder)
Enum mapping Schema resolution support (codec + decoder)
Skip Value Schema resolution support (codec + decoder)
Record resolution support (codec + decoder)

…s for arrow-avro

jecsand838 · 2025-08-05T20:32:37Z

@alamb @scovich

Here's the first Schema resolution PR. I left out the RecordDecoder changes because it would add another ~500 LOC.

I also listed out the follow-up PRs I'm planning to get schema resolution completed.

alamb

Thanks @jecsand838

Notes to other reviewers, I found this PR eaiser to understand with whitespace blind diff: https://github.com/apache/arrow-rs/pull/8047/files?w=1 (specifically, it makes the refactoring into Maker easier I found)

alamb · 2025-08-11T20:45:40Z

arrow-avro/src/codec.rs

        }
    }

+    #[test]


Yes, new unit tests have been added to verify the schema resolution logic, including tests for type promotions and handling of default values.

is this the only new test ? I didn't see any others for the promotion rules added in this PR

@alamb That's a good callout. I was trying to mindful of this PR's size and most of the tests I wrote are e2e tests. I probably should have included more tests in this PR though.

I went ahead and pushed up another 9 unit tests in codec.rs covering all promotion resolution paths. I also wired up the schema resolution path to arrow-avro/src/reader/mod.rs and updated the schema_store and handle_prefix tests to have valid promotion paths to give us extra coverage.

Thank you (both for the tests and being mindful about our limited review capacity)

- Add support for reader/writer schema resolution with promotions (e.g., int to long, float to double) in `RecordDecoder` and related components. - Enhance decoding logic to handle schema evolution for various Avro types. - Introduce tests for schema promotions and compatibility.

…n support in `resolve_type`, wired up the handle prefix logic in `reader/mod.rs`, and fixed illegal schema promotions in existing tests.

alamb · 2025-08-12T18:09:47Z

🚀

alamb · 2025-08-12T18:10:26Z

Github says this PR has conflicts that need to be fixed before I can merge the PR:

jecsand838 · 2025-08-12T19:16:45Z

@alamb

I just resolved the conflicts.
Also I seem to be getting an error on the dev_pr check:

Error: HttpError: You do not have permission to create labels on this repository.: {"resource":"Repository","field":"label","code":"unauthorized"}
Error: You do not have permission to create labels on this repository.: {"resource":"Repository","field":"label","code":"unauthorized"}

alamb · 2025-08-12T21:21:47Z

@alamb

I just resolved the conflicts. Also I seem to be getting an error on the dev_pr check:

Error: HttpError: You do not have permission to create labels on this repository.: {"resource":"Repository","field":"label","code":"unauthorized"}
Error: You do not have permission to create labels on this repository.: {"resource":"Repository","field":"label","code":"unauthorized"}

I think this was due to my change in

Update labeler configuration to include avro and parquet-variant … #8109

(specifically I didn't add the correct label)

I fixed the label and retriggered the job and it now passed.

Thanks again @jecsand838

#8124) # Which issue does this PR close? - Part of #4886 - Follows up on #8047 # Rationale for this change Avro allows safe widening between numeric primitives and interoperability between `bytes` and UTF‑8 `string` during **schema resolution**. Implementing promotion-aware decoding lets us: - Honor the Avro spec’s resolution matrix directly in the reader, improving interoperability with evolving schemas. - Decode **directly into the target Arrow type** (avoiding extra passes and temporary arrays). - Produce clear errors for **illegal promotions**, instead of surprising behavior. (Per the spec, unresolved writer/reader mismatches are errors.) # What changes are included in this PR? **Core decoding (`arrow-avro/src/reader/record.rs`):** - Add promotion-aware decoder variants: - `Int32ToInt64`, `Int32ToFloat32`, `Int32ToFloat64` - `Int64ToFloat32`, `Int64ToFloat64` - `Float32ToFloat64` - `BytesToString`, `StringToBytes` - Teach `Decoder::try_new` to inspect `ResolutionInfo::Promotion` and select the appropriate variant, so conversion happens **as we decode**, not after. - Extend `decode`, `append_null`, and `flush` to handle the new variants and materialize the correct Arrow arrays (`Int64Array`, `Float32Array`, `Float64Array`, `StringArray`, `BinaryArray`). - Keep existing behavior for `Utf8View` for non-promoted strings; promotions to `string` materialize a `StringArray` (not `StringViewArray`) for correctness and simplicity. (StringView remains available for native UTF‑8 paths.) **Integration tests & helpers (`arrow-avro/src/reader/mod.rs`):** - Add utilities to load a file’s **writer schema** JSON and synthesize a **reader schema** with field-level promotions (`make_reader_schema_with_promotions`). - Add cross‑codec tests on `alltypes_plain` (no compression, snappy, zstd, bzip2, xz) that validate: - Mixed numeric promotions to `float`/`double` and `int to long`. - `bytes to string` and `string to bytes`. - Timestamp/timezone behavior unchanged. - Add **negative** test ensuring **illegal promotions** (e.g., `boolean to double`) produce a descriptive error. # Are these changes tested? Yes. - **Unit tests** (in `record.rs`) for each promotion path: - `int to long`, `int to float`, `int to double` - `long to float`, `long to double` - `float to double` - `bytes to string` (including non‑ASCII UTF‑8) and `string to bytes` - Verifies that **illegal** promotions fail fast. - **Integration tests** (in `mod.rs`) reading real `alltypes_plain` Avro files across multiple compression codecs, asserting exact Arrow outputs for promoted fields. - Existing tests continue to pass. # Are there any user-facing changes? N/A

# Which issue does this PR close? - Part of #4886 - Follows up on #8047 # Rationale for this change When reading Avro into Arrow with a projection or a reader schema that omits some writer fields, we were still decoding those writer‑only fields item‑by‑item. This is unnecessary work and can dominate CPU time for large arrays/maps or deeply nested records. Avro’s binary format explicitly allows fast skipping for arrays/maps by encoding data in blocks: when the count is negative, the next `long` gives the byte size of the block, enabling O(1) skipping of that block without decoding each item. This PR teaches the record reader to recognize and leverage that, and to avoid constructing decoders for fields we will skip altogether. # What changes are included in this PR? **Reader / decoding architecture** - **Skip-aware record decoding**: - At construction time, we now precompute per-record **skip decoders** for writer fields that the reader will ignore. - Introduced a resolved-record path (`RecordResolved`) that carries: - `writer_to_reader` mapping for field alignment, - a prebuilt list of **skip decoders** for fields not present in the reader, - the set of active per-field decoders for the projected fields. - **Codec builder enhancements**: In `arrow-avro/src/codec.rs`, record construction now: - Builds Arrow `Field`s and their decoders only for fields that are read, - Builds `skip_decoders` (via `build_skip_decoders`) for fields to ignore. - **Error handling and consistency**: Kept existing strict-mode behavior; improved internal branching to avoid inconsistent states during partial decodes. **Tests** - **Unit tests (in `arrow-avro/src/reader/record.rs`)** - Added focused tests that exercise the new skip logic: - Skipping writer‑only fields inside **arrays** and **maps** (including negative‑count block skipping and mixed multi‑block payloads). - Skipping nested structures within records to ensure offsets and lengths remain correct for the fields that are read. - Ensured nullability and union handling remain correct when adjacent fields are skipped. - **Integration tests (in `arrow-avro/src/reader/mod.rs`)** - Added end‑to‑end test using `avro/alltypes_plain.avro` to validate that projecting a subset of fields (reader schema omits some writer fields) both: - Produces the correct Arrow arrays for the selected fields, and - Avoids decoding skipped fields (validated indirectly via behavior and block boundaries). - The test covers compressed and uncompressed variants already present in the suite to ensure behavior is consistent across codecs. # Are these changes tested? - **New unit tests** cover: - Fast skipping for arrays/maps using negative block counts and block sizes (per Avro spec). - Nested and nullable scenarios to ensure correct offsets, validity bitmaps, and flush behavior when adjacent fields are skipped. - **New integration test** in `reader/mod.rs`: - Reads `avro/alltypes_plain.avro` with a reader schema that omits several writer fields and asserts the resulting `RecordBatch` matches the expected arrays while exercising the skip path. - Existing promotion, enum, decimal, fixed, and union tests continue to pass, ensuring no regressions in unrelated areas. # Are there any user-facing changes? N/A since `arrow-avro` is not public yet.

# Which issue does this PR close? - Part of #4886 - Follows up on #8047 # Rationale for this change Avro `enum` values are **encoded by index** but are **semantically identified by symbol name**. During schema evolution it is legal for the writer and reader to use different enum symbol *orders* so long as the **symbol set is compatible**. The Avro specification requires that, when resolving a writer enum against a reader enum, the value be mapped **by symbol name**, not by the writer’s numeric index. If the writer’s symbol is not present in the reader’s enum and the reader defines a default, the default is used; otherwise it is an error. # What changes are included in this PR? **Core changes** - Implement **writer to reader enum symbol remapping**: - Build a fast lookup table at schema resolution time from **writer enum index to reader enum index** using symbol **names**. - Apply this mapping during decode so the produced Arrow dictionary keys always reference the **reader’s** symbol order. - If a writer symbol is not found in the reader enum, surface a clear error. # Are these changes tested? Yes. This PR adds comprehensive **unit tests** for enum mapping in `reader/record.rs` and a **real‑file integration test** in `reader/mod.rs` using `avro/simple_enum.avro`. # Are there any user-facing changes? N/A due to `arrow-avro` not being public yet.

Implement schema resolution foundations and type promotion in codec.r…

21523ad

…s for arrow-avro

github-actions bot added the arrow Changes to the arrow crate label Aug 5, 2025

Merge branch 'main' into avro-schema-resolution-promotions

88b4c99

jecsand838 force-pushed the avro-schema-resolution-promotions branch from 920e15d to 88b4c99 Compare August 6, 2025 20:37

Merge branch 'main' into avro-schema-resolution-promotions

34c687d

jecsand838 force-pushed the avro-schema-resolution-promotions branch from eaa70aa to 34c687d Compare August 9, 2025 19:58

jecsand838 mentioned this pull request Aug 11, 2025

Refactor arrow-avro Decoder to support partial decoding #8100

Merged

alamb approved these changes Aug 11, 2025

View reviewed changes

jecsand838 changed the title ~~Added arrow-avro schema resolution foundations and type promotion in codec.rs~~ Added arrow-avro schema resolution foundations and type promotion Aug 11, 2025

Added more codec.rs schema resolution tests, improved nullable unio…

054a7f9

…n support in `resolve_type`, wired up the handle prefix logic in `reader/mod.rs`, and fixed illegal schema promotions in existing tests.

Merge branch 'main' into avro-schema-resolution-promotions

4768dfc

jecsand838 force-pushed the avro-schema-resolution-promotions branch from 19d5b4f to 4768dfc Compare August 12, 2025 19:10

Fix Merge Conflict

02769ab

github-actions bot added the arrow-avro arrow-avro crate label Aug 12, 2025

alamb merged commit c52db65 into apache:main Aug 12, 2025
23 of 24 checks passed

jecsand838 mentioned this pull request Aug 12, 2025

Add schema resolution and type promotion support to arrow-avro Decoder #8124

Merged

This was referenced Aug 25, 2025

Added arrow-avro schema resolution value skipping #8220

Merged

Added arrow-avro enum mapping support for schema resolution #8223

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Added arrow-avro schema resolution foundations and type promotion #8047

Added arrow-avro schema resolution foundations and type promotion #8047

Uh oh!

jecsand838 commented Aug 5, 2025

Uh oh!

jecsand838 commented Aug 5, 2025

Uh oh!

alamb left a comment

Uh oh!

alamb Aug 11, 2025

Uh oh!

jecsand838 Aug 11, 2025

Uh oh!

alamb Aug 11, 2025 •

edited

Loading

Uh oh!

alamb commented Aug 12, 2025

Uh oh!

alamb commented Aug 12, 2025

Uh oh!

jecsand838 commented Aug 12, 2025

Uh oh!

alamb commented Aug 12, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

                       }
                   }
+                  #[test]

Added arrow-avro schema resolution foundations and type promotion #8047

Added arrow-avro schema resolution foundations and type promotion #8047

Uh oh!

Conversation

jecsand838 commented Aug 5, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Follow-up PRs

Uh oh!

jecsand838 commented Aug 5, 2025

Uh oh!

alamb left a comment

Choose a reason for hiding this comment

Uh oh!

alamb Aug 11, 2025

Choose a reason for hiding this comment

Uh oh!

jecsand838 Aug 11, 2025

Choose a reason for hiding this comment

Uh oh!

alamb Aug 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

alamb commented Aug 12, 2025

Uh oh!

alamb commented Aug 12, 2025

Uh oh!

jecsand838 commented Aug 12, 2025

Uh oh!

alamb commented Aug 12, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

alamb Aug 11, 2025 •

edited

Loading