Skip to content

Conversation

jecsand838
Copy link
Contributor

Which issue does this PR close?

Rationale for this change

This change introduces the foundation in codec.rs for supporting for Avro schema evolution, a key feature of the Avro specification. It enables reading Avro data when the writer's schema and the reader's schema do not match exactly but are compatible according to Avro's resolution rules. This makes data consumption more robust and flexible.

This approach focuses on "annotating" each AvroDataType with optional ResolutionInfo and then building the Codec using the reader_schema. This ResolutionInfo will be used downstream in my next PR by the RecordDecoder to efficiently read and decode the raw record bytes into the reader_schema.

Once this is merged in, promotion schema resolution support will need to be added to the RecordDecoder in a follow-up PR. These RecordDecoder updates will resemble this:

                    Promotion::IntToLong => Int32ToInt64(BufferBuilder::new(DEFAULT_CAPACITY)),
                    Promotion::IntToFloat => Int32ToFloat32(BufferBuilder::new(DEFAULT_CAPACITY)),
                    Promotion::IntToDouble => Int32ToFloat64(BufferBuilder::new(DEFAULT_CAPACITY)),
                    Promotion::LongToFloat => Int64ToFloat32(BufferBuilder::new(DEFAULT_CAPACITY)),
                    Promotion::LongToDouble => Int64ToFloat64(BufferBuilder::new(DEFAULT_CAPACITY)),
                    Promotion::FloatToDouble => {
                        Float32ToFloat64(BufferBuilder::new(DEFAULT_CAPACITY))
                    }
                    Promotion::BytesToString => BytesToString(
                        OffsetBufferBuilder::new(DEFAULT_CAPACITY),
                        BufferBuilder::new(DEFAULT_CAPACITY),
                    ),
                    Promotion::StringToBytes => StringToBytes(
                        OffsetBufferBuilder::new(DEFAULT_CAPACITY),
                        BufferBuilder::new(DEFAULT_CAPACITY),
                    ),

What changes are included in this PR?

  • Schema Resolution Logic: The core of this PR is the new schema resolution logic, which is encapsulated in the Maker struct. This handles:
    • Type Promotions: E.g., promoting int to long or string to bytes.
    • Default Values: Using default values from the reader's schema when a field is missing in the writer's schema.
    • Record Evolution: Resolving differences in record fields between the writer and reader schemas. This includes adding or removing fields.
    • Enum Evolution: Mapping enum symbols between the writer's and reader's schemas.
  • New Data Structures: Several new data structures have been added to support schema resolution:
    • ResolutionInfo: An enum that captures the necessary information for resolving schema differences.
    • ResolvedRecord: A struct that holds the mapping between writer and reader record fields.
    • AvroLiteral: Represents Avro default values.
    • Promotion: An enum for different kinds of type promotions.
    • EnumMapping: A struct for enum symbol mapping.
  • Updated AvroFieldBuilder: The AvroFieldBuilder has been updated to accept both a writer's and an optional reader's schema to facilitate schema resolution.
  • PartialEq Derivations: PartialEq has been derived for several structs to simplify testing.
  • Refactoring: The schema parsing logic has been refactored from a standalone function into the new Maker struct for better organization.

Are these changes tested?

Yes, new unit tests have been added to verify the schema resolution logic, including tests for type promotions and handling of default values.

Are there any user-facing changes?

N/A

Follow-up PRs

  • Promotion Schema Resolution support in RecordDecoder
  • Default Value Schema resolution support (codec + decoder)
  • Enum mapping Schema resolution support (codec + decoder)
  • Skip Value Schema resolution support (codec + decoder)
  • Record resolution support (codec + decoder)

@github-actions github-actions bot added the arrow Changes to the arrow crate label Aug 5, 2025
@jecsand838
Copy link
Contributor Author

@alamb @scovich

Here's the first Schema resolution PR. I left out the RecordDecoder changes because it would add another ~500 LOC.

I also listed out the follow-up PRs I'm planning to get schema resolution completed.

@jecsand838 jecsand838 force-pushed the avro-schema-resolution-promotions branch from 920e15d to 88b4c99 Compare August 6, 2025 20:37
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @jecsand838

Notes to other reviewers, I found this PR eaiser to understand with whitespace blind diff: https://github.com/apache/arrow-rs/pull/8047/files?w=1 (specifically, it makes the refactoring into Maker easier I found)

}
}

#[test]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, new unit tests have been added to verify the schema resolution logic, including tests for type promotions and handling of default values.

is this the only new test ? I didn't see any others for the promotion rules added in this PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb That's a good callout. I was trying to mindful of this PR's size and most of the tests I wrote are e2e tests. I probably should have included more tests in this PR though.

I went ahead and pushed up another 9 unit tests in codec.rs covering all promotion resolution paths. I also wired up the schema resolution path to arrow-avro/src/reader/mod.rs and updated the schema_store and handle_prefix tests to have valid promotion paths to give us extra coverage.

Copy link
Contributor

@alamb alamb Aug 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you (both for the tests and being mindful about our limited review capacity)

- Add support for reader/writer schema resolution with promotions (e.g., int to long, float to double) in `RecordDecoder` and related components.
- Enhance decoding logic to handle schema evolution for various Avro types.
- Introduce tests for schema promotions and compatibility.
@jecsand838 jecsand838 changed the title Added arrow-avro schema resolution foundations and type promotion in codec.rs Added arrow-avro schema resolution foundations and type promotion Aug 11, 2025
…n support in `resolve_type`, wired up the handle prefix logic in `reader/mod.rs`, and fixed illegal schema promotions in existing tests.
@alamb
Copy link
Contributor

alamb commented Aug 12, 2025

🚀

@alamb
Copy link
Contributor

alamb commented Aug 12, 2025

Github says this PR has conflicts that need to be fixed before I can merge the PR:

Screenshot 2025-08-12 at 2 09 53 PM

@jecsand838 jecsand838 force-pushed the avro-schema-resolution-promotions branch from 19d5b4f to 4768dfc Compare August 12, 2025 19:10
@jecsand838
Copy link
Contributor Author

@alamb

I just resolved the conflicts.
Also I seem to be getting an error on the dev_pr check:

Error: HttpError: You do not have permission to create labels on this repository.: {"resource":"Repository","field":"label","code":"unauthorized"}
Error: You do not have permission to create labels on this repository.: {"resource":"Repository","field":"label","code":"unauthorized"}

@github-actions github-actions bot added the arrow-avro arrow-avro crate label Aug 12, 2025
@alamb
Copy link
Contributor

alamb commented Aug 12, 2025

@alamb

I just resolved the conflicts. Also I seem to be getting an error on the dev_pr check:

Error: HttpError: You do not have permission to create labels on this repository.: {"resource":"Repository","field":"label","code":"unauthorized"}
Error: You do not have permission to create labels on this repository.: {"resource":"Repository","field":"label","code":"unauthorized"}

I think this was due to my change in

(specifically I didn't add the correct label)

I fixed the label and retriggered the job and it now passed.

Thanks again @jecsand838

@alamb alamb merged commit c52db65 into apache:main Aug 12, 2025
23 of 24 checks passed
alamb pushed a commit that referenced this pull request Aug 14, 2025
#8124)

# Which issue does this PR close?

- Part of #4886
- Follows up on #8047

# Rationale for this change

Avro allows safe widening between numeric primitives and
interoperability between `bytes` and UTF‑8 `string` during **schema
resolution**.

Implementing promotion-aware decoding lets us:
- Honor the Avro spec’s resolution matrix directly in the reader,
improving interoperability with evolving schemas.
- Decode **directly into the target Arrow type** (avoiding extra passes
and temporary arrays).
- Produce clear errors for **illegal promotions**, instead of surprising
behavior. (Per the spec, unresolved writer/reader mismatches are
errors.)

# What changes are included in this PR?

**Core decoding (`arrow-avro/src/reader/record.rs`):**
- Add promotion-aware decoder variants:
  - `Int32ToInt64`, `Int32ToFloat32`, `Int32ToFloat64`
  - `Int64ToFloat32`, `Int64ToFloat64`
  - `Float32ToFloat64`
  - `BytesToString`, `StringToBytes`
- Teach `Decoder::try_new` to inspect `ResolutionInfo::Promotion` and
select the appropriate variant, so conversion happens **as we decode**,
not after.
- Extend `decode`, `append_null`, and `flush` to handle the new variants
and materialize the correct Arrow arrays (`Int64Array`, `Float32Array`,
`Float64Array`, `StringArray`, `BinaryArray`).
- Keep existing behavior for `Utf8View` for non-promoted strings;
promotions to `string` materialize a `StringArray` (not
`StringViewArray`) for correctness and simplicity. (StringView remains
available for native UTF‑8 paths.)

**Integration tests & helpers (`arrow-avro/src/reader/mod.rs`):**
- Add utilities to load a file’s **writer schema** JSON and synthesize a
**reader schema** with field-level promotions
(`make_reader_schema_with_promotions`).
- Add cross‑codec tests on `alltypes_plain` (no compression, snappy,
zstd, bzip2, xz) that validate:
  - Mixed numeric promotions to `float`/`double` and `int to long`.
  - `bytes to string` and `string to bytes`.
  - Timestamp/timezone behavior unchanged.
- Add **negative** test ensuring **illegal promotions** (e.g., `boolean
to double`) produce a descriptive error.

# Are these changes tested?

Yes.
- **Unit tests** (in `record.rs`) for each promotion path:
  - `int to long`, `int to float`, `int to double`
  - `long to float`, `long to double`
  - `float to double`
  - `bytes to string` (including non‑ASCII UTF‑8) and `string to bytes`
  - Verifies that **illegal** promotions fail fast.
- **Integration tests** (in `mod.rs`) reading real `alltypes_plain` Avro
files across multiple compression codecs, asserting exact Arrow outputs
for promoted fields.
- Existing tests continue to pass.

# Are there any user-facing changes?

N/A
alamb pushed a commit that referenced this pull request Sep 4, 2025
# Which issue does this PR close?

- Part of #4886
- Follows up on #8047

# Rationale for this change

When reading Avro into Arrow with a projection or a reader schema that
omits some writer fields, we were still decoding those writer‑only
fields item‑by‑item. This is unnecessary work and can dominate CPU time
for large arrays/maps or deeply nested records.

Avro’s binary format explicitly allows fast skipping for arrays/maps by
encoding data in blocks: when the count is negative, the next `long`
gives the byte size of the block, enabling O(1) skipping of that block
without decoding each item. This PR teaches the record reader to
recognize and leverage that, and to avoid constructing decoders for
fields we will skip altogether.

# What changes are included in this PR?

**Reader / decoding architecture**
- **Skip-aware record decoding**:
- At construction time, we now precompute per-record **skip decoders**
for writer fields that the reader will ignore.
  - Introduced a resolved-record path (`RecordResolved`) that carries:
    - `writer_to_reader` mapping for field alignment,
- a prebuilt list of **skip decoders** for fields not present in the
reader,
    - the set of active per-field decoders for the projected fields.
- **Codec builder enhancements**: In `arrow-avro/src/codec.rs`, record
construction now:
- Builds Arrow `Field`s and their decoders only for fields that are
read,
- Builds `skip_decoders` (via `build_skip_decoders`) for fields to
ignore.
- **Error handling and consistency**: Kept existing strict-mode
behavior; improved internal branching to avoid inconsistent states
during partial decodes.

**Tests**
- **Unit tests (in `arrow-avro/src/reader/record.rs`)**
  - Added focused tests that exercise the new skip logic:
- Skipping writer‑only fields inside **arrays** and **maps** (including
negative‑count block skipping and mixed multi‑block payloads).
- Skipping nested structures within records to ensure offsets and
lengths remain correct for the fields that are read.
- Ensured nullability and union handling remain correct when adjacent
fields are skipped.
- **Integration tests (in `arrow-avro/src/reader/mod.rs`)**
- Added end‑to‑end test using `avro/alltypes_plain.avro` to validate
that projecting a subset of fields (reader schema omits some writer
fields) both:
    - Produces the correct Arrow arrays for the selected fields, and
- Avoids decoding skipped fields (validated indirectly via behavior and
block boundaries).
- The test covers compressed and uncompressed variants already present
in the suite to ensure behavior is consistent across codecs.

# Are these changes tested?

- **New unit tests** cover:
- Fast skipping for arrays/maps using negative block counts and block
sizes (per Avro spec).
- Nested and nullable scenarios to ensure correct offsets, validity
bitmaps, and flush behavior when adjacent fields are skipped.
- **New integration test** in `reader/mod.rs`:
- Reads `avro/alltypes_plain.avro` with a reader schema that omits
several writer fields and asserts the resulting `RecordBatch` matches
the expected arrays while exercising the skip path.
- Existing promotion, enum, decimal, fixed, and union tests continue to
pass, ensuring no regressions in unrelated areas.

# Are there any user-facing changes?

N/A since `arrow-avro` is not public yet.
alamb pushed a commit that referenced this pull request Sep 6, 2025
# Which issue does this PR close?

- Part of #4886
- Follows up on #8047

# Rationale for this change

Avro `enum` values are **encoded by index** but are **semantically
identified by symbol name**. During schema evolution it is legal for the
writer and reader to use different enum symbol *orders* so long as the
**symbol set is compatible**. The Avro specification requires that, when
resolving a writer enum against a reader enum, the value be mapped **by
symbol name**, not by the writer’s numeric index. If the writer’s symbol
is not present in the reader’s enum and the reader defines a default,
the default is used; otherwise it is an error.

# What changes are included in this PR?

**Core changes**
- Implement **writer to reader enum symbol remapping**:
- Build a fast lookup table at schema resolution time from **writer enum
index to reader enum index** using symbol **names**.
- Apply this mapping during decode so the produced Arrow dictionary keys
always reference the **reader’s** symbol order.
- If a writer symbol is not found in the reader enum, surface a clear
error.

# Are these changes tested?

Yes. This PR adds comprehensive **unit tests** for enum mapping in
`reader/record.rs` and a **real‑file integration test** in
`reader/mod.rs` using `avro/simple_enum.avro`.

# Are there any user-facing changes?

N/A due to `arrow-avro` not being public yet.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate arrow-avro arrow-avro crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants