Skip to content

Compute data buffer length by using start and end values in offset buffer #5756

@viirya

Description

@viirya

Describe the bug

Encountered an issue when importing empty variable-size binary layout array (e.g., string) from Java Arrow.

There is difference between Java Arrow and arrow-rs when computing the length of data buffer: apache/arrow#41610 (comment)

This is how Java Arrow imports an Utf8 array:

try (ArrowBuf offsets = importOffsets(type, VarCharVector.OFFSET_WIDTH)) {
      final int start = offsets.getInt(0);
      final int end = offsets.getInt(fieldNode.getLength() * (long) VarCharVector.OFFSET_WIDTH);
      final int len = end - start;
      ...
}

So even the offset buffer is not initialized, for empty array with one element offset buffer, end - start is always 0 that is the length of data buffer. That is why the added roundtrip tests are passed.

But in arrow-rs, it takes the last value of the offset buffer as the length of data buffer, i.e., end. If the value is not initialized to zero, the computed length of data buffer is incorrect.

That is what I found for the first offset value from the spec:

Generally the first slot in the offsets array is 0, and the last slot is the length of the values array.
When serializing this layout, we recommend normalizing the offsets to start at 0.

It looks like the first value doesn't have to be 0, although generally it is. So seems Java Arrow's approach is (more) correct.

To Reproduce

Expected behavior

Additional context

Metadata

Metadata

Assignees

No one assigned

    Labels

    arrowChanges to the arrow cratebug

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions