Skip to content

Conversation

@NguyenHoangSon96
Copy link

@NguyenHoangSon96 NguyenHoangSon96 commented Jul 29, 2025

Rationale for this change
InfluxDB3 client library uses arrow-js underneath, but arrow-js does not support Utf8View datatype, so it caused this error to happen Issue.

Checklist

  • All tests pass (yarn test)
  • Build completes (yarn build)

I have added a new test for the Utf8View datatype.

NOTE: Please, we need this PR to be approved because it prevents influxdb3-js users from querying some tables that use Utf8View in Influxdb3.

This PR includes breaking changes to public APIs?
No. The change adds functionality but does not modify any existing API behavior.

Closes #44

@NguyenHoangSon96 NguyenHoangSon96 changed the title feat: add support for utf8view types feat: add support for utf8view type Jul 29, 2025
@NguyenHoangSon96 NguyenHoangSon96 marked this pull request as draft July 29, 2025 08:38
@NguyenHoangSon96 NguyenHoangSon96 marked this pull request as ready for review July 31, 2025 02:47
@NguyenHoangSon96
Copy link
Author

Hi @trxcllnt
Can you help me review this PR?
I can't add you to the Reviewers for some reason, so I commented here.
This is the first time I have created a PR for arrow-js, sorry if I did something incorrectly 😃

@amoeba amoeba requested review from domoritz and trxcllnt July 31, 2025 20:54
@NguyenHoangSon96
Copy link
Author

Hi
yarn test:bundle fixed.

Copy link
Member

@domoritz domoritz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not run the code locally but it looks reasonable to me and tests are passing.

@NguyenHoangSon96
Copy link
Author

NguyenHoangSon96 commented Aug 2, 2025

I have not run the code locally, but it looks reasonable to me, and the tests are passing.

Hi @domoritz
Thank you. Can you merge it?
And if it merged, how long the new version of arrow-js be released

@trxcllnt
Copy link
Contributor

trxcllnt commented Aug 2, 2025

Maybe I'm missing something about this PR, but this seems like the Utf8View just duplicates everything from Utf8, and doesn't actually provide a "view" over Utf8 bytes? I assume a real Utf8View implementation would include a scalar type that lazily decodes the bytes into a JS string on demand?

If this is just duplicating the Utf8 code, we should just interpret the Utf8View typeId as a Utf8 typeId and reuse all the existing codepaths. We already get complaints about library bundle size, we shouldn't add to it if it can be helped.

@domoritz
Copy link
Member

domoritz commented Aug 3, 2025

Good catch and agreed that we should not just duplicate code if the logic is the same.

@kou kou requested a review from Copilot August 5, 2025 03:48
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR adds support for the Utf8View datatype to the Apache Arrow JavaScript library, which was preventing InfluxDB3 users from querying tables that use this type. The implementation follows the existing pattern for string types like Utf8 and LargeUtf8.

  • Adds Utf8View datatype class and corresponding builder
  • Implements visitor pattern support for Utf8View across all visitor classes
  • Adds comprehensive test coverage for the new type

Reviewed Changes

Copilot reviewed 34 out of 34 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
src/type.ts Defines the new Utf8View datatype class
src/builder/utf8view.ts Implements Utf8ViewBuilder for creating Utf8View vectors
src/visitor/*.ts Adds Utf8View support to all visitor pattern implementations
src/fb/utf8-view.ts Generated FlatBuffers definition for Utf8View
test/unit/builders/utf8view-tests.ts Comprehensive test suite for Utf8ViewBuilder
test/unit/vector/vector-tests.ts Vector tests for Utf8View functionality

return makeData({ type, length, nullCount, nullBitmap: this.readNullBitmap(type, nullCount), valueOffsets: this.readOffsets(type), data: this.readData(type) });
}
public visitUtf8View<T extends type.Utf8View>(type: T, { length, nullCount } = this.nextFieldNode()) {
return makeData({ type, length, nullCount, nullBitmap: this.readNullBitmap(type, nullCount), valueOffsets: this.readOffsets(type), data: this.readData(type) });
Copy link

Copilot AI Aug 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The visitUtf8View method implementation is identical to visitUtf8, but Utf8View has a different internal representation that may require different handling of data buffers. According to the Utf8View specification, it uses a view struct and may point to multiple data buffers, which differs from the simple offset-based approach used by Utf8.

Suggested change
return makeData({ type, length, nullCount, nullBitmap: this.readNullBitmap(type, nullCount), valueOffsets: this.readOffsets(type), data: this.readData(type) });
// Read the null bitmap as usual
const nullBitmap = this.readNullBitmap(type, nullCount);
// Read the value offsets as usual
const valueOffsets = this.readOffsets(type);
// Read the view struct buffer (describes mapping to data buffers)
const viewStruct = this.readViewStruct(type, length);
// Read the referenced data buffers (could be multiple)
const dataBuffers = this.readDataBuffers(type, viewStruct);
return makeData({
type,
length,
nullCount,
nullBitmap,
valueOffsets,
viewStruct,
dataBuffers
});

Copilot uses AI. Check for mistakes.
Comment on lines 351 to 342
const valueOffsets = createVariableWidthOffsets32(length, nullBitmap, 10, 20, nullCount != 0);
const values: string[] = new Array(valueOffsets.length - 1).fill(null);
[...valueOffsets.slice(1)]
.map((o, i) => isValid(nullBitmap, i) ? o - valueOffsets[i] : null)
.reduce((map, length, i) => {
if (length !== null) {
if (length > 0) {
do {
values[i] = randomString(length);
} while (map.has(values[i]));
return map.set(values[i], i);
}
values[i] = '';
}
return map;
}, new Map<string, number>());
const data = createVariableWidthBytes(length, nullBitmap, valueOffsets, (i) => encodeUtf8(values[i]));
return { values: () => values, vector: new Vector([makeData({ type, length, nullCount, nullBitmap, valueOffsets, data })]) };
Copy link

Copilot AI Aug 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The generateUtf8View function uses the same implementation as generateUtf8 with valueOffsets and simple byte encoding, but Utf8View should use a different internal structure with view structs that contain length and either inline data or buffer references. This doesn't match the Utf8View specification.

Suggested change
const valueOffsets = createVariableWidthOffsets32(length, nullBitmap, 10, 20, nullCount != 0);
const values: string[] = new Array(valueOffsets.length - 1).fill(null);
[...valueOffsets.slice(1)]
.map((o, i) => isValid(nullBitmap, i) ? o - valueOffsets[i] : null)
.reduce((map, length, i) => {
if (length !== null) {
if (length > 0) {
do {
values[i] = randomString(length);
} while (map.has(values[i]));
return map.set(values[i], i);
}
values[i] = '';
}
return map;
}, new Map<string, number>());
const data = createVariableWidthBytes(length, nullBitmap, valueOffsets, (i) => encodeUtf8(values[i]));
return { values: () => values, vector: new Vector([makeData({ type, length, nullCount, nullBitmap, valueOffsets, data })]) };
// Generate random string values, similar to generateUtf8
const values: string[] = new Array(length).fill(null);
for (let i = 0; i < length; ++i) {
if (isValid(nullBitmap, i)) {
// Random string length between 10 and 20
values[i] = randomString(10 + Math.floor(Math.random() * 11));
}
}
// Now, for each value, create a view struct
// We'll use the convention: { length: number, data: Uint8Array } for all values
// (If the Utf8View spec requires inline vs. buffer, you can split here, but for simplicity, always use Uint8Array)
const viewStructs: { length: number, data: Uint8Array | null }[] = [];
for (let i = 0; i < length; ++i) {
if (!isValid(nullBitmap, i)) {
viewStructs.push({ length: 0, data: null });
} else {
const utf8 = encodeUtf8(values[i]);
viewStructs.push({ length: utf8.length, data: utf8 });
}
}
// The vector should be constructed from the array of view structs
return {
values: () => values,
vector: new Vector([makeData({
type,
length,
nullCount,
nullBitmap,
data: viewStructs
})])
};

Copilot uses AI. Check for mistakes.
Comment on lines 40 to 44
// @ts-ignore
protected _flushPending(pending: Map<number, Uint8Array | undefined>, pendingLength: number): void { }
}

(Utf8ViewBuilder.prototype as any)._flushPending = (BinaryBuilder.prototype as any)._flushPending;
Copy link

Copilot AI Aug 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using prototype copying with type assertions to share implementation between builders is fragile and makes the code harder to maintain. Consider using composition or inheritance instead of prototype manipulation.

Suggested change
// @ts-ignore
protected _flushPending(pending: Map<number, Uint8Array | undefined>, pendingLength: number): void { }
}
(Utf8ViewBuilder.prototype as any)._flushPending = (BinaryBuilder.prototype as any)._flushPending;
protected _flushPending(pending: Map<number, Uint8Array | undefined>, pendingLength: number): void {
// Delegate to BinaryBuilder's _flushPending implementation using composition
// This assumes BinaryBuilder's _flushPending is compatible and does not rely on internal state
// If not, copy the logic here or extract to a shared helper function
(BinaryBuilder.prototype as any)._flushPending.call(this, pending, pendingLength);
}
}

Copilot uses AI. Check for mistakes.
}
public setValue(index: number, value: string) {
return super.setValue(index, encodeUtf8(value) as any);
}
Copy link

Copilot AI Aug 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The @ts-ignore comment suppresses TypeScript errors without explanation. This makes it difficult to understand what error is being suppressed and why it's safe to ignore.

Suggested change
}
}
// TypeScript cannot track that we are intentionally overriding this method by direct prototype assignment below.
// It reports a type error because the method is replaced at runtime. This is safe because the assigned method is compatible.

Copilot uses AI. Check for mistakes.
@kou
Copy link
Member

kou commented Aug 5, 2025

We can release a new version whenever we release a new version because we split the JS implementation from apache/arrow.
For example, we can release a new version once this is merged.

@NguyenHoangSon96
Copy link
Author

NguyenHoangSon96 commented Aug 5, 2025

@trxcllnt @domoritz
Hi guys, thank you for your inputs.
I will be working on implementing the view type properly.
I will do some more research.

@ahirner
Copy link

ahirner commented Sep 2, 2025

Hi @NguyenHoangSon96 did your research turn up something?

AFAICS, proper view support needs updated flatbuffers with at least version 1.4. This can be a separate PR. Afterwards, variadicBufferCounts can be used from RecordBatch.

What I don't get is whether/how far arrow-js can defer materializing views as strings/bytes since I'm not familiar with this repo @domoritz @trxcllnt .

@NguyenHoangSon96
Copy link
Author

Hi @ahirner
I already knew how to implement utf8view type. I'm writing unit tests right now.
It just I'm still busy with my other projects at the workplace.
I will try to finish this in the next week. 😀

@NguyenHoangSon96
Copy link
Author

Hi guys
I closed this branch because I used a force push after I rebased on the main branch.
I moved the branch to here 331.
I don't have time to work on this anymore. So if anyone continues to finish this, I will very appreciate it.
Sorry because I can't finish this.

GeorgeLeePatterson added a commit to GeorgeLeePatterson/arrow-js that referenced this pull request Oct 31, 2025
## What's Changed

This PR adds read support for BinaryView and Utf8View types (Arrow format 1.4.0+),
enabling arrow-js to consume IPC data from systems like InfluxDB 3.0 and DataFusion
that use view types for efficient string handling.

## Implementation Details

### Core Type Support
- Added BinaryView and Utf8View type classes with view struct layout constants
- Type enum entries: Type.BinaryView = 23, Type.Utf8View = 24
- Data class support for variadic buffer management

### Visitor Pattern
- Get visitor: Implements proper view semantics (16-byte structs, inline/out-of-line data)
- Set visitor: Marks as immutable (read-only)
- VectorLoader: Reads from IPC format with variadicBufferCounts
- TypeComparator, TypeCtor: Type system integration
- JSON visitors: Explicitly unsupported (throws error)

### FlatBuffers
- Generated schema files for BinaryView, Utf8View, ListView, LargeListView
- Script to regenerate from Arrow format definitions

## What Works
- Reading BinaryView/Utf8View columns from Arrow IPC files
- Accessing values with proper inline/out-of-line handling
- Variadic buffer management
- Type checking and comparison

## Testing
- ✅ Unit tests for BinaryView and Utf8View (test/unit/ipc/view-types-tests.ts)
- ✅ Tests verify both inline (≤12 bytes) and out-of-line data handling
- ✅ TypeScript compiles without errors
- ✅ All existing tests pass
- ✅ Verified with DataFusion 50.0.3 integration (enables native view types, removing need for workarounds)

## Use Cases
- Reading query results from DataFusion 50.0+ with view types enabled
- Consuming InfluxDB 3.0 Arrow data with Utf8View/BinaryView columns
- Processing Arrow IPC streams from any system using view types

## Future Work (Separate PRs)
- Builders for write operations
- ListView/LargeListView type implementation
- Additional test coverage

Closes apache#311
Related to apache#225
GeorgeLeePatterson added a commit to GeorgeLeePatterson/arrow-js that referenced this pull request Oct 31, 2025
This PR adds read support for BinaryView and Utf8View types (Arrow format 1.4.0+),
enabling arrow-js to consume IPC data from systems like InfluxDB 3.0 and DataFusion
that use view types for efficient string handling.

- Added BinaryView and Utf8View type classes with view struct layout constants
- Type enum entries: Type.BinaryView = 23, Type.Utf8View = 24
- Data class support for variadic buffer management

- Get visitor: Implements proper view semantics (16-byte structs, inline/out-of-line data)
- Set visitor: Marks as immutable (read-only)
- VectorLoader: Reads from IPC format with variadicBufferCounts
- TypeComparator, TypeCtor: Type system integration
- JSON visitors: Explicitly unsupported (throws error)

- Generated schema files for BinaryView, Utf8View, ListView, LargeListView
- Script to regenerate from Arrow format definitions

- Reading BinaryView/Utf8View columns from Arrow IPC files
- Accessing values with proper inline/out-of-line handling
- Variadic buffer management
- Type checking and comparison

- ✅ Unit tests for BinaryView and Utf8View (test/unit/ipc/view-types-tests.ts)
- ✅ Tests verify both inline (≤12 bytes) and out-of-line data handling
- ✅ TypeScript compiles without errors
- ✅ All existing tests pass
- ✅ Verified with DataFusion 50.0.3 integration (enables native view types, removing need for workarounds)

- Reading query results from DataFusion 50.0+ with view types enabled
- Consuming InfluxDB 3.0 Arrow data with Utf8View/BinaryView columns
- Processing Arrow IPC streams from any system using view types

- Builders for write operations
- ListView/LargeListView type implementation
- Additional test coverage

Closes apache#311
Related to apache#225
kou pushed a commit that referenced this pull request Nov 19, 2025
## What's Changed

This PR adds read support for BinaryView and Utf8View types (Arrow
format 1.4.0+), enabling arrow-js to consume IPC data from systems like
InfluxDB 3.0 and DataFusion that use view types for efficient string
handling.

## Implementation Details

### Core Type Support
- Added BinaryView and Utf8View type classes with view struct layout
constants
- Type enum entries: Type.BinaryView = 23, Type.Utf8View = 24
- Data class support for variadic buffer management

### Visitor Pattern
- Get visitor: Implements proper view semantics (16-byte structs,
inline/out-of-line data)
- Set visitor: Marks as immutable (read-only)
- VectorLoader: Reads from IPC format with variadicBufferCounts
- TypeComparator, TypeCtor: Type system integration
- JSON visitors
- Builders

### FlatBuffers
- Generated schema files for BinaryView, Utf8View
- Introduced `scripts/update_flatbuffers.sh` to regenerate from Arrow
format definitions

## What Works
- Reading BinaryView/Utf8View columns from Arrow IPC as well as JSON
- Accessing values with proper inline/out-of-line handling
- Variadic buffer management
- Type checking and comparison
- BinaryView and Utf8View Builders

## Testing
- [X] Unit tests for BinaryView and Utf8View
- [X] Tests verify both inline (≤12 bytes) and out-of-line data handling
- [X] TypeScript compiles without errors
- [X] All existing tests pass
- [X] Builders verified
- [X] Verified against DataFusion 50.0.3 integration, not included in
this PR (enables native view types, removing need for configuration
change in DataFusion's SessionConfig)

## Future Work (Separate PRs)
- ~~Builders for write operations~~
- ListView/LargeListView type implementation
- ~~Additional test coverage~~

Closes #311
Related to #225

---------

Co-authored-by: Paul Taylor <[email protected]>
Divyanshu-s13 pushed a commit to Divyanshu-s13/arrow-js that referenced this pull request Nov 22, 2025
## What's Changed

This PR adds read support for BinaryView and Utf8View types (Arrow
format 1.4.0+), enabling arrow-js to consume IPC data from systems like
InfluxDB 3.0 and DataFusion that use view types for efficient string
handling.

## Implementation Details

### Core Type Support
- Added BinaryView and Utf8View type classes with view struct layout
constants
- Type enum entries: Type.BinaryView = 23, Type.Utf8View = 24
- Data class support for variadic buffer management

### Visitor Pattern
- Get visitor: Implements proper view semantics (16-byte structs,
inline/out-of-line data)
- Set visitor: Marks as immutable (read-only)
- VectorLoader: Reads from IPC format with variadicBufferCounts
- TypeComparator, TypeCtor: Type system integration
- JSON visitors
- Builders

### FlatBuffers
- Generated schema files for BinaryView, Utf8View
- Introduced `scripts/update_flatbuffers.sh` to regenerate from Arrow
format definitions

## What Works
- Reading BinaryView/Utf8View columns from Arrow IPC as well as JSON
- Accessing values with proper inline/out-of-line handling
- Variadic buffer management
- Type checking and comparison
- BinaryView and Utf8View Builders

## Testing
- [X] Unit tests for BinaryView and Utf8View
- [X] Tests verify both inline (≤12 bytes) and out-of-line data handling
- [X] TypeScript compiles without errors
- [X] All existing tests pass
- [X] Builders verified
- [X] Verified against DataFusion 50.0.3 integration, not included in
this PR (enables native view types, removing need for configuration
change in DataFusion's SessionConfig)

## Future Work (Separate PRs)
- ~~Builders for write operations~~
- ListView/LargeListView type implementation
- ~~Additional test coverage~~

Closes apache#311
Related to apache#225

---------

Co-authored-by: Paul Taylor <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[JS] Add support for StringView types

5 participants