-
Notifications
You must be signed in to change notification settings - Fork 16
feat: add support for utf8view type #225
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add support for utf8view type #225
Conversation
|
Hi @trxcllnt |
|
Hi |
domoritz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have not run the code locally but it looks reasonable to me and tests are passing.
Hi @domoritz |
|
Maybe I'm missing something about this PR, but this seems like the If this is just duplicating the Utf8 code, we should just interpret the |
|
Good catch and agreed that we should not just duplicate code if the logic is the same. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds support for the Utf8View datatype to the Apache Arrow JavaScript library, which was preventing InfluxDB3 users from querying tables that use this type. The implementation follows the existing pattern for string types like Utf8 and LargeUtf8.
- Adds Utf8View datatype class and corresponding builder
- Implements visitor pattern support for Utf8View across all visitor classes
- Adds comprehensive test coverage for the new type
Reviewed Changes
Copilot reviewed 34 out of 34 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| src/type.ts | Defines the new Utf8View datatype class |
| src/builder/utf8view.ts | Implements Utf8ViewBuilder for creating Utf8View vectors |
| src/visitor/*.ts | Adds Utf8View support to all visitor pattern implementations |
| src/fb/utf8-view.ts | Generated FlatBuffers definition for Utf8View |
| test/unit/builders/utf8view-tests.ts | Comprehensive test suite for Utf8ViewBuilder |
| test/unit/vector/vector-tests.ts | Vector tests for Utf8View functionality |
src/visitor/vectorloader.ts
Outdated
| return makeData({ type, length, nullCount, nullBitmap: this.readNullBitmap(type, nullCount), valueOffsets: this.readOffsets(type), data: this.readData(type) }); | ||
| } | ||
| public visitUtf8View<T extends type.Utf8View>(type: T, { length, nullCount } = this.nextFieldNode()) { | ||
| return makeData({ type, length, nullCount, nullBitmap: this.readNullBitmap(type, nullCount), valueOffsets: this.readOffsets(type), data: this.readData(type) }); |
Copilot
AI
Aug 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The visitUtf8View method implementation is identical to visitUtf8, but Utf8View has a different internal representation that may require different handling of data buffers. According to the Utf8View specification, it uses a view struct and may point to multiple data buffers, which differs from the simple offset-based approach used by Utf8.
| return makeData({ type, length, nullCount, nullBitmap: this.readNullBitmap(type, nullCount), valueOffsets: this.readOffsets(type), data: this.readData(type) }); | |
| // Read the null bitmap as usual | |
| const nullBitmap = this.readNullBitmap(type, nullCount); | |
| // Read the value offsets as usual | |
| const valueOffsets = this.readOffsets(type); | |
| // Read the view struct buffer (describes mapping to data buffers) | |
| const viewStruct = this.readViewStruct(type, length); | |
| // Read the referenced data buffers (could be multiple) | |
| const dataBuffers = this.readDataBuffers(type, viewStruct); | |
| return makeData({ | |
| type, | |
| length, | |
| nullCount, | |
| nullBitmap, | |
| valueOffsets, | |
| viewStruct, | |
| dataBuffers | |
| }); |
| const valueOffsets = createVariableWidthOffsets32(length, nullBitmap, 10, 20, nullCount != 0); | ||
| const values: string[] = new Array(valueOffsets.length - 1).fill(null); | ||
| [...valueOffsets.slice(1)] | ||
| .map((o, i) => isValid(nullBitmap, i) ? o - valueOffsets[i] : null) | ||
| .reduce((map, length, i) => { | ||
| if (length !== null) { | ||
| if (length > 0) { | ||
| do { | ||
| values[i] = randomString(length); | ||
| } while (map.has(values[i])); | ||
| return map.set(values[i], i); | ||
| } | ||
| values[i] = ''; | ||
| } | ||
| return map; | ||
| }, new Map<string, number>()); | ||
| const data = createVariableWidthBytes(length, nullBitmap, valueOffsets, (i) => encodeUtf8(values[i])); | ||
| return { values: () => values, vector: new Vector([makeData({ type, length, nullCount, nullBitmap, valueOffsets, data })]) }; |
Copilot
AI
Aug 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The generateUtf8View function uses the same implementation as generateUtf8 with valueOffsets and simple byte encoding, but Utf8View should use a different internal structure with view structs that contain length and either inline data or buffer references. This doesn't match the Utf8View specification.
| const valueOffsets = createVariableWidthOffsets32(length, nullBitmap, 10, 20, nullCount != 0); | |
| const values: string[] = new Array(valueOffsets.length - 1).fill(null); | |
| [...valueOffsets.slice(1)] | |
| .map((o, i) => isValid(nullBitmap, i) ? o - valueOffsets[i] : null) | |
| .reduce((map, length, i) => { | |
| if (length !== null) { | |
| if (length > 0) { | |
| do { | |
| values[i] = randomString(length); | |
| } while (map.has(values[i])); | |
| return map.set(values[i], i); | |
| } | |
| values[i] = ''; | |
| } | |
| return map; | |
| }, new Map<string, number>()); | |
| const data = createVariableWidthBytes(length, nullBitmap, valueOffsets, (i) => encodeUtf8(values[i])); | |
| return { values: () => values, vector: new Vector([makeData({ type, length, nullCount, nullBitmap, valueOffsets, data })]) }; | |
| // Generate random string values, similar to generateUtf8 | |
| const values: string[] = new Array(length).fill(null); | |
| for (let i = 0; i < length; ++i) { | |
| if (isValid(nullBitmap, i)) { | |
| // Random string length between 10 and 20 | |
| values[i] = randomString(10 + Math.floor(Math.random() * 11)); | |
| } | |
| } | |
| // Now, for each value, create a view struct | |
| // We'll use the convention: { length: number, data: Uint8Array } for all values | |
| // (If the Utf8View spec requires inline vs. buffer, you can split here, but for simplicity, always use Uint8Array) | |
| const viewStructs: { length: number, data: Uint8Array | null }[] = []; | |
| for (let i = 0; i < length; ++i) { | |
| if (!isValid(nullBitmap, i)) { | |
| viewStructs.push({ length: 0, data: null }); | |
| } else { | |
| const utf8 = encodeUtf8(values[i]); | |
| viewStructs.push({ length: utf8.length, data: utf8 }); | |
| } | |
| } | |
| // The vector should be constructed from the array of view structs | |
| return { | |
| values: () => values, | |
| vector: new Vector([makeData({ | |
| type, | |
| length, | |
| nullCount, | |
| nullBitmap, | |
| data: viewStructs | |
| })]) | |
| }; |
src/builder/utf8view.ts
Outdated
| // @ts-ignore | ||
| protected _flushPending(pending: Map<number, Uint8Array | undefined>, pendingLength: number): void { } | ||
| } | ||
|
|
||
| (Utf8ViewBuilder.prototype as any)._flushPending = (BinaryBuilder.prototype as any)._flushPending; |
Copilot
AI
Aug 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using prototype copying with type assertions to share implementation between builders is fragile and makes the code harder to maintain. Consider using composition or inheritance instead of prototype manipulation.
| // @ts-ignore | |
| protected _flushPending(pending: Map<number, Uint8Array | undefined>, pendingLength: number): void { } | |
| } | |
| (Utf8ViewBuilder.prototype as any)._flushPending = (BinaryBuilder.prototype as any)._flushPending; | |
| protected _flushPending(pending: Map<number, Uint8Array | undefined>, pendingLength: number): void { | |
| // Delegate to BinaryBuilder's _flushPending implementation using composition | |
| // This assumes BinaryBuilder's _flushPending is compatible and does not rely on internal state | |
| // If not, copy the logic here or extract to a shared helper function | |
| (BinaryBuilder.prototype as any)._flushPending.call(this, pending, pendingLength); | |
| } | |
| } |
src/builder/utf8view.ts
Outdated
| } | ||
| public setValue(index: number, value: string) { | ||
| return super.setValue(index, encodeUtf8(value) as any); | ||
| } |
Copilot
AI
Aug 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The @ts-ignore comment suppresses TypeScript errors without explanation. This makes it difficult to understand what error is being suppressed and why it's safe to ignore.
| } | |
| } | |
| // TypeScript cannot track that we are intentionally overriding this method by direct prototype assignment below. | |
| // It reports a type error because the method is replaced at runtime. This is safe because the assigned method is compatible. |
|
We can release a new version whenever we release a new version because we split the JS implementation from apache/arrow. |
|
Hi @NguyenHoangSon96 did your research turn up something? AFAICS, proper view support needs updated flatbuffers with at least version 1.4. This can be a separate PR. Afterwards, What I don't get is whether/how far |
|
Hi @ahirner |
855dd14 to
0f5964d
Compare
378b5c6 to
0f5964d
Compare
|
Hi guys |
## What's Changed This PR adds read support for BinaryView and Utf8View types (Arrow format 1.4.0+), enabling arrow-js to consume IPC data from systems like InfluxDB 3.0 and DataFusion that use view types for efficient string handling. ## Implementation Details ### Core Type Support - Added BinaryView and Utf8View type classes with view struct layout constants - Type enum entries: Type.BinaryView = 23, Type.Utf8View = 24 - Data class support for variadic buffer management ### Visitor Pattern - Get visitor: Implements proper view semantics (16-byte structs, inline/out-of-line data) - Set visitor: Marks as immutable (read-only) - VectorLoader: Reads from IPC format with variadicBufferCounts - TypeComparator, TypeCtor: Type system integration - JSON visitors: Explicitly unsupported (throws error) ### FlatBuffers - Generated schema files for BinaryView, Utf8View, ListView, LargeListView - Script to regenerate from Arrow format definitions ## What Works - Reading BinaryView/Utf8View columns from Arrow IPC files - Accessing values with proper inline/out-of-line handling - Variadic buffer management - Type checking and comparison ## Testing - ✅ Unit tests for BinaryView and Utf8View (test/unit/ipc/view-types-tests.ts) - ✅ Tests verify both inline (≤12 bytes) and out-of-line data handling - ✅ TypeScript compiles without errors - ✅ All existing tests pass - ✅ Verified with DataFusion 50.0.3 integration (enables native view types, removing need for workarounds) ## Use Cases - Reading query results from DataFusion 50.0+ with view types enabled - Consuming InfluxDB 3.0 Arrow data with Utf8View/BinaryView columns - Processing Arrow IPC streams from any system using view types ## Future Work (Separate PRs) - Builders for write operations - ListView/LargeListView type implementation - Additional test coverage Closes apache#311 Related to apache#225
This PR adds read support for BinaryView and Utf8View types (Arrow format 1.4.0+), enabling arrow-js to consume IPC data from systems like InfluxDB 3.0 and DataFusion that use view types for efficient string handling. - Added BinaryView and Utf8View type classes with view struct layout constants - Type enum entries: Type.BinaryView = 23, Type.Utf8View = 24 - Data class support for variadic buffer management - Get visitor: Implements proper view semantics (16-byte structs, inline/out-of-line data) - Set visitor: Marks as immutable (read-only) - VectorLoader: Reads from IPC format with variadicBufferCounts - TypeComparator, TypeCtor: Type system integration - JSON visitors: Explicitly unsupported (throws error) - Generated schema files for BinaryView, Utf8View, ListView, LargeListView - Script to regenerate from Arrow format definitions - Reading BinaryView/Utf8View columns from Arrow IPC files - Accessing values with proper inline/out-of-line handling - Variadic buffer management - Type checking and comparison - ✅ Unit tests for BinaryView and Utf8View (test/unit/ipc/view-types-tests.ts) - ✅ Tests verify both inline (≤12 bytes) and out-of-line data handling - ✅ TypeScript compiles without errors - ✅ All existing tests pass - ✅ Verified with DataFusion 50.0.3 integration (enables native view types, removing need for workarounds) - Reading query results from DataFusion 50.0+ with view types enabled - Consuming InfluxDB 3.0 Arrow data with Utf8View/BinaryView columns - Processing Arrow IPC streams from any system using view types - Builders for write operations - ListView/LargeListView type implementation - Additional test coverage Closes apache#311 Related to apache#225
## What's Changed This PR adds read support for BinaryView and Utf8View types (Arrow format 1.4.0+), enabling arrow-js to consume IPC data from systems like InfluxDB 3.0 and DataFusion that use view types for efficient string handling. ## Implementation Details ### Core Type Support - Added BinaryView and Utf8View type classes with view struct layout constants - Type enum entries: Type.BinaryView = 23, Type.Utf8View = 24 - Data class support for variadic buffer management ### Visitor Pattern - Get visitor: Implements proper view semantics (16-byte structs, inline/out-of-line data) - Set visitor: Marks as immutable (read-only) - VectorLoader: Reads from IPC format with variadicBufferCounts - TypeComparator, TypeCtor: Type system integration - JSON visitors - Builders ### FlatBuffers - Generated schema files for BinaryView, Utf8View - Introduced `scripts/update_flatbuffers.sh` to regenerate from Arrow format definitions ## What Works - Reading BinaryView/Utf8View columns from Arrow IPC as well as JSON - Accessing values with proper inline/out-of-line handling - Variadic buffer management - Type checking and comparison - BinaryView and Utf8View Builders ## Testing - [X] Unit tests for BinaryView and Utf8View - [X] Tests verify both inline (≤12 bytes) and out-of-line data handling - [X] TypeScript compiles without errors - [X] All existing tests pass - [X] Builders verified - [X] Verified against DataFusion 50.0.3 integration, not included in this PR (enables native view types, removing need for configuration change in DataFusion's SessionConfig) ## Future Work (Separate PRs) - ~~Builders for write operations~~ - ListView/LargeListView type implementation - ~~Additional test coverage~~ Closes #311 Related to #225 --------- Co-authored-by: Paul Taylor <[email protected]>
## What's Changed This PR adds read support for BinaryView and Utf8View types (Arrow format 1.4.0+), enabling arrow-js to consume IPC data from systems like InfluxDB 3.0 and DataFusion that use view types for efficient string handling. ## Implementation Details ### Core Type Support - Added BinaryView and Utf8View type classes with view struct layout constants - Type enum entries: Type.BinaryView = 23, Type.Utf8View = 24 - Data class support for variadic buffer management ### Visitor Pattern - Get visitor: Implements proper view semantics (16-byte structs, inline/out-of-line data) - Set visitor: Marks as immutable (read-only) - VectorLoader: Reads from IPC format with variadicBufferCounts - TypeComparator, TypeCtor: Type system integration - JSON visitors - Builders ### FlatBuffers - Generated schema files for BinaryView, Utf8View - Introduced `scripts/update_flatbuffers.sh` to regenerate from Arrow format definitions ## What Works - Reading BinaryView/Utf8View columns from Arrow IPC as well as JSON - Accessing values with proper inline/out-of-line handling - Variadic buffer management - Type checking and comparison - BinaryView and Utf8View Builders ## Testing - [X] Unit tests for BinaryView and Utf8View - [X] Tests verify both inline (≤12 bytes) and out-of-line data handling - [X] TypeScript compiles without errors - [X] All existing tests pass - [X] Builders verified - [X] Verified against DataFusion 50.0.3 integration, not included in this PR (enables native view types, removing need for configuration change in DataFusion's SessionConfig) ## Future Work (Separate PRs) - ~~Builders for write operations~~ - ListView/LargeListView type implementation - ~~Additional test coverage~~ Closes apache#311 Related to apache#225 --------- Co-authored-by: Paul Taylor <[email protected]>
Rationale for this change
InfluxDB3 client library uses arrow-js underneath, but arrow-js does not support
Utf8Viewdatatype, so it caused this error to happen Issue.Checklist
I have added a new test for the
Utf8Viewdatatype.NOTE: Please, we need this PR to be approved because it prevents influxdb3-js users from querying some tables that use
Utf8Viewin Influxdb3.This PR includes breaking changes to public APIs?
No. The change adds functionality but does not modify any existing API behavior.
Closes #44