feat: Implement IPC RecordBatch body buffer compression #14

Djjanks · 2025-05-19T06:35:18Z

Rationale for this change

This change introduces support for reading compressed Arrow IPC streams in JavaScript. The primary motivation is the need to read Arrow IPC Stream in the browser when they are transmitted over the network in a compressed format to reduce network load.

Several reasons support this enhancement:

Personal need in other project to read compressed arrow IPC stream.
Community demand, as seen in Issue apache/arrow-js#109.
A similar implementation was attempted in PR apache/arrow#13076 but was never merged. I am very grateful to @kylebarron .
Other language implementations (e.g., C++, Python, Rust) already support IPC compression.

What changes are included in this PR?

Support for decoding compressed RecordBatch buffers during reading.
Each buffer is decompressed individually, offsets are recalculated with 8-byte alignment, and a new metadata. RecordBatch is constructed before loading vectors.
Only decompression is implemented; compression (writing) is not supported yet.
Currently tested with the lz4 codec using the lz4js library. lz4-wasm was evaluated but rejected due to incompatibility with LZ4 Frame format.
The decompression logic is isolated to _loadRecordBatch() in the RecordBatchReaderImpl class.
A codec.decode function is retrieved from the compressionRegistry and applied per-buffer. So users can choose suitable lib.

Additional notes:

Codec compatibility caveats
Not all JavaScript LZ4 libraries are compatible with the Arrow IPC format. For example:

lz4js works correctly as it supports the LZ4 Frame Format.
lz4-wasm is not compatible, as it expects raw LZ4 blocks and fails to decompress LZ4 frame data.
This can result in silent or cryptic errors. To improve developer experience, we could:
Wrap codec.decode calls in try/catch and surface a clearer error message if decompression fails.
Add an optional check in compressionRegistry.set() to validate that the codec supports LZ4 Frame Format. One way would be to compress dummy data and inspect the first 4 bytes for the expected LZ4 Frame magic header (0x04 0x22 0x4D 0x18).

Reconstruction of metadata.RecordBatch
After decompressing the buffers, new BufferRegion entries are calculated to match the uncompressed data layout. A new metadata.RecordBatch is constructed with the updated buffer regions and passed into _loadVectors().
This introduces a mutation-like pattern that may break assumptions in the current design. However, it's necessary because:

_loadVectors() depends strictly on the offsets in header.buffers, which no longer match the decompressed buffer layout.
Without changing either _loadVectors() or metadata.RecordBatch, the current approach is the least intrusive.

Setting compression = null in new RecordBatch
When reconstructing the metadata, the compression field is explicitly set to null, since the data is already decompressed in memory.
This decision is somewhat debatable — feedback is welcome on whether it's better to retain the original compression metadata or to reflect the current state of the buffer (uncompressed). The current implementation assumes the latter.

Are these changes tested?

The changes were tested in the own project using LZ4-compressed Arrow stream.
Test uncompressed, compressed and pseudo compressed(uncompressed data length = -1) data.
No unit tests are included in this PR yet.
The decompression was verified with real-world data and the lz4js codec (lz4-wasm is not compatible).
No issues were observed with alignment, vector loading, or decompression integrity.
Exception handling is not yet added around codec.decode. This may be useful for catching codec incompatibility and providing better user feedback.

Are there any user-facing changes?

Yes, Arrow JS users can now read compressed IPC stream, assuming they register an appropriate codec using compressionRegistry.set().

Example:

import { Codec, compressionRegistry } from 'apache-arrow';
import * as lz4 from 'lz4js';

  const lz4Codec: Codec = {
      encode(data: Uint8Array): Uint8Array { return lz4js.compress(data) },
      decode(data: Uint8Array): Uint8Array { return lz4js.decompress(data) }
  }; 

  compressionRegistry.set(CompressionType.LZ4_FRAME, lz4Codec);

This change does not affect writing or serialization.

This PR includes breaking changes to public APIs.
No. The change adds functionality but does not modify any existing API behavior.

This PR contains a "Critical Fix".
No. This is a new feature, not a critical fix.

Checklist

All tests pass (yarn test)
Build completes (yarn build)
I have added a new test for compressed batches

Closes #109.

trxcllnt · 2025-05-20T20:28:13Z

src/ipc/reader.ts

+        const combined = new Uint8Array(totalSize);
+
+        for (const [i, decompressedBuffer] of decompressedBuffers.entries()) {
+            combined.set(decompressedBuffer, newBufferRegions[i].offset);


We should be able to implement this without copying the inflated data back into a single contiguous ArrayBuffer.

I think it's possible to implement a VirtualUint8Array class that takes an array of Uint8Array chunks and implements the necessary methods to behave like a contiguous Uint8Array. I'm going to experiment with that approach soon.

I think that might be more complicated than necessary.

IIUC, the new logic loops through all buffers, decompresses them, and collects them into a list. Then it packs all the decompressed buffers into a contiguous ArrayBuffer that matches the equivalent IPC format without compression.

In order to avoid the last step of re-packing into an ArrayBuffer, we'd need to return the list of uncompressed buffers and use a VectorLoader instance that accepts the list and selects the buffers by index (vs. the current behavior which accepts the contiguous ArrayBuffer and slices from it). Luckily, that's exactly what the JSONVectorLoader does!

I don't think you can use the JSONVectorLoader directly, since it assumes the list of buffers are JSON-encoded representations of the values, but you could implement a new CompressedVectorLoader class that closely follows its structure but doesn't call methods like packBools() and binaryDataFromJSON().

The logic in your function here would need to also return a list of BufferRegion instances whose offset field corresponds to the Array index of each decompressed buffer (rather than the byteOffset of each buffer in the contiguous ArrayBuffer).

Something like this:

export class CompressedVectorLoader extends VectorLoader { private sources: any[][]; constructor(sources: Uint8Array[][], nodes: FieldNode[], buffers: BufferRegion[], dictionaries: Map<number, Vector<any>>, metadataVersion: MetadataVersion) { super(new Uint8Array(0), nodes, buffers, dictionaries, metadataVersion); this.sources = sources; } protected readNullBitmap<T extends DataType>(_type: T, nullCount: number, { offset } = this.nextBufferRange()) { return nullCount <= 0 ? new Uint8Array(0) : this.sources[offset]; } protected readOffsets<T extends DataType>(_type: T, { offset } = this.nextBufferRange()) { return this.sources[offset]; } protected readTypeIds<T extends DataType>(_type: T, { offset } = this.nextBufferRange()) { return this.sources[offset]; } protected readData<T extends DataType>(_type: T, { offset } = this.nextBufferRange()) { return this.sources[offset]; } }

I ended up solving this issue without implementing a VirtualUint8Array. Instead, I modified the body parameter signature in _loadVectors and the VectorLoader constructor to accept Uint8Array | Uint8Array[].

It worked out nicely because the class already has a buffersIndex parameter that points to the correct buffer, and in my case, the decompression order matches the BufferRegion[] sequence. This approach required minimal changes, and thanks to the type signature, TypeScript will prevent errors in future modifications to VectorLoader.

Your suggested approach (with CompressedVectorLoader) is also interesting—it would help isolate the logic for compressed buffers. If you think it’s the better solution, I can refactor the code to use it instead.

could you help me figure out how to properly test this functionality

Do you intend to add compression support to the writer? Typically that's how we'd test this sort of behavior, since the reader and writer are duals.

No problem—I’ll refactor this to use the CompressedVectorLoader class instead of overloading the type signatures.

As for testing, I do plan to add compression support to the writer, though likely not for another month (depending on my project’s needs). Initially, I assumed the tests should be entirely independent, but I agree that aligning them with the writer’s behavior makes more sense and will be more maintainable long-term.

I can take a look at adding it to the writer. I wouldn't want to merge this PR without at least a limited set of tests, and verifying we can read what we write is the easiest way to do that.

Okay, I'll try to find time this weekend to implement compression support in the writer.

Hi, @trxcllnt!

I've implemented compression support for the reader and performed some minor refactoring to improve the structure. Here are the key changes:

Added compression support for the writer (debugged and tested)

Successfully verified LZ4 writer locally - it works correctly

Small refactoring to streamline the code

Introduced codec validators to prevent potential library mismatch issues

The main motivation for validators came from realizing that the current CompressionRegistry approach might cause problems for users when trying to match compression/decompression libraries across different environments.

Could you please review my changes, especially the validation logic? Maybe you can suggest something about ZSTD validation?

Djjanks · 2025-05-27T22:39:51Z

src/ipc/compression/validators.ts

+class Lz4FrameValidator implements CompressionValidator {
+    private readonly LZ4_FRAME_MAGIC = new Uint8Array([4, 34, 77, 24]);
+    private readonly MIN_HEADER_LENGTH = 7; // 4 (magic) + 2 (FLG + BD) + 1 (header checksum) = 7 min bytes
+
+    isValidCodecEncode(codec: Codec): boolean {


Since many libraries use the raw LZ4 format instead of the framed format, I decided to add validation for the encode function. This ensures that Arrow files compressed with LZ4 can be correctly read in other languages. Initially, I considered comparing compressed and decompressed buffers, but due to optional metadata flags, this might not be reliable. Instead, I validate that the encode function generates a correct metadata header. I'm unsure if similar validation is needed for decode since users should notice if their data decompresses incorrectly.

Djjanks · 2025-05-27T22:40:47Z

src/ipc/compression/validators.ts

+class ZstdValidator implements CompressionValidator {
+    // private readonly ZSTD_MAGIC = new Uint8Array([40, 181, 47, 253]);
+    isValidCodecEncode(_: Codec): boolean {
+        console.warn('ZSTD encode validator is not implemented yet.');
+        return true;
+    }
+}


For ZSTD, I need to research how its metadata is structured and whether different formats exist (similar to LZ4's raw vs. framed formats). This will help determine if additional validation is necessary.

src/ipc/writer.ts

Djjanks · 2025-05-31T13:04:27Z

@trxcllnt Before finalizing the PR, I'd like to add proper tests. Could you advise:

Should I create new dedicated test files for compression/decompression functionality, or
Integrate tests into existing test files?

If there are specific test patterns or files I should follow as reference?

trxcllnt · 2025-06-02T17:05:10Z

@Djjanks Since compression mode is an option on the reader and writer, the easiest way to integrate is probably to add/update the tests in stream-writer-tests.ts and file-writer-tests.ts.

The testStreamWriter and testFileWriter helper functions can accept the writer options and verify the input table round-trips through the writer -> reader successfully.

It looks like the RecordBatchFileWriter tests need to be updated to accept writer options, and both will need to also pass the compression option to the reader, but that should be straightforward.

Djjanks

@trxcllnt Thanks for the testing guidance. I've implemented stream-writer and file-writer tests successfully. But face some problems:

dictionary batch compression logic
import zstd libs

Can you give your advice on the above points?

Djjanks · 2025-06-08T14:00:43Z

src/ipc/writer.ts

-                if ((padding = ((size + 7) & ~7) - size) > 0) {
-                    this._writePadding(padding);
-                }
+    protected _writeBodyBuffers(buffers: ArrayBufferView[], batchType: "record" | "dictionary" = "record") {


According to Arrow format documentation, only record batches are compressed, not dictionary batches. Is this correct? I add attribute batchType to aviod bufGroupSize compression logic. Have I understood this correctly?

I don't think that's correct, where do you read that from that documentation?

The compression section of the documentation focuses on record batches, but it doesn't specifically mention that dictionary batches should also be compressed. However, I agree that, logically, dictionary compression should be included.

Djjanks · 2025-06-08T14:11:32Z

test/unit/ipc/writer/file-writer-tests.ts

+import { Codec, compressionRegistry } from 'apache-arrow/ipc/compression/registry';
+import * as lz4js from 'lz4js';
+
+export async function registerCompressionCodecs(): Promise<void> {


I've implemented ZSTD compression with async initialization since most popular libraries require WASM/Node.js. I used dynamic import for ZSTD to avoid bundling issues. Duplicated the registration logic from stream writer because separating it into a shared module caused Jest import errors. May by somebody knows more elegant way to import zstd?

This method worked for us as well - if it helps, I needed to use @oneidentity/zstd-js instead since zstd-codec wouldn't bundle correctly in our code base

brancz · 2025-07-23T15:45:16Z

We would love to see lz4 support in arrow-js.

@westonpace @trxcllnt any chance you could give this another review?

stephnom · 2025-09-04T19:04:04Z

src/ipc/reader.ts

    }
-    protected _loadDictionaryBatch(header: metadata.DictionaryBatch, body: any) {
+
+    protected _loadDictionaryBatch(header: metadata.DictionaryBatch, body: Uint8Array) {


I've been trying this PR at my company using zstd encoding, and can confirm that dictionary batches are in fact compressed as well.

I needed to add the following lines for the zstd decompression to work fully for dictionary vectors:

let data: Data<any>[]; if (header.data.compression != null) { const codec = compressionRegistry.get(header.data.compression.type); if (codec?.decode && typeof codec.decode === 'function') { const { decommpressedBody, buffers } = this._decompressBuffers(header.data, body, codec); data = this._loadCompressedVectors(header.data, decommpressedBody, [type]); header = new metadata.DictionaryBatch(new metadata.RecordBatch( header.data.length, header.data.nodes, buffers, null ), id, isDelta) } else { throw new Error('Dictionary batch is compressed but codec not found'); } } else { data = this._loadVectors(header.data, body, [type]); }

otherwise this PR has been working great as-is at scale for us!

Thank you! I have made a new commit with the correct compression and decompression dictionary functionality. Previously, compression did not work on dictionary batches. Could you please try it in your project?

Works great, thanks so much! Note that I'm only able to test the decompression, since we do not use the compression paths in our codebase (we use https://arrow.apache.org/java/18.2.0/reference/org.apache.arrow.vector/org/apache/arrow/vector/ipc/ArrowStreamWriter.html to send compressed data to the frontend)

Ok, now compression works too. Thank you for testing! In my work we use only lz4 decompression

How can i use this?

I’m using this in a web application. Here’s how you can try it out in your own project:

1. Clone the fork and switch to the feature branch:

git clone https://github.com/Djjanks/arrow-js.git cd arrow-js git checkout feature/arrow-compression

2. Build the package (Linux or WSL recommended):

yarn install yarn build

3. Link the library locally (assuming you use npm):

cd targets/apache-arrow/ npm link

4. Link it in your project:

npm uninstall apache-arrow npm link apache-arrow

5. Register the codec you want to use.

For example, with LZ4 (see tests or the PR for more examples):

import { Codec, compressionRegistry } from 'apache-arrow'; import * as lz4 from 'lz4js'; const lz4Codec: Codec = { encode(data: Uint8Array): Uint8Array { return lz4js.compress(data) }, decode(data: Uint8Array): Uint8Array { return lz4js.decompress(data) } }; compressionRegistry.set(CompressionType.LZ4_FRAME, lz4Codec);

It’s not the most convenient setup, but it’s enough to experiment with compression support right now.
For a cleaner workflow, it’s better to wait until the PR is merged into the main repo.

…s enabled

…o dictionary reader

Djjanks · 2025-09-05T11:44:36Z

@trxcllnt
Current status:

LZ4 and ZSTD codec compression and decompression registries implemented.
Dictionary and record batch writing and reading compression and decompression functionality added.
Test cases with compression added (stream and file).
yarn build successful.
yarn test passed.

trxcllnt · 2025-09-10T19:12:23Z

test/unit/ipc/writer/stream-writer-tests.ts

    Uint32,
    Vector
 } from 'apache-arrow';
+import { Codec, compressionRegistry } from 'apache-arrow/ipc/compression/registry';


Are these not exported in the top-level export? It seems like they should be, since they'd allow users to register their own implementations?

Good catch, thanks! Codec wasn’t exported at the top level before. I’ve fixed that so now both compressionRegistry and Codec are available directly from the main package exports.

brancz · 2025-09-15T16:59:36Z

Super exciting! Any chance we could see an arrow-js release to make this more widely available?

kou · 2025-09-16T01:21:41Z

Let's discuss it in #283.

kylebarron · 2025-10-01T21:09:46Z

I think I understand this after reading through the code, but it would be great to have some documentation included in the repo about how to use this feature. Documentation is not a strength of Arrow JS.

If you call registerCompressionCodecs and no codecs have been set, then it'll automatically register lz4js and fetch zstd-codec? That might be unexpected, especially since zstd-codec is probably pretty large, right? To turn that off, you have to set the compressionRegistry for ZSTD to anything other than null? And then if set to something "fake", you just have to make sure you don't try to load ZSTD files?
lz4js is always bundled with arrow-js now it looks like? So there's no penalty to always registering lz4 compression. But there is a penalty to registering zstd compression because that'll trigger the fetch of the codec, right?
Oh wait, that registerCompressionCodecs is just in the tests..? So users always have to copy that implementation? That's fine, just trying to understand.

kou · 2025-10-01T21:22:45Z

Could you open a new issue for documentation?
Let's work on it as a separated task.

Djjanks closed this May 19, 2025

Djjanks reopened this May 19, 2025

Djjanks closed this May 19, 2025

trxcllnt reopened this May 20, 2025

trxcllnt self-requested a review May 20, 2025 19:38

trxcllnt reviewed May 20, 2025

View reviewed changes

Djjanks force-pushed the feature/arrow-compression branch from c739044 to d1cabbf Compare May 27, 2025 22:11

Djjanks commented May 27, 2025

View reviewed changes

trxcllnt reviewed May 28, 2025

View reviewed changes

src/ipc/writer.ts Outdated Show resolved Hide resolved

kou changed the title ~~GH-24833 Implement IPC RecordBatch body buffer compression~~ feat: Implement IPC RecordBatch body buffer compression May 29, 2025

Djjanks force-pushed the feature/arrow-compression branch 2 times, most recently from 090c28a to 0ca03f2 Compare June 8, 2025 13:52

Djjanks commented Jun 8, 2025

View reviewed changes

stephnom reviewed Sep 4, 2025

View reviewed changes

Djjanks added 12 commits September 5, 2025 10:57

Define commpression registry

6ea12df

Implement RecordBatch body decompression

89436d8

Export compressionType, Codec and compressionRegistry

4373d90

Avoid copying decompressed data to contiguous ArrayBuffer

f1d2d31

Refactor: Move vector decompressed data reader to CompressedVectorLoader

2146c61

Refactor _CompressionRegistry

0e79563

Add codec encode validators

0cd7be2

Implement BodyCompression encode/decode to metadata message

e203bfa

Implement compression to RecordBatch writer

a060b93

Refactor padding calculation for compressed Record Batch

ff32696

Refactor RecordBatch writing to handle compression inline

ae6fd40

Add ZSTD validation

81aa0b0

Djjanks added 7 commits September 5, 2025 10:57

Fix buffer size calculation in _writeBodyBuffers

e983b78

fix(ipc/writer): prevent compression with writeLegacyIpcFormat=true

b60969c

test(ipc/writer): add compression test to stream writer

4041d27

License header

2750ede

feat(ipc/writer): add options to RecordBatchFileWriter constructor

eb07473

fix(ipc/writer): handle dictionary batch correctly when compression i…

1e43814

…s enabled

test(ipc/writer): add compression test to file writer

0f1a11b

Djjanks force-pushed the feature/arrow-compression branch from 0ca03f2 to 0f1a11b Compare September 5, 2025 08:03

feat: implement dictionary compmression to writer and decompression t…

2363dc3

…o dictionary reader

trxcllnt reviewed Sep 10, 2025

View reviewed changes

feat: add Codec type to top-level export

efa38fe

trxcllnt approved these changes Sep 14, 2025

View reviewed changes

kou merged commit a0e6bc1 into apache:main Sep 15, 2025
11 checks passed

kylebarron mentioned this pull request Sep 15, 2025

ARROW-8674: [JS] Implement IPC RecordBatch body buffer compression apache/arrow#13076

Closed

kylebarron mentioned this pull request Oct 1, 2025

Add option to not serialize data to Parquet developmentseed/lonboard#668

Open

kylebarron mentioned this pull request Oct 1, 2025

Documentation for how to use IPC RecordBatch compression #298

Open

NguyenHoangSon96 mentioned this pull request Jan 9, 2026

ci: support arrow-js v21 InfluxCommunity/influxdb3-js#655

Merged

6 tasks

feat: Implement IPC RecordBatch body buffer compression #14

feat: Implement IPC RecordBatch body buffer compression #14

Uh oh!

Conversation

Djjanks commented May 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

What changes are included in this PR?

Additional notes:

Are these changes tested?

Are there any user-facing changes?

Checklist

Uh oh!

trxcllnt May 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

trxcllnt May 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

trxcllnt May 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Djjanks commented May 31, 2025

Uh oh!

trxcllnt commented Jun 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Djjanks left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

brancz commented Jul 23, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Djjanks Sep 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

1. Clone the fork and switch to the feature branch:

2. Build the package (Linux or WSL recommended):

3. Link the library locally (assuming you use npm):

Djjanks commented May 19, 2025 •

edited

Loading

trxcllnt May 20, 2025 •

edited

Loading

trxcllnt May 21, 2025 •

edited

Loading

trxcllnt May 21, 2025 •

edited

Loading

trxcllnt commented Jun 2, 2025 •

edited

Loading

Djjanks Sep 5, 2025 •

edited

Loading

Djjanks commented Sep 5, 2025 •

edited

Loading