Implement our own custom hasher #2

Synicix · 2025-11-08T03:30:16Z

Replacement for arrow-digest, since its hash depends on the order of fields elements, and missing some additional data type support that we may need in the future.

Other fixes:

For certain types like decimal, timeunit, arrow-digest doesn't hash the meta info about the bytes leading to possible collision between two different data but same byte representation. i.e 1.20 (scale: 2, precision: 3), vs 12.0 (scale:1, precision 3) both have the byte representation of the number 120 in arrow, since scale and precision is stored in metadata.

Fixes PLT-451, PLT-544 PLT-561

eywalker · 2025-12-01T22:48:07Z

@Synicix would you mind outlining the exact form by which you are serializing the Arrow array? I'd like to have the serialization rule well documented so that in theory we could replicate the implementation in other places.

Synicix · 2025-12-04T03:43:30Z

@eywalker

Added readme about the hashing
Added test to check if the flatten function was working correctly (turns out there was a bug and I fixed it)
Yanked and reuploaded 0.0.2

Copilot

Pull request overview

This PR implements a custom Arrow data hashing system to replace the arrow-digest dependency, addressing field order dependency issues and improving support for decimal types by including metadata (precision/scale) in the hash to prevent collisions.

Key Changes:

Custom ArrowDigester struct with support for multiple Arrow data types and nested structures
Field-order-independent hashing using alphabetically sorted field names
Enhanced decimal type hashing that includes precision and scale metadata

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 12 comments.

Show a summary per file

File	Description
src/arrow_digester.rs	New custom hasher implementation with support for various Arrow data types, nested structures, and comprehensive test coverage
src/pyarrow.rs	Updated FFI integration to use custom `ArrowDigester` instead of `arrow-digest`, with improved safety annotations
src/lib.rs	Added arrow_digester module declaration
Cargo.toml	Updated dependencies (removed arrow-digest, added digest/postcard/serde), configured strict clippy lints, changed edition to '2024'
README.md	Added documentation explaining the hashing system architecture and design choices
cspell.json	Added "uids" to dictionary and cleaned up empty ignoreWords
.vscode/settings.json	Added VSCode workspace configuration for Rust development
.github/workflows/clippy.yml	Added CI workflow for Rust syntax, style, and format checks

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

README.md

src/arrow_digester.rs

README.md

src/arrow_digester.rs

Co-authored-by: Copilot <[email protected]>

Not sure why clippy didn't catch this Co-authored-by: Copilot <[email protected]>

Copilot

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 20 comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/arrow_digester.rs

README.md

.github/workflows/clippy.yml

src/pyarrow.rs

src/arrow_digester.rs

Cargo.toml

eywalker · 2025-12-04T15:52:50Z

.github/workflows/clippy.yml

As far as I can tell, this just runs checks but not apply the formatting, does it? Shall we make it so that code gets auto-formatted?

In Vscode with the settings I set, it does auto format when you save. Unless you mean to let the github action force a commit to autoformat it?

eywalker · 2025-12-04T15:53:29Z

.vscode/settings.json

Generally I don't think we should include vscode settings, and if we were to do, we should keep it minimal to things that would be applicable to everyone.

A lot of it allow people who use VS code to have everything setup with the correct configuration like rust-analyzer, and formatting stuff. Ideally I want to keep it somewhere, but I can simplify it to the pure essentials.

Let's reduce this to bare minimal. There are definitely some entries like Python 3 interpreter path that shouldn't be set & expected to be the same across different working environments

eywalker · 2025-12-04T15:55:15Z

src/pyarrow.rs

+        reason = "Need to convert raw pointers to Arrow data structures"
+    )]
+    #[expect(
+        clippy::multiple_unsafe_ops_per_block,
+        clippy::expect_used,
+        reason = "Okay since we are doing the same operation of dereferencing pointers, Will add proper errors later"


I agree with the suggestion here. @Synicix mind making the changes?

README.md

… a sha256 as public

… in the same hash, and fix bug related to it

Synicix · 2025-12-11T10:01:39Z

@eywalker
Updates:

Change postcard to JSON
Fixed issues with hash collision involving binary and list array similar to StringArray Fix
Added additional test to confirm that the code is hashing correctly (Found some bugs and fixed them)
Split the hasher into core (arrow_digester_core.rs) and user facing lib side (lib.rs/ArrowDigester) where SHA256 is set as the digester
Add 3 Bytes of versioning to allow end user to keep track of which StarFix version was use in the hash computation of an arrow table/array
Split unit test and integration tests to be more separate.

Will tag all the linear issues here tomorrow.

Fixes PLT-583, PLT-560, PLT-558

…he next is not.

eywalker · 2025-12-04T16:07:49Z

src/arrow_digester.rs

+    }
+
+    /// Hash an array directly without needing to create an `ArrowDigester` instance on the user side
+    pub fn hash_array(array: &dyn Array) -> Vec<u8> {


could you expose the hashing of the schema alone? I'd have a use case for that

I have updated the code to allow hashing schema function to be exported to the python side.

eywalker · 2026-01-05T17:53:40Z

.vscode/settings.json

Let's reduce this to bare minimal. There are definitely some entries like Python 3 interpreter path that shouldn't be set & expected to be the same across different working environments

src/arrow_digester_core.rs

Synicix · 2026-01-07T00:55:25Z

Update:

Exported all the new functions and improvements to the python side (Update the Example Python Usage notebook to reflect the exported functions
Update the _internal bindings to use the new export functions
Renamed process_arrow_table to hash_record_batch to match the rust side naming on the python side.

Fixes: PLT-657

Synicix added 11 commits November 6, 2025 09:07

Update settings for vs code

f4231a8

Add custom hasher framework

6e37676

Fix clippy errors

28761b3

Add list hashing

e6336c5

Add decimal hashing

1fc9bf5

Rename hasher to arrow_digester

72b1fb8

Add String hashing

e377ee8

Add binary and string hashing

df3a21b

Add time hashing

1f6577a

Change to new custom hasher and remove old one.

1fff344

Add rust tests

5a4371a

Synicix marked this pull request as ready for review November 13, 2025 10:51

Synicix requested a review from eywalker November 13, 2025 10:51

Synicix added 4 commits November 14, 2025 03:30

Fix all clippy recommendations

934e6e7

Remove incorrect categories

da3a892

Update categories

80653d1

Update clippy actions

8da263c

Synicix added 3 commits December 4, 2025 03:24

Update read me to include section about hashing

4730702

Change delimiter from _ to __

2b9da46

Add test for field name extraction and fix logic bug

485a544

Copilot AI review requested due to automatic review settings December 4, 2025 03:40

Copilot started reviewing on behalf of Synicix December 4, 2025 03:40 View session

Up the version due to bug

c2ff003

Copilot finished reviewing on behalf of Synicix December 4, 2025 03:42

Copilot AI reviewed Dec 4, 2025

View reviewed changes

Synicix and others added 3 commits December 3, 2025 19:46

Update README.md

669647a

Co-authored-by: Copilot <[email protected]>

Update README.md

882adca

Co-authored-by: Copilot <[email protected]>

Update src/arrow_digester.rs

3f083ba

Not sure why clippy didn't catch this Co-authored-by: Copilot <[email protected]>

Copilot finished reviewing on behalf of Synicix December 4, 2025 05:06

Copilot AI reviewed Dec 4, 2025

View reviewed changes

Synicix added 4 commits December 4, 2025 06:44

Move some dependenices into dev

5bb6414

Fix decimal and possible string array collision

ca3426c

Remove unused panic

08bc1f1

Fix cargo fmt error

cb9cf80

eywalker requested changes Dec 4, 2025

View reviewed changes

Synicix added 13 commits December 8, 2025 23:19

Change postcard hashing to json hashing

9ce2993

Change delimiter to /

81726f5

Add binary array len hashing to resolve hash collision problem & tests

932abe1

Save progress on redesigning null handling

58ac701

Patch nullbits handling and included datatypes into schema definition

7728bfe

Move actual arrow_digester logic to core and private it, while making…

c2e2564

… a sha256 as public

Up clippy version

63fb32a

Update hashing to meet new arrow format

d4a233e

Remove stale file that was already move to the lib module

5a86fbc

Add nullable and non-nullable tests

b9b6384

Add 3 bytes at the start for versioning

45cb028

Add documentation about hashing

2f866e4

Add test to confirm update in batches and hashing all at once results…

4f4b577

… in the same hash, and fix bug related to it

Add test to check for consistent hashing when one batch is null but t…

23fc982

…he next is not.

eywalker requested changes Jan 5, 2026

View reviewed changes

eywalker requested changes Jan 6, 2026

View reviewed changes

src/arrow_digester_core.rs Show resolved Hide resolved

Synicix added 4 commits January 6, 2026 20:23

feat: Remove some python interp settings

bfa2b17

feat: update some stale comments

8b9db23

feat: Expose new functions to python side

70effd5

feat: remove eadianness file

3591940

Fix fmt error

26990d8

Implement our own custom hasher #2

Are you sure you want to change the base?

Implement our own custom hasher #2

Uh oh!

Conversation

Synicix commented Nov 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eywalker commented Dec 1, 2025

Uh oh!

Synicix commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Synicix commented Dec 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Synicix commented Jan 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Synicix commented Nov 8, 2025 •

edited

Loading

Synicix commented Dec 4, 2025 •

edited

Loading

Synicix commented Dec 11, 2025 •

edited

Loading