Skip to content

Conversation

@Synicix
Copy link
Collaborator

@Synicix Synicix commented Nov 8, 2025

Replacement for arrow-digest, since its hash depends on the order of fields elements, and missing some additional data type support that we may need in the future.

Other fixes:

  • For certain types like decimal, timeunit, arrow-digest doesn't hash the meta info about the bytes leading to possible collision between two different data but same byte representation. i.e 1.20 (scale: 2, precision: 3), vs 12.0 (scale:1, precision 3) both have the byte representation of the number 120 in arrow, since scale and precision is stored in metadata.

Fixes PLT-451, PLT-544 PLT-561

@Synicix Synicix marked this pull request as ready for review November 13, 2025 10:51
@Synicix Synicix requested a review from eywalker November 13, 2025 10:51
Copy link
Contributor

eywalker commented Dec 1, 2025

@Synicix would you mind outlining the exact form by which you are serializing the Arrow array? I'd like to have the serialization rule well documented so that in theory we could replicate the implementation in other places.

Copilot AI review requested due to automatic review settings December 4, 2025 03:40
@Synicix
Copy link
Collaborator Author

Synicix commented Dec 4, 2025

@eywalker

  • Added readme about the hashing
  • Added test to check if the flatten function was working correctly (turns out there was a bug and I fixed it)
  • Yanked and reuploaded 0.0.2

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR implements a custom Arrow data hashing system to replace the arrow-digest dependency, addressing field order dependency issues and improving support for decimal types by including metadata (precision/scale) in the hash to prevent collisions.

Key Changes:

  • Custom ArrowDigester struct with support for multiple Arrow data types and nested structures
  • Field-order-independent hashing using alphabetically sorted field names
  • Enhanced decimal type hashing that includes precision and scale metadata

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 12 comments.

Show a summary per file
File Description
src/arrow_digester.rs New custom hasher implementation with support for various Arrow data types, nested structures, and comprehensive test coverage
src/pyarrow.rs Updated FFI integration to use custom ArrowDigester instead of arrow-digest, with improved safety annotations
src/lib.rs Added arrow_digester module declaration
Cargo.toml Updated dependencies (removed arrow-digest, added digest/postcard/serde), configured strict clippy lints, changed edition to '2024'
README.md Added documentation explaining the hashing system architecture and design choices
cspell.json Added "uids" to dictionary and cleaned up empty ignoreWords
.vscode/settings.json Added VSCode workspace configuration for Rust development
.github/workflows/clippy.yml Added CI workflow for Rust syntax, style, and format checks

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Synicix and others added 3 commits December 3, 2025 19:46
Co-authored-by: Copilot <[email protected]>
Co-authored-by: Copilot <[email protected]>
Not sure why clippy didn't catch this

Co-authored-by: Copilot <[email protected]>
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 9 out of 9 changed files in this pull request and generated 20 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can tell, this just runs checks but not apply the formatting, does it? Shall we make it so that code gets auto-formatted?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In Vscode with the settings I set, it does auto format when you save. Unless you mean to let the github action force a commit to autoformat it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally I don't think we should include vscode settings, and if we were to do, we should keep it minimal to things that would be applicable to everyone.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A lot of it allow people who use VS code to have everything setup with the correct configuration like rust-analyzer, and formatting stuff. Ideally I want to keep it somewhere, but I can simplify it to the pure essentials.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's reduce this to bare minimal. There are definitely some entries like Python 3 interpreter path that shouldn't be set & expected to be the same across different working environments

Comment on lines +16 to +21
reason = "Need to convert raw pointers to Arrow data structures"
)]
#[expect(
clippy::multiple_unsafe_ops_per_block,
clippy::expect_used,
reason = "Okay since we are doing the same operation of dereferencing pointers, Will add proper errors later"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the suggestion here. @Synicix mind making the changes?

@Synicix
Copy link
Collaborator Author

Synicix commented Dec 11, 2025

@eywalker
Updates:

  • Change postcard to JSON
  • Fixed issues with hash collision involving binary and list array similar to StringArray Fix
  • Added additional test to confirm that the code is hashing correctly (Found some bugs and fixed them)
  • Split the hasher into core (arrow_digester_core.rs) and user facing lib side (lib.rs/ArrowDigester) where SHA256 is set as the digester
  • Add 3 Bytes of versioning to allow end user to keep track of which StarFix version was use in the hash computation of an arrow table/array
  • Split unit test and integration tests to be more separate.

Will tag all the linear issues here tomorrow.

Fixes PLT-583, PLT-560, PLT-558

}

/// Hash an array directly without needing to create an `ArrowDigester` instance on the user side
pub fn hash_array(array: &dyn Array) -> Vec<u8> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you expose the hashing of the schema alone? I'd have a use case for that

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated the code to allow hashing schema function to be exported to the python side.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's reduce this to bare minimal. There are definitely some entries like Python 3 interpreter path that shouldn't be set & expected to be the same across different working environments

@Synicix
Copy link
Collaborator Author

Synicix commented Jan 7, 2026

Update:

  • Exported all the new functions and improvements to the python side (Update the Example Python Usage notebook to reflect the exported functions
  • Update the _internal bindings to use the new export functions
  • Renamed process_arrow_table to hash_record_batch to match the rust side naming on the python side.

Fixes: PLT-657

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants