-
Notifications
You must be signed in to change notification settings - Fork 1
Implement our own custom hasher #2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
@Synicix would you mind outlining the exact form by which you are serializing the Arrow array? I'd like to have the serialization rule well documented so that in theory we could replicate the implementation in other places. |
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR implements a custom Arrow data hashing system to replace the arrow-digest dependency, addressing field order dependency issues and improving support for decimal types by including metadata (precision/scale) in the hash to prevent collisions.
Key Changes:
- Custom
ArrowDigesterstruct with support for multiple Arrow data types and nested structures - Field-order-independent hashing using alphabetically sorted field names
- Enhanced decimal type hashing that includes precision and scale metadata
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 12 comments.
Show a summary per file
| File | Description |
|---|---|
| src/arrow_digester.rs | New custom hasher implementation with support for various Arrow data types, nested structures, and comprehensive test coverage |
| src/pyarrow.rs | Updated FFI integration to use custom ArrowDigester instead of arrow-digest, with improved safety annotations |
| src/lib.rs | Added arrow_digester module declaration |
| Cargo.toml | Updated dependencies (removed arrow-digest, added digest/postcard/serde), configured strict clippy lints, changed edition to '2024' |
| README.md | Added documentation explaining the hashing system architecture and design choices |
| cspell.json | Added "uids" to dictionary and cleaned up empty ignoreWords |
| .vscode/settings.json | Added VSCode workspace configuration for Rust development |
| .github/workflows/clippy.yml | Added CI workflow for Rust syntax, style, and format checks |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <[email protected]>
Co-authored-by: Copilot <[email protected]>
Not sure why clippy didn't catch this Co-authored-by: Copilot <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 9 out of 9 changed files in this pull request and generated 20 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I can tell, this just runs checks but not apply the formatting, does it? Shall we make it so that code gets auto-formatted?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In Vscode with the settings I set, it does auto format when you save. Unless you mean to let the github action force a commit to autoformat it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally I don't think we should include vscode settings, and if we were to do, we should keep it minimal to things that would be applicable to everyone.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A lot of it allow people who use VS code to have everything setup with the correct configuration like rust-analyzer, and formatting stuff. Ideally I want to keep it somewhere, but I can simplify it to the pure essentials.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's reduce this to bare minimal. There are definitely some entries like Python 3 interpreter path that shouldn't be set & expected to be the same across different working environments
| reason = "Need to convert raw pointers to Arrow data structures" | ||
| )] | ||
| #[expect( | ||
| clippy::multiple_unsafe_ops_per_block, | ||
| clippy::expect_used, | ||
| reason = "Okay since we are doing the same operation of dereferencing pointers, Will add proper errors later" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with the suggestion here. @Synicix mind making the changes?
… a sha256 as public
… in the same hash, and fix bug related to it
|
@eywalker
Will tag all the linear issues here tomorrow. Fixes PLT-583, PLT-560, PLT-558 |
src/arrow_digester.rs
Outdated
| } | ||
|
|
||
| /// Hash an array directly without needing to create an `ArrowDigester` instance on the user side | ||
| pub fn hash_array(array: &dyn Array) -> Vec<u8> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you expose the hashing of the schema alone? I'd have a use case for that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have updated the code to allow hashing schema function to be exported to the python side.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's reduce this to bare minimal. There are definitely some entries like Python 3 interpreter path that shouldn't be set & expected to be the same across different working environments
|
Update:
Fixes: PLT-657 |
Replacement for arrow-digest, since its hash depends on the order of fields elements, and missing some additional data type support that we may need in the future.
Other fixes:
Fixes PLT-451, PLT-544 PLT-561