-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Add nullable to Categorize #118176
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Closed
Closed
Add nullable to Categorize #118176
Changes from all commits
Commits
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, this is conceptually not true. Actually, this should be
Nullability.TRUE.In the csv tests, we have test cases that emit nulls: either from actual
nullinputs, or from empty strings.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CATEGORIZEof null, empty string, only whitespace, or only stopword tokens, is null.Just look at categorize.csv-spec and search for "null" for many examples.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Then why that check in
FoldNull?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Disclaimer: I'm not an expert on this part of the code. @nik9000 or @ivancea probably are the right persons to ask.
My thoughts:
CATEGORIZEis a stateful function, that uses a specializedCategorizeBlockHashto gather the results. I think folding the nulls breaks that. (Note that commenting out the line also breaks a bunch of CSV tests.)Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ivancea provided some context on this to me. I think this is required for the following case, which can be seen when removing the check in
FoldNullbefore running:FoldNullpropagates thenullfrom the row even into theAggregate:The reason this can happen is that the
CATEGORIZEmust remain in theAggregate's groupings.I think the correct course of action is:
Categorize.Nullableand set it toNullable.TRUEas we can only know if it'll produce a null after running/folding it.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Csv test proposal that I just had pass locally:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ivan is adding a similar test case here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And probably CI passed in this PR because we were missing the right tests, no? @alex-spies @ivancea @jan-elastic
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, I think the
nullablemethod may currently only be relevant forFoldNull, where the explicit exemption forCATEGORIZEmakes it kinda irrelevant.That is to say, I can't think of a test case that will make your PR fail @astefan . (But I think we should still not use
nullable()this way, as it would be a breach of contract).FWIW, the
rowtest case above, or the one from Ivan's PR at least make things fail when you just remove the extra check forCATEGORIZEfromFoldNullwithout also overridingCategorize.nullable().There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added here an extra testcase (ROW null), apart from the null-in-param one the PR had already: #118013
Also, updated the
Categorize#nullable()to make sense (An empty string will also return null), and commented the why of theCategorizeexception inFoldNull