-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-17916][SPARK-25241][SQL][FOLLOWUP] Fix empty string being parsed as null when nullValue is set. #22367
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #95840 has finished for PR 22367 at commit
|
7eac385 to
e23098c
Compare
|
jenkins, retest this, please |
|
Test build #95845 has finished for PR 22367 at commit
|
docs/sql-programming-guide.md
Outdated
| - In version 2.3 and earlier, CSV rows are considered as malformed if at least one column value in the row is malformed. CSV parser dropped such rows in the DROPMALFORMED mode or outputs an error in the FAILFAST mode. Since Spark 2.4, CSV row is considered as malformed only when it contains malformed column values requested from CSV datasource, other values can be ignored. As an example, CSV file contains the "id,name" header and one row "1234". In Spark 2.4, selection of the id column consists of a row with one column value 1234 but in Spark 2.3 and earlier it is empty in the DROPMALFORMED mode. To restore the previous behavior, set `spark.sql.csv.parser.columnPruning.enabled` to `false`. | ||
| - Since Spark 2.4, File listing for compute statistics is done in parallel by default. This can be disabled by setting `spark.sql.parallelFileListingInStatsComputation.enabled` to `False`. | ||
| - Since Spark 2.4, Metadata files (e.g. Parquet summary files) and temporary files are not counted as data files when calculating table size during Statistics computation. | ||
| - Since Spark 2.4, empty strings are saved as quoted empty strings `""`. In version 2.3 and earlier, empty strings were equal to `null` values and didn't reflect to any characters in saved CSV files. For example, the row of `"a", null, "", 1` was writted as `a,,,1`. Since Spark 2.4, the same row is saved as `a,,"",1`. To restore the previous behavior, set the CSV option `emptyValue` to empty (not quoted) string. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not a big deal at all but i would avoid abbreviation (didn't) in the documentation personally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
avoided
|
@MaxGekk, mind adding |
docs/sql-programming-guide.md
Outdated
| - In version 2.3 and earlier, CSV rows are considered as malformed if at least one column value in the row is malformed. CSV parser dropped such rows in the DROPMALFORMED mode or outputs an error in the FAILFAST mode. Since Spark 2.4, CSV row is considered as malformed only when it contains malformed column values requested from CSV datasource, other values can be ignored. As an example, CSV file contains the "id,name" header and one row "1234". In Spark 2.4, selection of the id column consists of a row with one column value 1234 but in Spark 2.3 and earlier it is empty in the DROPMALFORMED mode. To restore the previous behavior, set `spark.sql.csv.parser.columnPruning.enabled` to `false`. | ||
| - Since Spark 2.4, File listing for compute statistics is done in parallel by default. This can be disabled by setting `spark.sql.parallelFileListingInStatsComputation.enabled` to `False`. | ||
| - Since Spark 2.4, Metadata files (e.g. Parquet summary files) and temporary files are not counted as data files when calculating table size during Statistics computation. | ||
| - Since Spark 2.4, empty strings are saved as quoted empty strings `""`. In version 2.3 and earlier, empty strings were equal to `null` values and didn't reflect to any characters in saved CSV files. For example, the row of `"a", null, "", 1` was writted as `a,,,1`. Since Spark 2.4, the same row is saved as `a,,"",1`. To restore the previous behavior, set the CSV option `emptyValue` to empty (not quoted) string. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This fix sounds rather a bug fix tho.. I don't feel strongly documenting this .. but I am okay with it.
| // When there are empty strings or the values set in `nullValue`, put the | ||
| // index as the suffix. | ||
| if (value == null || value.isEmpty || value == options.nullValue || | ||
| value == options.emptyValueInRead) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@MaxGekk, can we get rid of this one (and the one in CSVInferSchema.scala) for now since we target 2.4? IIRC (need to double check), this behaviour by makeSafeHeader is from R's read.csv. We should check if it is coherent or not.
| def inferField(typeSoFar: DataType, field: String, options: CSVOptions): DataType = { | ||
| if (field == null || field.isEmpty || field == options.nullValue) { | ||
| if (field == null || field.isEmpty || field == options.nullValue || | ||
| field == options.emptyValueInRead) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, this will exclude empty strings from type inference. Wonder if that makes sense. Let's target this change for 3.0.0 if you feel in the similar way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kept the changes because the test https://github.com/apache/spark/pull/22367/files#diff-0433a2d4247fdef1fc57ab95d99218cfR108 seems reasonable for me. Without the changes it fails since inferred type becomes StringType. In any case I will revert this if you insist.
This reverts commit 48e143d.
|
Test build #95898 has finished for PR 22367 at commit
|
|
Test build #95899 has finished for PR 22367 at commit
|
|
Oh, wait. why does this target branch-2.4? You can open this against master and backport if it's needed |
Just to make sure, the changes don't have conflicts in
I will open another PR with the same changes for |
|
@HyukjinKwon The same for the |
|
Usually we merge into master and backport to other branches when it's needed. https://spark.apache.org/contributing.html
|
…sed as null when nullValue is set. ## What changes were proposed in this pull request? In the PR, I propose new CSV option `emptyValue` and an update in the SQL Migration Guide which describes how to revert previous behavior when empty strings were not written at all. Since Spark 2.4, empty strings are saved as `""` to distinguish them from saved `null`s. Closes #22234 Closes #22367 ## How was this patch tested? It was tested by `CSVSuite` and new tests added in the PR #22234 Closes #22389 from MaxGekk/csv-empty-value-master. Lead-authored-by: Mario Molina <[email protected]> Co-authored-by: Maxim Gekk <[email protected]> Signed-off-by: hyukjinkwon <[email protected]> (cherry picked from commit c9cb393) Signed-off-by: hyukjinkwon <[email protected]>
…sed as null when nullValue is set. ## What changes were proposed in this pull request? In the PR, I propose new CSV option `emptyValue` and an update in the SQL Migration Guide which describes how to revert previous behavior when empty strings were not written at all. Since Spark 2.4, empty strings are saved as `""` to distinguish them from saved `null`s. Closes apache#22234 Closes apache#22367 ## How was this patch tested? It was tested by `CSVSuite` and new tests added in the PR apache#22234 Closes apache#22389 from MaxGekk/csv-empty-value-master. Lead-authored-by: Mario Molina <[email protected]> Co-authored-by: Maxim Gekk <[email protected]> Signed-off-by: hyukjinkwon <[email protected]>
What changes were proposed in this pull request?
In the PR, I propose new CSV option
emptyValueand an update in the SQL Migration Guide which describes how to revert previous behavior when empty strings were not written at all. Since Spark 2.4, empty strings are saved as""to distinguish them from savednulls.Closes #22234
How was this patch tested?
It was tested by
CSVSuiteand new tests added in the PR #22234