[SPARK-17916][SPARK-25241][SQL][FOLLOWUP] Fix empty string being parsed as null when nullValue is set. #22367

MaxGekk · 2018-09-08T21:22:19Z

What changes were proposed in this pull request?

In the PR, I propose new CSV option emptyValue and an update in the SQL Migration Guide which describes how to revert previous behavior when empty strings were not written at all. Since Spark 2.4, empty strings are saved as "" to distinguish them from saved nulls.

Closes #22234

How was this patch tested?

It was tested by CSVSuite and new tests added in the PR #22234

SparkQA · 2018-09-09T00:11:12Z

Test build #95840 has finished for PR 22367 at commit 7eac385.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

MaxGekk · 2018-09-09T07:39:18Z

jenkins, retest this, please

SparkQA · 2018-09-09T11:32:10Z

Test build #95845 has finished for PR 22367 at commit e23098c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-09-10T02:40:53Z

docs/sql-programming-guide.md

  - In version 2.3 and earlier, CSV rows are considered as malformed if at least one column value in the row is malformed. CSV parser dropped such rows in the DROPMALFORMED mode or outputs an error in the FAILFAST mode. Since Spark 2.4, CSV row is considered as malformed only when it contains malformed column values requested from CSV datasource, other values can be ignored. As an example, CSV file contains the "id,name" header and one row "1234". In Spark 2.4, selection of the id column consists of a row with one column value 1234 but in Spark 2.3 and earlier it is empty in the DROPMALFORMED mode. To restore the previous behavior, set `spark.sql.csv.parser.columnPruning.enabled` to `false`.
  - Since Spark 2.4, File listing for compute statistics is done in parallel by default. This can be disabled by setting `spark.sql.parallelFileListingInStatsComputation.enabled` to `False`.
  - Since Spark 2.4, Metadata files (e.g. Parquet summary files) and temporary files are not counted as data files when calculating table size during Statistics computation.
+  - Since Spark 2.4, empty strings are saved as quoted empty strings `""`. In version 2.3 and earlier, empty strings were equal to `null` values and didn't reflect to any characters in saved CSV files. For example, the row of `"a", null, "", 1` was writted as `a,,,1`. Since Spark 2.4, the same row is saved as `a,,"",1`. To restore the previous behavior, set the CSV option `emptyValue` to empty (not quoted) string.  


Not a big deal at all but i would avoid abbreviation (didn't) in the documentation personally.

HyukjinKwon · 2018-09-10T02:42:59Z

@MaxGekk, mind adding Closes #22234 at the end of PR description so that we can automatically close that one?

HyukjinKwon · 2018-09-10T02:47:51Z

docs/sql-programming-guide.md

  - In version 2.3 and earlier, CSV rows are considered as malformed if at least one column value in the row is malformed. CSV parser dropped such rows in the DROPMALFORMED mode or outputs an error in the FAILFAST mode. Since Spark 2.4, CSV row is considered as malformed only when it contains malformed column values requested from CSV datasource, other values can be ignored. As an example, CSV file contains the "id,name" header and one row "1234". In Spark 2.4, selection of the id column consists of a row with one column value 1234 but in Spark 2.3 and earlier it is empty in the DROPMALFORMED mode. To restore the previous behavior, set `spark.sql.csv.parser.columnPruning.enabled` to `false`.
  - Since Spark 2.4, File listing for compute statistics is done in parallel by default. This can be disabled by setting `spark.sql.parallelFileListingInStatsComputation.enabled` to `False`.
  - Since Spark 2.4, Metadata files (e.g. Parquet summary files) and temporary files are not counted as data files when calculating table size during Statistics computation.
+  - Since Spark 2.4, empty strings are saved as quoted empty strings `""`. In version 2.3 and earlier, empty strings were equal to `null` values and didn't reflect to any characters in saved CSV files. For example, the row of `"a", null, "", 1` was writted as `a,,,1`. Since Spark 2.4, the same row is saved as `a,,"",1`. To restore the previous behavior, set the CSV option `emptyValue` to empty (not quoted) string.  


This fix sounds rather a bug fix tho.. I don't feel strongly documenting this .. but I am okay with it.

HyukjinKwon · 2018-09-10T04:41:20Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVDataSource.scala

-          // When there are empty strings or the values set in `nullValue`, put the
-          // index as the suffix.
+        if (value == null || value.isEmpty || value == options.nullValue ||
+          value == options.emptyValueInRead) {


@MaxGekk, can we get rid of this one (and the one in CSVInferSchema.scala) for now since we target 2.4? IIRC (need to double check), this behaviour by makeSafeHeader is from R's read.csv. We should check if it is coherent or not.

HyukjinKwon · 2018-09-10T04:42:11Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/csv/CSVInferSchema.scala

  def inferField(typeSoFar: DataType, field: String, options: CSVOptions): DataType = {
-    if (field == null || field.isEmpty || field == options.nullValue) {
+    if (field == null || field.isEmpty || field == options.nullValue ||
+      field == options.emptyValueInRead) {


Also, this will exclude empty strings from type inference. Wonder if that makes sense. Let's target this change for 3.0.0 if you feel in the similar way.

I kept the changes because the test https://github.com/apache/spark/pull/22367/files#diff-0433a2d4247fdef1fc57ab95d99218cfR108 seems reasonable for me. Without the changes it fails since inferred type becomes StringType. In any case I will revert this if you insist.

This reverts commit 48e143d.

MaxGekk · 2018-09-10T20:34:11Z

mind adding Closes #22234 at the end of PR description so that we can automatically close that one?

Just in case, this PR for branch-2.4 but the original #22234 for master.

SparkQA · 2018-09-11T00:14:27Z

Test build #95898 has finished for PR 22367 at commit 40cfa28.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-09-11T00:34:27Z

Test build #95899 has finished for PR 22367 at commit a56d001.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-09-11T06:11:01Z

Oh, wait. why does this target branch-2.4? You can open this against master and backport if it's needed

MaxGekk · 2018-09-11T06:37:02Z

Oh, wait. why does this target branch-2.4?

Just to make sure, the changes don't have conflicts in branch-2.4. @HyukjinKwon Is this PR not mergeable to master?

You can open this against master and backport if it's needed

I will open another PR with the same changes for master.

MaxGekk · 2018-09-11T06:46:07Z

@HyukjinKwon The same for the master branch: #22389

HyukjinKwon · 2018-09-11T08:15:11Z

Usually we merge into master and backport to other branches when it's needed.

https://spark.apache.org/contributing.html

Open a pull request against the master branch of apache/spark. (Only in special cases would the PR be opened against other branches.)

…sed as null when nullValue is set. ## What changes were proposed in this pull request? In the PR, I propose new CSV option `emptyValue` and an update in the SQL Migration Guide which describes how to revert previous behavior when empty strings were not written at all. Since Spark 2.4, empty strings are saved as `""` to distinguish them from saved `null`s. Closes #22234 Closes #22367 ## How was this patch tested? It was tested by `CSVSuite` and new tests added in the PR #22234 Closes #22389 from MaxGekk/csv-empty-value-master. Lead-authored-by: Mario Molina <[email protected]> Co-authored-by: Maxim Gekk <[email protected]> Signed-off-by: hyukjinkwon <[email protected]> (cherry picked from commit c9cb393) Signed-off-by: hyukjinkwon <[email protected]>

…sed as null when nullValue is set. ## What changes were proposed in this pull request? In the PR, I propose new CSV option `emptyValue` and an update in the SQL Migration Guide which describes how to revert previous behavior when empty strings were not written at all. Since Spark 2.4, empty strings are saved as `""` to distinguish them from saved `null`s. Closes apache#22234 Closes apache#22367 ## How was this patch tested? It was tested by `CSVSuite` and new tests added in the PR apache#22234 Closes apache#22389 from MaxGekk/csv-empty-value-master. Lead-authored-by: Mario Molina <[email protected]> Co-authored-by: Maxim Gekk <[email protected]> Signed-off-by: hyukjinkwon <[email protected]>

mmolimar and others added 7 commits September 8, 2018 22:19

Configurable empty values when reading/writing CSV files

465ed7a

Adding tests

48e143d

Changing emptyValue order arg in streaming.py

70e2171

Changing emptyValue order arg in set_opts

8665f93

Added comments for parameters

867c6de

Updating the migration guide

e0cb879

Changing order in args for emptyValue

e23098c

MaxGekk mentioned this pull request Sep 8, 2018

[SPARK-25241][SQL] Configurable empty values when reading/writing CSV files #22234

Closed

MaxGekk force-pushed the csv-empty-value-2.4 branch from 7eac385 to e23098c Compare September 9, 2018 07:38

HyukjinKwon reviewed Sep 10, 2018

View reviewed changes

MaxGekk added 2 commits September 10, 2018 22:25

Revert "Adding tests"

40cfa28

This reverts commit 48e143d.

Addressing Hyukjin Kwon's concerns

a56d001

MaxGekk mentioned this pull request Sep 11, 2018

[SPARK-17916][SPARK-25241][SQL][FOLLOW-UP] Fix empty string being parsed as null when nullValue is set. #22389

Closed

asfgit closed this in c9cb393 Sep 11, 2018

MaxGekk deleted the csv-empty-value-2.4 branch August 17, 2019 13:33

[SPARK-17916][SPARK-25241][SQL][FOLLOWUP] Fix empty string being parsed as null when nullValue is set. #22367

[SPARK-17916][SPARK-25241][SQL][FOLLOWUP] Fix empty string being parsed as null when nullValue is set. #22367

Uh oh!

Conversation

MaxGekk commented Sep 8, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Sep 9, 2018

Uh oh!

MaxGekk commented Sep 9, 2018

Uh oh!

SparkQA commented Sep 9, 2018

Uh oh!

HyukjinKwon Sep 10, 2018

Choose a reason for hiding this comment

Uh oh!

MaxGekk Sep 10, 2018

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Sep 10, 2018

Uh oh!

HyukjinKwon Sep 10, 2018

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Sep 10, 2018

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Sep 10, 2018

Choose a reason for hiding this comment

Uh oh!

MaxGekk Sep 10, 2018

Choose a reason for hiding this comment

Uh oh!

MaxGekk commented Sep 10, 2018

Uh oh!

SparkQA commented Sep 11, 2018

Uh oh!

SparkQA commented Sep 11, 2018

Uh oh!

HyukjinKwon commented Sep 11, 2018

Uh oh!

MaxGekk commented Sep 11, 2018

Uh oh!

MaxGekk commented Sep 11, 2018

Uh oh!

HyukjinKwon commented Sep 11, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

MaxGekk commented Sep 8, 2018 •

edited

Loading