Skip to content

Conversation

@AngersZhuuuu
Copy link
Contributor

What changes were proposed in this pull request?

snappy-java have release v1.1.7.5, upgrade to latest version.

Fixed in v1.1.7.4

Fixed in v1.1.7.5

  • Fixes java.lang.NoClassDefFoundError: org/xerial/snappy/pool/DefaultPoolFactory in 1.1.7.4

xerial/snappy-java@1.1.7.3...1.1.7.5

v 1.1.7.5 release note:
xerial/snappy-java@edc4ec2

Why are the changes needed?

Fix bug

Does this PR introduce any user-facing change?

No

How was this patch tested?

No need

cloud-fan and others added 30 commits April 18, 2019 13:24
## What changes were proposed in this pull request?

This backports a tiny part of another change:
apache@4bdfda9#diff-3c792ce7265b69b448a984caf629c96bR161
... which just works around the possibility that the local python interpreter is 'python3' or 'python2' when running the spark-submit tests.

I'd like to backport to 2.3 too.

This otherwise prevents this test from passing on my mac, though I have a custom install with brew. But may affect others.

## How was this patch tested?

Existing tests.

Closes apache#24407 from srowen/Python23check.

Authored-by: Sean Owen <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
## What changes were proposed in this pull request?

have jenkins test against python3.6 (instead of 3.4).

## How was this patch tested?

extensive testing on both the centos and ubuntu jenkins workers revealed that 2.4 doesn't like python 3.6...  :(

NOTE: this is just for branch-2.4

PLEASE DO NOT MERGE

Closes apache#24379 from shaneknapp/update-python-executable.

Authored-by: shane knapp <[email protected]>
Signed-off-by: shane knapp <[email protected]>
## What changes were proposed in this pull request?

This backports:
apache@ab1650d
apache@7857c6d

which collectively updates Jackson to 2.9.8.

## How was this patch tested?

Existing tests.

Closes apache#24418 from srowen/SPARK-24601.2.

Authored-by: Sean Owen <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
## What changes were proposed in this pull request?

When a fatal error (such as StackOverflowError) throws from "receiveAndReply", we should try our best to notify the sender. Otherwise, the sender will hang until timeout.

In addition, when a MessageLoop is dying unexpectedly, it should resubmit a new one so that Dispatcher is still working.

## How was this patch tested?

New unit tests.

Closes apache#24396 from zsxwing/SPARK-27496.

Authored-by: Shixiong Zhu <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit 009059e)
Signed-off-by: Dongjoon Hyun <[email protected]>
…Interval change to migration guide

Add note about spark.executor.heartbeatInterval change to migration guide
See also apache#24329

N/A

Closes apache#24432 from srowen/SPARK-27419.2.

Authored-by: Sean Owen <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit d4a16f4)
Signed-off-by: Wenchen Fan <[email protected]>
…st 1.9.3

## What changes were proposed in this pull request?

Unify commons-beanutils deps to latest 1.9.3
Backport of apache#24378

## How was this patch tested?

Existing tests.

Closes apache#24433 from srowen/SPARK-27469.2.

Authored-by: Sean Owen <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
…h column containing null values

## What changes were proposed in this pull request?
This PR is follow up of apache#24286. As gatorsmile pointed out that column with null value is inaccurate as well.

```
> select key from test;
2
NULL
1
spark-sql> desc extended test key;
col_name key
data_type int
comment NULL
min 1
max 2
num_nulls 1
distinct_count 2
```

The distinct count should be distinct_count + 1 when column contains null value.
## How was this patch tested?

Existing tests & new UT added.

Closes apache#24436 from pengbo/aggregation_estimation.

Authored-by: pengbo <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit d9b2ce0)
Signed-off-by: Dongjoon Hyun <[email protected]>
…k on Scala-2.12 build

## What changes were proposed in this pull request?

Since [SPARK-27274](https://issues.apache.org/jira/browse/SPARK-27274) deprecated Scala-2.11 at Spark 2.4.1, we need to test Scala-2.12 more. This PR aims to fix the Python test script on Scala-2.12 build in `branch-2.4`.

**BEFORE**
```
$ dev/change-scala-version.sh 2.12

$ build/sbt -Pscala-2.12 package

$ python/run-tests.py --python-executables python2.7 --modules pyspark-sql
Traceback (most recent call last):
  File "python/run-tests.py", line 70, in <module>
    raise Exception("Cannot find assembly build directory, please build Spark first.")
Exception: Cannot find assembly build directory, please build Spark first.
```

**AFTER**
```
$ python/run-tests.py --python-executables python2.7 --modules pyspark-sql
Running PySpark tests. Output is in /Users/dongjoon/APACHE/spark/python/unit-tests.log
Will test against the following Python executables: ['python2.7']
Will test the following Python modules: ['pyspark-sql']
Starting test(python2.7): pyspark.sql.tests
...
```

## How was this patch tested?

Manually do the above procedure because Jenkins doesn't test Scala-2.12 in `branch-2.4`.

Closes apache#24439 from dongjoon-hyun/SPARK-27544.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
… `kafka-0-8`profile for Scala-2.12

## What changes were proposed in this pull request?

Since SPARK-27274 deprecated Scala-2.11 at Spark 2.4.1, we need to test Scala-2.12 more.
Since Kakfa 0.8 doesn't have Scala-2.12 artifacts, e.g., `org.apache.kafka:kafka_2.12:jar:0.8.2.1`, this PR aims to fix `test-dependencies.sh` script to understand Scala binary version.
```
$ dev/change-scala-version.sh 2.12

$ dev/test-dependencies.sh
Using `mvn` from path: /usr/local/bin/mvn
Using `mvn` from path: /usr/local/bin/mvn
Performing Maven install for hadoop-2.6
Using `mvn` from path: /usr/local/bin/mvn
[ERROR] Failed to execute goal on project spark-streaming-kafka-0-8_2.12: Could not resolve dependencies for project org.apache.spark:spark-streaming-kafka-0-8_2.12:jar:spark-335572: Could not find artifact org.apache.kafka:kafka_2.12:jar:0.8.2.1 in central (https://repo.maven.apache.org/maven2) -> [Help 1]
```

## How was this patch tested?

Manually do `dev/change-scala-version.sh 2.12` and `dev/test-dependencies.sh`.
The script should show `DO NOT MATCH` message instead of Maven `[ERROR]`.
```
$ dev/test-dependencies.sh
Using `mvn` from path: /usr/local/bin/mvn
...
Generating dependency manifest for hadoop-3.1
Using `mvn` from path: /usr/local/bin/mvn
Spark's published dependencies DO NOT MATCH the manifest file (dev/spark-deps).
To update the manifest file, run './dev/test-dependencies.sh --replace-manifest'.
diff --git a/dev/deps/spark-deps-hadoop-2.6 b/dev/pr-deps/spark-deps-hadoop-2.6
...
```

Closes apache#24445 from dongjoon-hyun/SPARK-27550.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
…nsSuite

## What changes were proposed in this pull request?

update `HiveExternalCatalogVersionsSuite` to test 2.4.2, as 2.4.1 will be removed from Mirror Network soon.

## How was this patch tested?

N/A

Closes apache#24452 from cloud-fan/release.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit b7f9830)
Signed-off-by: Wenchen Fan <[email protected]>
## What changes were proposed in this pull request?

Right now Kafka source v2 doesn't support null values. The issue is in org.apache.spark.sql.kafka010.KafkaRecordToUnsafeRowConverter.toUnsafeRow which doesn't handle null values.

## How was this patch tested?

add new unit tests

Closes apache#24441 from uncleGen/SPARK-27494.

Authored-by: uncleGen <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit d2656aa)
Signed-off-by: Wenchen Fan <[email protected]>
…in HiveExternalCatalogVersionsSuite

## What changes were proposed in this pull request?

We can get the latest downloadable Spark versions from https://dist.apache.org/repos/dist/release/spark/

## How was this patch tested?

manually.

Closes apache#24454 from cloud-fan/test.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
…ackendSuite

## What changes were proposed in this pull request?

The test "RequestExecutors reflects node blacklist and is serializable" is flaky because of multi threaded access of the mock task scheduler. For details check [Mockito FAQ (occasional exceptions like: WrongTypeOfReturnValue)](https://github.com/mockito/mockito/wiki/FAQ#is-mockito-thread-safe). So instead of mocking the task scheduler in the test TaskSchedulerImpl is simply subclassed.

This multithreaded access of the `nodeBlacklist()` method is coming from:
1) the unit test thread via calling of the method `prepareRequestExecutors()`
2) the `DriverEndpoint.onStart` which runs a periodic task that ends up calling this method

## How was this patch tested?

Existing unittest.

(cherry picked from commit e4e4e2b)

Closes apache#24474 from attilapiros/SPARK-26891-branch-2.4.

Authored-by: “attilapiros” <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
…mons-crypto.

The commons-crypto library does some questionable error handling internally,
which can lead to JVM crashes if some call into native code fails and cleans
up state it should not.

While the library is not fixed, this change adds some workarounds in Spark code
so that when an error is detected in the commons-crypto side, Spark avoids
calling into the library further.

Tested with existing and added unit tests.

Closes apache#24476 from vanzin/SPARK-25535-2.4.

Authored-by: Marcelo Vanzin <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
… count

This PR consists of the `test` components of apache#23665 only, minus the associated patch from that PR.

It adds a new unit test to `JsonSuite` which verifies that the `count()` returned from a `DataFrame` loaded from JSON containing empty lines does not include those empty lines in the record count. The test runs `count` prior to otherwise reading data from the `DataFrame`, so as to catch future cases where a pre-parsing optimization might result in `count` results inconsistent with existing behavior.

This PR is intended to be deployed alongside apache#23667; `master` currently causes the test to fail, as described in [SPARK-26745](https://issues.apache.org/jira/browse/SPARK-26745).

Manual testing, existing `JsonSuite` unit tests.

Closes apache#23674 from sumitsu/json_emptyline_count_test.

Authored-by: Branden Smith <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
(cherry picked from commit 63bced9)
Signed-off-by: Dongjoon Hyun <[email protected]>
…H in Hive UDAF adapter

## What changes were proposed in this pull request?

This is a followup of apache#24144 . apache#24144 missed one case: when hash aggregate fallback to sort aggregate, the life cycle of UDAF is: INIT -> UPDATE -> MERGE -> FINISH.

However, not all Hive UDAF can support it. Hive UDAF knows the aggregation mode when creating the aggregation buffer, so that it can create different buffers for different inputs: the original data or the aggregation buffer. Please see an example in the [sketches library](https://github.com/DataSketches/sketches-hive/blob/7f9e76e9e03807277146291beb2c7bec40e8672b/src/main/java/com/yahoo/sketches/hive/cpc/DataToSketchUDAF.java#L107). The buffer for UPDATE may not support MERGE.

This PR updates the Hive UDAF adapter in Spark to support INIT -> UPDATE -> MERGE -> FINISH, by turning it to  INIT -> UPDATE -> FINISH + IINIT -> MERGE -> FINISH.

## How was this patch tested?

a new test case

Closes apache#24459 from cloud-fan/hive-udaf.

Authored-by: Wenchen Fan <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit 7432e7d)
Signed-off-by: Wenchen Fan <[email protected]>
…2.9.8

## What changes were proposed in this pull request?

This reverts commit 6f394a2.

In general, we need to be very cautious about the Jackson upgrade in the patch releases, especially when this upgrade could break the existing behaviors of the external packages or data sources, and generate different results after the upgrade. The external packages and data sources need to change their source code to keep the original behaviors. The upgrade requires more discussions before releasing it, I think.

In the previous PR apache#22071, we turned off `spark.master.rest.enabled` by default and added the following claim in our security doc:
> The Rest Submission Server and the MesosClusterDispatcher do not support authentication.  You should ensure that all network access to the REST API & MesosClusterDispatcher (port 6066 and 7077 respectively by default) are restricted to hosts that are trusted to submit jobs.

We need to understand whether this Jackson CVE applies to Spark. Before officially releasing it, we need more inputs from all of you. Currently, I would suggest to revert this upgrade from the upcoming 2.4.3 release, which is trying to fix the accidental default Scala version changes in pre-built artifacts.

## How was this patch tested?

N/A

Closes apache#24493 from gatorsmile/revert24418.

Authored-by: gatorsmile <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
… 2.4 release script

## What changes were proposed in this pull request?
This PR is to cherry-pick all the missing and relevant commits that were merged to master but not to branch-2.4.

Previously, dbtsai used the release script in the branch 2.4 to release 2.4.1.

After more investigation, I found it is risky to make a 2.4 release by using the release script in the master branch since the release script has various changes. It could easily introduce unnoticeable issues, like what we did for 2.4.2.

Thus, I would cherry-pick all the missing fixes and use the updated release script to release 2.4.3

## How was this patch tested?
N/A

Closes apache#24503 from gatorsmile/upgradeReleaseScript.

Lead-authored-by: Wenchen Fan <[email protected]>
Co-authored-by: gatorsmile <[email protected]>
Co-authored-by: wright <[email protected]>
Signed-off-by: gatorsmile <[email protected]>
…h shell env

Although we use shebang `#!/usr/bin/env bash`, `minikube docker-env` returns invalid commands in `non-bash` environment and causes failures at `eval` because it only recognizes the default shell. We had better add `--shell bash` option explicitly in our `bash` script.

```bash
$ bash -c 'eval $(minikube docker-env)'
bash: line 0: set: -g: invalid option
set: usage: set [-abefhkmnptuvxBCHP] [-o option-name] [--] [arg ...]
bash: line 0: set: -g: invalid option
set: usage: set [-abefhkmnptuvxBCHP] [-o option-name] [--] [arg ...]
bash: line 0: set: -g: invalid option
set: usage: set [-abefhkmnptuvxBCHP] [-o option-name] [--] [arg ...]
bash: line 0: set: -g: invalid option
set: usage: set [-abefhkmnptuvxBCHP] [-o option-name] [--] [arg ...]

$ bash -c 'eval $(minikube docker-env --shell bash)'
```

Manual. Run the script with non-bash shell environment.
```
bin/docker-image-tool.sh -m -t testing build
```

Closes apache#24517 from dongjoon-hyun/SPARK-27626.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit 6c2d351)
Signed-off-by: Dongjoon Hyun <[email protected]>
…s such as loss only during fitting phase

## What changes were proposed in this pull request?

When transform(...) method is called on a LinearRegressionModel created directly with the coefficients and intercepts, the following exception is encountered.

```
java.util.NoSuchElementException: Failed to find a default value for loss
	at org.apache.spark.ml.param.Params$$anonfun$getOrDefault$2.apply(params.scala:780)
	at org.apache.spark.ml.param.Params$$anonfun$getOrDefault$2.apply(params.scala:780)
	at scala.Option.getOrElse(Option.scala:121)
	at org.apache.spark.ml.param.Params$class.getOrDefault(params.scala:779)
	at org.apache.spark.ml.PipelineStage.getOrDefault(Pipeline.scala:42)
	at org.apache.spark.ml.param.Params$class.$(params.scala:786)
	at org.apache.spark.ml.PipelineStage.$(Pipeline.scala:42)
	at org.apache.spark.ml.regression.LinearRegressionParams$class.validateAndTransformSchema(LinearRegression.scala:111)
	at org.apache.spark.ml.regression.LinearRegressionModel.validateAndTransformSchema(LinearRegression.scala:637)
	at org.apache.spark.ml.PredictionModel.transformSchema(Predictor.scala:192)
	at org.apache.spark.ml.PipelineModel$$anonfun$transformSchema$5.apply(Pipeline.scala:311)
	at org.apache.spark.ml.PipelineModel$$anonfun$transformSchema$5.apply(Pipeline.scala:311)
	at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57)
	at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
	at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:186)
	at org.apache.spark.ml.PipelineModel.transformSchema(Pipeline.scala:311)
	at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:74)
	at org.apache.spark.ml.PipelineModel.transform(Pipeline.scala:305)
```

This is because validateAndTransformSchema() is called both during training and scoring phases, but the checks against the training related params like loss should really be performed during training phase only, I think, please correct me if I'm missing anything :)

This issue was first reported for mleap (combust/mleap#455) because basically when we serialize the Spark transformers for mleap, we only serialize the params that are relevant for scoring. We do have the option to de-serialize the serialized transformers back into Spark for scoring again, but in that case, we no longer have all the training params.

## How was this patch tested?
Added a unit test to check this scenario.

Please let me know if there's anything additional required, this is the first PR that I've raised in this project.

Closes apache#24509 from ancasarb/linear_regression_params_fix.

Authored-by: asarb <[email protected]>
Signed-off-by: Sean Owen <[email protected]>
(cherry picked from commit 4241a72)
Signed-off-by: Sean Owen <[email protected]>
…tabase

## What changes were proposed in this pull request?
**Description from JIRA**
For the JDBC option `query`, we use the identifier name to start with underscore: s"(${subquery}) _SPARK_GEN_JDBC_SUBQUERY_NAME${curId.getAndIncrement()}". This is not supported by Oracle.
The Oracle doesn't seem to support identifier name to start with non-alphabet character (unless it is quoted) and has length restrictions as well. [link](https://docs.oracle.com/cd/B19306_01/server.102/b14200/sql_elements008.htm)

In this PR, the generated alias name 'SPARK_GEN_JDBC_SUBQUERY_NAME<int value>' is fixed to remove "_" prefix and also the alias name is shortened to not exceed the identifier length limit.

## How was this patch tested?
Tests are added for MySql, Postgress, Oracle and DB2 to ensure enough coverage.

Closes apache#24532 from dilipbiswal/SPARK-27596.

Authored-by: Dilip Biswal <[email protected]>
Signed-off-by: gatorsmile <[email protected]>
(cherry picked from commit 6001d47)
Signed-off-by: gatorsmile <[email protected]>
…cationMetrics

## What changes were proposed in this pull request?

Choose the last record in chunks when calculating metrics with downsampling in `BinaryClassificationMetrics`.

## How was this patch tested?

A new unit test is added to verify thresholds from downsampled records.

Closes apache#24470 from shishaochen/spark-mllib-binary-metrics.

Authored-by: Shaochen Shi <[email protected]>
Signed-off-by: Sean Owen <[email protected]>
(cherry picked from commit d5308cd)
Signed-off-by: Sean Owen <[email protected]>
…rrectly

## What changes were proposed in this pull request?

If the interval is `0`, it doesn't show both the value `0` and the unit at all. For example, this happens in the explain plans and Spark Web UI on `EventTimeWatermark` diagram.

**BEFORE**
```scala
scala> spark.readStream.schema("ts timestamp").parquet("/tmp/t").withWatermark("ts", "1 microsecond").explain
== Physical Plan ==
EventTimeWatermark ts#0: timestamp, interval 1 microseconds
+- StreamingRelation FileSource[/tmp/t], [ts#0]

scala> spark.readStream.schema("ts timestamp").parquet("/tmp/t").withWatermark("ts", "0 microsecond").explain
== Physical Plan ==
EventTimeWatermark ts#3: timestamp, interval
+- StreamingRelation FileSource[/tmp/t], [ts#3]
```

**AFTER**
```scala
scala> spark.readStream.schema("ts timestamp").parquet("/tmp/t").withWatermark("ts", "1 microsecond").explain
== Physical Plan ==
EventTimeWatermark ts#0: timestamp, interval 1 microseconds
+- StreamingRelation FileSource[/tmp/t], [ts#0]

scala> spark.readStream.schema("ts timestamp").parquet("/tmp/t").withWatermark("ts", "0 microsecond").explain
== Physical Plan ==
EventTimeWatermark ts#3: timestamp, interval 0 microseconds
+- StreamingRelation FileSource[/tmp/t], [ts#3]
```

## How was this patch tested?

Pass the Jenkins with the updated test case.

Closes apache#24516 from dongjoon-hyun/SPARK-27624.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit 614a5cc)
Signed-off-by: Dongjoon Hyun <[email protected]>
## What changes were proposed in this pull request?

When following the example for using `spark.streams().awaitAnyTermination()`
a valid pyspark code will output the following error:

```
Traceback (most recent call last):
  File "pyspark_app.py", line 182, in <module>
    spark.streams().awaitAnyTermination()
TypeError: 'StreamingQueryManager' object is not callable
```

Docs URL: https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#managing-streaming-queries

This changes the documentation line to properly call the method under the StreamingQueryManager Class
https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.streaming.StreamingQueryManager

## How was this patch tested?

After changing the syntax, error no longer occurs and pyspark application works

This is only docs change

Closes apache#24547 from asaf400/patch-1.

Authored-by: Asaf Levy <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
(cherry picked from commit 09422f5)
Signed-off-by: HyukjinKwon <[email protected]>
…cutor in PythonRunner

## What changes were proposed in this pull request?

Backport apache#24542 to 2.4.

## How was this patch tested?

existing tests

Closes apache#24552 from jiangxb1987/SPARK-25139-2.4.

Authored-by: Xingbo Jiang <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
10110346 and others added 20 commits October 4, 2019 13:49
### What changes were proposed in this pull request?
This is a clean cherry pick of  apache#22725 from master to 2.4

This is a follow up of apache#21601, `StreamFileInputFormat` and `WholeTextFileInputFormat` have the same problem.

`Minimum split size pernode 5123456 cannot be larger than maximum split size 4194304
java.io.IOException: Minimum split size pernode 5123456 cannot be larger than maximum split size 4194304
        at org.apache.hadoop.mapreduce.lib.input.CombineFileInputFormat.getSplits(CombineFileInputFormat.java: 201)
	at org.apache.spark.rdd.BinaryFileRDD.getPartitions(BinaryFileRDD.scala:52)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:254)
	at org.apache.spark.rdd.RDD$$anonfun$partitions$2.apply(RDD.scala:252)
	at scala.Option.getOrElse(Option.scala:121)
	at org.apache.spark.rdd.RDD.partitions(RDD.scala:252)
	at org.apache.spark.SparkContext.runJob(SparkContext.scala:2138)`

### Why are the changes needed?
This is an existing bug which was fixed in master, but not back ported to 2.4.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
The original patch added a unit test.

Ran the unit test that was added in the original patch and manually verified the changes by creating a multiline csv and loading it in spark shell.

Closes apache#26026 from dhruve/fix/SPARK-25753/2.4.

Authored-by: liuxian <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
…bernetes

### What changes were proposed in this pull request?

The current docker image used by Kubernetes is `openjdk:8-alpine`. It was not supported and  was removed with the commit docker-library/openjdk@3eb0351#diff-f95ffa3d1377774732c33f7b8368e099.

This PR proposes to move to a supported docker image.

### Why are the changes needed?

I think there are at least two reasons:

1. According to the commit, Alpine/musl is not officially supported by the OpenJDK project.
2. As no more OpenJDK 8 Alpine images, new JDK updates including security fixes
, are not applied to it. See below:

```
docker run -it --rm openjdk:8-alpine java -version
openjdk version "1.8.0_212"
OpenJDK Runtime Environment (IcedTea 3.12.0) (Alpine 8.212.04-r0)
OpenJDK 64-Bit Server VM (build 25.212-b04, mixed mode)
```
```
docker run -it --rm openjdk:8-jdk-slim java -version
openjdk version "1.8.0_222"
OpenJDK Runtime Environment (build 1.8.0_222-b10)
OpenJDK 64-Bit Server VM (build 25.222-b10, mixed mode)
```

### Does this PR introduce any user-facing change?

Yes. This changes the base docker image of Spark.

### How was this patch tested?

Existing tests.

Closes apache#26046 from viirya/SPARK-28938-2.4.

Authored-by: Liang-Chi Hsieh <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
### What changes were proposed in this pull request?
The `LICENSE` file exists a minor issue has an incorrect path.
This PR will fix it.

### Why are the changes needed?
This is a minor bug.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Exists UT.

Closes apache#26050 from beliefer/resolve-minor-license-issue.

Authored-by: gengjiaan <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
RDD dependencies and partitions can be simultaneously
accessed and mutated by user threads and spark's scheduler threads, so
access must be thread-safe.  In particular, as partitions and
dependencies are lazily-initialized, before this change they could get
initialized multiple times, which would lead to the scheduler having an
inconsistent view of the pendings stages and get stuck.

Tested with existing unit tests.

Closes apache#25951 from squito/SPARK-28917.

Authored-by: Imran Rashid <[email protected]>
Signed-off-by: Marcelo Vanzin <[email protected]>
(cherry picked from commit 0da667d)
Signed-off-by: Marcelo Vanzin <[email protected]>
This PR updates commons-beanutils to 1.9.4.

CVE fixed in 1.9.4: http://commons.apache.org/proper/commons-beanutils/javadocs/v1.9.4/RELEASE-NOTES.txt

No.

Existing UTs.

Closes apache#26069 from peter-toth/SPARK-29410-update-commons-beanutils-to-1.9.4.

Authored-by: Peter Toth <[email protected]>
Signed-off-by: Sean Owen <[email protected]>
Minor version bump of Netty to patch reported CVE.

Patches: https://www.cvedetails.com/cve/CVE-2019-16869/

No

Compiled locally using `mvn clean install -DskipTests`

Closes apache#26099 from Fokko/SPARK-29445.

Authored-by: Fokko Driesprong <[email protected]>
Signed-off-by: Sean Owen <[email protected]>
(cherry picked from commit b5b1b69)
Signed-off-by: Sean Owen <[email protected]>
### What changes were proposed in this pull request?

This PR aims to update the validation check on `length` from `length >= 0` to `length >= -1` in order to allow set `-1` to keep the default value.

### Why are the changes needed?

At Apache Spark 2.2.0, [SPARK-18702](https://github.com/apache/spark/pull/16133/files#diff-2c5519b1cf4308d77d6f12212971544fR27-R38) adds `class FileBlock` with the default `length` value, `-1`, initially.

There is no way to set `filePath` only while keeping `length` is `-1`.
```scala
  def set(filePath: String, startOffset: Long, length: Long): Unit = {
     require(filePath != null, "filePath cannot be null")
     require(startOffset >= 0, s"startOffset ($startOffset) cannot be negative")
     require(length >= 0, s"length ($length) cannot be negative")
     inputBlock.set(new FileBlock(UTF8String.fromString(filePath), startOffset, length))
   }
```

For compressed files (like GZ), the size of split can be set to -1. This was allowed till Spark 2.1 but regressed starting with spark 2.2.x. Please note that split length of -1 also means the length was unknown - a valid scenario. Thus, split length of -1 should be acceptable like pre Spark 2.2.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

This is updating the corner case on the requirement check. Manually check the code.

Closes apache#26123 from praneetsharma/fix-SPARK-27259.

Authored-by: prasha2 <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit 57edb42)
Signed-off-by: Dongjoon Hyun <[email protected]>
# What changes were proposed in this pull request?

Backport of apache#26093 to `branch-2.4`

### Why are the changes needed?

https://issues.apache.org/jira/browse/SPARK-27812
https://issues.apache.org/jira/browse/SPARK-27927

We need this fix fabric8io/kubernetes-client#1768 that was released on version 4.6 of the client. The root cause of the problem is better explained in apache#25785

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

This patch was tested manually using a simple pyspark job

```python
from pyspark.sql import SparkSession

if __name__ == '__main__':
    spark = SparkSession.builder.getOrCreate()
```

The expected behaviour of this "job" is that both python's and jvm's process exit automatically after the main runs. This is the case for spark versions <= 2.4. On version 2.4.3, the jvm process hangs because there's a non daemon thread running

```
"OkHttp WebSocket https://10.96.0.1/..." apache#121 prio=5 os_prio=0 tid=0x00007fb27c005800 nid=0x24b waiting on condition [0x00007fb300847000]
"OkHttp WebSocket https://10.96.0.1/..." apache#117 prio=5 os_prio=0 tid=0x00007fb28c004000 nid=0x247 waiting on condition [0x00007fb300e4b000]
```
This is caused by a bug on `kubernetes-client` library, which is fixed on the version that we are upgrading to.

When the mentioned job is run with this patch applied, the behaviour from spark <= 2.4.0 is restored and both processes terminate successfully

Closes apache#26152 from igorcalabria/k8s-client-update-2.4.

Authored-by: igor.calabria <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
… string to timestamp

### What changes were proposed in this pull request?
* Adding an additional check in `stringToTimestamp` to handle cases where the input has trailing ':'
* Added a test to make sure this works.

### Why are the changes needed?
In a couple of scenarios while converting from String to Timestamp `DateTimeUtils.stringToTimestamp` throws an array out of bounds exception if there is trailing  ':'. The behavior of this method requires it to return `None` in case the format of the string is incorrect.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Added a test in the `DateTimeTestUtils` suite to test if my fix works.

Closes apache#26143 from rahulsmahadev/SPARK-29494.

Lead-authored-by: Rahul Mahadev <[email protected]>
Co-authored-by: Rahul Shivu Mahadev <[email protected]>
Signed-off-by: Sean Owen <[email protected]>
(cherry picked from commit 4cfce3e)
Signed-off-by: Sean Owen <[email protected]>
…nverting string to timestamp"

This reverts commit 4d476ed.
…rting string to timestamp

### What changes were proposed in this pull request?
* Adding an additional check in `stringToTimestamp` to handle cases where the input has trailing ':'
* Added a test to make sure this works.

### Why are the changes needed?
In a couple of scenarios while converting from String to Timestamp `DateTimeUtils.stringToTimestamp` throws an array out of bounds exception if there is trailing  ':'. The behavior of this method requires it to return `None` in case the format of the string is incorrect.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Added a test in the `DateTimeTestUtils` suite to test if my fix works.

Closes apache#26171 from rahulsmahadev/araryOB.

Authored-by: Rahul Mahadev <[email protected]>
Signed-off-by: Sean Owen <[email protected]>
… older releases

### What changes were proposed in this pull request?

Fall back to archive.apache.org in `build/mvn` to download Maven, in case the ASF mirrors no longer have an older release.

### Why are the changes needed?

If an older release's specified Maven doesn't exist in the mirrors, {{build/mvn}} will fail.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

Manually tested different paths and failures by commenting in/out parts of the script and modifying it directly.

Closes apache#25667 from srowen/SPARK-28963.

Authored-by: Sean Owen <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
(cherry picked from commit df39855)
Signed-off-by: Dongjoon Hyun <[email protected]>
…rrorServlet

### What changes were proposed in this pull request?

Don't include `$path` from user query in the error response.

### Why are the changes needed?

The path could contain input that is then rendered as HTML in the error response. It's not clear whether it's exploitable, but better safe than sorry as the path info really isn't that important in this context.

### Does this PR introduce any user-facing change?

No

### How was this patch tested?

Existing tests.

Closes apache#26211 from srowen/SPARK-29556.

Authored-by: Sean Owen <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit 8009468)
Signed-off-by: Dongjoon Hyun <[email protected]>
This add `typesafe` bintray repo for `sbt-mima-plugin`.

Since Oct 21, the following plugin causes [Jenkins failures](https://amplab.cs.berkeley.edu/jenkins/view/Spark%20QA%20Test%20(Dashboard)/job/spark-branch-2.4-test-sbt-hadoop-2.6/611/console
) due to the missing jar.

- `branch-2.4`: `sbt-mima-plugin:0.1.17` is missing.
- `master`: `sbt-mima-plugin:0.3.0` is missing.

These versions of `sbt-mima-plugin` seems to be removed from the old repo.

```
$ rm -rf ~/.ivy2/

$ build/sbt scalastyle test:scalastyle
...
[warn] 	::::::::::::::::::::::::::::::::::::::::::::::
[warn] 	::          UNRESOLVED DEPENDENCIES         ::
[warn] 	::::::::::::::::::::::::::::::::::::::::::::::
[warn] 	:: com.typesafe#sbt-mima-plugin;0.1.17: not found
[warn] 	::::::::::::::::::::::::::::::::::::::::::::::
```

No.

Check `GitHub Action` linter result. This PR should pass. Or, manual check.
(Note that Jenkins PR builder didn't fail until now due to the local cache.)

Closes apache#26217 from dongjoon-hyun/SPARK-29560.

Authored-by: Dongjoon Hyun <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit f23c5d7)
Signed-off-by: Dongjoon Hyun <[email protected]>
### What changes were proposed in this pull request?
We shall have a new mechanism that the downstream operators may notify its parents that they may release the output data stream. In this PR, we implement the mechanism as below:
- Add function named `cleanupResources` in SparkPlan, which default call children's `cleanupResources` function, the operator which need a resource cleanup should rewrite this with the self cleanup and also call `super.cleanupResources`, like SortExec in this PR.
- Add logic support on the trigger side, in this PR is SortMergeJoinExec, which make sure and call the `cleanupResources` to do the cleanup job for all its upstream(children) operator.

### Why are the changes needed?
Bugfix for SortMergeJoin memory leak, and implement a general framework for SparkPlan resource cleanup.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
UT: Add new test suite JoinWithResourceCleanSuite to check both standard and code generation scenario.

Integrate Test: Test with driver/executor default memory set 1g, local mode 10 thread. The below test(thanks taosaildrone for providing this test  [here](apache#23762 (comment))) will pass with this PR.

```
from pyspark.sql.functions import rand, col

spark.conf.set("spark.sql.join.preferSortMergeJoin", "true")
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", -1)

r1 = spark.range(1, 1001).select(col("id").alias("timestamp1"))
r1 = r1.withColumn('value', rand())
r2 = spark.range(1000, 1001).select(col("id").alias("timestamp2"))
r2 = r2.withColumn('value2', rand())
joined = r1.join(r2, r1.timestamp1 == r2.timestamp2, "inner")
joined = joined.coalesce(1)
joined.explain()
joined.show()
```

Closes apache#26210 from xuanyuanking/SPARK-21492-backport.

Authored-by: Yuanjian Li <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
…database style iterator

### What changes were proposed in this pull request?
Reimplement the iterator in UnsafeExternalRowSorter in database style. This can be done by reusing the `RowIterator` in our code base.

### Why are the changes needed?
During the job in apache#26164, after involving a var `isReleased` in `hasNext`, there's possible that `isReleased` is false when calling `hasNext`, but it becomes true before calling `next`. A safer way is using database-style iterator: `advanceNext` and `getRow`.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
Existing UT.

Closes apache#26229 from xuanyuanking/SPARK-21492-follow-up.

Authored-by: Yuanjian Li <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit 9e77d48)
Signed-off-by: Wenchen Fan <[email protected]>
### What changes were proposed in this pull request?
 Remove the requirement of fetch_size>=0 from JDBCOptions to allow negative fetch size.

### Why are the changes needed?

Namely, to allow data fetch in stream manner (row-by-row fetch) against MySQL database.

### Does this PR introduce any user-facing change?
No

### How was this patch tested?
Unit test (JDBCSuite)

This closes apache#26230 .

Closes apache#26244 from fuwhu/SPARK-21287-FIX.

Authored-by: fuwhu <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit 92b2529)
Signed-off-by: Dongjoon Hyun <[email protected]>
### What changes were proposed in this pull request?

SparkSession.sql() method parse process not under current sparksession's conf, so some configuration about parser is not valid in multi-thread situation.

In this pr, we add a SQLConf parameter to AbstractSqlParser and initial it with SessionState's conf.
Then for each SparkSession's parser process. It will use's it's own SessionState's SQLConf and to be thread safe

### Why are the changes needed?
Fix bug

### Does this PR introduce any user-facing change?
NO

### How was this patch tested?
NO

Closes apache#26240 from AngersZhuuuu/SPARK-29530-V2.4.

Authored-by: angerszhu <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
…he table's ownership

### What changes were proposed in this pull request?

This PR backport apache#26160 to branch-2.4.

### Why are the changes needed?
Backport from master.

### Does this PR introduce any user-facing change?
No.

### How was this patch tested?
unit test

Closes apache#26248 from wangyum/SPARK-29498-branch-2.4.

Authored-by: Yuming Wang <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.