Building Apache Spark

Note: This package uses Log4j. Please see details here, for the updates on security vulnerabilities.

The instructions provided below specify the steps to build Apache Spark version 3.3.0 in Standalone Mode on Linux on IBM Z for the following distributions:

RHEL (7.8, 7.9, 8.4, 8.6, 9.0)
SLES (12 SP5, 15 SP3, 15 SP4)
Ubuntu (18.04, 20.04, 22.04)

Limitation: Hive and ORC do not currently fully support big-endian systems. The use of these components should be avoided if possible.

The binary for Apache Spark version 3.3.0 can be downloaded from here. It works after installing Java and building LevelDB JNI from source as mentioned in Step 2.2 and Step 2.4. Please note that steps in Documentation are the only verification performed on the binary.

General Notes:

When following the steps below please use a standard permission user unless otherwise specified.
A directory /<source_root>/ will be referred to in these instructions, this is a temporary writeable directory anywhere you'd like to place it.

Step 1. Build using script

If you want to build Spark using manual steps, go to STEP 2.

Use the following commands to build Spark using the build script. Please make sure you have wget installed.

wget -q https://raw.githubusercontent.com/linux-on-ibm-z/scripts/master/ApacheSpark/3.3.0/build_spark.sh

# Build Spark
bash build_spark.sh   [Provide -h option to print help menu]

If the build completes successfully, go to STEP 5. In case of error, check logs for more details or go to STEP 2 to follow manual build steps.

Step 2. Build Prerequisites for Apache Spark

2.1) Install the dependencies

export SOURCE_ROOT=/<source_root>/

RHEL (7.8, 7.9, 8.4, 8.6, 9.0)

sudo yum groupinstall -y 'Development Tools'
sudo yum install -y wget tar git libtool autoconf make curl python3

SLES (12 SP5)

sudo zypper install -y wget tar git libtool autoconf curl gcc make gcc-c++ zip unzip gzip gawk python36
sudo update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.6 40

SLES (15 SP3, 15 SP4)

sudo zypper install -y wget tar git libtool autoconf curl gcc make gcc-c++ zip unzip gzip gawk python3

Ubuntu (18.04, 20.04, 22.04)

sudo apt-get update
sudo apt-get install -y wget tar git libtool autoconf build-essential curl apt-transport-https

2.2) Install Java

Install Java 8 (required for building LevelDB JNI):

cd "${SOURCE_ROOT}"
curl -SL -o jdk8.tar.gz "https://github.com/ibmruntimes/semeru8-binaries/releases/download/jdk8u332-b09_openj9-0.32.0/ibm-semeru-open-jdk_s390x_linux_8u332b09_openj9-0.32.0.tar.gz"
sudo mkdir -p /opt/openjdk/8/
sudo tar -zxf jdk8.tar.gz -C /opt/openjdk/8/ --strip-components 1

Install Java 11:

With OpenJDK

RHEL (7.8, 7.9, 8.4, 8.6, 9.0)

sudo yum install -y java-11-openjdk java-11-openjdk-devel

SLES (12 SP5, 15 SP3, 15 SP4)

sudo zypper install -y java-11-openjdk java-11-openjdk-devel

Ubuntu (18.04, 20.04, 22.04)

sudo DEBIAN_FRONTEND=noninteractive apt-get install -y openjdk-11-jdk

export JAVA_HOME=/<Path to OpenJDK>/
export PATH="${JAVA_HOME}/bin:${PATH}"

With Eclipse Adoptium Temurin Runtime (previously known as AdoptOpenJDK hotspot)

cd "${SOURCE_ROOT}"
sudo mkdir -p /opt/openjdk/11/
curl -SL -o jdk11.tar.gz "https://github.com/adoptium/temurin11-binaries/releases/download/jdk-11.0.15%2B10/OpenJDK11U-jdk_s390x_linux_hotspot_11.0.15_10.tar.gz"
sudo tar -zxf jdk11.tar.gz -C /opt/openjdk/11/ --strip-components 1
export JAVA_HOME=/opt/openjdk/11/
export PATH="${JAVA_HOME}/bin:${PATH}"

Note: At the time of creation of these build instructions, Apache Spark version 3.3.0 was verified with OpenJDK 11 (latest distro provided version) and Eclipse Adoptium Temurin Runtime (build 11.0.15+10) on all above distributions.

2.3) Install Maven

wget -O apache-maven-3.8.6.tar.gz "https://www.apache.org/dyn/mirrors/mirrors.cgi?action=download&filename=maven/maven-3/3.8.6/binaries/apache-maven-3.8.6-bin.tar.gz"
tar -xzf apache-maven-3.8.6.tar.gz
export PATH="${PATH}:${SOURCE_ROOT}/apache-maven-3.8.6/bin"

2.4) Build LevelDB JNI

Download and configure Snappy

cd "${SOURCE_ROOT}"
wget https://github.com/google/snappy/releases/download/1.1.3/snappy-1.1.3.tar.gz
tar -zxvf snappy-1.1.3.tar.gz
export SNAPPY_HOME="${SOURCE_ROOT}/snappy-1.1.3"
cd "${SNAPPY_HOME}"
./configure --disable-shared --with-pic
make
sudo make install

Download the source code for LevelDB and LevelDB JNI

cd "${SOURCE_ROOT}"
git clone -b s390x https://github.com/linux-on-ibm-z/leveldb.git
git clone -b leveldbjni-1.8-s390x https://github.com/linux-on-ibm-z/leveldbjni.git

Set the environment variables

export LEVELDB_HOME="${SOURCE_ROOT}/leveldb"
export LEVELDBJNI_HOME="${SOURCE_ROOT}/leveldbjni"
export LIBRARY_PATH="${SNAPPY_HOME}"
export C_INCLUDE_PATH="${LIBRARY_PATH}"
export CPLUS_INCLUDE_PATH="${LIBRARY_PATH}"

Apply the LevelDB patch

cd "${LEVELDB_HOME}"
git apply "${LEVELDBJNI_HOME}/leveldb.patch"
make libleveldb.a

Build the jar file

cd "${LEVELDBJNI_HOME}"
JAVA_HOME="/opt/openjdk/8/" PATH="/opt/openjdk/8/bin/:${PATH}" mvn clean install -P download -Plinux64-s390x -DskipTests
JAVA_HOME="/opt/openjdk/8/" PATH="/opt/openjdk/8/bin/:${PATH}" jar -xvf "${LEVELDBJNI_HOME}/leveldbjni-linux64-s390x/target/leveldbjni-linux64-s390x-1.8.jar"
export LD_LIBRARY_PATH="${LEVELDBJNI_HOME}/META-INF/native/linux64/s390x${LD_LIBRARY_PATH:+:${LD_LIBRARY_PATH}}"

2.5) Set Environment Variables

export MAVEN_OPTS="-Xss128m -Xmx3g -XX:ReservedCodeCacheSize=1g"

Step 3. Build Apache Spark

3.1) Clone Spark Repository

cd "${SOURCE_ROOT}"
git clone -b v3.3.0 https://github.com/apache/spark.git

3.2) Apply Patches to Fix Known Issues

    cd spark
    wget -O - "https://raw.githubusercontent.com/linux-on-ibm-z/scripts/master/ApacheSpark/3.3.0/patch/spark.diff" | git apply

3.3) Disable Tests for Unsupported Components

The ORC component do not currently support big-endian systems. Run the following commands to disable their test suites:

cd "${SOURCE_ROOT}/spark"
for f in \
    sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcColumnarBatchReaderSuite.scala \
    sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcEncryptionSuite.scala \
    sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcFilterSuite.scala \
    sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcPartitionDiscoverySuite.scala \
    sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcQuerySuite.scala \
    sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcSourceSuite.scala \
    sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcTest.scala \
    sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcV1FilterSuite.scala \
    sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcV1SchemaPruningSuite.scala \
    sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/orc/OrcV2SchemaPruningSuite.scala \
    sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/HiveOrcPartitionDiscoverySuite.scala \
    sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/HiveOrcQuerySuite.scala \
    sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/HiveOrcSourceSuite.scala \
    sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcHadoopFsRelationSuite.scala \
    sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcReadBenchmark.scala
do
    mv "${f}" "${f}.orig"
done

3.4) Build Spark

  cd "${SOURCE_ROOT}/spark"
  ./build/mvn -DskipTests clean package

Step 4. Run the test cases (Optional)

Run the Whole Java Test Suites

cd "${SOURCE_ROOT}/spark"
./build/mvn test -fn -DwildcardSuites=none

Run an Individual Java Test (For example JavaAPISuite)

cd "${SOURCE_ROOT}/spark"
./build/mvn -DwildcardSuites=none -Dtest=org.apache.spark.streaming.JavaAPISuite test

Run the Whole Scala Test Suites

cd "${SOURCE_ROOT}/spark"
./build/mvn test -fn -Dtest=none -pl '!sql/hive'

Run an Individual Scala Test (For example DataFrameCallbackSuite)

cd "${SOURCE_ROOT}/spark"
./build/mvn -Dtest=none -DwildcardSuites=org.apache.spark.sql.util.DataFrameCallbackSuite test

Note:

Hive does not currently support big-endian systems and so its test suites are skipped.
Tests can also be run using sbt.

Note: Following scala test cases have been observed to fail intermittently. They should pass on rerun.

Spark Project Core
- SparkContextSuite
  - SPARK-33084: Add jar support Ivy URI -- test param key case sensitive
- DAGSchedulerSuite
  - Failures in different stages should not trigger an overall abort
- ExecutorSuite
  - SPARK-33587: isFatalError
Spark Project SQL
- SQLAppStatusListenerSuite
  - driver side SQL metrics
- JDBCV2Suite
  - column name with non-ascii
Spark Project Streaming
- StreamingContextSuite
  - SPARK-18560 Receiver data should be deserialized properly.
  - stop gracefully
Kafka 0.10+ Source for Structured Streaming
- DirectKafkaStreamSuite
  - offset recovery

Note: Following RHEL 8.x, 9.0 and SLES 15 SP4 scala test failures are observed on both s390x and amd64 when building Spark with OpenJDK11.

Spark Project Core
- UISuite
  - http -> https redirect applies to all URIs

Step 5. Start Apache Spark Shell

cd "${SOURCE_ROOT}/spark"
./bin/spark-shell

References

The information provided in this article is accurate at the time of writing, but on-going development in the open-source projects involved may make the information incorrect or obsolete. Please open issue or contact us on IBM Z Community if you have any questions or feedback.

Building Apache Spark

Building Apache Spark

Step 1. Build using script

Step 2. Build Prerequisites for Apache Spark

2.1) Install the dependencies

2.2) Install Java

2.3) Install Maven

2.4) Build LevelDB JNI

2.5) Set Environment Variables

Step 3. Build Apache Spark

3.1) Clone Spark Repository

3.2) Apply Patches to Fix Known Issues

3.3) Disable Tests for Unsupported Components

3.4) Build Spark

Step 4. Run the test cases (Optional)

Step 5. Start Apache Spark Shell

References

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!