Skip to content

Building Apache Spark

Cindy Lee edited this page May 30, 2019 · 19 revisions

Building Apache Spark

The instructions provided below specify the steps to build Apache Spark version 2.3.2 in Standalone Mode on Linux on IBM Z for the following distributions:

  • RHEL (7.4, 7.5, 7.6)
  • SLES (12 SP4, 15)
  • Ubuntu (16.04, 18.04, 19.04)

General Notes:

  • When following the steps below please use a standard permission user unless otherwise specified.

  • A directory /<source_root>/ will be referred to in these instructions, this is a temporary writable directory anywhere you'd like to place it.

Step 1 : Build using script

If you want to build Spark using manual steps, go to STEP 2.

Use the following commands to build Spark using the build script. Please make sure you have wget installed.

wget -q https://raw.githubusercontent.com/linux-on-ibm-z/scripts/master/ApacheSpark/2.3.2/build_spark.sh

# Build Spark
bash build_spark.sh   

If the build completes successfully, go to STEP 4. In case of error, check logs for more details or go to STEP 2 to follow manual build steps.

Step 2. Building Apache Spark

2.1) Install the dependencies
export SOURCE_ROOT=/<source_root>/
  • RHEL (7.4, 7.5, 7.6)

     sudo yum groupinstall -y 'Development Tools' 
     sudo yum install -y wget tar git libtool autoconf maven make
    • With AdoptOpenJDK

      • Download and install AdoptOpenJDK (OpenJDK8 with Eclipse OpenJ9) from here
    • With IBM SDK

      • Download and Install IBM SDK from here
  • SLES (12-SP4, 15)

     sudo zypper install -y wget tar git libtool autoconf gcc make  gcc-c++ zip unzip
    • With AdoptOpenJDK

      • Download and install AdoptOpenJDK (OpenJDK8 with Eclipse OpenJ9) from here
    • With IBM SDK

      • Download and Install IBM SDK from here
    • Install maven

      cd $SOURCE_ROOT
      wget http://mirrors.estointernet.in/apache/maven/maven-3/3.6.1/binaries/apache-maven-3.6.1-bin.tar.gz 
      tar -xvf apache-maven-3.6.1-bin.tar.gz
      export PATH=$PATH:$SOURCE_ROOT/apache-maven-3.6.1/bin
  • Ubuntu (16.04, 18.04, 19.04)

     sudo apt-get install -y wget tar git libtool autoconf build-essential maven
    • With AdoptOpenJDK

      • Download and install AdoptOpenJDK (OpenJDK8 with Eclipse OpenJ9) from here
    • With IBM SDK

      • Download and Install IBM SDK from here
2.2) Build LevelDB JNI
  • Download and configure Snappy

    cd $SOURCE_ROOT
    wget https://github.com/google/snappy/releases/download/1.1.3/snappy-1.1.3.tar.gz
    tar -zxvf  snappy-1.1.3.tar.gz
    export SNAPPY_HOME=`pwd`/snappy-1.1.3
    cd ${SNAPPY_HOME}
    ./configure --disable-shared --with-pic
    make
  • Download the source code for LevelDB and LevelDB JNI

    cd $SOURCE_ROOT
    git clone -b s390x https://github.com/linux-on-ibm-z/leveldb.git
    git clone -b leveldbjni-1.8-s390x https://github.com/linux-on-ibm-z/leveldbjni.git
  • Set the environment variables

    export JAVA_HOME=/<path to JDK>/
    export PATH=$JAVA_HOME/bin:$PATH
    export LEVELDB_HOME=`pwd`/leveldb
    export LEVELDBJNI_HOME=`pwd`/leveldbjni
    export LIBRARY_PATH=${SNAPPY_HOME}
    export C_INCLUDE_PATH=${LIBRARY_PATH}
    export CPLUS_INCLUDE_PATH=${LIBRARY_PATH}
  • Apply the LevelDB patch

    cd ${LEVELDB_HOME}
    git apply ${LEVELDBJNI_HOME}/leveldb.patch
    make libleveldb.a
  • Build the jar file

    cd ${LEVELDBJNI_HOME}
    mvn clean install -P download -Plinux64-s390x –DskipTests
    jar -xvf ${LEVELDBJNI_HOME}/leveldbjni-linux64-s390x/target/leveldbjni-linux64-s390x-1.8.jar
    export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$SOURCE_ROOT/leveldbjni/META-INF/native/linux64/s390x
2.3) Build ZSTD JNI
  • RHEL (7.4, 7.5, 7.6)
curl https://bintray.com/sbt/rpm/rpm | sudo tee /etc/yum.repos.d/bintray-sbt-rpm.repo
sudo yum install -y sbt
  • SLES (12-SP4, 15)
cd $SOURCE_ROOT
wget https://piccolo.link/sbt-1.2.8.zip
unzip sbt-1.2.8.zip
export PATH=$PATH:$SOURCE_ROOT/sbt/bin/
  • Ubuntu (16.04, 18.04, 19.04)
echo "deb https://dl.bintray.com/sbt/debian /" | sudo tee -a /etc/apt/sources.list.d/sbt.list
sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 2EE0EA64E40A89B84B2DF73499E82A75642AC823
sudo apt-get update
sudo apt-get install sbt
cd $SOURCE_ROOT
git clone https://github.com/luben/zstd-jni.git
cd zstd-jni
git checkout v1.3.8-2
sbt compile test package
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/$SOURCE_ROOT/zstd-jni/target/classes/linux/s390x/
2.4) Set Environment Variables
export MAVEN_OPTS="-Xmx2g -XX:ReservedCodeCacheSize=512m"
export HADOOP_USER_NAME="hadoop" ( IBM SDK only )
ulimit -s unlimited
ulimit -n 999999

Step 3. Build Apache Spark

cd $SOURCE_ROOT
git clone https://github.com/apache/spark.git
cd spark
git checkout v2.3.2
  • Add s390x support in common/unsafe/src/main/java/org/apache/spark/unsafe/Platform.java
index aca6fca..ed79f6c 100644
--- a/common/unsafe/src/main/java/org/apache/spark/unsafe/Platform.java
+++ b/common/unsafe/src/main/java/org/apache/spark/unsafe/Platform.java
@@ -44,6 +44,7 @@ public final class Platform {
   public static final int DOUBLE_ARRAY_OFFSET;

   private static final boolean unaligned;
+  String arch = System.getProperty("os.arch", "");
   static {
     boolean _unaligned;
     String arch = System.getProperty("os.arch", "");
@@ -57,7 +58,14 @@ public final class Platform {
           Class.forName("java.nio.Bits", false, ClassLoader.getSystemClassLoader());
         Method unalignedMethod = bitsClass.getDeclaredMethod("unaligned");
         unalignedMethod.setAccessible(true);
-        _unaligned = Boolean.TRUE.equals(unalignedMethod.invoke(null));
+
+      //Since java.nio.Bits.unaligned() doesn't return true on s390x
+      if(arch.matches("^(s390x|s390x)$")){
+       _unaligned=true;
+      }else{
+       _unaligned = Boolean.TRUE.equals(unalignedMethod.invoke(null));
+      }
+
       } catch (Throwable t) {
         // We at least know x86 and x64 support unaligned access.
         //noinspection DynamicRegexReplaceableByCompiledPattern
  • Add s390x support in sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/OnHeapColumnVector.java
diff --git a/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/OnHeapColumnVector.java b/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/OnHeapColumnVector.java
index 577eab6..1bf3126 100644
--- a/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/OnHeapColumnVector.java
+++ b/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/OnHeapColumnVector.java
@@ -396,7 +396,7 @@ public final class OnHeapColumnVector extends WritableColumnVector {
       Platform.copyMemory(src, Platform.BYTE_ARRAY_OFFSET + srcIndex, floatData,
           Platform.DOUBLE_ARRAY_OFFSET + rowId * 4L, count * 4L);
     } else {
-      ByteBuffer bb = ByteBuffer.wrap(src).order(ByteOrder.LITTLE_ENDIAN);
+      ByteBuffer bb = ByteBuffer.wrap(src).order(ByteOrder.BIG_ENDIAN);
       for (int i = 0; i < count; ++i) {
         floatData[i + rowId] = bb.getFloat(srcIndex + (4 * i));
       }
@@ -445,7 +445,7 @@ public final class OnHeapColumnVector extends WritableColumnVector {
       Platform.copyMemory(src, Platform.BYTE_ARRAY_OFFSET + srcIndex, doubleData,
           Platform.DOUBLE_ARRAY_OFFSET + rowId * 8L, count * 8L);
     } else {
-      ByteBuffer bb = ByteBuffer.wrap(src).order(ByteOrder.LITTLE_ENDIAN);
+      ByteBuffer bb = ByteBuffer.wrap(src).order(ByteOrder.BIG_ENDIAN);
       for (int i = 0; i < count; ++i) {
         doubleData[i + rowId] = bb.getDouble(srcIndex + (8 * i));
       }
  • Add s390x support in sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/OffHeapColumnVector.java
diff --git a/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/OffHeapColumnVector.java b/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/OffHeapColumnVector.java
index 5e0cf7d..3b919c7 100644
--- a/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/OffHeapColumnVector.java
+++ b/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/OffHeapColumnVector.java
@@ -417,7 +417,7 @@ public final class OffHeapColumnVector extends WritableColumnVector {
       Platform.copyMemory(src, Platform.BYTE_ARRAY_OFFSET + srcIndex,
           null, data + rowId * 4L, count * 4L);
     } else {
-      ByteBuffer bb = ByteBuffer.wrap(src).order(ByteOrder.LITTLE_ENDIAN);
+      ByteBuffer bb = ByteBuffer.wrap(src).order(ByteOrder.BIG_ENDIAN);
       long offset = data + 4L * rowId;
       for (int i = 0; i < count; ++i, offset += 4) {
         Platform.putFloat(null, offset, bb.getFloat(srcIndex + (4 * i)));
@@ -472,7 +472,7 @@ public final class OffHeapColumnVector extends WritableColumnVector {
       Platform.copyMemory(src, Platform.BYTE_ARRAY_OFFSET + srcIndex,
         null, data + rowId * 8L, count * 8L);
     } else {
-      ByteBuffer bb = ByteBuffer.wrap(src).order(ByteOrder.LITTLE_ENDIAN);
+      ByteBuffer bb = ByteBuffer.wrap(src).order(ByteOrder.BIG_ENDIAN);
       long offset = data + 8L * rowId;
       for (int i = 0; i < count; ++i, offset += 8) {
         Platform.putDouble(null, offset, bb.getDouble(srcIndex + (8 * i)));
  • Test case metrics StatsD sink with Timer failure can be resolved by doing the below change in core/src/test/scala/org/apache/spark/metrics/sink/StatsdSinkSuite.scala
diff git --a/core/src/test/scala/org/apache/spark/metrics/sink/StatsdSinkSuite.scala
+++ b/core/src/test/scala/org/apache/spark/metrics/sink/StatsdSinkSuite.scala
@@ -36,7 +36,7 @@ class StatsdSinkSuite extends SparkFunSuite {
     STATSD_KEY_HOST -> "127.0.0.1"
   )
   private val socketTimeout = 30000 // milliseconds
-  private val socketBufferSize = 8192
+  private val socketBufferSize = 10000

   private def withSocketAndSink(testCode: (DatagramSocket, StatsdSink) => Any): Unit = {
   val socket = new DatagramSocket
  • Test cases from HIVE and SQL module fail with a known issue. To avoid running those test move the related files using the commands given below.
cd $SOURCE_ROOT/spark
mv sql/core/src/test/scala/org/apache/spark/sql/execution/arrow/ArrowConvertersSuite.scala sql/core/src/test/scala/org/apache/spark/sql/execution/arrow/ArrowConvertersSuite.scala.orig
mv sql/core/src/test/scala/org/apache/spark/sql/execution/vectorized/ColumnarBatchSuite.scala sql/core/src/test/scala/org/apache/spark/sql/execution/vectorized/ColumnarBatchSuite.scala.orig
mv sql/core/src/test/scala/org/apache/spark/sql/execution/vectorized/ArrowColumnVectorSuite.scala sql/core/src/test/scala/org/apache/spark/sql/execution/vectorized/ArrowColumnVectorSuite.scala.orig
mv sql/core/src/test/scala/org/apache/spark/sql/execution/arrow/ArrowWriterSuite.scala sql/core/src/test/scala/org/apache/spark/sql/execution/arrow/ArrowWriterSuite.scala.orig
mv sql/core/src/test/scala/org/apache/spark/sql/execution/arrow/ArrowUtilsSuite.scala sql/core/src/test/scala/org/apache/spark/sql/execution/arrow/ArrowUtilsSuite.scala.orig
mv sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/HiveOrcFilterSuite.scala sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/HiveOrcFilterSuite.scala.orig
mv sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/HiveOrcQuerySuite.scala sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/HiveOrcQuerySuite.scala.orig
mv sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcHadoopFsRelationSuite.scala sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcHadoopFsRelationSuite.scala.orig
mv sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/HiveOrcPartitionDiscoverySuite.scala sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/HiveOrcPartitionDiscoverySuite.scala.orig
mv sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/HiveOrcSourceSuite.scala sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/HiveOrcSourceSuite.scala.orig
mv sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcReadBenchmark.scala sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcReadBenchmark.scala.orig
cd $SOURCE_ROOT/spark
./build/mvn -DskipTests clean package

Step 4. Run the test cases (Optional)

  • Java Tests

     cd $SOURCE_ROOT/spark
     ./build/mvn test -DwildcardSuites=none
  • Scala Tests

    cd $SOURCE_ROOT/spark
    ./build/mvn -Dtest=none test

Note:

  1. Scala Test cases of Kafka 0.10 Source for Structured Streaming gets aborted. However they pass when run individually.
  2. Safe getSimpleName test fails on x86 as well, hence failure can be ignored.
  3. org.apache.spark.unsafe.hash.Murmur3_x86_32Suite test is x86 specific test case. Hence failure can be ignored.
  4. Test Failures related to HIVE, Kafka, Hadoop, ORC are neglected since this packages and their related test cases are not part of spark standalone build.

Step 5. Apache Spark Shell

cd $SOURCE_ROOT/spark
./bin/spark-shell

References

http://spark.apache.org/docs/latest/building-spark.html

http://spark.apache.org/developer-tools.html#individual-tests

https://spark.apache.org/docs/latest/spark-standalone.html

https://issues.apache.org/jira/browse/SPARK-20984

https://issues.apache.org/jira/browse/ARROW-3476

Clone this wiki locally