-
Notifications
You must be signed in to change notification settings - Fork 56
Building Apache Spark
The instructions provided below specify the steps to build Apache Spark version 2.3.2 in Standalone Mode on Linux on IBM Z for the following distributions:
- RHEL (7.4, 7.5, 7.6)
- SLES (12 SP4, 15)
- Ubuntu (16.04, 18.04, 19.04)
General Notes:
-
When following the steps below please use a standard permission user unless otherwise specified.
-
A directory
/<source_root>/
will be referred to in these instructions, this is a temporary writable directory anywhere you'd like to place it.
If you want to build Spark using manual steps, go to STEP 2.
Use the following commands to build Spark using the build script. Please make sure you have wget installed.
wget -q https://raw.githubusercontent.com/linux-on-ibm-z/scripts/master/ApacheSpark/2.3.2/build_spark.sh
# Build Spark
bash build_spark.sh
If the build completes successfully, go to STEP 4. In case of error, check logs for more details or go to STEP 2 to follow manual build steps.
export SOURCE_ROOT=/<source_root>/
-
RHEL (7.4, 7.5, 7.6)
sudo yum groupinstall -y 'Development Tools' sudo yum install -y wget tar git libtool autoconf maven make
-
SLES (12-SP4, 15)
sudo zypper install -y wget tar git libtool autoconf gcc make gcc-c++ zip unzip
-
With AdoptOpenJDK
- Download and install AdoptOpenJDK (OpenJDK8 with Eclipse OpenJ9) from here
-
With IBM SDK
- Download and Install IBM SDK from here
-
Install maven
cd $SOURCE_ROOT wget http://mirrors.estointernet.in/apache/maven/maven-3/3.6.1/binaries/apache-maven-3.6.1-bin.tar.gz tar -xvf apache-maven-3.6.1-bin.tar.gz export PATH=$PATH:$SOURCE_ROOT/apache-maven-3.6.1/bin
-
-
Ubuntu (16.04, 18.04, 19.04)
sudo apt-get install -y wget tar git libtool autoconf build-essential maven
-
Download and configure Snappy
cd $SOURCE_ROOT wget https://github.com/google/snappy/releases/download/1.1.3/snappy-1.1.3.tar.gz tar -zxvf snappy-1.1.3.tar.gz export SNAPPY_HOME=`pwd`/snappy-1.1.3 cd ${SNAPPY_HOME} ./configure --disable-shared --with-pic make
-
Download the source code for LevelDB and LevelDB JNI
cd $SOURCE_ROOT git clone -b s390x https://github.com/linux-on-ibm-z/leveldb.git git clone -b leveldbjni-1.8-s390x https://github.com/linux-on-ibm-z/leveldbjni.git
-
Set the environment variables
export JAVA_HOME=/<path to JDK>/ export PATH=$JAVA_HOME/bin:$PATH export LEVELDB_HOME=`pwd`/leveldb export LEVELDBJNI_HOME=`pwd`/leveldbjni export LIBRARY_PATH=${SNAPPY_HOME} export C_INCLUDE_PATH=${LIBRARY_PATH} export CPLUS_INCLUDE_PATH=${LIBRARY_PATH}
-
Apply the LevelDB patch
cd ${LEVELDB_HOME} git apply ${LEVELDBJNI_HOME}/leveldb.patch make libleveldb.a
-
Build the jar file
cd ${LEVELDBJNI_HOME} mvn clean install -P download -Plinux64-s390x –DskipTests jar -xvf ${LEVELDBJNI_HOME}/leveldbjni-linux64-s390x/target/leveldbjni-linux64-s390x-1.8.jar export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:$SOURCE_ROOT/leveldbjni/META-INF/native/linux64/s390x
- RHEL (7.4, 7.5, 7.6)
curl https://bintray.com/sbt/rpm/rpm | sudo tee /etc/yum.repos.d/bintray-sbt-rpm.repo
sudo yum install -y sbt
- SLES (12-SP4, 15)
cd $SOURCE_ROOT
wget https://piccolo.link/sbt-1.2.8.zip
unzip sbt-1.2.8.zip
export PATH=$PATH:$SOURCE_ROOT/sbt/bin/
- Ubuntu (16.04, 18.04, 19.04)
echo "deb https://dl.bintray.com/sbt/debian /" | sudo tee -a /etc/apt/sources.list.d/sbt.list
sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv 2EE0EA64E40A89B84B2DF73499E82A75642AC823
sudo apt-get update
sudo apt-get install sbt
cd $SOURCE_ROOT
git clone https://github.com/luben/zstd-jni.git
cd zstd-jni
git checkout v1.3.8-2
sbt compile test package
export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/$SOURCE_ROOT/zstd-jni/target/classes/linux/s390x/
export MAVEN_OPTS="-Xmx2g -XX:ReservedCodeCacheSize=512m"
export HADOOP_USER_NAME="hadoop" ( IBM SDK only )
ulimit -s unlimited
ulimit -n 999999
cd $SOURCE_ROOT
git clone https://github.com/apache/spark.git
cd spark
git checkout v2.3.2
- Add s390x support in
common/unsafe/src/main/java/org/apache/spark/unsafe/Platform.java
index aca6fca..ed79f6c 100644
--- a/common/unsafe/src/main/java/org/apache/spark/unsafe/Platform.java
+++ b/common/unsafe/src/main/java/org/apache/spark/unsafe/Platform.java
@@ -44,6 +44,7 @@ public final class Platform {
public static final int DOUBLE_ARRAY_OFFSET;
private static final boolean unaligned;
+ String arch = System.getProperty("os.arch", "");
static {
boolean _unaligned;
String arch = System.getProperty("os.arch", "");
@@ -57,7 +58,14 @@ public final class Platform {
Class.forName("java.nio.Bits", false, ClassLoader.getSystemClassLoader());
Method unalignedMethod = bitsClass.getDeclaredMethod("unaligned");
unalignedMethod.setAccessible(true);
- _unaligned = Boolean.TRUE.equals(unalignedMethod.invoke(null));
+
+ //Since java.nio.Bits.unaligned() doesn't return true on s390x
+ if(arch.matches("^(s390x|s390x)$")){
+ _unaligned=true;
+ }else{
+ _unaligned = Boolean.TRUE.equals(unalignedMethod.invoke(null));
+ }
+
} catch (Throwable t) {
// We at least know x86 and x64 support unaligned access.
//noinspection DynamicRegexReplaceableByCompiledPattern
- Add s390x support in
sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/OnHeapColumnVector.java
diff --git a/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/OnHeapColumnVector.java b/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/OnHeapColumnVector.java
index 577eab6..1bf3126 100644
--- a/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/OnHeapColumnVector.java
+++ b/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/OnHeapColumnVector.java
@@ -396,7 +396,7 @@ public final class OnHeapColumnVector extends WritableColumnVector {
Platform.copyMemory(src, Platform.BYTE_ARRAY_OFFSET + srcIndex, floatData,
Platform.DOUBLE_ARRAY_OFFSET + rowId * 4L, count * 4L);
} else {
- ByteBuffer bb = ByteBuffer.wrap(src).order(ByteOrder.LITTLE_ENDIAN);
+ ByteBuffer bb = ByteBuffer.wrap(src).order(ByteOrder.BIG_ENDIAN);
for (int i = 0; i < count; ++i) {
floatData[i + rowId] = bb.getFloat(srcIndex + (4 * i));
}
@@ -445,7 +445,7 @@ public final class OnHeapColumnVector extends WritableColumnVector {
Platform.copyMemory(src, Platform.BYTE_ARRAY_OFFSET + srcIndex, doubleData,
Platform.DOUBLE_ARRAY_OFFSET + rowId * 8L, count * 8L);
} else {
- ByteBuffer bb = ByteBuffer.wrap(src).order(ByteOrder.LITTLE_ENDIAN);
+ ByteBuffer bb = ByteBuffer.wrap(src).order(ByteOrder.BIG_ENDIAN);
for (int i = 0; i < count; ++i) {
doubleData[i + rowId] = bb.getDouble(srcIndex + (8 * i));
}
- Add s390x support in
sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/OffHeapColumnVector.java
diff --git a/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/OffHeapColumnVector.java b/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/OffHeapColumnVector.java
index 5e0cf7d..3b919c7 100644
--- a/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/OffHeapColumnVector.java
+++ b/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/OffHeapColumnVector.java
@@ -417,7 +417,7 @@ public final class OffHeapColumnVector extends WritableColumnVector {
Platform.copyMemory(src, Platform.BYTE_ARRAY_OFFSET + srcIndex,
null, data + rowId * 4L, count * 4L);
} else {
- ByteBuffer bb = ByteBuffer.wrap(src).order(ByteOrder.LITTLE_ENDIAN);
+ ByteBuffer bb = ByteBuffer.wrap(src).order(ByteOrder.BIG_ENDIAN);
long offset = data + 4L * rowId;
for (int i = 0; i < count; ++i, offset += 4) {
Platform.putFloat(null, offset, bb.getFloat(srcIndex + (4 * i)));
@@ -472,7 +472,7 @@ public final class OffHeapColumnVector extends WritableColumnVector {
Platform.copyMemory(src, Platform.BYTE_ARRAY_OFFSET + srcIndex,
null, data + rowId * 8L, count * 8L);
} else {
- ByteBuffer bb = ByteBuffer.wrap(src).order(ByteOrder.LITTLE_ENDIAN);
+ ByteBuffer bb = ByteBuffer.wrap(src).order(ByteOrder.BIG_ENDIAN);
long offset = data + 8L * rowId;
for (int i = 0; i < count; ++i, offset += 8) {
Platform.putDouble(null, offset, bb.getDouble(srcIndex + (8 * i)));
- Test case
metrics StatsD sink with Timer
failure can be resolved by doing the below change incore/src/test/scala/org/apache/spark/metrics/sink/StatsdSinkSuite.scala
diff git --a/core/src/test/scala/org/apache/spark/metrics/sink/StatsdSinkSuite.scala
+++ b/core/src/test/scala/org/apache/spark/metrics/sink/StatsdSinkSuite.scala
@@ -36,7 +36,7 @@ class StatsdSinkSuite extends SparkFunSuite {
STATSD_KEY_HOST -> "127.0.0.1"
)
private val socketTimeout = 30000 // milliseconds
- private val socketBufferSize = 8192
+ private val socketBufferSize = 10000
private def withSocketAndSink(testCode: (DatagramSocket, StatsdSink) => Any): Unit = {
val socket = new DatagramSocket
- Test cases from
HIVE
andSQL
module fail with a known issue. To avoid running those test move the related files using the commands given below.
cd $SOURCE_ROOT/spark
mv sql/core/src/test/scala/org/apache/spark/sql/execution/arrow/ArrowConvertersSuite.scala sql/core/src/test/scala/org/apache/spark/sql/execution/arrow/ArrowConvertersSuite.scala.orig
mv sql/core/src/test/scala/org/apache/spark/sql/execution/vectorized/ColumnarBatchSuite.scala sql/core/src/test/scala/org/apache/spark/sql/execution/vectorized/ColumnarBatchSuite.scala.orig
mv sql/core/src/test/scala/org/apache/spark/sql/execution/vectorized/ArrowColumnVectorSuite.scala sql/core/src/test/scala/org/apache/spark/sql/execution/vectorized/ArrowColumnVectorSuite.scala.orig
mv sql/core/src/test/scala/org/apache/spark/sql/execution/arrow/ArrowWriterSuite.scala sql/core/src/test/scala/org/apache/spark/sql/execution/arrow/ArrowWriterSuite.scala.orig
mv sql/core/src/test/scala/org/apache/spark/sql/execution/arrow/ArrowUtilsSuite.scala sql/core/src/test/scala/org/apache/spark/sql/execution/arrow/ArrowUtilsSuite.scala.orig
mv sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/HiveOrcFilterSuite.scala sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/HiveOrcFilterSuite.scala.orig
mv sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/HiveOrcQuerySuite.scala sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/HiveOrcQuerySuite.scala.orig
mv sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcHadoopFsRelationSuite.scala sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcHadoopFsRelationSuite.scala.orig
mv sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/HiveOrcPartitionDiscoverySuite.scala sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/HiveOrcPartitionDiscoverySuite.scala.orig
mv sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/HiveOrcSourceSuite.scala sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/HiveOrcSourceSuite.scala.orig
mv sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcReadBenchmark.scala sql/hive/src/test/scala/org/apache/spark/sql/hive/orc/OrcReadBenchmark.scala.orig
cd $SOURCE_ROOT/spark
./build/mvn -DskipTests clean package
-
Java Tests
cd $SOURCE_ROOT/spark ./build/mvn test -DwildcardSuites=none
-
Scala Tests
cd $SOURCE_ROOT/spark ./build/mvn -Dtest=none test
Note:
- Scala Test cases of
Kafka 0.10 Source for Structured Streaming
gets aborted. However they pass when run individually. -
Safe getSimpleName
test fails on x86 as well, hence failure can be ignored. -
org.apache.spark.unsafe.hash.Murmur3_x86_32Suite
test is x86 specific test case. Hence failure can be ignored. - Test Failures related to HIVE, Kafka, Hadoop, ORC are neglected since this packages and their related test cases are not part of spark standalone build.
cd $SOURCE_ROOT/spark
./bin/spark-shell
http://spark.apache.org/docs/latest/building-spark.html
http://spark.apache.org/developer-tools.html#individual-tests
https://spark.apache.org/docs/latest/spark-standalone.html
The information provided in this article is accurate at the time of writing, but on-going development in the open-source projects involved may make the information incorrect or obsolete. Please open issue or contact us on IBM Z Community if you have any questions or feedback.