The cloud-integration repository provides modules to improve Apache Spark's integration with cloud infrastructures.
Classes and Tools to make Spark work better in-cloud
- Committer integration with the s3a committers.
- Proof of concept cloud-first distcp replacement.
- Serialization for Hadoop
Configuration: classConfigSerDeser. Use this to get a configuration into an RDD method - Trait
HConfto manipulate the hadoop options in a spark config. - Anything else which turns out to be useful.
- Variant of
FileInputStreamfor cloud storage,org.apache.spark.streaming.hortonworks.CloudInputDStream
This does the packaging/integration tests for Spark and cloud against AWS, Azure and openstack.
These are basic tests of the core functionality of I/O, streaming, and verify that the commmitters work in the presence of inconsistent object storage As well as running as unit tests, they have CLI entry points which can be used for scalable functional testing.
This is a minimal JAR for integration tests, intended to work against all versions
of Spark 2.2. As Spark 2.1 has Spark's Logging class private, it reinstates its own
log API, CloudLogging which is used; then copies in the relevant ops from
spark-cloud-integration with their logging fixed up.
Usage
spark-submit --class com.hortonworks.spark.cloud.integration.Generator \
--master yarn \
--num-executors 2 \
--driver-memory 512m \
--executor-memory 512m \
--executor-cores 1 \
minimal-integration-test-1.0-SNAPSHOT.jar \
adl://example.azuredatalakestore.net/output/dest/1 \
2 2 15