Building custom Apache Spark

TRAN Ngoc Thach
3 min readOct 17, 2022

Introduction

Typically, we want to download Apache Spark pre-built for a specific version of Hadoop, in which Hadoop libraries are readily included for HDFS and YARN. However, there are situations when this pre-built is less favored. For example:

  • An Hadoop infrastructure has been already set up and well-maintained with all parts fit together. In that case, a Hadoop-free Spark Build is a good choice, enabling Users to refer to and re-use the desired Hadoop libraries via Spark’s Environmental Variable SPARK_DIST_CLASSPATH.
  • A minimum Spark Build, in which irrelevant libraries, e.g. hadoop-aws, are excluded. Users then indicate what exact libraries are used in their Spark application, via e.g. — packages argument in PySpark shell. Spark will download the packages and their dependencies automatically. This approach maximizes Users’ freedom of library choice.
  • A convenient Spark Build, in which often-used libraries are included, relieving Users from manually indicating additional libraries. This approach also brings a hidden benefit: let Administrator manage the default libraries, e.g. security update, dependency consistency. It’s true that Spark takes precedence in loading its default libraries over User libraries, when looking for classes. (note: this behavior can be overridden by setting spark.driver.userClassPathFirst. However, it’s an Experimental setting. Use at one’s own risk!)
  • In certain cases, it is troublesome for Hadoop-free Spark Build to integrate some libraries, e.g. Hive. This issue can be overcome by building Spark from source code with the desired Hadoop version and a switch for Hive. (-Phive -Phive-thriftserver)

The latter three cases involve building Spark from Source code. There can be many variations from combining different switches, but the author is going to show at least one approach that works, and the readers can go further from there.

Steps

In Terminal:

  1. Increase the memory of Maven.
    + export MAVEN_OPTS=”-Xss64m -Xmx2g -XX:ReservedCodeCacheSize=1g”
  2. Ensure JAVA_HOME is pointed to a valid JDK 8 or 11, regardless of Oracle or Temurin. Example:
    + export JAVA_HOME=/usr/lib/jvm/temurin-11-jdk-amd64
  3. Download the Spark Source code, and extract it.
    + tar -xvzf spark-3.2.2.tgz
  4. Just in case, change the file mode recursively (assuming the OS is Ubuntu) to grant read-write-execute to anyone.
    + chmod -R gou+rwx spark-3.2.2
  5. Move to the just extracted folder, and type:
    + ./dev/make-distribution.sh --name custom-spark --pip --tgz -Pyarn -Phive -Phive-thriftserver -Phadoop-cloud -Dhadoop.version=3.3.4 -e -DskipTests
    Explanation: --name specifies the name of the final packaged Spark folder, e.g. spark-3.2.2-bin-custom-spark.tgz. -Phive -Phive-thriftserver indicates the inclusion of Hive functionality. -Phadoop-cloud integrates Hadoop AWS libraries (not sure if other Cloud vendors are also supported). -e helps output more information in case of build errors.
  6. We end up with spark-3.2.2-bin-custom-spark.tgz, which is similar in terms of structure, compared to other Spark Download options. This Spark Build is ready to deploy.

Environment

  • Ubuntu 22.04 LTS (64 bit).
  • OpenJDK 11.0.16.1 Temurin (64 bit).
  • Target Spark version: v3.2.2.
  • Target Hadoop version: v3.3.4.

References

--

--