Wednesday, June 11, 2014

Moving from Spark 0.9.1 to Spark 1.0.0

I recently had to support Spark 1.0.0 in a project (Magpie).

The conversion from Spark 0.9.1 to Spark 1.0.0 was a bit annoying, as many changes had happened.

Here are a list of changes that I thought were worth mentioning.  Hope what I say here can help others.

1) Running examples differences

In Spark 0.9.1, you would run one of the Spark examples (such as SparkPi) like this:

> bin/run-example org.apache.spark.examples.SparkPi spark://SPARKMASTER:7077 
 
In Spark 1.0.0, running examples through run-example requires the Spark master to be specified through the MASTER environment variable and not on the command line.  So for Spark 1.0.0, you'll want to do something like this instead:

> export MASTER="spark://SPARKMASTER:7077"
> bin/run-example org.apache.spark.examples.SparkPi

otherwise you'll get a bad input error/exception.

If you don't set the MASTER environment variable, run-example will assume you want to run the example locally.

Note that setting the MASTER environment variable is specific to the run-example script.  It won't pick up the default value from the new spark-defaults.conf file.

2) spark-submit script

The spark-submit is a new wrapper script for submitting Spark jobs.  Although you can still use spark-class directly, this is the primary job submission script. It has the following usage:

> bin/spark-submit --class JOBCLASSTORUN [spark-submit options] APPLICATIONJAR [application args]
Note that the job jar is now passed in on the command line, unlike before with spark-class.

There's all sorts of new options in spark-submit.  Here are some options of note taken from the --help output:



  --master MASTER_URL         spark://host:port, mesos://host:port, yarn, or local.
  --class CLASS_NAME          Your application's main class (for Java / Scala apps).
  --jars JARS                 Comma-separated list of local jars to include on the driver
                              and executor classpaths.
  --properties-file FILE      Path to a file from which to load extra properties. If not
                              specified, this will look for conf/spark-defaults.conf.
  --driver-memory MEM         Memory for driver (e.g. 1000M, 2G) (Default: 512M).
  --driver-java-options       Extra Java options to pass to the driver.
  --driver-library-path       Extra library path entries to pass to the driver.
  --driver-class-path         Extra class path entries to pass to the driver. Note that
                              jars added with --jars are automatically included in the
                              classpath.


But you probably won't end up using these, you'll more likely use ...

3) spark-defaults.conf

Previously, options were configured through SPARK_JAVA_OPTS, but that is now deprecated. Everything should now be done through the spark-defaults.conf file. It is read and loaded from spark-submit when you submit a job. By default it is read in conf/spark-defaults.conf but that can be altered using the --properties-file option. In addition, settings through SPARK_CLASSPATH or SPARK_LIBRARY_PATH should now be set through spark-defaults.conf as well.

Here are several options for spark-defaults.conf of particular note, with the full list in the Spark documentation.


spark.master                     set Spark master, e.g. spark://SPARKMASTER:7077
spark.executor.memory            set executor memory, e.g. 1024m, see more below
spark.executor.extraClassPath    what used to be set by SPARK_CLASSPATH
spark.executor.extraLibraryPath  what used to be set by SPARK_LIBRARY_PATH 

4) deprecated SPARK_MEM environment variable

The SPARK_MEM environment variable has been deprecated and replaced by two new configurations so users can configure the Spark executors and driver memory separately..

The SPARK_DRIVER_MEMORY environment will set memory for your Spark driver.  This could also be handled via the --driver-memory option in spark-submit.

The memory for Spark executors is now handled by the spark.executor.memory option in spark-defaults.conf.  Documentation indicates the environment variable SPARK_EXECUTOR_MEMORY will also work, but I didn't try that.

5) deprecated spark.local.dir

The configuration option spark.local.dir is now apparently deprecated in favor of the SPARK_LOCAL_DIRS environment variable.


No comments:

Post a Comment