The conversion from Spark 0.9.1 to Spark 1.0.0 was a bit annoying, as many changes had happened.
Here are a list of changes that I thought were worth mentioning. Hope what I say here can help others.
1) Running examples differences
In Spark 0.9.1, you would run one of the Spark examples (such as SparkPi) like this:
> bin/run-example org.apache.spark.examples.SparkPi spark://SPARKMASTER:7077
In Spark 1.0.0, running examples through run-example requires the Spark master to be specified through the MASTER environment variable and not on the command line. So for Spark 1.0.0, you'll want to do something like this instead:
otherwise you'll get a bad input error/exception.
> export MASTER="spark://SPARKMASTER:7077"
> bin/run-example org.apache.spark.examples.SparkPi
If you don't set the MASTER environment variable, run-example will assume you want to run the example locally.
Note that setting the MASTER environment variable is specific to the run-example script. It won't pick up the default value from the new spark-defaults.conf file.
2) spark-submit script
The spark-submit is a new wrapper script for submitting Spark jobs. Although you can still use spark-class directly, this is the primary job submission script. It has the following usage:
> bin/spark-submit --class JOBCLASSTORUN [spark-submit options] APPLICATIONJAR [application args]
Note that the job jar is now passed in on the command line, unlike before with spark-class.
There's all sorts of new options in spark-submit. Here are some options of note taken from the --help output:
--master MASTER_URL spark://host:port, mesos://host:port, yarn, or local. --class CLASS_NAME Your application's main class (for Java / Scala apps). --jars JARS Comma-separated list of local jars to include on the driver and executor classpaths. --properties-file FILE Path to a file from which to load extra properties. If not specified, this will look for conf/spark-defaults.conf. --driver-memory MEM Memory for driver (e.g. 1000M, 2G) (Default: 512M). --driver-java-options Extra Java options to pass to the driver. --driver-library-path Extra library path entries to pass to the driver. --driver-class-path Extra class path entries to pass to the driver. Note that jars added with --jars are automatically included in the classpath.
But you probably won't end up using these, you'll more likely use ...
Previously, options were configured through SPARK_JAVA_OPTS, but that is now deprecated. Everything should now be done through the spark-defaults.conf file. It is read and loaded from spark-submit when you submit a job. By default it is read in conf/spark-defaults.conf but that can be altered using the --properties-file option. In addition, settings through SPARK_CLASSPATH or SPARK_LIBRARY_PATH should now be set through spark-defaults.conf as well.
Here are several options for spark-defaults.conf of particular note, with the full list in the Spark documentation.
spark.master set Spark master, e.g. spark://SPARKMASTER:7077 spark.executor.memory set executor memory, e.g. 1024m, see more below spark.executor.extraClassPath what used to be set by SPARK_CLASSPATH spark.executor.extraLibraryPath what used to be set by SPARK_LIBRARY_PATH
4) deprecated SPARK_MEM environment variable
The SPARK_MEM environment variable has been deprecated and replaced by two new configurations so users can configure the Spark executors and driver memory separately..
The SPARK_DRIVER_MEMORY environment will set memory for your Spark driver. This could also be handled via the --driver-memory option in spark-submit.
The memory for Spark executors is now handled by the spark.executor.memory option in spark-defaults.conf. Documentation indicates the environment variable SPARK_EXECUTOR_MEMORY will also work, but I didn't try that.
5) deprecated spark.local.dir
The configuration option spark.local.dir is now apparently deprecated in favor of the SPARK_LOCAL_DIRS environment variable.