Update Sahara documentation to make clear images with Spark 1.0.0 created by the DIB are now supported by the plugin. Change-Id: Id9ec3b5947ee9212bb2ee77ebc0d4875e08486d8
2.4 KiB
Spark Plugin
The Spark Sahara plugin provides a way to provision Apache Spark clusters on OpenStack in a single click and in an easily repeatable fashion.
Currently Spark is installed in standalone mode, with no YARN or Mesos support.
Images
For cluster provisioning prepared images should be used. The Spark
plugin has been developed and tested with the images generated by the
diskimagebuilder
. Those
Ubuntu images already have Cloudera CDH4 HDFS and Apache Spark
installed.
The Spark plugin requires an image to be tagged in Sahara Image Registry with two tags: 'spark' and '<Spark version>' (e.g. '1.0.0').
Also you should specify the username of the default cloud-user used in the image. For images generated with the DIB it is 'ubuntu'.
Note that the Spark cluster is deployed using the scripts available in the Spark distribution, which allow to start all services (master and slaves), stop all services and so on. As such (and as opposed to CDH HDFS daemons), Spark is not deployed as a standard Ubuntu service and if the virtual machines are rebooted, Spark will not be restarted.
Spark configuration
Spark needs few parameters to work and has sensible defaults. If needed they can be changed when creating the Sahara cluster template. No node group options are available.
Once the cluster is ready, connect with ssh to the master using the 'ubuntu' user and the appropriate ssh key. Spark is installed in /opt/spark and should be completely configured and ready to start executing jobs. At the bottom of the cluster information page from the OpenStack dashboard, a link to the Spark web interface is provided.
Cluster Validation
When a user creates an Hadoop cluster using the Spark plugin, the cluster topology requested by user is verified for consistency.
Currently there are the following limitations in cluster topology for the Spark plugin:
- Cluster must contain exactly one HDFS namenode
- Cluster must contain exactly one Spark master
- Cluster must contain at least one Spark slave
- Cluster must contain at least one HDFS datanode
The tested configuration puts the NameNode co-located with the master and a DataNode with each slave to maximize data locality.
Limitations
For now scaling and EDP are not supported.
Swift support is not available in Spark. Once it is developed there, it will be possible to add it to this plugin.