Per the Sahara team meeting on 2015/02/26, distributed mode should no longer be marked as an alpha feature as of the Kilo release, as it is well covered by our test suite and has been proven stable. Change-Id: I6c2ff54d8b2fcba89d68a27030f14da11942a3f0 Closes-Bug: #1426512
15 KiB
Features Overview
Cluster Scaling
The mechanism of cluster scaling is designed to enable a user to change the number of running instances without creating a new cluster. A user may change the number of instances in existing Node Groups or add new Node Groups.
If a cluster fails to scale properly, all changes will be rolled back.
Swift Integration
In order to leverage Swift within Hadoop, including using Swift data
sources from within EDP, Hadoop requires the application of a patch. For
additional information about using Swift with Sahara, including patching
Hadoop and configuring Sahara, please refer to the hadoop-swift
documentation.
Cinder support
Cinder is a block storage service that can be used as an alternative for an ephemeral drive. Using Cinder volumes increases reliability of data which is important for HDFS service.
A user can set how many volumes will be attached to each node in a Node Group and the size of each volume.
All volumes are attached during Cluster creation/scaling operations.
Neutron and Nova Network support
OpenStack clusters may use Nova or Neutron as a networking service.
Sahara supports both, but when deployed a special configuration for
networking should be set explicitly. By default Sahara will behave as if
Nova is used. If an OpenStack cluster uses Neutron, then the
use_neutron
property should be set to True
in
the Sahara configuration file. Additionally, if the cluster supports
network namespaces the use_namespaces
property can be used
to enable their usage.
[DEFAULT]
use_neutron=True
use_namespaces=True
Note
If a user other than root
will be running the Sahara
server instance and namespaces are used, some additional configuration
is required, please see the advanced.configuration.guide
for more information.
Floating IP Management
Sahara needs to access instances through ssh during a Cluster setup.
To establish a connection Sahara may use both: fixed and floating IP of
an Instance. By default use_floating_ips
parameter is set
to True
, so Sahara will use Floating IP of an Instance to
connect. In this case, the user has two options for how to make all
instances get a floating IP:
- Nova Network may be configured to assign floating IPs automatically
by setting
auto_assign_floating_ip
toTrue
innova.conf
- User may specify a floating IP pool for each Node Group directly.
Note: When using floating IPs for management
(use_floating_ip=True
) every instance in
the Cluster should have a floating IP, otherwise Sahara will not be able
to work with it.
If the use_floating_ips
parameter is set to
False
Sahara will use Instances' fixed IPs for management.
In this case the node where Sahara is running should have access to
Instances' fixed IP network. When OpenStack uses Neutron for networking,
a user will be able to choose fixed IP network for all instances in a
Cluster.
Anti-affinity
One of the problems in Hadoop running on OpenStack is that there is no ability to control where the machine is actually running. We cannot be sure that two new virtual machines are started on different physical machines. As a result, any replication with the cluster is not reliable because all replicas may turn up on one physical machine. The anti-affinity feature provides an ability to explicitly tell Sahara to run specified processes on different compute nodes. This is especially useful for the Hadoop data node process to make HDFS replicas reliable.
Starting with the Juno release, Sahara creates server groups with the
anti-affinity
policy to enable the anti-affinity feature.
Sahara creates one server group per cluster and assigns all instances
with affected processes to this server group. Refer to the Nova
documentation on how server groups work.
This feature is supported by all plugins out of the box.
Data-locality
It is extremely important for data processing to work locally (on the same rack, OpenStack compute node or even VM). Hadoop supports the data-locality feature and can schedule jobs to task tracker nodes that are local for input stream. In this case task tracker could communicate directly with the local data node.
Sahara supports topology configuration for HDFS and Swift data sources.
To enable data-locality set enable_data_locality
parameter to True
in Sahara configuration file
enable_data_locality=True
In this case two files with topology must be provided to Sahara.
Options compute_topology_file
and
swift_topology_file
parameters control location of files
with compute and swift nodes topology descriptions correspondingly.
compute_topology_file
should contain mapping between
compute nodes and racks in the following format:
compute1 /rack1
compute1 /rack2
compute1 /rack2
Note that the compute node name must be exactly the same as
configured in OpenStack (host
column in admin list for
instances).
swift_topology_file
should contain mapping between swift
nodes and racks in the following format:
node1 /rack1
node2 /rack2
node3 /rack2
Note that the swift node must be exactly the same as configures in object.builder swift ring. Also make sure that VMs with the task tracker service have direct access to swift nodes.
Hadoop versions after 1.2.0 support four-layer topology (https://issues.apache.org/jira/browse/HADOOP-8468).
To enable this feature set enable_hypervisor_awareness
option to True
in Sahara configuration file. In this case
Sahara will add the compute node ID as a second level of topology for
Virtual Machines.
Security group management
Sahara allows you to control which security groups will be used for
created instances. This can be done by providing the
security_groups
parameter for the Node Group or Node Group
Template. By default an empty list is used that will result in using the
default security group.
Sahara may also create a security group for instances in the node group automatically. This security group will only have open ports which are required by instance processes or the Sahara engine. This option is useful for development and secured from outside environments, but for production environments it is recommended to control the security group policy manually.
Heat Integration
Sahara may use OpenStack Orchestration engine (aka Heat) to provision nodes for Hadoop cluster. To make Sahara work with Heat the following steps are required:
- Your OpenStack installation must have 'orchestration' service up and running
- Sahara must contain the following configuration parameter in sahara.conf:
# An engine which will be used to provision infrastructure for Hadoop cluster. (string value)
infrastructure_engine=heat
There is a feature parity between direct and heat infrastructure engines. It is recommended to use the heat engine since the direct engine will be deprecated at some point.
Multi region deployment
Sahara supports multi region deployment. In this case, each instance
of Sahara should have the os_region_name=<region>
property set in the configuration file.
Hadoop HDFS High Availability
Hadoop HDFS High Availability (HDFS HA) uses 2 Namenodes in an active/standby architecture to ensure that HDFS will continue to work even when the active namenode fails. The High Availability is achieved by using a set of JournalNodes and Zookeeper servers along with ZooKeeper Failover Controllers (ZKFC) and some additional configurations and changes to HDFS and other services that use HDFS.
Currently HDFS HA is only supported with the HDP 2.0.6 plugin. The feature is enabled through a cluster_configs parameter in the cluster's JSON:
"cluster_configs": {
"HDFSHA": {
"hdfs.nnha": true
}
}
Plugin Capabilities
The below tables provides a plugin capability matrix:
+ Feature | Plugin ---------+ Vanilla | ----------+ HDP | ----------+ Cloudera | -------+ Spark |
---|---|---|---|---|
Nova and Neutron network | x | x | x | x |
Cluster Scaling | x | Scale Up | x | x |
Swift Integration | x | x | x | N/A |
Cinder Support | x | x | x | x |
Data Locality | x | x | N/A | x |
EDP | x | x | x | x |
Running Sahara in Distributed Mode
The installation.guide
suggests to launch Sahara as a
single 'sahara-all' process. It is also possible to run Sahara in
distributed mode with 'sahara-api' and 'sahara-engine' processes running
on several machines simultaneously.
Sahara-api works as a front-end and serves users' requests. It offloads 'heavy' tasks to the sahara-engine via RPC mechanism. While the sahara-engine could be loaded, sahara-api by design stays free and hence may quickly respond on user queries.
If Sahara runs on several machines, the API requests could be balanced between several sahara-api instances using a load balancer. It is not required to balance load between different sahara-engine instances, as that will be automatically done via a message queue.
If a single machine goes down, others will continue serving users' requests. Hence a better scalability is achieved and some fault tolerance as well. Note that the proposed solution is not a true High Availability. While failure of a single machine does not affect work of other machines, all of the operations running on the failed machine will stop. For example, if a cluster scaling is interrupted, the cluster will be stuck in a half-scaled state. The cluster will probably continue working, but it will be impossible to scale it further or run jobs on it via EDP.
To run Sahara in distributed mode pick several machines on which you want to run Sahara services and follow these steps:
On each machine install and configure Sahara using the installation guide except:
- Do not run 'sahara-db-manage' or launch Sahara with 'sahara-all'
- Make sure sahara.conf provides database connection string to a single database on all machines.
Run 'sahara-db-manage' as described in the installation guide, but only on a single (arbitrarily picked) machine.
sahara-api and sahara-engine processes use oslo.messaging to communicate with each other. You need to configure it properly on each node (see below).
run sahara-api and sahara-engine on the desired nodes. On a node you can run both sahara-api and sahara-engine or you can run them on separate nodes. It does not matter as long as they are configured to use the same message broker and database.
To configure oslo.messaging, first you need to pick the driver you
are going to use. Right now three drivers are provided: Rabbit MQ, Qpid
or Zmq. To use Rabbit MQ or Qpid driver, you will have to setup
messaging broker. The picked driver must be supplied in
sahara.conf
in [DEFAULT]/rpc_backend
parameter. Use one the following values: rabbit
,
qpid
or zmq
. Next you have to supply
driver-specific options.
Unfortunately, right now there is no documentation with a description of drivers' configuration. The options are available only in source code.
- For Rabbit MQ see
- rabbit_opts variable in impl_rabbit.py
- amqp_opts variable in amqp.py
- For Qpid see
- qpid_opts variable in impl_qpid.py
- amqp_opts variable in amqp.py
- For Zmq see
- zmq_opts variable in impl_zmq.py
- matchmaker_opts variable in matchmaker.py
- matchmaker_redis_opts variable in matchmaker_redis.py
- matchmaker_opts variable in matchmaker_ring.py
You can find the same options defined in
sahara.conf.sample
. You can use it to find section names
for each option (matchmaker options are defined not in
[DEFAULT]
)
Managing instances with limited access
Warning
The indirect VMs access feature is in alpha state. We do not recommend using it in a production environment.
Sahara needs to access instances through ssh during a Cluster setup.
This could be obtained by a number of ways (see neutron-nova-network
, floating_ip_management
, custom_network_topologies
).
But sometimes it is impossible to provide access to all nodes (because
of limited numbers of floating IPs or security policies). In this case
access can be gained using other nodes of the cluster. To do that set
is_proxy_gateway=True
for the node group you want to use as
proxy. In this case Sahara will communicate with all other instances via
instances of this node group.
Note, if use_floating_ips=true
and the cluster contains
a node group with is_proxy_gateway=True
, requirement to
have floating_ip_pool
specified is applied only to the
proxy node group. Other instances will be accessed via proxy instances
using standard private network.
Note, Cloudera hadoop plugin doesn't support access to Cloudera manager via proxy node. This means that for CDH cluster only node with manager could be be a proxy gateway node.