Updating features documentation

This change updates the grammar and content of the features page.
Additionally it moves some of the more configuration related topics to
the configuration guides depending on the nature of the feature.

Change-Id: I1489ffbe1fe0f71a3a41a02e46614ab77263b8d8
This commit is contained in:
Michael McCune 2015-04-07 18:41:38 -04:00 committed by Nikolay Starodubtsev
parent e70491d5c9
commit 2729062cef
3 changed files with 537 additions and 421 deletions

View File

@ -5,72 +5,6 @@ This guide addresses specific aspects of Sahara configuration that pertain to
advanced usage. It is divided into sections about various features that can be
utilized, and their related configurations.
Object Storage access using proxy users
---------------------------------------
To improve security for clusters accessing files in Object Storage,
sahara can be configured to use proxy users and delegated trusts for
access. This behavior has been implemented to reduce the need for
storing and distributing user credentials.
The use of proxy users involves creating an Identity domain that will be
designated as the home for these users. Proxy users will be
created on demand by sahara and will only exist during a job execution
which requires Object Storage access. The domain created for the
proxy users must be backed by a driver that allows sahara's admin user to
create new user accounts. This new domain should contain no roles, to limit
the potential access of a proxy user.
Once the domain has been created, sahara must be configured to use it by
adding the domain name and any potential delegated roles that must be used
for Object Storage access to the configuration file. With the
domain enabled in sahara, users will no longer be required to enter
credentials for their data sources and job binaries referenced in
Object Storage.
Detailed instructions
^^^^^^^^^^^^^^^^^^^^^
First a domain must be created in the Identity service to hold proxy
users created by sahara. This domain must have an identity backend driver
that allows for sahara to create new users. The default SQL engine is
sufficient but if your keystone identity is backed by LDAP or similar
then domain specific configurations should be used to ensure sahara's
access. Please see the `Keystone documentation`_ for more information.
.. _Keystone documentation: http://docs.openstack.org/developer/keystone/configuration.html#domain-specific-drivers
With the domain created, sahara's configuration file should be updated to
include the new domain name and any potential roles that will be needed. For
this example let's assume that the name of the proxy domain is
``sahara_proxy`` and the roles needed by proxy users will be ``Member`` and
``SwiftUser``.
.. sourcecode:: cfg
[DEFAULT]
use_domain_for_proxy_users=True
proxy_user_domain_name=sahara_proxy
proxy_user_role_names=Member,SwiftUser
..
A note on the use of roles. In the context of the proxy user, any roles
specified here are roles intended to be delegated to the proxy user from the
user with access to Object Storage. More specifically, any roles that
are required for Object Storage access by the project owning the object
store must be delegated to the proxy user for authentication to be
successful.
Finally, the stack administrator must ensure that images registered with
sahara have the latest version of the Hadoop swift filesystem plugin
installed. The sources for this plugin can be found in the
`sahara extra repository`_. For more information on images or swift
integration see the sahara documentation sections
:ref:`diskimage-builder-label` and :ref:`swift-integration-label`.
.. _Sahara extra repository: http://github.com/openstack/sahara-extra
.. _custom_network_topologies:
Custom network topologies
@ -106,6 +40,229 @@ a custom network namespace:
[DEFAULT]
proxy_command='ip netns exec ns_for_{network_id} nc {host} {port}'
.. _data_locality_configuration:
Data-locality configuration
---------------------------
Hadoop provides the data-locality feature to enable task tracker and
data nodes the capability of spawning on the same rack, Compute node,
or virtual machine. Sahara exposes this functionality to the user
through a few configuration parameters and user defined topology files.
To enable data-locality, set the ``enable_data_locality`` parameter to
``True`` in the sahara configuration file
.. sourcecode:: cfg
[DEFAULT]
enable_data_locality=True
With data locality enabled, you must now specify the topology files
for the Compute and Object Storage services. These files are
specified in the sahara configuration file as follows:
.. sourcecode:: cfg
[DEFAULT]
compute_topology_file=/etc/sahara/compute.topology
swift_topology_file=/etc/sahara/swift.topology
The ``compute_topology_file`` should contain mappings between Compute
nodes and racks in the following format:
.. sourcecode:: cfg
compute1 /rack1
compute1 /rack2
compute1 /rack2
Note that the Compute node names must be exactly the same as configured in
OpenStack (``host`` column in admin list for instances).
The ``swift_topology_file`` should contain mappings between Object Storage
nodes and racks in the following format:
.. sourcecode:: cfg
node1 /rack1
node2 /rack2
node3 /rack2
Note that the Object Storage node names must be exactly the same as
configured in the object ring. Also, you should ensure that instances
with the task tracker process have direct access to the Object Storage
nodes.
Hadoop versions after 1.2.0 support four-layer topology (for more detail
please see `HADOOP-8468 JIRA issue`_). To enable this feature set the
``enable_hypervisor_awareness`` parameter to ``True`` in the configuration
file. In this case sahara will add the Compute node ID as a second level of
topology for virtual machines.
.. _HADOOP-8468 JIRA issue: https://issues.apache.org/jira/browse/HADOOP-8468
.. _distributed-mode-configuration:
Distributed mode configuration
------------------------------
Sahara can be configured to run in a distributed mode that creates a
separation between the API and engine processes. This allows the API
process to remain relatively free to handle requests while offloading
intensive tasks to the engine processes.
The ``sahara-api`` application works as a front-end and serves user
requests. It offloads 'heavy' tasks to the ``sahara-engine`` process
via RPC mechanisms. While the ``sahara-engine`` process could be loaded
with tasks, ``sahara-api`` stays free and hence may quickly respond to
user queries.
If sahara runs on several hosts, the API requests could be
balanced between several ``sahara-api`` hosts using a load balancer.
It is not required to balance load between different ``sahara-engine``
hosts as this will be automatically done via the message broker.
If a single host becomes unavailable, other hosts will continue
serving user requests. Hence, a better scalability is achieved and some
fault tolerance as well. Note that distributed mode is not a true
high availability. While the failure of a single host does not
affect the work of the others, all of the operations running on
the failed host will stop. For example, if a cluster scaling is
interrupted, the cluster will be stuck in a half-scaled state. The
cluster might continue working, but it will be impossible to scale it
further or run jobs on it via EDP.
To run sahara in distributed mode pick several hosts on which
you want to run sahara services and follow these steps:
* On each host install and configure sahara using the
`installation guide <../installation.guide.html>`_
except:
* Do not run ``sahara-db-manage`` or launch sahara with ``sahara-all``
* Ensure that each configuration file provides a database connection
string to a single database for all hosts.
* Run ``sahara-db-manage`` as described in the installation guide,
but only on a single (arbitrarily picked) host.
* The ``sahara-api`` and ``sahara-engine`` processes use oslo.messaging to
communicate with each other. You will need to configure it properly on
each host (see below).
* Run ``sahara-api`` and ``sahara-engine`` on the desired hosts. You may
run both processes on the same or separate hosts as long as they are
configured to use the same message broker and database.
To configure oslo.messaging, first you will need to choose a message
broker driver. Currently there are three drivers provided: RabbitMQ, Qpid
or ZeroMQ. For the RabbitMQ or Qpid drivers please see the
:ref:`notification-configuration` documentation for an explanation of
common configuration options.
For an expanded view of all the options provided by each message broker
driver in oslo.messaging please refer to the options available in the
respective source trees:
* For Rabbit MQ see
* rabbit_opts variable in `impl_rabbit.py <https://git.openstack.org/cgit/openstack/oslo.messaging/tree/oslo/messaging/_drivers/impl_rabbit.py?id=1.4.0#n38>`_
* amqp_opts variable in `amqp.py <https://git.openstack.org/cgit/openstack/oslo.messaging/tree/oslo/messaging/_drivers/amqp.py?id=1.4.0#n37>`_
* For Qpid see
* qpid_opts variable in `impl_qpid.py <https://git.openstack.org/cgit/openstack/oslo.messaging/tree/oslo/messaging/_drivers/impl_qpid.py?id=1.4.0#n40>`_
* amqp_opts variable in `amqp.py <https://git.openstack.org/cgit/openstack/oslo.messaging/tree/oslo/messaging/_drivers/amqp.py?id=1.4.0#n37>`_
* For Zmq see
* zmq_opts variable in `impl_zmq.py <https://git.openstack.org/cgit/openstack/oslo.messaging/tree/oslo/messaging/_drivers/impl_zmq.py?id=1.4.0#n49>`_
* matchmaker_opts variable in `matchmaker.py <https://git.openstack.org/cgit/openstack/oslo.messaging/tree/oslo/messaging/_drivers/matchmaker.py?id=1.4.0#n27>`_
* matchmaker_redis_opts variable in `matchmaker_redis.py <https://git.openstack.org/cgit/openstack/oslo.messaging/tree/oslo/messaging/_drivers/matchmaker_redis.py?id=1.4.0#n26>`_
* matchmaker_opts variable in `matchmaker_ring.py <https://git.openstack.org/cgit/openstack/oslo.messaging/tree/oslo/messaging/_drivers/matchmaker_ring.py?id=1.4.0#n27>`_
These options will also be present in the generated sample configuration
file. For instructions on creating the configuration file please see the
:doc:`configuration.guide`.
External key manager usage (EXPERIMENTAL)
-----------------------------------------
Sahara generates and stores several passwords during the course of operation.
To harden sahara's usage of passwords it can be instructed to use an
external key manager for storage and retrieval of these secrets. To enable
this feature there must first be an OpenStack Key Manager service deployed
within the stack. Currently, the barbican project is the only key manager
supported by sahara.
With a Key Manager service deployed on the stack, sahara must be configured
to enable the external storage of secrets. This is accomplished by editing
the sahara configuration file as follows:
.. sourcecode:: cfg
[DEFAULT]
use_external_key_manager=True
.. TODO (mimccune)
this language should be removed once a new keystone authentication
section has been created in the configuration file.
Additionally, at this time there are two more values which must be provided
to ensure proper access for sahara to the Key Manager service. These are
the Identity domain for the administrative user and the domain for the
administrative project. By default these values will appear as:
.. sourcecode:: cfg
[DEFAULT]
admin_user_domain_name=default
admin_project_domain_name=default
With all of these values configured and the Key Manager service deployed,
sahara will begin storing its secrets in the external manager.
Indirect instance access through proxy nodes
--------------------------------------------
.. warning::
The indirect VMs access feature is in alpha state. We do not
recommend using it in a production environment.
Sahara needs to access instances through SSH during cluster setup. This
access can be obtained a number of different ways (see
:ref:`neutron-nova-network`, :ref:`floating_ip_management`,
:ref:`custom_network_topologies`). Sometimes it is impossible to provide
access to all nodes (because of limited numbers of floating IPs or security
policies). In these cases access can be gained using other nodes of the
cluster as proxy gateways. To enable this set ``is_proxy_gateway=True``
for the node group you want to use as proxy. Sahara will communicate with
all other cluster instances through the instances of this node group.
Note, if ``use_floating_ips=true`` and the cluster contains a node group with
``is_proxy_gateway=True``, the requirement to have ``floating_ip_pool``
specified is applied only to the proxy node group. Other instances will be
accessed through proxy instances using the standard private network.
Note, the Cloudera Hadoop plugin doesn't support access to Cloudera manager
through a proxy node. This means that for CDH clusters only nodes with
the Cloudera manager can be designated as proxy gateway nodes.
Multi region deployment
-----------------------
Sahara supports multi region deployment. To enable this option each
instance of sahara should have the ``os_region_name=<region>``
parameter set in the configuration file. The following example demonstrates
configuring sahara to use the ``RegionOne`` region:
.. sourcecode:: cfg
[DEFAULT]
os_region_name=RegionOne
.. _non-root-users:
Non-root users
--------------
@ -166,39 +323,111 @@ set the following parameter in your sahara configuration file:
For more information on rootwrap please refer to the
`official Rootwrap documentation <https://wiki.openstack.org/wiki/Rootwrap>`_
External key manager usage (EXPERIMENTAL)
-----------------------------------------
Object Storage access using proxy users
---------------------------------------
Sahara generates and stores several passwords during the course of operation.
To harden sahara's usage of passwords it can be instructed to use an
external key manager for storage and retrieval of these secrets. To enable
this feature there must first be an OpenStack Key Manager service deployed
within the stack. Currently, the barbican project is the only key manager
supported by sahara.
To improve security for clusters accessing files in Object Storage,
sahara can be configured to use proxy users and delegated trusts for
access. This behavior has been implemented to reduce the need for
storing and distributing user credentials.
With a Key Manager service deployed on the stack, sahara must be configured
to enable the external storage of secrets. This is accomplished by editing
the configuration file as follows:
The use of proxy users involves creating an Identity domain that will be
designated as the home for these users. Proxy users will be
created on demand by sahara and will only exist during a job execution
which requires Object Storage access. The domain created for the
proxy users must be backed by a driver that allows sahara's admin user to
create new user accounts. This new domain should contain no roles, to limit
the potential access of a proxy user.
Once the domain has been created, sahara must be configured to use it by
adding the domain name and any potential delegated roles that must be used
for Object Storage access to the sahara configuration file. With the
domain enabled in sahara, users will no longer be required to enter
credentials for their data sources and job binaries referenced in
Object Storage.
Detailed instructions
^^^^^^^^^^^^^^^^^^^^^
First a domain must be created in the Identity service to hold proxy
users created by sahara. This domain must have an identity backend driver
that allows for sahara to create new users. The default SQL engine is
sufficient but if your keystone identity is backed by LDAP or similar
then domain specific configurations should be used to ensure sahara's
access. Please see the `Keystone documentation`_ for more information.
.. _Keystone documentation: http://docs.openstack.org/developer/keystone/configuration.html#domain-specific-drivers
With the domain created, sahara's configuration file should be updated to
include the new domain name and any potential roles that will be needed. For
this example let's assume that the name of the proxy domain is
``sahara_proxy`` and the roles needed by proxy users will be ``Member`` and
``SwiftUser``.
.. sourcecode:: cfg
[DEFAULT]
use_external_key_manager=True
use_domain_for_proxy_users=True
proxy_user_domain_name=sahara_proxy
proxy_user_role_names=Member,SwiftUser
.. TODO (mimccune)
this language should be removed once a new keystone authentication
section has been created in the configuration file.
..
Additionally, at this time there are two more values which must be provided
to ensure proper access for sahara to the Key Manager service. These are
the Identity domain for the administrative user and the domain for the
administrative project. By default these values will appear as:
A note on the use of roles. In the context of the proxy user, any roles
specified here are roles intended to be delegated to the proxy user from the
user with access to Object Storage. More specifically, any roles that
are required for Object Storage access by the project owning the object
store must be delegated to the proxy user for authentication to be
successful.
.. sourcecode:: cfg
Finally, the stack administrator must ensure that images registered with
sahara have the latest version of the Hadoop swift filesystem plugin
installed. The sources for this plugin can be found in the
`sahara extra repository`_. For more information on images or swift
integration see the sahara documentation sections
:ref:`diskimage-builder-label` and :ref:`swift-integration-label`.
[DEFAULT]
admin_user_domain_name=default
admin_project_domain_name=default
.. _Sahara extra repository: http://github.com/openstack/sahara-extra
With all of these values configured and the Key Manager service deployed,
sahara will begin storing its secrets in the external manager.
.. _volume_instance_locality_configuration:
Volume instance locality configuration
--------------------------------------
The Block Storage service provides the ability to define volume instance
locality to ensure that instance volumes are created on the same host
as the hypervisor. The ``InstanceLocalityFilter`` provides the mechanism
for the selection of a storage provider located on the same physical
host as an instance.
To enable this functionality for instances of a specific node group, the
``volume_local_to_instance`` field in the node group template should be
set to ``True`` and some extra configurations are needed:
* The cinder-volume service should be launched on every physical host and at
least one physical host should run both cinder-scheduler and
cinder-volume services.
* ``InstanceLocalityFilter`` should be added to the list of default filters
(``scheduler_default_filters`` in cinder) for the Block Storage
configuration.
* The Extended Server Attributes extension needs to be active in the Compute
service (this is true by default in nova), so that the
``OS-EXT-SRV-ATTR:host`` property is returned when requesting instance
info.
* The user making the call needs to have sufficient rights for the property to
be returned by the Compute service.
This can be done by:
* by changing nova's ``policy.json`` to allow the user access to the
``extended_server_attributes`` option.
* by designating an account with privileged rights in the cinder
configuration:
.. sourcecode:: cfg
os_privileged_user_name =
os_privileged_user_password =
os_privileged_user_tenant =
It should be noted that in a situation when the host has no space for volume
creation, the created volume will have an ``Error`` state and can not be used.

View File

@ -5,6 +5,9 @@ This guide covers the steps for a basic configuration of sahara.
It will help you to configure the service in the most simple
manner.
Basic configuration
-------------------
Sahara is packaged with a basic sample configration file:
``sahara.conf.sample-basic``. This file contains all the essential
parameters that are required for sahara. We recommend creating your
@ -70,8 +73,66 @@ file which control the level of logging output; ``verbose`` and
level will be set to INFO, and with ``debug`` set to ``true`` it will
be set to DEBUG. By default the sahara's log level is set to WARNING.
Sahara notifications configuration
----------------------------------
.. _neutron-nova-network:
Networking configuration
------------------------
By default sahara is configured to use the nova-network implementation
of OpenStack Networking. If an OpenStack cluster uses Neutron,
then the ``use_neutron`` parameter should be set to ``True`` in the
sahara configuration file. Additionally, if the cluster supports network
namespaces the ``use_namespaces`` property can be used to enable their usage.
.. sourcecode:: cfg
[DEFAULT]
use_neutron=True
use_namespaces=True
.. note::
If a user other than ``root`` will be running the Sahara server
instance and namespaces are used, some additional configuration is
required, please see :ref:`non-root-users` for more information.
.. _floating_ip_management:
Floating IP management
++++++++++++++++++++++
During cluster setup sahara must access instances through a secure
shell(SSH). To establish this connection it may use either the fixed
or floating IP address of an instance. By default sahara is configured
to use floating IP addresses for access. This is controlled by the
``use_floating_ips`` configuration parameter. With this setup the user
has two options for ensuring that all instances gain a floating IP
address:
* If using the nova-network, it may be configured to assign floating
IP addresses automatically by setting the ``auto_assign_floating_ip``
parameter to ``True`` in the nova configuration file
(usually ``nova.conf``).
* The user may specify a floating IP address pool for each node
group directly.
.. warning::
When using floating IP addresses for management
(``use_floating_ip=True``) **every** instance in the cluster must have
a floating IP address, otherwise sahara will not be able to utilize
that cluster.
If not using floating IP addresses (``use_floating_ip=False``) sahara
will use fixed IP addresses for instance management. When using neutron
for the Networking service the user will be able to choose the
fixed IP network for all instances in a cluster. Whether using nova-network
or neutron it is important to ensure that all instances running sahara
have access to the fixed IP networks.
.. _notification-configuration:
Notifications configuration
---------------------------
Sahara can be configured to send notifications to the OpenStack
Telemetry module. To enable this functionality the following parameters
@ -129,15 +190,38 @@ in the ``[oslo_messaging_qpid]`` section:
qpid_password=
..
.. _orchestration-configuration:
Orchestration configuration
---------------------------
By default sahara is configured to use the direct engine for instance
creation. This engine makes calls directly to the services required
for instance provisioning. Sahara can be configured to use the OpenStack
Orchestration service for this task instead of the direct engine.
To configure sahara to utilize the Orchestration service for instance
provisioning the ``infrastructure_engine`` parameter should be modified in
the configuration file as follows:
.. sourcecode:: cfg
[DEFAULT]
infrastructure_engine=heat
There is feature parity between the direct and heat infrastructure
engines. We recommend using the heat engine for provisioning as the
direct is planned for deprecation.
.. _policy-configuration-label:
Sahara policy configuration
Policy configuration
---------------------------
Saharas public API calls may be restricted to certain sets of users by
using a policy configuration file. The location of the policy file(s)
is controlled by the ``policy_file`` and ``policy_dirs`` parameters
in the ``[DEFAULT]`` section. By default sahara will search for
in the ``[oslo_policy]`` section. By default sahara will search for
a ``policy.json`` file in the same directory as the configuration file.
Examples

View File

@ -1,248 +1,99 @@
Features Overview
=================
Cluster Scaling
---------------
The mechanism of cluster scaling is designed to enable a user to change the
number of running instances without creating a new cluster.
A user may change the number of instances in existing Node Groups or add new Node
Groups.
If a cluster fails to scale properly, all changes will be rolled back.
Swift Integration
-----------------
In order to leverage Swift within Hadoop, including using Swift data sources
from within EDP, Hadoop requires the application of a patch.
For additional information about using Swift with Sahara, including patching
Hadoop and configuring Sahara, please refer to the :doc:`hadoop-swift`
documentation.
Cinder support
--------------
Cinder is a block storage service that can be used as an alternative for an
ephemeral drive. Using Cinder volumes increases reliability of data which is
important for HDFS service.
A user can set how many volumes will be attached to each node in a Node Group
and the size of each volume.
All volumes are attached during Cluster creation/scaling operations.
.. _neutron-nova-network:
Neutron and Nova Network support
--------------------------------
OpenStack clusters may use Nova or Neutron as a networking service. Sahara
supports both, but when deployed a special configuration for networking
should be set explicitly. By default Sahara will behave as if Nova is used.
If an OpenStack cluster uses Neutron, then the ``use_neutron`` property should
be set to ``True`` in the Sahara configuration file. Additionally, if the
cluster supports network namespaces the ``use_namespaces`` property can be
used to enable their usage.
.. sourcecode:: cfg
[DEFAULT]
use_neutron=True
use_namespaces=True
.. note::
If a user other than ``root`` will be running the Sahara server
instance and namespaces are used, some additional configuration is
required, please see the :doc:`advanced.configuration.guide` for more
information.
.. _floating_ip_management:
Floating IP Management
----------------------
Sahara needs to access instances through ssh during a Cluster setup. To
establish a connection Sahara may
use both: fixed and floating IP of an Instance. By default
``use_floating_ips`` parameter is set to ``True``, so
Sahara will use Floating IP of an Instance to connect. In this case, the user has
two options for how to make all instances
get a floating IP:
* Nova Network may be configured to assign floating IPs automatically by
setting ``auto_assign_floating_ip`` to ``True`` in ``nova.conf``
* User may specify a floating IP pool for each Node Group directly.
Note: When using floating IPs for management (``use_floating_ip=True``)
**every** instance in the Cluster should have a floating IP,
otherwise Sahara will not be able to work with it.
If the ``use_floating_ips`` parameter is set to ``False`` Sahara will use
Instances' fixed IPs for management. In this case
the node where Sahara is running should have access to Instances' fixed IP
network. When OpenStack uses Neutron for
networking, a user will be able to choose fixed IP network for all instances
in a Cluster.
This page highlights some of the most prominent features available in
sahara. The guidance provided here is primarily focused on the
runtime aspects of sahara, for discussions about configuring the sahara
server processes please see the :doc:`configuration.guide` and
:doc:`advanced.configuration.guide`.
Anti-affinity
-------------
One of the problems in Hadoop running on OpenStack is that there is no
ability to control where the machine is actually running.
We cannot be sure that two new virtual machines are started on different
physical machines. As a result, any replication with the cluster
One of the problems with running data processing applications on OpenStack
is the inability to control where an instance is actually running. It is
not always possible to ensure that two new virtual machines are started on
different physical machines. As a result, any replication within the cluster
is not reliable because all replicas may turn up on one physical machine.
The anti-affinity feature provides an ability to explicitly tell Sahara to run
specified processes on different compute nodes. This
is especially useful for the Hadoop data node process to make HDFS replicas
reliable.
To remedy this, sahara provides the anti-affinity feature to explicitly
command all instances of the specified processes to spawn on different
Compute nodes. This is especially useful for Hadoop data node processes
to increase HDFS replica reliability.
Starting with the Juno release, Sahara creates server groups with the
``anti-affinity`` policy to enable the anti-affinity feature. Sahara creates one
server group per cluster and assigns all instances with affected processes to
this server group. Refer to the Nova documentation on how server groups work.
Starting with the Juno release, sahara can create server groups with the
``anti-affinity`` policy to enable this feature. Sahara creates one server
group per cluster and assigns all instances with affected processes to
this server group. Refer to the `Nova documentation`_ on how server groups
work.
This feature is supported by all plugins out of the box.
This feature is supported by all plugins out of the box, and can be enabled
during the cluster template creation.
.. _Nova documentation: http://docs.openstack.org/developer/nova
Block Storage support
---------------------
OpenStack Block Storage (cinder) can be used as an alternative for
ephemeral drives on instances. Using Block Storage volumes increases the
reliability of data which is important for HDFS service.
A user can set how many volumes will be attached to each instance in a
node group, and the size of each volume.
All volumes are attached during cluster creation/scaling operations.
Cluster scaling
---------------
Cluster scaling allows users to change the number of running instances
in a cluster without needing to recreate the cluster. Users may
increase or decrease the number of instances in node groups or add
new node groups to existing clusters.
If a cluster fails to scale properly, all changes will be rolled back.
Data-locality
-------------
It is extremely important for data processing to work locally (on the same rack,
OpenStack compute node or even VM). Hadoop supports the data-locality feature and can schedule jobs to
task tracker nodes that are local for input stream. In this case task tracker
could communicate directly with the local data node.
Sahara supports topology configuration for HDFS and Swift data sources.
It is extremely important for data processing applications to perform
work locally on the same rack, OpenStack Compute node, or virtual
machine. Hadoop supports a data-locality feature and can schedule jobs
to task tracker nodes that are local for the input stream. In this
manner the task tracker nodes can communicate directly with the local
data nodes.
To enable data-locality set ``enable_data_locality`` parameter to ``True`` in
Sahara configuration file
Sahara supports topology configuration for HDFS and Object Storage
data sources. For more information on configuring this option please
see the :ref:`data_locality_configuration` documentation.
.. sourcecode:: cfg
enable_data_locality=True
In this case two files with topology must be provided to Sahara.
Options ``compute_topology_file`` and ``swift_topology_file`` parameters
control location of files with compute and swift nodes topology descriptions
correspondingly.
``compute_topology_file`` should contain mapping between compute nodes and
racks in the following format:
.. sourcecode:: cfg
compute1 /rack1
compute1 /rack2
compute1 /rack2
Note that the compute node name must be exactly the same as configured in
OpenStack (``host`` column in admin list for instances).
``swift_topology_file`` should contain mapping between swift nodes and
racks in the following format:
.. sourcecode:: cfg
node1 /rack1
node2 /rack2
node3 /rack2
Note that the swift node must be exactly the same as configures in object.builder
swift ring. Also make sure that VMs with the task tracker service have direct access
to swift nodes.
Hadoop versions after 1.2.0 support four-layer topology
(https://issues.apache.org/jira/browse/HADOOP-8468). To enable this feature
set ``enable_hypervisor_awareness`` option to ``True`` in Sahara configuration
file. In this case Sahara will add the compute node ID as a second level of
topology for Virtual Machines.
Volume-to-instance locality
---------------------------
Having an instance and an attached volume on the same physical host can be very
helpful in order to achieve high-performance disk I/O. To achieve this,
volume-to-instance locality should be used.
Cinder has ``InstanceLocalityFilter`` which enables selection of a storage
back-end located on the host where the instance's hypervisor is running. It
allows volumes to be created on the same physical host as the instance.
To enable this functionality for instances of a specific node group, the
``volume_local_to_instance`` field in node group template should be set to
``True`` and some extra configurations are needed:
* Cinder-volume service should be launched on every physical host and at least
one physical host should run both cinder-scheduler and cinder-volume services.
* ``InstanceLocalityFilter`` should be added to the list of default filters
(``scheduler_default_filters`` in Cinder config).
* The Extended Server Attributes extension needs to be active in Nova
(this is true by default), so that the ``OS-EXT-SRV-ATTR:host`` property is
returned when requesting instance info.
* The user making the call needs to have sufficient rights for the property to
be returned by Nova.
This can be made:
* by changing Nova's ``policy.json`` (the ``extended_server_attributes`` option)
* by setting an account with privileged rights in Cinder config:
.. sourcecode:: cfg
os_privileged_user_name =
os_privileged_user_password =
os_privileged_user_tenant =
It should be noted that in a situation when the host has no space for volume
creation, the created volume will have an ``Error`` state and can not be used.
Security group management
-------------------------
Sahara allows you to control which security groups will be used for created
instances. This can be done by providing the ``security_groups`` parameter for
the Node Group or Node Group Template. By default an empty list is used that
will result in using the default security group.
Sahara may also create a security group for instances in the node group
automatically. This security group will only have open ports which are
required by instance processes or the Sahara engine. This option is useful
for development and secured from outside environments, but for production
environments it is recommended to control the security group policy manually.
Heat Integration
Distributed Mode
----------------
Sahara may use
`OpenStack Orchestration engine <https://wiki.openstack.org/wiki/Heat>`_
(aka Heat) to provision nodes for Hadoop cluster.
To make Sahara work with Heat the following steps are required:
The :doc:`installation.guide` suggests launching sahara as a single
``sahara-all`` process. It is also possible to run sahara in distributed
mode with ``sahara-api`` and ``sahara-engine`` processes running on several
machines simultaneously. Running in distributed mode allows sahara to
offload intensive tasks to the engine processes while keeping the API
process free to handle requests.
* Your OpenStack installation must have 'orchestration' service up and running
* Sahara must contain the following configuration parameter in *sahara.conf*:
.. sourcecode:: cfg
# An engine which will be used to provision infrastructure for Hadoop cluster. (string value)
infrastructure_engine=heat
There is a feature parity between direct and heat infrastructure engines. It is
recommended to use the heat engine since the direct engine will be deprecated at some
point.
Multi region deployment
-----------------------
Sahara supports multi region deployment. In this case, each instance of Sahara
should have the ``os_region_name=<region>`` property set in the
configuration file.
For an expanded discussion of configuring sahara to run in distrbuted
mode please see the :ref:`distributed-mode-configuration` documentation.
Hadoop HDFS High Availability
-----------------------------
Hadoop HDFS High Availability (HDFS HA) uses 2 Namenodes in an active/standby
architecture to ensure that HDFS will continue to work even when the active namenode fails.
The High Availability is achieved by using a set of JournalNodes and Zookeeper servers along
with ZooKeeper Failover Controllers (ZKFC) and some additional configurations and changes to
HDFS and other services that use HDFS.
Currently HDFS HA is only supported with the HDP 2.0.6 plugin. The feature is enabled through
a cluster_configs parameter in the cluster's JSON:
Hadoop HDFS High Availability (HDFS HA) provides an architecture to ensure
that HDFS will continue to work in the result of an active namenode failure.
It uses 2 namenodes in an active/standby configuration to provide this
availibility.
High availability is achieved by using a set of journalnodes and Zookeeper
servers along with ZooKeeper Failover Controllers (ZKFC) and additional
configuration changes to HDFS and other services that use HDFS.
Currently HDFS HA is only supported with the HDP 2.0.6 plugin. The feature
is enabled through a ``cluster_configs`` parameter in the cluster's JSON:
.. sourcecode:: cfg
@ -252,9 +103,38 @@ a cluster_configs parameter in the cluster's JSON:
}
}
Networking support
------------------
Sahara supports both the nova-network and neutron implementations of
OpenStack Networking. By default sahara is configured to behave as if
the nova-network implementation is available. For OpenStack installations
that are using the neutron project please see :ref:`neutron-nova-network`.
Object Storage support
----------------------
Sahara can use OpenStack Object Storage (swift) to store job binaries and data
sources utilized by its job executions and clusters. In order to
leverage this support within Hadoop, including using Object Storage
for data sources for EDP, Hadoop requires the application of
a patch. For additional information about enabling this support,
including patching Hadoop and configuring sahara, please refer to
the :doc:`hadoop-swift` documentation.
Orchestration support
---------------------
Sahara may use the
`OpenStack Orchestration engine <https://wiki.openstack.org/wiki/Heat>`_
(heat) to provision nodes for clusters. For more information about
enabling Orchestration usage in sahara please see
:ref:`orchestration-configuration`.
Plugin Capabilities
-------------------
The below tables provides a plugin capability matrix:
The following table provides a plugin capability matrix:
+--------------------------+---------+----------+----------+-------+
| | Plugin |
@ -274,111 +154,34 @@ The below tables provides a plugin capability matrix:
| EDP | x | x | x | x |
+--------------------------+---------+----------+----------+-------+
Running Sahara in Distributed Mode
----------------------------------
Security group management
-------------------------
The :doc:`installation.guide` suggests to launch
Sahara as a single 'sahara-all' process. It is also possible to run Sahara
in distributed mode with 'sahara-api' and 'sahara-engine' processes running
on several machines simultaneously.
.. TODO (mimccune)
This section could use an example to show how security groups are
used.
Sahara-api works as a front-end and serves users' requests. It
offloads 'heavy' tasks to the sahara-engine via RPC mechanism. While the
sahara-engine could be loaded, sahara-api by design stays free
and hence may quickly respond on user queries.
Sahara allows you to control which security groups will be used for created
instances. This can be done by providing the ``security_groups`` parameter for
the node group or node group template. The default for this option is an
empty list, which will result in the default project security group being
used for the instances.
If Sahara runs on several machines, the API requests could be
balanced between several sahara-api instances using a load balancer.
It is not required to balance load between different sahara-engine
instances, as that will be automatically done via a message queue.
Sahara may also create a security group for instances in the node group
automatically. This security group will only contain open ports for required
instance processes and the sahara engine. This option is useful
for development and for when your installation is secured from outside
environments. For production environments we recommend controlling the
security group policy manually.
If a single machine goes down, others will continue serving
users' requests. Hence a better scalability is achieved and some
fault tolerance as well. Note that the proposed solution is not
a true High Availability. While failure of a single machine does not
affect work of other machines, all of the operations running on
the failed machine will stop. For example, if a cluster
scaling is interrupted, the cluster will be stuck in a half-scaled state.
The cluster will probably continue working, but it will be impossible
to scale it further or run jobs on it via EDP.
Volume-to-instance locality
---------------------------
To run Sahara in distributed mode pick several machines on which
you want to run Sahara services and follow these steps:
Having an instance and an attached volume on the same physical host can
be very helpful in order to achieve high-performance disk I/O operations.
To achieve this, sahara provides access to the Block Storage
volume intance locality functionality.
* On each machine install and configure Sahara using the
`installation guide <../installation.guide.html>`_
except:
* Do not run 'sahara-db-manage' or launch Sahara with 'sahara-all'
* Make sure sahara.conf provides database connection string to a
single database on all machines.
* Run 'sahara-db-manage' as described in the installation guide,
but only on a single (arbitrarily picked) machine.
* sahara-api and sahara-engine processes use oslo.messaging to
communicate with each other. You need to configure it properly on
each node (see below).
* run sahara-api and sahara-engine on the desired nodes. On a node
you can run both sahara-api and sahara-engine or you can run them on
separate nodes. It does not matter as long as they are configured
to use the same message broker and database.
To configure oslo.messaging, first you need to pick the driver you are
going to use. Right now three drivers are provided: Rabbit MQ, Qpid or Zmq.
To use Rabbit MQ or Qpid driver, you will have to setup messaging broker.
The picked driver must be supplied in ``sahara.conf`` in
``[DEFAULT]/rpc_backend`` parameter. Use one the following values:
``rabbit``, ``qpid`` or ``zmq``. Next you have to supply
driver-specific options.
Unfortunately, right now there is no documentation with a description of
drivers' configuration. The options are available only in source code.
* For Rabbit MQ see
* rabbit_opts variable in `impl_rabbit.py <https://git.openstack.org/cgit/openstack/oslo.messaging/tree/oslo/messaging/_drivers/impl_rabbit.py?id=1.4.0#n38>`_
* amqp_opts variable in `amqp.py <https://git.openstack.org/cgit/openstack/oslo.messaging/tree/oslo/messaging/_drivers/amqp.py?id=1.4.0#n37>`_
* For Qpid see
* qpid_opts variable in `impl_qpid.py <https://git.openstack.org/cgit/openstack/oslo.messaging/tree/oslo/messaging/_drivers/impl_qpid.py?id=1.4.0#n40>`_
* amqp_opts variable in `amqp.py <https://git.openstack.org/cgit/openstack/oslo.messaging/tree/oslo/messaging/_drivers/amqp.py?id=1.4.0#n37>`_
* For Zmq see
* zmq_opts variable in `impl_zmq.py <https://git.openstack.org/cgit/openstack/oslo.messaging/tree/oslo/messaging/_drivers/impl_zmq.py?id=1.4.0#n49>`_
* matchmaker_opts variable in `matchmaker.py <https://git.openstack.org/cgit/openstack/oslo.messaging/tree/oslo/messaging/_drivers/matchmaker.py?id=1.4.0#n27>`_
* matchmaker_redis_opts variable in `matchmaker_redis.py <https://git.openstack.org/cgit/openstack/oslo.messaging/tree/oslo/messaging/_drivers/matchmaker_redis.py?id=1.4.0#n26>`_
* matchmaker_opts variable in `matchmaker_ring.py <https://git.openstack.org/cgit/openstack/oslo.messaging/tree/oslo/messaging/_drivers/matchmaker_ring.py?id=1.4.0#n27>`_
You can find the same options defined in ``sahara.conf.sample``. You can use
it to find section names for each option (matchmaker options are
defined not in ``[DEFAULT]``)
Managing instances with limited access
--------------------------------------
.. warning::
The indirect VMs access feature is in alpha state. We do not
recommend using it in a production environment.
Sahara needs to access instances through ssh during a Cluster setup. This
could be obtained by a number of ways (see :ref:`neutron-nova-network`,
:ref:`floating_ip_management`, :ref:`custom_network_topologies`). But
sometimes it is impossible to provide access to all nodes (because of limited
numbers of floating IPs or security policies). In this case
access can be gained using other nodes of the cluster. To do that set
``is_proxy_gateway=True`` for the node group you want to use as proxy. In this
case Sahara will communicate with all other instances via instances of this
node group.
Note, if ``use_floating_ips=true`` and the cluster contains a node group with
``is_proxy_gateway=True``, requirement to have ``floating_ip_pool`` specified
is applied only to the proxy node group. Other instances will be accessed via
proxy instances using standard private network.
Note, Cloudera hadoop plugin doesn't support access to Cloudera manager via
proxy node. This means that for CDH cluster only node with manager could be
be a proxy gateway node.
For more information on using volume instance locality with sahara,
please see the :ref:`volume_instance_locality_configuration`
documentation.