Files
security-doc/security-guide/section_data-processing-introduction-to-data-processing.xml
Michael McCune e958f07f31 Adding data processing chapter
This change introduces the data processing service chapter as chapter 13
in the security guide.

Changes
* adding data processing chapter to index
* adding data processing chapter file
* adding introduction section file
* adding architecture image
* adding deployment section file
* adding configuration and hardening section file
* adding case studies section file
* adding data processing to the introduction to openstack section

Change-Id: I50c5066373f7c9bd75eb956cbb163f27d6a63058
Closes-bug: 1415218
2015-02-19 16:59:14 -05:00

230 lines
9.1 KiB
XML

<?xml version="1.0" encoding="UTF-8"?>
<section xmlns="http://docbook.org/ns/docbook"
xmlns:xi="http://www.w3.org/2001/XInclude"
xmlns:xlink="http://www.w3.org/1999/xlink"
version="5.0"
xml:id="data-processing-introduction-to-data-processing">
<?dbhtml stop-chunking?>
<title>Introduction to Data processing</title>
<para>
The Data processing service for OpenStack (sahara) provides a platform
for the provisioning and management of instance clusters using processing
frameworks such as Hadoop and Spark. Through the OpenStack dashboard
or REST API, users will be able to upload and execute framework
applications which may access data in object storage or external
providers. The data processing controller uses the Orchestration
service to create clusters of instances which may exist as
long-running groups that can grow and shrink as requested, or as
transient groups created for a single workload.
</para>
<para>
The service controller will be responsible for creating, maintaining,
and destroying any instances created for its clusters. The controller
will use the Networking service to establish network paths between
itself and the cluster instances. It will also manage the deployment
and life-cycle of user applications that are to be run on the
clusters. The instances within a cluster contain the core of a
framework's processing engine and the Data processing service provides
several options for creating and managing the connections to these
instances.
</para>
<para>
Data processing resources (clusters, jobs, and data sources) are
segregated by projects defined within the Identity service. These
resources are shared within a project and it is important to
understand the access needs of those who are using the
service. Activities within projects (for example launching clusters,
uploading jobs, etc.) can be restricted further through the use of
role-based access controls.
</para>
<para>
In this chapter we discuss how to assess the needs of data processing
users with respect to their applications, the data that they use, and
their expected capabilities within a project. We will also demonstrate
a number of hardening techniques for the service controller and its
clusters, and provide examples of various controller configurations
and user management approaches to ensure an adequate level of security
and privacy.
</para>
<section xml:id="data-processing-introduction-to-data-processing-architecture">
<title>Architecture</title>
<para>
The following diagram presents a conceptual view of how the Data
processing service fits into the greater OpenStack ecosystem.
</para>
<para>
<inlinemediaobject>
<imageobject role="html">
<imagedata contentdepth="621" contentwidth="955" fileref="static/data_processing_architecture.png" format="PNG" scalefit="1"/>
</imageobject>
<imageobject role="fo">
<imagedata contentdepth="100%" fileref="static/data_processing_architecture.png" format="PNG" scalefit="1" width="100%"/>
</imageobject>
</inlinemediaobject>
</para>
<para>
The Data processing service makes heavy use of the Compute,
Orchestration, Image, and Block Storage services during the
provisioning of clusters. It will also use one or more networks,
created by the Networking service, provided during cluster creation
for administrative access to the instances. While users are running
framework applications the controller and the clusters will be
accessing the Object Storage service. Given these service usages, we
recommend following the instructions outlined in
<xref linkend="documentation"/> for cataloging all the components of
an installation.
</para>
</section>
<section xml:id="data-processing-introduction-to-data-processing-technologies-involved">
<title>Technologies involved</title>
<para>
The Data Processing service is responsible for the deployment and
management of several applications. For a complete understanding of
the security options provided we recommend that operators have a
general familiarity with these applications. The list of highlighted
technologies is broken into two sections: first, high priority
applications that have a greater impact on security, and second,
supporting applications with a lower impact.
</para>
<para>Higher impact</para>
<itemizedlist>
<listitem>
<para>
<link xlink:href="https://hadoop.apache.org/">
Hadoop
</link>
</para>
</listitem>
<listitem>
<para>
<link xlink:href="https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/SecureMode.html">
Hadoop secure mode docs
</link>
</para>
</listitem>
<listitem>
<para>
<link xlink:href="https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html">
HDFS
</link>
</para>
</listitem>
<listitem>
<para>
<link xlink:href="https://spark.apache.org/">
Spark
</link>
</para>
</listitem>
<listitem>
<para>
<link xlink:href="https://spark.apache.org/docs/latest/security.html">
Spark Security
</link>
</para>
</listitem>
<listitem>
<para>
<link xlink:href="https://storm.apache.org/">
Storm
</link>
</para>
</listitem>
<listitem>
<para>
<link xlink:href="https://zookeeper.apache.org/">
Zookeeper
</link>
</para>
</listitem>
</itemizedlist>
<para>Lower impact</para>
<itemizedlist>
<listitem>
<para>
<link xlink:href="https://oozie.apache.org/">
Oozie
</link>
</para>
</listitem>
<listitem>
<para>
<link xlink:href="https://hive.apache.org/">
Hive
</link>
</para>
</listitem>
<listitem>
<para>
<link xlink:href="https://pig.apache.org/">
Pig
</link>
</para>
</listitem>
</itemizedlist>
<para>
These technologies comprise the core of the frameworks that are
deployed with the Data processing service. In addition to these
technologies, the service also includes bundled frameworks provided by
third party vendors. These bundled frameworks are built using the same
core pieces described above plus configurations and applications that
the vendors include. For more information on the third party framework
bundles please see the following links:
</para>
<itemizedlist>
<listitem>
<para>
<link xlink:href="https://www.cloudera.com/content/cloudera/en/documentation.html#CDH">
Cloudera CDH
</link>
</para>
</listitem>
<listitem>
<para>
<link xlink:href="http://docs.hortonworks.com/">
Hortonworks Data Platform
</link>
</para>
</listitem>
<listitem>
<para>
<link xlink:href="https://www.mapr.com/products/mapr-distribution-including-apache-hadoop">
MapR
</link>
</para>
</listitem>
</itemizedlist>
</section>
<section xml:id="data-processing-introduction-to-data-processing-user-access-resources">
<title>User access to resources</title>
<para>
The resources (clusters, jobs, and data sources) of the Data
processing service are shared within the scope of a project. Although
a single controller installation may manage several sets of resources,
these resources will each be scoped to a single project. Given this
constraint we recommend that user membership in projects is monitored
closely to maintain proper segregation of resources.
</para>
<para>
As the security requirements of organizations deploying this service
will vary based on their specific needs, we recommend that operators
focus on data privacy, cluster management, and end-user applications as
a starting point for evaluating the needs of their users. These
decisions will help guide the process of configuring user access to
the service. For an expanded discussion on data privacy see
<xref linkend="tenant-data"/>.
</para>
<para>
The default assumption for a data processing installation is that
users will have access to all functionality within their projects. In
the event that more granular control is required the Data processing
service provides a policy file (as described in
<xref linkend="identity-policies"/>). These configurations will be
highly dependent on the needs of the installing organization, and as
such there is no general advice on their usage: see
<xref linkend="data-processing-configuration-and-hardening-role-based-access-control-policies"/>
for details.
</para>
</section>
</section>