This change introduces the data processing service chapter as chapter 13 in the security guide. Changes * adding data processing chapter to index * adding data processing chapter file * adding introduction section file * adding architecture image * adding deployment section file * adding configuration and hardening section file * adding case studies section file * adding data processing to the introduction to openstack section Change-Id: I50c5066373f7c9bd75eb956cbb163f27d6a63058 Closes-bug: 1415218
230 lines
9.1 KiB
XML
230 lines
9.1 KiB
XML
<?xml version="1.0" encoding="UTF-8"?>
|
|
<section xmlns="http://docbook.org/ns/docbook"
|
|
xmlns:xi="http://www.w3.org/2001/XInclude"
|
|
xmlns:xlink="http://www.w3.org/1999/xlink"
|
|
version="5.0"
|
|
xml:id="data-processing-introduction-to-data-processing">
|
|
<?dbhtml stop-chunking?>
|
|
<title>Introduction to Data processing</title>
|
|
<para>
|
|
The Data processing service for OpenStack (sahara) provides a platform
|
|
for the provisioning and management of instance clusters using processing
|
|
frameworks such as Hadoop and Spark. Through the OpenStack dashboard
|
|
or REST API, users will be able to upload and execute framework
|
|
applications which may access data in object storage or external
|
|
providers. The data processing controller uses the Orchestration
|
|
service to create clusters of instances which may exist as
|
|
long-running groups that can grow and shrink as requested, or as
|
|
transient groups created for a single workload.
|
|
</para>
|
|
<para>
|
|
The service controller will be responsible for creating, maintaining,
|
|
and destroying any instances created for its clusters. The controller
|
|
will use the Networking service to establish network paths between
|
|
itself and the cluster instances. It will also manage the deployment
|
|
and life-cycle of user applications that are to be run on the
|
|
clusters. The instances within a cluster contain the core of a
|
|
framework's processing engine and the Data processing service provides
|
|
several options for creating and managing the connections to these
|
|
instances.
|
|
</para>
|
|
<para>
|
|
Data processing resources (clusters, jobs, and data sources) are
|
|
segregated by projects defined within the Identity service. These
|
|
resources are shared within a project and it is important to
|
|
understand the access needs of those who are using the
|
|
service. Activities within projects (for example launching clusters,
|
|
uploading jobs, etc.) can be restricted further through the use of
|
|
role-based access controls.
|
|
</para>
|
|
<para>
|
|
In this chapter we discuss how to assess the needs of data processing
|
|
users with respect to their applications, the data that they use, and
|
|
their expected capabilities within a project. We will also demonstrate
|
|
a number of hardening techniques for the service controller and its
|
|
clusters, and provide examples of various controller configurations
|
|
and user management approaches to ensure an adequate level of security
|
|
and privacy.
|
|
</para>
|
|
<section xml:id="data-processing-introduction-to-data-processing-architecture">
|
|
<title>Architecture</title>
|
|
<para>
|
|
The following diagram presents a conceptual view of how the Data
|
|
processing service fits into the greater OpenStack ecosystem.
|
|
</para>
|
|
<para>
|
|
<inlinemediaobject>
|
|
<imageobject role="html">
|
|
<imagedata contentdepth="621" contentwidth="955" fileref="static/data_processing_architecture.png" format="PNG" scalefit="1"/>
|
|
</imageobject>
|
|
<imageobject role="fo">
|
|
<imagedata contentdepth="100%" fileref="static/data_processing_architecture.png" format="PNG" scalefit="1" width="100%"/>
|
|
</imageobject>
|
|
</inlinemediaobject>
|
|
</para>
|
|
<para>
|
|
The Data processing service makes heavy use of the Compute,
|
|
Orchestration, Image, and Block Storage services during the
|
|
provisioning of clusters. It will also use one or more networks,
|
|
created by the Networking service, provided during cluster creation
|
|
for administrative access to the instances. While users are running
|
|
framework applications the controller and the clusters will be
|
|
accessing the Object Storage service. Given these service usages, we
|
|
recommend following the instructions outlined in
|
|
<xref linkend="documentation"/> for cataloging all the components of
|
|
an installation.
|
|
</para>
|
|
</section>
|
|
<section xml:id="data-processing-introduction-to-data-processing-technologies-involved">
|
|
<title>Technologies involved</title>
|
|
<para>
|
|
The Data Processing service is responsible for the deployment and
|
|
management of several applications. For a complete understanding of
|
|
the security options provided we recommend that operators have a
|
|
general familiarity with these applications. The list of highlighted
|
|
technologies is broken into two sections: first, high priority
|
|
applications that have a greater impact on security, and second,
|
|
supporting applications with a lower impact.
|
|
</para>
|
|
<para>Higher impact</para>
|
|
<itemizedlist>
|
|
<listitem>
|
|
<para>
|
|
<link xlink:href="https://hadoop.apache.org/">
|
|
Hadoop
|
|
</link>
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
<link xlink:href="https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/SecureMode.html">
|
|
Hadoop secure mode docs
|
|
</link>
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
<link xlink:href="https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html">
|
|
HDFS
|
|
</link>
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
<link xlink:href="https://spark.apache.org/">
|
|
Spark
|
|
</link>
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
<link xlink:href="https://spark.apache.org/docs/latest/security.html">
|
|
Spark Security
|
|
</link>
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
<link xlink:href="https://storm.apache.org/">
|
|
Storm
|
|
</link>
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
<link xlink:href="https://zookeeper.apache.org/">
|
|
Zookeeper
|
|
</link>
|
|
</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
<para>Lower impact</para>
|
|
<itemizedlist>
|
|
<listitem>
|
|
<para>
|
|
<link xlink:href="https://oozie.apache.org/">
|
|
Oozie
|
|
</link>
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
<link xlink:href="https://hive.apache.org/">
|
|
Hive
|
|
</link>
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
<link xlink:href="https://pig.apache.org/">
|
|
Pig
|
|
</link>
|
|
</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
<para>
|
|
These technologies comprise the core of the frameworks that are
|
|
deployed with the Data processing service. In addition to these
|
|
technologies, the service also includes bundled frameworks provided by
|
|
third party vendors. These bundled frameworks are built using the same
|
|
core pieces described above plus configurations and applications that
|
|
the vendors include. For more information on the third party framework
|
|
bundles please see the following links:
|
|
</para>
|
|
<itemizedlist>
|
|
<listitem>
|
|
<para>
|
|
<link xlink:href="https://www.cloudera.com/content/cloudera/en/documentation.html#CDH">
|
|
Cloudera CDH
|
|
</link>
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
<link xlink:href="http://docs.hortonworks.com/">
|
|
Hortonworks Data Platform
|
|
</link>
|
|
</para>
|
|
</listitem>
|
|
<listitem>
|
|
<para>
|
|
<link xlink:href="https://www.mapr.com/products/mapr-distribution-including-apache-hadoop">
|
|
MapR
|
|
</link>
|
|
</para>
|
|
</listitem>
|
|
</itemizedlist>
|
|
</section>
|
|
<section xml:id="data-processing-introduction-to-data-processing-user-access-resources">
|
|
<title>User access to resources</title>
|
|
<para>
|
|
The resources (clusters, jobs, and data sources) of the Data
|
|
processing service are shared within the scope of a project. Although
|
|
a single controller installation may manage several sets of resources,
|
|
these resources will each be scoped to a single project. Given this
|
|
constraint we recommend that user membership in projects is monitored
|
|
closely to maintain proper segregation of resources.
|
|
</para>
|
|
<para>
|
|
As the security requirements of organizations deploying this service
|
|
will vary based on their specific needs, we recommend that operators
|
|
focus on data privacy, cluster management, and end-user applications as
|
|
a starting point for evaluating the needs of their users. These
|
|
decisions will help guide the process of configuring user access to
|
|
the service. For an expanded discussion on data privacy see
|
|
<xref linkend="tenant-data"/>.
|
|
</para>
|
|
<para>
|
|
The default assumption for a data processing installation is that
|
|
users will have access to all functionality within their projects. In
|
|
the event that more granular control is required the Data processing
|
|
service provides a policy file (as described in
|
|
<xref linkend="identity-policies"/>). These configurations will be
|
|
highly dependent on the needs of the installing organization, and as
|
|
such there is no general advice on their usage: see
|
|
<xref linkend="data-processing-configuration-and-hardening-role-based-access-control-policies"/>
|
|
for details.
|
|
</para>
|
|
</section>
|
|
</section>
|