security-doc/security-guide/section_data-processing-introduction-to-data-processing.xml

<?xml version="1.0" encoding="UTF-8"?>
<section xmlns="http://docbook.org/ns/docbook"
  xmlns:xi="http://www.w3.org/2001/XInclude"
  xmlns:xlink="http://www.w3.org/1999/xlink"
  version="5.0"
  xml:id="data-processing-introduction-to-data-processing">
  <?dbhtml stop-chunking?>
  <title>Introduction to Data processing</title>
  <para>
    The Data processing service for OpenStack (sahara) provides a platform
    for the provisioning and management of instance clusters using processing
    frameworks such as Hadoop and Spark. Through the OpenStack dashboard
    or REST API, users will be able to upload and execute framework
    applications which may access data in object storage or external
    providers. The data processing controller uses the Orchestration
    service to create clusters of instances which may exist as
    long-running groups that can grow and shrink as requested, or as
    transient groups created for a single workload.
  </para>
  <para>
    The service controller will be responsible for creating, maintaining,
    and destroying any instances created for its clusters. The controller
    will use the Networking service to establish network paths between
    itself and the cluster instances. It will also manage the deployment
    and life-cycle of user applications that are to be run on the
    clusters. The instances within a cluster contain the core of a
    framework's processing engine and the Data processing service provides
    several options for creating and managing the connections to these
    instances.
  </para>
  <para>
    Data processing resources (clusters, jobs, and data sources) are
    segregated by projects defined within the Identity service. These
    resources are shared within a project and it is important to
    understand the access needs of those who are using the
    service. Activities within projects (for example launching clusters,
    uploading jobs, etc.) can be restricted further through the use of
    role-based access controls.
  </para>
  <para>
    In this chapter we discuss how to assess the needs of data processing
    users with respect to their applications, the data that they use, and
    their expected capabilities within a project. We will also demonstrate
    a number of hardening techniques for the service controller and its
    clusters, and provide examples of various controller configurations
    and user management approaches to ensure an adequate level of security
    and privacy.
  </para>
  <section xml:id="data-processing-introduction-to-data-processing-architecture">
    <title>Architecture</title>
    <para>
      The following diagram presents a conceptual view of how the Data
      processing service fits into the greater OpenStack ecosystem.
    </para>
    <para>
      <inlinemediaobject>
        <imageobject role="html">
          <imagedata contentdepth="621" contentwidth="955" fileref="static/data_processing_architecture.png" format="PNG" scalefit="1"/>
        </imageobject>
        <imageobject role="fo">
          <imagedata contentdepth="100%" fileref="static/data_processing_architecture.png" format="PNG" scalefit="1" width="100%"/>
        </imageobject>
      </inlinemediaobject>
    </para>
    <para>
      The Data processing service makes heavy use of the Compute,
      Orchestration, Image, and Block Storage services during the
      provisioning of clusters. It will also use one or more networks,
      created by the Networking service, provided during cluster creation
      for administrative access to the instances. While users are running
      framework applications the controller and the clusters will be
      accessing the Object Storage service. Given these service usages, we
      recommend following the instructions outlined in
      <xref linkend="documentation"/> for cataloging all the components of
      an installation.
    </para>
  </section>
  <section xml:id="data-processing-introduction-to-data-processing-technologies-involved">
    <title>Technologies involved</title>
    <para>
      The Data Processing service is responsible for the deployment and
      management of several applications. For a complete understanding of
      the security options provided we recommend that operators have a
      general familiarity with these applications. The list of highlighted
      technologies is broken into two sections: first, high priority
      applications that have a greater impact on security, and second,
      supporting applications with a lower impact.
    </para>
    <para>Higher impact</para>
    <itemizedlist>
      <listitem>
        <para>
          <link xlink:href="https://hadoop.apache.org/">
            Hadoop
          </link>
        </para>
      </listitem>
      <listitem>
        <para>
          <link xlink:href="https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/SecureMode.html">
            Hadoop secure mode docs
          </link>
        </para>
      </listitem>
      <listitem>
        <para>
          <link xlink:href="https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HdfsUserGuide.html">
            HDFS
          </link>
        </para>
      </listitem>
      <listitem>
        <para>
          <link xlink:href="https://spark.apache.org/">
            Spark
          </link>
        </para>
      </listitem>
      <listitem>
        <para>
          <link xlink:href="https://spark.apache.org/docs/latest/security.html">
            Spark Security
          </link>
        </para>
      </listitem>
      <listitem>
        <para>
          <link xlink:href="https://storm.apache.org/">
            Storm
          </link>
        </para>
      </listitem>
      <listitem>
        <para>
          <link xlink:href="https://zookeeper.apache.org/">
            Zookeeper
          </link>
        </para>
      </listitem>
    </itemizedlist>
    <para>Lower impact</para>
    <itemizedlist>
      <listitem>
        <para>
          <link xlink:href="https://oozie.apache.org/">
            Oozie
          </link>
        </para>
      </listitem>
      <listitem>
        <para>
          <link xlink:href="https://hive.apache.org/">
            Hive
          </link>
        </para>
      </listitem>
      <listitem>
        <para>
          <link xlink:href="https://pig.apache.org/">
            Pig
          </link>
        </para>
      </listitem>
    </itemizedlist>
    <para>
      These technologies comprise the core of the frameworks that are
      deployed with the Data processing service. In addition to these
      technologies, the service also includes bundled frameworks provided by
      third party vendors. These bundled frameworks are built using the same
      core pieces described above plus configurations and applications that
      the vendors include. For more information on the third party framework
      bundles please see the following links:
    </para>
    <itemizedlist>
      <listitem>
        <para>
          <link xlink:href="https://www.cloudera.com/content/cloudera/en/documentation.html#CDH">
            Cloudera CDH
          </link>
        </para>
      </listitem>
      <listitem>
        <para>
          <link xlink:href="http://docs.hortonworks.com/">
            Hortonworks Data Platform
          </link>
        </para>
      </listitem>
      <listitem>
        <para>
          <link xlink:href="https://www.mapr.com/products/mapr-distribution-including-apache-hadoop">
            MapR
          </link>
        </para>
      </listitem>
    </itemizedlist>
  </section>
  <section xml:id="data-processing-introduction-to-data-processing-user-access-resources">
    <title>User access to resources</title>
    <para>
      The resources (clusters, jobs, and data sources) of the Data
      processing service are shared within the scope of a project. Although
      a single controller installation may manage several sets of resources,
      these resources will each be scoped to a single project. Given this
      constraint we recommend that user membership in projects is monitored
      closely to maintain proper segregation of resources.
    </para>
    <para>
      As the security requirements of organizations deploying this service
      will vary based on their specific needs, we recommend that operators
      focus on data privacy, cluster management, and end-user applications as
      a starting point for evaluating the needs of their users. These
      decisions will help guide the process of configuring user access to
      the service. For an expanded discussion on data privacy see
      <xref linkend="tenant-data"/>.
    </para>
    <para>
      The default assumption for a data processing installation is that
      users will have access to all functionality within their projects. In
      the event that more granular control is required the Data processing
      service provides a policy file (as described in
      <xref linkend="identity-policies"/>). These configurations will be
      highly dependent on the needs of the installing organization, and as
      such there is no general advice on their usage: see
      <xref linkend="data-processing-configuration-and-hardening-role-based-access-control-policies"/>
      for details.
    </para>
  </section>
</section>