Add some S3 doc

Perhaps some more to follow, someday, but this is nice to have. Change-Id: I2235f903105049432de24d89a88b40f753fd93d6
2018-08-07 17:30:26 -04:00 · 2018-08-07 17:30:26 -04:00 · 141a67e7c7
parent 2c6232c9ad
commit 141a67e7c7
2 changed files with 88 additions and 0 deletions
--- a/doc/source/user/edp-s3.rst
+++ b/doc/source/user/edp-s3.rst
@ -0,0 +1,87 @@
 ==============================
 EDP with S3-like Object Stores
 ==============================
 Overview and rationale of S3 integration
 ========================================
 Since the Rocky release, Sahara clusters have full support for interaction with
 S3-like object stores, for example Ceph Rados Gateway. Through the abstractions
 offered by EDP, a Sahara job execution may consume input data and job binaries
 stored in S3, as well as write back its output data to S3.
 The copying of job binaries from S3 to a cluster is performed by the botocore
 library. A job's input and output to and from S3 is handled by the Hadoop-S3A
 driver.
 It's also worth noting that the Hadoop-S3A driver may be more mature and
 performant than the Hadoop-SwiftFS driver (either as hosted by Apache or in
 the sahara-extra respository).
 Sahara clusters are also provisioned such that data in S3-like storage can also
 be accessed when manually interacting with the cluster; in other words: the
 needed libraries are properly situated.
 Considerations for deployers
 ============================
 The S3 integration features can function without any specific deployment
 requirement. This is because the EDP S3 abstractions can point to an arbitrary
 S3 endpoint.
 Deployers may want to consider using Sahara's optional integration with secret
 storage to protect the S3 access and secret keys that users will provide. Also,
 if using Rados Gateway for S3, deployers may want to use Keystone for RGW auth
 so that users can simply request Keystone EC2 credentials to access RGW's S3.
 S3 user experience
 ==================
 Below, details about how to use the S3 integration features are discussed.
 EDP job binaries in S3
 ----------------------
 The ``url`` must be in the format ``s3://bucket/path/to/object``, similar to
 the format used for binaries in Swift. The ``extra`` structure must contain
 ``accesskey``, ``secretkey``, and ``endpoint``, which is the URL of the S3
 service, including the protocol ``http`` or ``https``.
 As mentioned above, the binary will be copied to the cluster before execution,
 by use of the botocore library. This also means that the set of credentials
 used to access this binary may be entirely different than those for accessing
 a data source.
 EDP data sources in S3
 ----------------------
 The ``url`` should be in the format ``s3://bucket/path/to/object``, although
 upon execution the protocol will be automatically changed to ``s3a``. The
 ``credentials`` does not have any required values, although the following may
 be set:
 * ``accesskey`` and ``secretkey``
 * ``endpoint``, which is the URL of the S3 service, without the protocl
 * ``ssl``, which must be a boolean
 * ``bucket_in_path``, to indicate whether the S3 service uses
  virtual-hosted-style or path-style URLs, and must be a boolean
 The values above are optional, as they may be set in the cluster's
 ``core-site.xml`` or as configuration values of the job execution, as follows,
 as dictated by the options understood by the Hadoop-S3A driver:
 * ``fs.s3a.access.key``, corresponding to ``accesskey``
 * ``fs.s3a.secret.key``, corresponding to ``secretkey``
 * ``fs.s3a.endpoint``, corresponding to ``endpoint``
 * ``fs.s3a.connection.ssl.enabled``, corresponding to ``ssl``
 * ``fs.s3a.path.style.access``, corresponding to ``bucket_in_path``
 In the case of ``fs.s3a.path.style.access``, a default value is determined by
 the Hadoop-S3A driver if none is set: virtual-hosted-style URLs are assumed
 unless told otherwise, or if the endpoint is a raw IP address.
 Additional configuration values are supported by the Hadoop-S3A driver, and are
 discussed in its official documentation.
 It is recommended that the EDP data source abstraction is used, rather than
 handling bare arguments and configuration values.
 If any S3 configuration values are to be set at execution time, including such
 situations in which those values are contained by the EDP data source
 abstraction, then ``edp.spark.adapt_for_swift`` or ``edp.java.adapt_for_oozie``
 must be set to ``true`` as appropriate.
--- a/doc/source/user/index.rst
+++ b/doc/source/user/index.rst
@ -39,6 +39,7 @@ Elastic Data Processing
   :maxdepth: 2
   edp
   edp-s3
 Guest Images