Browse Source

Merge "Add some S3 doc"

tags/9.0.0.0rc1^0
Zuul 9 months ago
parent
commit
bd72585709
2 changed files with 88 additions and 0 deletions
  1. 87
    0
      doc/source/user/edp-s3.rst
  2. 1
    0
      doc/source/user/index.rst

+ 87
- 0
doc/source/user/edp-s3.rst View File

@@ -0,0 +1,87 @@
1
+==============================
2
+EDP with S3-like Object Stores
3
+==============================
4
+
5
+Overview and rationale of S3 integration
6
+========================================
7
+Since the Rocky release, Sahara clusters have full support for interaction with
8
+S3-like object stores, for example Ceph Rados Gateway. Through the abstractions
9
+offered by EDP, a Sahara job execution may consume input data and job binaries
10
+stored in S3, as well as write back its output data to S3.
11
+
12
+The copying of job binaries from S3 to a cluster is performed by the botocore
13
+library. A job's input and output to and from S3 is handled by the Hadoop-S3A
14
+driver.
15
+
16
+It's also worth noting that the Hadoop-S3A driver may be more mature and
17
+performant than the Hadoop-SwiftFS driver (either as hosted by Apache or in
18
+the sahara-extra respository).
19
+
20
+Sahara clusters are also provisioned such that data in S3-like storage can also
21
+be accessed when manually interacting with the cluster; in other words: the
22
+needed libraries are properly situated.
23
+
24
+Considerations for deployers
25
+============================
26
+The S3 integration features can function without any specific deployment
27
+requirement. This is because the EDP S3 abstractions can point to an arbitrary
28
+S3 endpoint.
29
+
30
+Deployers may want to consider using Sahara's optional integration with secret
31
+storage to protect the S3 access and secret keys that users will provide. Also,
32
+if using Rados Gateway for S3, deployers may want to use Keystone for RGW auth
33
+so that users can simply request Keystone EC2 credentials to access RGW's S3.
34
+
35
+S3 user experience
36
+==================
37
+Below, details about how to use the S3 integration features are discussed.
38
+
39
+EDP job binaries in S3
40
+----------------------
41
+The ``url`` must be in the format ``s3://bucket/path/to/object``, similar to
42
+the format used for binaries in Swift. The ``extra`` structure must contain
43
+``accesskey``, ``secretkey``, and ``endpoint``, which is the URL of the S3
44
+service, including the protocol ``http`` or ``https``.
45
+
46
+As mentioned above, the binary will be copied to the cluster before execution,
47
+by use of the botocore library. This also means that the set of credentials
48
+used to access this binary may be entirely different than those for accessing
49
+a data source.
50
+
51
+EDP data sources in S3
52
+----------------------
53
+The ``url`` should be in the format ``s3://bucket/path/to/object``, although
54
+upon execution the protocol will be automatically changed to ``s3a``. The
55
+``credentials`` does not have any required values, although the following may
56
+be set:
57
+
58
+* ``accesskey`` and ``secretkey``
59
+* ``endpoint``, which is the URL of the S3 service, without the protocl
60
+* ``ssl``, which must be a boolean
61
+* ``bucket_in_path``, to indicate whether the S3 service uses
62
+  virtual-hosted-style or path-style URLs, and must be a boolean
63
+
64
+The values above are optional, as they may be set in the cluster's
65
+``core-site.xml`` or as configuration values of the job execution, as follows,
66
+as dictated by the options understood by the Hadoop-S3A driver:
67
+
68
+* ``fs.s3a.access.key``, corresponding to ``accesskey``
69
+* ``fs.s3a.secret.key``, corresponding to ``secretkey``
70
+* ``fs.s3a.endpoint``, corresponding to ``endpoint``
71
+* ``fs.s3a.connection.ssl.enabled``, corresponding to ``ssl``
72
+* ``fs.s3a.path.style.access``, corresponding to ``bucket_in_path``
73
+
74
+In the case of ``fs.s3a.path.style.access``, a default value is determined by
75
+the Hadoop-S3A driver if none is set: virtual-hosted-style URLs are assumed
76
+unless told otherwise, or if the endpoint is a raw IP address.
77
+
78
+Additional configuration values are supported by the Hadoop-S3A driver, and are
79
+discussed in its official documentation.
80
+
81
+It is recommended that the EDP data source abstraction is used, rather than
82
+handling bare arguments and configuration values.
83
+
84
+If any S3 configuration values are to be set at execution time, including such
85
+situations in which those values are contained by the EDP data source
86
+abstraction, then ``edp.spark.adapt_for_swift`` or ``edp.java.adapt_for_oozie``
87
+must be set to ``true`` as appropriate.

+ 1
- 0
doc/source/user/index.rst View File

@@ -39,6 +39,7 @@ Elastic Data Processing
39 39
    :maxdepth: 2
40 40
 
41 41
    edp
42
+   edp-s3
42 43
 
43 44
 
44 45
 Guest Images

Loading…
Cancel
Save