Merge "Swift tiering specification"

2016-05-05 03:46:40 +00:00 · 2016-05-05 03:46:40 +00:00 · cced93eba2
parent 4cf7f71121 08226f79b6
commit cced93eba2
2 changed files with 496 additions and 0 deletions
--- a/specs/in_progress/images/tiering_overview.png
+++ b/specs/in_progress/images/tiering_overview.png
--- a/specs/in_progress/tiering.rst
+++ b/specs/in_progress/tiering.rst
@ -0,0 +1,496 @@
+::
+
+  This work is licensed under a Creative Commons Attribution 3.0
+  Unported License.
+  http://creativecommons.org/licenses/by/3.0/legalcode
+
+*************************
+Automated Tiering Support
+*************************
+
+1. Problem Description
+======================
+Data hosted on long-term storage systems experience gradual changes in
+access patterns as part of their information lifecycles. For example,
+empirical studies by companies such as Facebook show that as image data
+age beyond their creation times, they become more and more unlikely to be
+accessed by users, with access rates dropping exponentially at times [1].
+Long retention periods, as is the case with data stored on cold storage
+systems like Swift, increase the possibility of such changes.
+
+Tiering is an important feature provided by many traditional file & block
+storage systems to deal with changes in data “temperature”. It enables
+seamless movement of inactive data from high performance storage media to
+low-cost, high capacity storage media to meet customers’ TCO (total cost of
+ownership) requirements. As scale-out object storage systems like Swift are
+starting to natively support multiple media types like SSD, HDD, tape and
+different storage policies such as replication and erasure coding, it becomes
+imperative to complement the wide range of available storage tiers (both
+virtual and physical) with automated data tiering.
+
+
+2. Tiering Use Cases in Swift
+=============================
+Swift users and operators can adapt to changes in access characteristics of
+objects by transparently converting their storage policies to cater to the
+goal of matching overall business needs ($/GB, performance, availability) with
+where and how the objects are stored.
+
+Some examples of how objects can be moved between Swift containers of different
+storage policies as they age.
+
+[SSD-based container] --> [HDD-based container]
+
+[HDD-based container] --> [Tape-based container]
+
+[Replication policy container] -->  [Erasure coded policy container]
+
+In some customer environments, a Swift container may not be the last storage
+tier. Examples of archival-class stores lower in cost than Swift include
+specialized tape-based systems [2], public cloud archival solutions such as
+Amazon Glacier and Google Nearline storage. Analogous to this proposed feature
+of tiering in Swift, Amazon S3 already has the in-built support to move
+objects between S3 and Glacier based on user-defined rules. Redhat Ceph has
+recently added tiering capabilities as well.
+
+
+3. Goals
+========
+The main goal of this document is to propose a tiering feature in Swift that
+enables seamless movement of objects between containers belonging to different
+storage policies. It is “seamless” because users will not experience any
+disruption in namespace, access API, or availability of the objects subject to
+tiering.
+
+Through new Swift API enhancements, Swift users and operators alike will have
+the ability to specify a tiering relationship between two containers and the
+associated data movement rules.
+
+The focus of this proposal is to identify, create and bring together the
+necessary building blocks towards a baseline tiering implementation natively
+within Swift. While this narrow scope is intentional, the expectation is that
+the baseline tiering implementation will lay the foundation and not preclude
+more advanced tiering features in future.
+
+4. Feature Dependencies
+=======================
+The following in-progress Swift features (aka specs) have been identified as
+core dependencies for this tiering proposal.
+
+1. Swift Symbolic Links [3]
+2. Changing Storage Policies [4]
+
+A few other specs are classified as nice-to-have dependencies, meaning that
+if they evolve into full implementations we will be able to demonstrate the
+tiering feature with advanced use cases and capabilities. However, they are
+not considered mandatory requirements for the first version of tiering.
+
+3. Metadata storage/search [5]
+4. Tape support in Swift [6]
+
+5. Implementation
+=================
+The proposed tiering implementation depends on several building blocks, some
+of which are unique to tiering, like the requisite API changes. They will be
+described in their entirety. Others like symlinks are independent features and
+have uses beyond tiering. Instead of re-inventing the wheel, the tiering
+implementation aims to leverage specific constructs that will be available
+through these in-progress features.
+
+5.1 Overview
+------------
+For a quick overview of the tiering implementation, please refer to the Figure
+(images/tiering_overview.png). It highlights the flow of actions taking place
+within the proposed tiering engine.
+
+1. Swift client creates a tiering relationship between two Swift containers by
+marking the source container with appropriate metadata.
+2. A background process named tiering-coordinator examines the source container
+and iterates through its objects.
+3. Tiering-coordinator identifies candidate objects for movement and de-stages
+each object to target container by issuing a copy request to an object server.
+4. After an object is copied, tiering-coordinator replaces it by a symlink in
+the source container pointing to corresponding object in target container.
+
+
+5.2 API Changes
+---------------
+Swift clients will be able to create a tiering relationship between two
+containers, i.e., source and target containers, by adding the following
+metadata to the source container.
+
+X-Container-Tiering-Target: <target_container_name>
+X-Container-Tiering-Age: <threshold_object_age >
+
+The metadata values can be set during the creation of the source container
+(PUT) operation or they can be set later as part of a container metadata
+update (POST) operation. Object age refers to the time elapsed since the
+object’s creation time (creation time is stored with the object as
+‘X-Timestamp’ header).
+
+The user semantics of setting the above container metadata are as follows.
+When objects in the source container become older than the specified threshold
+time, they become candidates for being de-staged to the target container. There
+are no guarantees on when exactly they will be moved or the precise location of
+the objects at any given time. Swift will operate on them asynchronously and
+relocate objects based on user-specified tiering rules. Once the tiering
+metadata is set on the source container, the user can expect levels of
+performance, reliability, etc. for its objects commensurate with the storage
+policy of either the source or target container.
+
+One can override the tiering metadata for individual objects in the source
+container by setting the following per-object metadata,
+
+X-Object-Tiering-Target: <target_container_name>
+X-Object-Tiering-Age: <object_age_in_minutes>
+
+Presence of tiering metadata on an object will imply that it will take
+precedence over the tiering metadata set on the hosting container. However,
+if a container is not tagged with any tiering metadata, the objects inside it
+will not be considered for tiering regardless of whether they are tagged with
+any tiering related metadata or not. Also, if the tiering age threshold on the
+object metadata is lower than the value set on the container, it will not take
+effect until the container age criterion is met.
+
+An important invariant preserved by the tiering feature is the namespace of
+objects. As will be explained in later sections, after objects are moved they
+will be replaced immediately by symlinks that will allow users to continue
+foreground operations on objects as if no migrations have taken place. Please
+refer to section 7 on open questions for further commentary on the API topic.
+
+To summarize, here are the steps that a Swift user must perform in order to
+initiate tiering between objects from a source container (S) to a target
+container (T) over time.
+
+1. Create containers S and T with desired storage policies, say replication
+and erasure coding respectively
+2. Set the tiering-related metadata (X-Container-Tiering-*) on container S
+as described earlier in this section.
+3. Deposit objects into container S.
+4. If needed, override the default container settings for individual objects
+inside container S by setting object metadata (X-Object-Tiering-*).
+
+It will also be possible to create cascading tiering relationships between
+more than two containers. For example, a sequence of tiering relationships
+between containers C1 -> C2 -> C3 can be established by setting appropriate
+tiering metadata on C1 and C2. When an object is old enough to be moved from
+C1, it will be deposited in C2. The timer will then start on the moved object
+in C2 and depending on the age settings on C2, the object will eventually be
+migrated to C3.
+
+
+5.3 Tiering Coordinator Process
+-------------------------------
+The tiering-coordinator is a background process similar to container-sync,
+container-reconciler and other container-* processes running on each container
+server. We can potentially re-use one of the existing container processes,
+specifically either container-sync or container-reconciler to perform the job of
+tiering-coordinator, but for the purposes of this discussion it will be assumed
+that it is a separate process.
+
+The key actions performed by tiering-coordinator are
+
+(a) Walk through containers marked with tiering metadata
+(b) Identify candidate objects for tiering within those containers
+(c) Initiate copy requests on candidate objects
+(d) Replace source objects with corresponding symlinks
+
+We will discuss (a) and (b) in this section and cover (c) and (d) in subsequent
+sections. Note that in the first version of tiering, only one metric
+<object age> will be used to determine the eligibility of an object for
+migration.
+
+The tiering-coordinator performs its operations in a series of rounds. In each
+round, it iterates through containers whose SQLite DBs it has direct access to
+on the container server it is running on. It checks if the container has the
+right X-Container-Tiering-* metadata. If present, it starts the scanning process
+to identify candidate objects. The scanning process leverages a convenient (but
+not necessary) property of the container DB that objects are listed in the
+chronological order of their creation times. That is, the first index in the
+container DB points to the object with oldest creation time, followed by next
+younger object and so on. As such, the scanning process described below is
+optimized for the object age criterion chosen for tiering v1 implementation.
+For extending to other tiering metrics, we refer the reader to section 6.1 for
+discussion.
+
+Each container DB will have two persistent markers to track the progress of
+tiering – tiering_sync_start and tiering_sync_end. The marker tiering_sync_start
+refers to the starting index in the container DB upto which objects have already
+been processed. The marker tiering_sync_end refers to the index beyond which
+objects have not yet been considered for tiering. All the objects that fall
+between the two markers are the ones for which tiering is currently in progress.
+Note that the presence of persistent markers in the container DB helps with
+quickly resuming from previous work done in the event of container server
+crash/reboot.
+
+When a container is selected for tiering for the first time, both the markers
+are initialized to -1. If the first object is old enough to meet the
+X-Container-Tiering-Age criterion, tiering_sync_start is set to 0. Then the
+second marker tiering_sync_end is advanced to an index that is lesser than
+the two values  - (i) tiering_sync_start + tier_max_objects_per_round (latter
+will be a configurable value in /etc/swift/container.conf) or (ii) largest
+index in the container DB whose corresponding object meets the tiering age
+criterion.
+
+The above marker settings will ensure two invariants. First, all objects
+between (and including) tiering_sync_start and tiering_sync_end are candidates
+for moving to the target container. Second, it will guarantee that the number
+of objects processed on the container in a single round is bound by the
+configuration parameter (tier_max_objects_per_round, say = 200). This will
+ensure that the coordinator process will round robin effectively amongst all
+containers on the server per round without spending undue amount of time on
+only a few.
+
+After the markers are fixed, tiering-coordinator will issue a copy request
+for each object within the range. When the copy requests are completed, it
+updates tiering_sync_start = tiering_sync_end and moves on to the next
+container. When tiering-coordinator re-visits the same container after
+completing the current round, it restarts the scanning routine described
+above from tiering_sync_start = tiering_sync_end (except they are not both
+-1 this time).
+
+In a typical Swift cluster, each container DB is replicated three times and
+resides on multiple container servers. Therefore, without proper
+synchronization, tiering-coordinator processes can end up conflicting with
+each other by processing the same container and same objects within. This
+can potentially lead to race conditions with non-deterministic behavior. We
+can overcome this issue by adopting the approach of divide-and-conquer
+employed by container-sync process. The range of object indices between
+(tiering_sync_start, tiering_sync_end) can be initially split up into as
+many disjoint regions as the number of tiering-coordinator processes
+operating on the same container. As they work through the object indices,
+each process might additionally complete others’ portions depending on the
+collective progress. For a detailed description of how container-sync
+processes implicitly communicate and make group progress, please refer
+to [7].
+
+5.4 Object Copy Mechanism
+-------------------------
+For each candidate object that the tiering-coordinator deems eligible to move to
+the target container, it issues an ‘object copy’ request using an API call
+supported by the object servers. The API call will map to a method used by
+object-transferrer daemons running on the object servers. The
+tiering-coordinator can select any of the object servers (by looking up the ring
+datastructure corresponding to the object in source container policy) as a
+destination for the request.
+
+The object-transferrer daemon is supposed to be optimized for converting an
+object from one storage policy to another. As per the ‘Changing policies’ spec,
+the object-transferrer daemon will be equipped with the right techniques to move
+objects between Replication -> EC, EC -> EC, etc. Alternatively, in the absence
+of object-transferrer, the tiering coordinator can simply make use of the
+server-side ‘COPY’ API that vanilla Swift exposes to regular clients. It can
+send the COPY request to a swift proxy server to clone the source object into
+the target container. The proxy server will perform the copy by first reading in
+(GET request) the object from any of the source object servers and creating a
+copy (PUT request) of the object in the target object servers. While this will
+work correctly for the purposes of the tiering coordinator, making use of the
+object-transferrer interface is likely to be a better option. Leveraging the
+specialized code in object-transferrer through a well-defined interface for
+copying an object between two different storage policy containers will make the
+overall tiering process efficient.
+
+Here is an example interface represented by a function call in the
+object-transferrer code:
+
+def  copy_object(source_obj_path, target_obj_path)
+
+The above method can be a wrapper over similar functionality used by the
+object-transferrer daemon. The tiering-coordinator will use this interface to
+call the function through a HTTP call.
+
+copy_object(/A/S/O, /A/T/O)
+
+where S is the source container and T is the target container. Note that the
+object name in the target container will be the same as in the source container.
+
+Upon receiving the copy request, the object server will first check if the
+source path is a symlink object. If it is a symlink, it will respond with an
+error to the tiering-coordinator to indicate that a symlink already exists.
+This behavior will ensure idempotence and guard against situations where
+tiering-coordinator crashes and retries a previously completed object copy
+request. Also, it avoids tiering for sparse objects such as symlinks created
+by users. Secondly, the object server will check if the source object has
+tiering metadata in the form of X-Object-Tiering-* that overrides the default
+tiering settings on the source container. It may or may not perform the object
+copy depending on the result.
+
+5.5 Symlink Creation
+--------------------
+After an object is successfully copied to the destination container, the
+tiering-coordinator will issue a ‘symlink create’ request to proxy server to
+replace the source object by a reference to the destination object. Waiting
+until the object copy is completed before replacing it by a symlink ensures
+safety in case of failures. The system could end up with an extra target
+object without a symlink pointing to it, but not the converse which
+constitutes data loss. Note that the symlink feature is currently
+work-in-progress and will also be available as an external API to swift clients.
+
+When the symlink is created by the tiering-coordinator, it will need to ensure
+that the original object’s ‘X-Timestamp’ value is preserved on the symlink
+object. Therefore, it is proposed that in the symlink creation request, the
+original time field can be provided (tiering-coordinator can quickly read the
+original values from container DB entry) as object user metadata, which is
+translated internally to a special sysmeta field by the symlink middleware.
+On subsequent user requests, the sysmeta field storing the correct creation
+timestamp will be sent to the user.
+
+With the symlink successfully created, Swift users can continue to issue object
+requests like GET, PUT to the original namespace /Account/Container/Object. The
+Symlink middleware will ensure that the swift users do not notice the presence
+of a symlink object unless a query parameter ‘?symlink=true’ [3] is explicitly
+provided with the object request.
+
+Users can also continue to read and update object metadata as before. It is not
+entirely clear at the time of this writing if the symlink object will store a
+copy of user metadata in its own extended attributes or if it will fetch the
+metadata from the referenced object for every HEAD/GET on the object. We will
+defer to whichever implementation that the symlink feature chooses to provide.
+
+An interesting race condition is possible due to the time window between object
+copy request and symlink creation. If there is an interim PUT request issued by
+a swift user between the two, it will be overwritten by the internal symlink
+created by the tiering-coordinator. This is an incorrect behavior that we need
+to protect against. We can use the same technique [8] (with help of a second
+vector timestamp) that container-reconciler uses to resolve a similar race
+condition. The tiering-coordinator, at the time of symlink creation, can detect
+the race condition and undo the COPY request. It will have to delete the object
+that was created in the destination container. Though this is wasted work in
+the face of such race conditions, we expect them to be rare scenarios. If the
+user conceives tiering rules properly, there ought to be little to no
+foreground traffic for the object that is being tiered.
+
+6. Future Work
+===============
+
+6.1 Other Tiering Criteria
+--------------------------
+The first version of tiering implementation will be heavily tailored (especially
+the scanning mechanism of tiering-coordinator) to the object age criterion. The
+convenient property of container DBs that store objects in the same order as
+they are created/overwritten lends to very efficient linear scanning for
+candidate objects.
+
+In the future, we should be able to support advanced criteria such as read
+frequency counts, object size, metadata-based selection, etc. For example,
+consider the following hypothetical criterion:
+
+"Tier objects from container S to container T if older than 1 month AND size >
+1GB AND tagged with metadata ‘surveillance-video’"
+
+When the metadata search feature [5] is available in Swift, tiering-coordinator
+should be able to run queries to quickly retrieve the set of object names that
+match ad-hoc criteria on both user and system metadata. As the metadata search
+feature evolves, we should be able to leverage it to add custom metadata such
+as read counts, etc for our purposes.
+
+6.2 Integration with External Storage Tiers
+-------------------------------------------
+The first implementation of tiering will only support object movement between
+Swift containers. In order to establish a tiering relationship between a swift
+container and an external storage backend, the backend must be mounted in Swift
+as a native container through the DiskFile API or other integration mechanisms.
+For instance, a target container fully hosted on GlusterFS or Seagate Kinetic
+drives can be created through Swift-on-file or Kinetic DiskFile implementations
+respectively.
+
+The Swift community believes that a similar integration approach is necessary
+to support external storage systems as tiering targets. There is already work
+underway to integrate tape-based systems in Swift. In the same vein, future
+work is needed to integrate external systems like Amazon Glacier or vendor
+archival products via DiskFile drivers or other means.
+
+7. Open Issues
+==============
+This section is structured as a series of questions and possible answers. With
+more feedback from the swift community, the open issues will be resolved and
+merged into the main document.
+
+Q1: Can the target container exist on a different account than the source
+container?
+
+Ans: The proposed API assumes that the target container is always on the same
+account as the source container. If this restriction is lifted, the proposed
+API needs to be modified appropriately.
+
+Q2: When the client sets the tiering metadata on the source container, should
+the target container exist at that time? What if the user has no permissions on
+the target container? When is all the error checking done?
+
+Ans: The error checking can be deferred to the tiering-coordinator process. The
+background process, upon detecting that the target container is unavailable can
+skip performing any tiering activity on the source container and move on to the
+next container. However, it might be better to detect errors in the client path
+and report early. If the latter approach is chosen, middleware functionality is
+needed to sanity check tiering metadata set on containers.
+
+Q3: How is the target container presented to the client? Would it be just like
+any other container with read/write permissions?
+
+Ans: The target container will be just like any other container. The client is
+responsible for manipulating the contents in the target container correctly. In
+particular, it should be aware that there might be symlinks in source container
+pointing to target objects. Deletions or overwrites of objects directly using
+the target container namespace could render some symlinks useless or obsolete.
+
+Q4: What is the behavior when conflicting tiering metadata are set over a
+period of time. For example, if the tiering age threshold is increased on a
+container with a POST metadata operation, will previously de-staged objects
+be brought back to the source container to match the new tiering rule?
+
+Ans: Perhaps not. The new tiering metadata should probably only be applied to
+objects that have not yet been processed by tiering-coordinator. Previous
+actions performed by tiering-coordinator based on older metadata need not be
+reversed.
+
+Q5: When a user issues a PUT operation to an object that has been de-staged to
+the target container earlier, what is the behavior?
+
+Ans: The default symlink behavior should apply but it’s not clear what it will
+be. Will an overwrite PUT cause the symlink middleware to delete both the
+symlink and the object being pointed to?
+
+Q6: When a user issues a GET operation to an object that has been de-staged to
+the target container earlier, will it be promoted back to source container?
+
+Ans: The proposed implementation does not promote objects back to an upper tier
+seamless to the user. If needed, such a behavior can be easily added with help
+of a tiering middleware in the proxy server.
+
+Q7: There is a mention of the ability to set cascading tiering relationships
+between multiple containers, C1 -> C2 -> C3. What if there is a cycle in this
+relationship graph?
+
+Ans: A cycle should be prevented, else we can run into atleast one complicated
+situation where a symlink might be pointing to an object on the same container
+with the same name, thereby overwriting the symlink ! It is possible to detect
+cycles at the time of tiering metadata creation in the client path with a
+tiering-specific middleware that is entrusted with the cycle detection by
+iterating through existing tiering relationships.
+
+Q8: Are there any unexpected interactions of tiering with existing or new
+features like SLO/DLO, encryption, container sharding, etc ?
+
+Ans: SLO and DLO segments should continue to work as expected. If an object
+server receives an object copy request for a SLO manifest object from a
+tiering-coordinator, it will iteratively perform the copy for each constituent
+object. Each constituent object will be replaced by a symlink. Encryption
+should also work correctly as it is almost entirely orthogonal to the tiering
+feature. Each object is treated as an opaque set of bytes by the tiering engine
+and it does not pay any heed to whether the object is cipher text or not.
+Dealing with container sharding might be tricky. Tiering-coordinator expects
+to linearly walk through the indices of a container DB. If the container DB
+is fragmented and stored in many different container servers, the scanning
+process can get complicated. Any ideas there?
+
+8. References
+=============
+
+1.  http://www.enterprisetech.com/2013/10/25/facebook-loads-innovative-cold-storage-datacenter/
+2.  http://www-03.ibm.com/systems/storage/tape/
+3.  Symlinks in Swift. https://review.openstack.org/#/c/173609/
+4.  Changing storage policies in Swift. https://review.openstack.org/#/c/168761/
+5.  Add metadata search in Swift. https://review.openstack.org/#/c/180918/
+6.  Tape support in Swift. https://etherpad.openstack.org/p/liberty-swift-tape-storage
+7.  http://docs.openstack.org/developer/swift/overview_container_sync.html
+8.  Container reconciler section at http://docs.openstack.org/developer/swift/overview_policies.html