 7f5e34a405
			
		
	
	7f5e34a405
	
	
	
		
			
			There are been members of the community running sharding in production and it's awesome. It's just the auto-sharding swift of that remains experimental. This patch removes the big sharding warning from the top of the sharding overview page and better emphasises that it's the audo_shard option that isn't ready for production use. Change-Id: Id2c842cffad58fb6fd5e1d12619c46ffcb38f8a5
		
			
				
	
	
		
			691 lines
		
	
	
		
			32 KiB
		
	
	
	
		
			ReStructuredText
		
	
	
	
	
	
			
		
		
	
	
			691 lines
		
	
	
		
			32 KiB
		
	
	
	
		
			ReStructuredText
		
	
	
	
	
	
| .. _sharding_doc:
 | |
| 
 | |
| ==================
 | |
| Container Sharding
 | |
| ==================
 | |
| 
 | |
| Container sharding is an operator controlled feature that may be used to shard
 | |
| very large container databases into a number of smaller shard containers
 | |
| 
 | |
| .. note::
 | |
| 
 | |
|     It is strongly recommended that operators gain experience of sharding
 | |
|     containers in a non-production cluster before using in production.
 | |
| 
 | |
|     The sharding process involves moving all sharding container database
 | |
|     records via the container replication engine; the time taken to complete
 | |
|     sharding is dependent upon the existing cluster load and the performance of
 | |
|     the container database being sharded.
 | |
| 
 | |
|     There is currently no documented process for reversing the sharding
 | |
|     process once sharding has been enabled.
 | |
| 
 | |
| 
 | |
| ----------
 | |
| Background
 | |
| ----------
 | |
| The metadata for each container in Swift is stored in an SQLite database. This
 | |
| metadata includes: information about the container such as its name,
 | |
| modification time and current object count; user metadata that may been written
 | |
| to the container by clients; a record of every object in the container. The
 | |
| container database object records are used to generate container listings in
 | |
| response to container GET requests; each object record stores the object's
 | |
| name, size, hash and content-type as well as associated timestamps.
 | |
| 
 | |
| As the number of objects in a container increases then the number of object
 | |
| records in the container database increases. Eventually the container database
 | |
| performance starts to degrade and the time taken to update an object record
 | |
| increases. This can result in object updates timing out, with a corresponding
 | |
| increase in the backlog of pending :ref:`asynchronous updates
 | |
| <architecture_updaters>` on object servers. Container databases are typically
 | |
| replicated on several nodes and any database performance degradation can also
 | |
| result in longer :doc:`container replication <overview_replication>` times.
 | |
| 
 | |
| The point at which container database performance starts to degrade depends
 | |
| upon the choice of hardware in the container ring. Anecdotal evidence suggests
 | |
| that containers with tens of millions of object records have noticeably
 | |
| degraded performance.
 | |
| 
 | |
| This performance degradation can be avoided by ensuring that clients use an
 | |
| object naming scheme that disperses objects across a number of containers
 | |
| thereby distributing load across a number of container databases. However, that
 | |
| is not always desirable nor is it under the control of the cluster operator.
 | |
| 
 | |
| Swift's container sharding feature provides the operator with a mechanism to
 | |
| distribute the load on a single client-visible container across multiple,
 | |
| hidden, shard containers, each of which stores a subset of the container's
 | |
| object records. Clients are unaware of container sharding; clients continue to
 | |
| use the same API to access a container that, if sharded, maps to a number of
 | |
| shard containers within the Swift cluster.
 | |
| 
 | |
| ------------------------
 | |
| Deployment and operation
 | |
| ------------------------
 | |
| 
 | |
| Upgrade Considerations
 | |
| ----------------------
 | |
| 
 | |
| It is essential that all servers in a Swift cluster have been upgraded to
 | |
| support the container sharding feature before attempting to shard a container.
 | |
| 
 | |
| Identifying containers in need of sharding
 | |
| ------------------------------------------
 | |
| 
 | |
| Container sharding is currently initiated by the ``swift-manage-shard-ranges``
 | |
| CLI tool :ref:`described below <swift-manage-shard-ranges>`. Operators must
 | |
| first identify containers that are candidates for sharding. To assist with
 | |
| this, the :ref:`sharder_daemon` inspects the size of containers that it visits
 | |
| and writes a list of sharding candidates to recon cache. For example::
 | |
| 
 | |
|     "sharding_candidates": {
 | |
|         "found": 1,
 | |
|         "top": [
 | |
|             {
 | |
|                 "account": "AUTH_test",
 | |
|                 "container": "c1",
 | |
|                 "file_size": 497763328,
 | |
|                 "meta_timestamp": "1525346445.31161",
 | |
|                 "node_index": 2,
 | |
|                 "object_count": 3349028,
 | |
|                 "path": <path_to_db>,
 | |
|                 "root": "AUTH_test/c1"
 | |
|             }
 | |
|         ]
 | |
|     }
 | |
| 
 | |
| A container is considered to be a sharding candidate if its object count is
 | |
| greater than or equal to the ``shard_container_threshold`` option.
 | |
| The number of candidates reported is limited to a number configured by the
 | |
| ``recon_candidates_limit`` option such that only the largest candidate
 | |
| containers are included in the ``sharding_candidates`` data.
 | |
| 
 | |
| 
 | |
| .. _swift-manage-shard-ranges:
 | |
| 
 | |
| ``swift-manage-shard-ranges`` CLI tool
 | |
| --------------------------------------
 | |
| 
 | |
| .. automodule:: swift.cli.manage_shard_ranges
 | |
|     :members:
 | |
|     :show-inheritance:
 | |
| 
 | |
| 
 | |
| .. _sharder_daemon:
 | |
| 
 | |
| ``container-sharder`` daemon
 | |
| ----------------------------
 | |
| 
 | |
| Once sharding has been enabled for a container, the act of sharding is
 | |
| performed by the :ref:`container-sharder`.  The :ref:`container-sharder` daemon
 | |
| must be running on all container servers. The ``container-sharder`` daemon
 | |
| periodically visits each container database to perform any container sharding
 | |
| tasks that are required.
 | |
| 
 | |
| The ``container-sharder`` daemon requires a ``[container-sharder]`` config
 | |
| section to exist in the container server configuration file; a sample config
 | |
| section is shown in the `container-server.conf-sample` file.
 | |
| 
 | |
| .. note::
 | |
| 
 | |
|     The ``auto_shard`` option is currently **NOT** recommended for production
 | |
|     systems and should be set to ``false`` (the default value).
 | |
| 
 | |
|     Several of the ``[container-sharder]`` config options are only significant
 | |
|     when the ``auto_shard`` option is enabled. This option enables the
 | |
|     ``container-sharder`` daemon to automatically identify containers that are
 | |
|     candidates for sharding and initiate the sharding process, instead of using
 | |
|     the ``swift-manage-shard-ranges`` tool.
 | |
| 
 | |
| The container sharder uses an internal client and therefore requires an
 | |
| internal client configuration file to exist. By default the internal-client
 | |
| configuration file is expected to be found at
 | |
| `/etc/swift/internal-client.conf`. An alternative location for the
 | |
| configuration file may be specified using the ``internal_client_conf_path``
 | |
| option in the ``[container-sharder]`` config section.
 | |
| 
 | |
| The content of the internal-client configuration file should be the same as the
 | |
| `internal-client.conf-sample` file. In particular, the internal-client
 | |
| configuration should have::
 | |
| 
 | |
|     account_autocreate = True
 | |
| 
 | |
| in the ``[proxy-server]`` section.
 | |
| 
 | |
| A container database may require several visits by the ``container-sharder``
 | |
| daemon before it is fully sharded. On each visit the ``container-sharder``
 | |
| daemon will move a subset of object records to new shard containers by cleaving
 | |
| new shard container databases from the original. By default, two shards are
 | |
| processed per visit; this number may be configured by the ``cleave_batch_size``
 | |
| option.
 | |
| 
 | |
| The ``container-sharder`` daemon periodically writes progress data for
 | |
| containers that are being sharded to recon cache. For example::
 | |
| 
 | |
|     "sharding_in_progress": {
 | |
|         "all": [
 | |
|             {
 | |
|                 "account": "AUTH_test",
 | |
|                 "active": 0,
 | |
|                 "cleaved": 2,
 | |
|                 "container": "c1",
 | |
|                 "created": 5,
 | |
|                 "db_state": "sharding",
 | |
|                 "error": null,
 | |
|                 "file_size": 26624,
 | |
|                 "found": 0,
 | |
|                 "meta_timestamp": "1525349617.46235",
 | |
|                 "node_index": 1,
 | |
|                 "object_count": 3349030,
 | |
|                 "path": <path_to_db>,
 | |
|                 "root": "AUTH_test/c1",
 | |
|                 "state": "sharding"
 | |
|             }
 | |
|         ]
 | |
|     }
 | |
| 
 | |
| This example indicates that from a total of 7 shard ranges, 2 have been cleaved
 | |
| whereas 5 remain in created state waiting to be cleaved.
 | |
| 
 | |
| Shard containers are created in an internal account and not visible to clients.
 | |
| By default, shard containers for an account ``AUTH_test`` are created in the
 | |
| internal account ``.shards_AUTH_test``.
 | |
| 
 | |
| Once a container has started sharding, object updates to that container may be
 | |
| redirected to the shard container. The ``container-sharder`` daemon is also
 | |
| responsible for sending updates of a shard's object count and bytes_used to the
 | |
| original container so that aggegrate object count and bytes used values can be
 | |
| returned in responses to client requests.
 | |
| 
 | |
| .. note::
 | |
| 
 | |
|     The ``container-sharder`` daemon must continue to run on all container
 | |
|     servers in order for shards object stats updates to be generated.
 | |
| 
 | |
| 
 | |
| --------------
 | |
| Under the hood
 | |
| --------------
 | |
| 
 | |
| Terminology
 | |
| -----------
 | |
| 
 | |
| ================== ====================================================
 | |
| Name               Description
 | |
| ================== ====================================================
 | |
| Root container     The original container that lives in the
 | |
|                    user's account. It holds references to its
 | |
|                    shard containers.
 | |
| Retiring DB        The original database file that is to be sharded.
 | |
| Fresh DB           A database file that will replace the retiring
 | |
|                    database.
 | |
| Epoch              A timestamp at which the fresh DB is created; the
 | |
|                    epoch value is embedded in the fresh DB filename.
 | |
| Shard range        A range of the object namespace defined by a lower
 | |
|                    bound and upper bound.
 | |
| Shard container    A container that holds object records for a shard
 | |
|                    range. Shard containers exist in a hidden account
 | |
|                    mirroring the user's account.
 | |
| Parent container   The container from which a shard container has been
 | |
|                    cleaved. When first sharding a root container each
 | |
|                    shard's parent container will be the root container.
 | |
|                    When sharding a shard container each shard's parent
 | |
|                    container will be the sharding shard container.
 | |
| Misplaced objects  Items that don't belong in a container's shard
 | |
|                    range. These will be moved to their correct
 | |
|                    location by the container-sharder.
 | |
| Cleaving           The act of moving object records within a shard
 | |
|                    range to a shard container database.
 | |
| Shrinking          The act of merging a small shard container into
 | |
|                    another shard container in order to delete the
 | |
|                    small shard container.
 | |
| Donor              The shard range that is shrinking away.
 | |
| Acceptor           The shard range into which a donor is merged.
 | |
| ================== ====================================================
 | |
| 
 | |
| 
 | |
| Finding shard ranges
 | |
| --------------------
 | |
| 
 | |
| The end goal of sharding a container is to replace the original container
 | |
| database which has grown very large with a number of shard container databases,
 | |
| each of which is responsible for storing a range of the entire object
 | |
| namespace. The first step towards achieving this is to identify an appropriate
 | |
| set of contiguous object namespaces, known as shard ranges, each of which
 | |
| contains a similar sized portion of the container's current object content.
 | |
| 
 | |
| Shard ranges cannot simply be selected by sharding the namespace uniformly,
 | |
| because object names are not guaranteed to be distributed uniformly. If the
 | |
| container were naively sharded into two shard ranges, one containing all
 | |
| object names up to `m` and the other containing all object names beyond `m`,
 | |
| then if all object names actually start with `o` the outcome would be an
 | |
| extremely unbalanced pair of shard containers.
 | |
| 
 | |
| It is also too simplistic to assume that every container that requires sharding
 | |
| can be sharded into two. This might be the goal in the ideal world, but in
 | |
| practice there will be containers that have grown very large and should be
 | |
| sharded into many shards. Furthermore, the time required to find the exact
 | |
| mid-point of the existing object names in a large SQLite database would
 | |
| increase with container size.
 | |
| 
 | |
| For these reasons, shard ranges of size `N` are found by searching for the
 | |
| `Nth` object in the database table, sorted by object name, and then searching
 | |
| for the `(2 * N)th` object, and so on until all objects have been searched. For
 | |
| a container that has exactly `2N` objects, the end result is the same as
 | |
| sharding the container at the midpoint of its object names. In practice
 | |
| sharding would typically be enabled for containers with great than `2N` objects
 | |
| and more than two shard ranges will be found, the last one probably containing
 | |
| less than `N` objects. With containers having large multiples of `N` objects,
 | |
| shard ranges can be identified in batches which enables more scalable solution.
 | |
| 
 | |
| To illustrate this process, consider a very large container in a user account
 | |
| ``acct`` that is a candidate for sharding:
 | |
| 
 | |
| .. image:: images/sharding_unsharded.svg
 | |
| 
 | |
| The :ref:`swift-manage-shard-ranges` tool ``find`` sub-command searches the
 | |
| object table for the `Nth` object whose name will become the upper bound of the
 | |
| first shard range, and the lower bound of the second shard range. The lower
 | |
| bound of the first shard range is the empty string.
 | |
| 
 | |
| For the purposes of this example the first upper bound is `cat`:
 | |
| 
 | |
| .. image:: images/sharding_scan_basic.svg
 | |
| 
 | |
| :ref:`swift-manage-shard-ranges` continues to search the container to find
 | |
| further shard ranges, with the final upper bound also being the empty string.
 | |
| 
 | |
| Enabling sharding
 | |
| -----------------
 | |
| 
 | |
| Once shard ranges have been found the :ref:`swift-manage-shard-ranges`
 | |
| ``replace`` sub-command is used to insert them into the `shard_ranges` table
 | |
| of the container database. In addition to its lower and upper bounds, each
 | |
| shard range is given a unique name.
 | |
| 
 | |
| The ``enable`` sub-command then creates some final state required to initiate
 | |
| sharding the container, including a special shard range record referred to as
 | |
| the container's `own_shard_range` whose name is equal to the container's path.
 | |
| This is used to keep a record of the object namespace that the container
 | |
| covers, which for user containers is always the entire namespace. Sharding of
 | |
| the container will only begin when its own shard range's state has been set to
 | |
| ``SHARDING``.
 | |
| 
 | |
| The :class:`~swift.common.utils.ShardRange` class
 | |
| -------------------------------------------------
 | |
| 
 | |
| The :class:`~swift.common.utils.ShardRange` class provides methods for
 | |
| interactng with the attributes and state of a shard range. The class
 | |
| encapsulates the following properties:
 | |
| 
 | |
| * The name of the shard range which is also the name of the shard container
 | |
|   used to hold object records in its namespace.
 | |
| * Lower and upper bounds which define the object namespace of the shard range.
 | |
| * A deleted flag.
 | |
| * A timestamp at which the bounds and deleted flag were last modified.
 | |
| * The object stats for the shard range i.e. object count and bytes used.
 | |
| * A timestamp at which the object stats were last modified.
 | |
| * The state of the shard range, and an epoch, which is the timestamp used in
 | |
|   the shard container's database file name.
 | |
| * A timestamp at which the state and epoch were last modified.
 | |
| 
 | |
| A shard range progresses through the following states:
 | |
| 
 | |
| * FOUND: the shard range has been identified in the container that is to be
 | |
|   sharded but no resources have been created for it.
 | |
| * CREATED: a shard container has been created to store the contents of the
 | |
|   shard range.
 | |
| * CLEAVED: the sharding container's contents for the shard range have been
 | |
|   copied to the shard container from *at least one replica* of the sharding
 | |
|   container.
 | |
| * ACTIVE: a sharding container's constituent shard ranges are moved to this
 | |
|   state when all shard ranges in the sharding container have been cleaved.
 | |
| * SHRINKING: the shard range has been enabled for shrinking; or
 | |
| * SHARDING: the shard range has been enabled for sharding into further
 | |
|   sub-shards.
 | |
| * SHARDED: the shard range has completed sharding or shrinking; the container
 | |
|   will typically now have a number of constituent ACTIVE shard ranges.
 | |
| 
 | |
| .. note::
 | |
| 
 | |
|     Shard range state represents the most advanced state of the shard range on
 | |
|     any replica of the container. For example, a shard range in CLEAVED state
 | |
|     may not have completed cleaving on all replicas but has cleaved on at least
 | |
|     one replica.
 | |
| 
 | |
| Fresh and retiring database files
 | |
| ---------------------------------
 | |
| 
 | |
| As alluded to earlier, writing to a large container causes increased latency
 | |
| for the container servers. Once sharding has been initiated on a container it
 | |
| is desirable to stop writing to the large database; ultimately it will be
 | |
| unlinked. This is primarily achieved by redirecting object updates to new shard
 | |
| containers as they are created (see :ref:`redirecting_updates` below), but some
 | |
| object updates may still need to be accepted by the root container and other
 | |
| container metadata must still be modifiable.
 | |
| 
 | |
| To render the large `retiring` database effectively read-only, when the
 | |
| :ref:`sharder_daemon` finds a container with a set of shard range records,
 | |
| including an `own_shard_range`, it first creates a fresh database file which
 | |
| will ultimately replace the existing `retiring` database. For a retiring DB
 | |
| whose filename is::
 | |
| 
 | |
|     <hash>.db
 | |
| 
 | |
| the fresh database file name is of the form::
 | |
| 
 | |
|     <hash>_<epoch>.db
 | |
| 
 | |
| where `epoch` is a timestamp stored in the container's `own_shard_range`.
 | |
| 
 | |
| The fresh DB has a copy of the shard ranges table from the retiring DB and all
 | |
| other container metadata apart from the object records. Once a fresh DB file
 | |
| has been created it is used to store any new object updates and no more object
 | |
| records are written to the retiring DB file.
 | |
| 
 | |
| Once the sharding process has completed, the retiring DB file will be unlinked
 | |
| leaving only the fresh DB file in the container's directory. There are
 | |
| therefore three states that the container DB directory may be in during the
 | |
| sharding process: UNSHARDED, SHARDING and SHARDED.
 | |
| 
 | |
| .. image:: images/sharding_db_states.svg
 | |
| 
 | |
| If the container ever shrink to the point that is has no shards then the fresh
 | |
| DB starts to store object records, behaving the same as an unsharded container.
 | |
| This is known as the COLLAPSED state.
 | |
| 
 | |
| In summary, the DB states that any container replica may be in are:
 | |
| 
 | |
| - UNSHARDED - In this state there is just one standard container database. All
 | |
|   containers are originally in this state.
 | |
| - SHARDING - There are now two databases, the retiring database and a fresh
 | |
|   database. The fresh database stores any metadata, container level stats,
 | |
|   an object holding table, and a table that stores shard ranges.
 | |
| - SHARDED - There is only one database, the fresh database, which has one or
 | |
|   more shard ranges in addition to its own shard range. The retiring database
 | |
|   has been unlinked.
 | |
| - COLLAPSED - There is only one database, the fresh database, which has only
 | |
|   its own shard range and store object records.
 | |
| 
 | |
| .. note::
 | |
| 
 | |
|     DB state is unique to each replica of a container and is not necessarily
 | |
|     synchronised with shard range state.
 | |
| 
 | |
| Creating shard containers
 | |
| -------------------------
 | |
| 
 | |
| The :ref:`sharder_daemon` next creates a shard container for each shard range
 | |
| using the shard range name as the name of the shard container:
 | |
| 
 | |
| .. image:: /images/sharding_cleave_basic.svg
 | |
| 
 | |
| Each shard container has an `own_shard_range` record which has the
 | |
| lower and upper bounds of the object namespace for which it is responsible, and
 | |
| a reference to the sharding user container, which is referred to as the
 | |
| `root_container`. Unlike the `root_container`, the shard container's
 | |
| `own_shard_range` does not cover the entire namepsace.
 | |
| 
 | |
| A shard range name takes the form ``<shard_a>/<shard_c>`` where `<shard_a>`
 | |
| is a hidden account and `<shard_c>` is a container name that is derived from
 | |
| the root container.
 | |
| 
 | |
| The account name `<shard_a>` used for shard containers is formed by prefixing
 | |
| the user account with the string ``.shards_``. This avoids namespace collisions
 | |
| and also keeps all the shard containers out of view from users of the account.
 | |
| 
 | |
| The container name for each shard container has the form::
 | |
| 
 | |
|   <root container name>-<hash of parent container>-<timestamp>-<shard index>
 | |
| 
 | |
| where `root container name` is the name of the user container to which the
 | |
| contents of the shard container belong, `parent container` is the name of the
 | |
| container from which the shard is being cleaved, `timestamp` is the time at
 | |
| which the shard range was created and `shard index` is the position of the
 | |
| shard range in the name-ordered list of shard ranges for the `parent
 | |
| container`.
 | |
| 
 | |
| When sharding a user container the parent container name will be the same as
 | |
| the root container. However, if a *shard container* grows to a size that it
 | |
| requires sharding, then the parent container name for its shards will be the
 | |
| name of the sharding shard container.
 | |
| 
 | |
| For example, consider a user container with path ``AUTH_user/c`` which is
 | |
| sharded into two shard containers whose name will be::
 | |
| 
 | |
|   .shards_AUTH_user/c-<hash(c)>-1234512345.12345-0
 | |
|   .shards_AUTH_user/c-<hash(c)>-1234512345.12345-1
 | |
| 
 | |
| If the first shard container is subsequently sharded into a further two shard
 | |
| containers then they will be named::
 | |
| 
 | |
|   .shards_AUTH_user/c-<hash(c-<hash(c)>-1234567890.12345-0)>-1234567890.12345-0
 | |
|   .shards_AUTH_user/c-<hash(c-<hash(c)>-1234567890.12345-0)>-1234567890.12345-1
 | |
| 
 | |
| This naming scheme guarantees that shards, and shards of shards, each have a
 | |
| unique name of bounded length.
 | |
| 
 | |
| 
 | |
| Cleaving shard containers
 | |
| -------------------------
 | |
| 
 | |
| Having created empty shard containers the sharder daemon will proceed to cleave
 | |
| objects from the retiring database to each shard range. Cleaving occurs in
 | |
| batches of two (by default) shard ranges, so if a container has more than two
 | |
| shard ranges then the daemon must visit it multiple times to complete cleaving.
 | |
| 
 | |
| To cleave a shard range the daemon creates a shard database for the shard
 | |
| container on a local device. This device may be one of the shard container's
 | |
| primary nodes but often it will not. Object records from the corresponding
 | |
| shard range namespace are then copied from the retiring DB to this shard DB.
 | |
| 
 | |
| Swift's container replication mechanism is then used to replicate the shard DB
 | |
| to its primary nodes. Checks are made to ensure that the new shard container DB
 | |
| has been replicated to a sufficient number of its primary nodes before it is
 | |
| considered to have been successfully cleaved. By default the daemon requires
 | |
| successful replication of a new shard broker to at least a quorum of the
 | |
| container rings replica count, but this requirement can be tuned using the
 | |
| ``shard_replication_quorum`` option.
 | |
| 
 | |
| Once a shard range has been successfully cleaved from a retiring database the
 | |
| daemon transitions its state to ``CLEAVED``. It should be noted that this state
 | |
| transition occurs as soon as any one of the retiring DB replicas has cleaved
 | |
| the shard range, and therefore does not imply that all retiring DB replicas
 | |
| have cleaved that range. The significance of the state transition is that the
 | |
| shard container is now considered suitable for contributing to object listings,
 | |
| since its contents are present on a quorum of its primary nodes and are the
 | |
| same as at least one of the retiring DBs for that namespace.
 | |
| 
 | |
| Once a shard range is in the ``CLEAVED`` state, the requirement for
 | |
| 'successful' cleaving of other instances of the retirng DB may optionally be
 | |
| relaxed since it is not so imperative that their contents are replicated
 | |
| *immediately* to their primary nodes. The ``existing_shard_replication_quorum``
 | |
| option can be used to reduce the quorum required for a cleaved shard range to
 | |
| be considered successfully replicated by the sharder daemon.
 | |
| 
 | |
| .. note::
 | |
| 
 | |
|     Once cleaved, shard container DBs will continue to be replicated by the
 | |
|     normal `container-replicator` daemon so that they will eventually be fully
 | |
|     replicated to all primary nodes regardless of any replication quorum options
 | |
|     used by the sharder daemon.
 | |
| 
 | |
| The cleaving progress of each replica of a retiring DB must be
 | |
| tracked independently of the shard range state. This is done using a per-DB
 | |
| CleavingContext object that maintains a cleaving cursor for the retiring DB
 | |
| that it is associated with. The cleaving cursor is simply the upper bound of
 | |
| the last shard range to have been cleaved *from that particular retiring DB*.
 | |
| 
 | |
| Each CleavingContext is stored in the sharding container's sysmeta under a key
 | |
| that is the ``id`` of the retiring DB. Since all container DB files have a
 | |
| unique ``id``, this guarantees that each retiring DB will have a unique
 | |
| CleavingContext. Furthermore, if the retiring DB file is changed, for example
 | |
| by an rsync_then_merge replication operation which might change the contents of
 | |
| the DB's object table, then it will get a new unique CleavingContext.
 | |
| 
 | |
| A CleavingContext maintains other state that is used to ensure that a retiring
 | |
| DB is only considered to be fully cleaved, and ready to be deleted, if *all* of
 | |
| its object rows have been cleaved to a shard range.
 | |
| 
 | |
| Once all shard ranges have been cleaved from the retiring DB it is deleted. The
 | |
| container is now represented by the fresh DB which has a table of shard range
 | |
| records that point to the shard containers that store the container's object
 | |
| records.
 | |
| 
 | |
| .. _redirecting_updates:
 | |
| 
 | |
| Redirecting object updates
 | |
| --------------------------
 | |
| 
 | |
| Once a shard container exists, object updates arising from new client requests
 | |
| and async pending files are directed to the shard container instead of the root
 | |
| container. This takes load off of the root container.
 | |
| 
 | |
| For a sharded (or partially sharded) container, when the proxy receives a new
 | |
| object request it issues a GET request to the container for data describing a
 | |
| shard container to which the object update should be sent. The proxy then
 | |
| annotates the object request with the shard container location so that the
 | |
| object server will forward object updates to the shard container. If those
 | |
| updates fail then the async pending file that is written on the object server
 | |
| contains the shard container location.
 | |
| 
 | |
| When the object updater processes async pending files for previously failed
 | |
| object updates, it may not find a shard container location. In this case the
 | |
| updater sends the update to the `root container`, which returns a redirection
 | |
| response with the shard container location.
 | |
| 
 | |
| .. note::
 | |
| 
 | |
|     Object updates are directed to shard containers as soon as they exist, even
 | |
|     if the retiring DB object records have not yet been cleaved to the shard
 | |
|     container. This prevents further writes to the retiring DB and also avoids
 | |
|     the fresh DB being polluted by new object updates. The goal is to
 | |
|     ultimately have all object records in the shard containers and none in the
 | |
|     root container.
 | |
| 
 | |
| Building container listings
 | |
| ---------------------------
 | |
| 
 | |
| Listing requests for a sharded container are handled by querying the shard
 | |
| containers for components of the listing. The proxy forwards the client listing
 | |
| request to the root container, as it would for an unsharded container, but the
 | |
| container server responds with a list of shard ranges rather than objects. The
 | |
| proxy then queries each shard container in namespace order for their listing,
 | |
| until either the listing length limit is reached or all shard ranges have been
 | |
| listed.
 | |
| 
 | |
| While a container is still in the process of sharding, only *cleaved* shard
 | |
| ranges are used when building a container listing. Shard ranges that have not
 | |
| yet cleaved will not have any object records from the root container. The root
 | |
| container continues to provide listings for the uncleaved part of its
 | |
| namespace.
 | |
| 
 | |
| .. note::
 | |
| 
 | |
|    New object updates are redirected to shard containers that have not yet been
 | |
|    cleaved. These updates will not therefore be included in container listings
 | |
|    until their shard range has been cleaved.
 | |
| 
 | |
| Example request redirection
 | |
| ---------------------------
 | |
| 
 | |
| As an example, consider a sharding container in which 3 shard ranges have been
 | |
| found ending in cat, giraffe and igloo. Their respective shard containers have
 | |
| been created so update requests for objects up to "igloo" are redirected to the
 | |
| appropriate shard container. The root DB continues to handle listing requests
 | |
| and update requests for any object name beyond "igloo".
 | |
| 
 | |
| .. image:: images/sharding_scan_load.svg
 | |
| 
 | |
| The sharder daemon cleaves objects from the retiring DB to the shard range DBs;
 | |
| it also moves any misplaced objects from the root container's fresh DB to the
 | |
| shard DB. Cleaving progress is represented by the blue line. Once the first
 | |
| shard range has been cleaved listing requests for that namespace are directed
 | |
| to the shard container. The root container still provides listings for the
 | |
| remainder of the namespace.
 | |
| 
 | |
| .. image:: images/sharding_cleave1_load.svg
 | |
| 
 | |
| The process continues: the sharder cleaves the next range and a new range is
 | |
| found with upper bound of "linux". Now the root container only needs to handle
 | |
| listing requests up to "giraffe" and update requests for objects whose name is
 | |
| greater than "linux". Load will continue to diminish on the root DB and be
 | |
| dispersed across the shard DBs.
 | |
| 
 | |
| .. image:: images/sharding_cleave2_load.svg
 | |
| 
 | |
| 
 | |
| Container replication
 | |
| ---------------------
 | |
| 
 | |
| Shard range records are replicated between container DB replicas in much the
 | |
| same way as object records are for unsharded containers. However, the usual
 | |
| replication of object records between replicas of a container is halted as soon
 | |
| as a container is capable of being sharded. Instead, object records are moved
 | |
| to their new locations in shard containers. This avoids unnecessary replication
 | |
| traffic between container replicas.
 | |
| 
 | |
| To facilitate this, shard ranges are both 'pushed' and 'pulled' during
 | |
| replication, prior to any attempt to replicate objects. This means that the
 | |
| node initiating replication learns about shard ranges from the destination node
 | |
| early during the replication process and is able to skip object replication if
 | |
| it discovers that it has shard ranges and is able to shard.
 | |
| 
 | |
| .. note::
 | |
| 
 | |
|     When the destination DB for container replication is missing then the
 | |
|     'complete_rsync' replication mechanism is still used and in this case only
 | |
|     both object records and shard range records are copied to the destination
 | |
|     node.
 | |
| 
 | |
| Container deletion
 | |
| ------------------
 | |
| 
 | |
| Sharded containers may be deleted by a ``DELETE`` request just like an
 | |
| unsharded container. A sharded container must be empty before it can be deleted
 | |
| which implies that all of its shard containers must have reported that they are
 | |
| empty.
 | |
| 
 | |
| Shard containers are *not* immediately deleted when their root container is
 | |
| deleted; the shard containers remain undeleted so that they are able to
 | |
| continue to receive object updates that might arrive after the root container
 | |
| has been deleted. Shard containers continue to update their deleted root
 | |
| container with their object stats. If a shard container does receive object
 | |
| updates that cause it to no longer be empty then the root container will no
 | |
| longer be considered deleted once that shard container sends an object stats
 | |
| update.
 | |
| 
 | |
| 
 | |
| Sharding a shard container
 | |
| --------------------------
 | |
| 
 | |
| A shard container may grow to a size that requires it to be sharded.
 | |
| ``swift-manage-shard-ranges`` may be used to identify shard ranges within a
 | |
| shard container and enable sharding in the same way as for a root container.
 | |
| When a shard is sharding it notifies the root container of its shard ranges so
 | |
| that the root container can start to redirect object updates to the new
 | |
| 'sub-shards'. When the shard has completed sharding the root is aware of all
 | |
| the new sub-shards and the sharding shard deletes its shard range record in the
 | |
| root container shard ranges table. At this point the root container is aware of
 | |
| all the new sub-shards which collectively cover the namespace of the
 | |
| now-deleted shard.
 | |
| 
 | |
| There is no hierarchy of shards beyond the root container and its immediate
 | |
| shards. When a shard shards, its sub-shards are effectively re-parented with
 | |
| the root container.
 | |
| 
 | |
| 
 | |
| Shrinking a shard container
 | |
| ---------------------------
 | |
| 
 | |
| A shard container's contents may reduce to a point where the shard container is
 | |
| no longer required. If this happens then the shard container may be shrunk into
 | |
| another shard range. Shrinking is achieved in a similar way to sharding: an
 | |
| 'acceptor' shard range is written to the shrinking shard container's shard
 | |
| ranges table; unlike sharding, where shard ranges each cover a subset of the
 | |
| sharding container's namespace, the acceptor shard range is a superset of the
 | |
| shrinking shard range.
 | |
| 
 | |
| Once given an acceptor shard range the shrinking shard will cleave itself to
 | |
| its acceptor, and then delete itself from the root container shard ranges
 | |
| table.
 |