diff --git a/doc/admin-guide-cloud-rst/source/figures/objectstorage-arch.png b/doc/admin-guide-cloud-rst/source/figures/objectstorage-arch.png new file mode 100644 index 0000000000..3b7978b673 Binary files /dev/null and b/doc/admin-guide-cloud-rst/source/figures/objectstorage-arch.png differ diff --git a/doc/admin-guide-cloud-rst/source/figures/objectstorage-nodes.png b/doc/admin-guide-cloud-rst/source/figures/objectstorage-nodes.png new file mode 100644 index 0000000000..e7a0396f5f Binary files /dev/null and b/doc/admin-guide-cloud-rst/source/figures/objectstorage-nodes.png differ diff --git a/doc/admin-guide-cloud-rst/source/objectstorage.rst b/doc/admin-guide-cloud-rst/source/objectstorage.rst index 94baa41e8c..ebfce18c9e 100644 --- a/doc/admin-guide-cloud-rst/source/objectstorage.rst +++ b/doc/admin-guide-cloud-rst/source/objectstorage.rst @@ -2,9 +2,6 @@ Object Storage ============== -Contents -~~~~~~~~ - .. toctree:: :maxdepth: 2 @@ -12,13 +9,13 @@ Contents objectstorage_features.rst objectstorage_characteristics.rst objectstorage_components.rst - objectstorage-monitoring.rst - objectstorage-admin.rst - -.. TODO (karenb) objectstorage_ringbuilder.rst objectstorage_arch.rst objectstorage_replication.rst objectstorage_account_reaper.rst objectstorage_tenant_specific_image_storage.rst + objectstorage-monitoring.rst + objectstorage-admin.rst + +.. TODO (karenb) objectstorage_troubleshoot.rst diff --git a/doc/admin-guide-cloud-rst/source/objectstorage_account_reaper.rst b/doc/admin-guide-cloud-rst/source/objectstorage_account_reaper.rst new file mode 100644 index 0000000000..ebab97ffa3 --- /dev/null +++ b/doc/admin-guide-cloud-rst/source/objectstorage_account_reaper.rst @@ -0,0 +1,50 @@ +============== +Account reaper +============== + +In the background, the account reaper removes data from the deleted +accounts. + +A reseller marks an account for deletion by issuing a ``DELETE`` request +on the account's storage URL. This action sets the ``status`` column of +the account\_stat table in the account database and replicas to +``DELETED``, marking the account's data for deletion. + +Typically, a specific retention time or undelete are not provided. +However, you can set a ``delay_reaping`` value in the +``[account-reaper]`` section of the :file:`account-server.conf` file to +delay the actual deletion of data. At this time, to undelete you have to update +the account database replicas directly, setting the status column to an +empty string and updating the put\_timestamp to be greater than the +delete\_timestamp. + +.. note:: + + It is on the development to-do list to write a utility that performs + this task, preferably through a REST call. + +The account reaper runs on each account server and scans the server +occasionally for account databases marked for deletion. It only fires up +on the accounts for which the server is the primary node, so that +multiple account servers aren't trying to do it simultaneously. Using +multiple servers to delete one account might improve the deletion speed +but requires coordination to avoid duplication. Speed really is not a +big concern with data deletion, and large accounts aren't deleted often. + +Deleting an account is simple. For each account container, all objects +are deleted and then the container is deleted. Deletion requests that +fail will not stop the overall process but will cause the overall +process to fail eventually (for example, if an object delete times out, +you will not be able to delete the container or the account). The +account reaper keeps trying to delete an account until it is empty, at +which point the database reclaim process within the db\_replicator will +remove the database files. + +A persistent error state may prevent the deletion of an object or +container. If this happens, you will see a message in the log, for example:: + + "Account has not been reaped since " + +You can control when this is logged with the ``reap_warn_after`` value in the +``[account-reaper]`` section of the :file:`account-server.conf` file. +The default value is 30 days. diff --git a/doc/admin-guide-cloud-rst/source/objectstorage_arch.rst b/doc/admin-guide-cloud-rst/source/objectstorage_arch.rst new file mode 100644 index 0000000000..7bc1a37456 --- /dev/null +++ b/doc/admin-guide-cloud-rst/source/objectstorage_arch.rst @@ -0,0 +1,81 @@ +==================== +Cluster architecture +==================== + +Access tier +~~~~~~~~~~~ +Large-scale deployments segment off an access tier, which is considered +the Object Storage system's central hub. The access tier fields the +incoming API requests from clients and moves data in and out of the +system. This tier consists of front-end load balancers, ssl-terminators, +and authentication services. It runs the (distributed) brain of the +Object Storage system: the proxy server processes. + +**Object Storage architecture** + +| + +.. image:: figures/objectstorage-arch.png + +| + +Because access servers are collocated in their own tier, you can scale +out read/write access regardless of the storage capacity. For example, +if a cluster is on the public Internet, requires SSL termination, and +has a high demand for data access, you can provision many access +servers. However, if the cluster is on a private network and used +primarily for archival purposes, you need fewer access servers. + +Since this is an HTTP addressable storage service, you may incorporate a +load balancer into the access tier. + +Typically, the tier consists of a collection of 1U servers. These +machines use a moderate amount of RAM and are network I/O intensive. +Since these systems field each incoming API request, you should +provision them with two high-throughput (10GbE) interfaces - one for the +incoming "front-end" requests and the other for the "back-end" access to +the object storage nodes to put and fetch data. + +Factors to consider +------------------- +For most publicly facing deployments as well as private deployments +available across a wide-reaching corporate network, you use SSL to +encrypt traffic to the client. SSL adds significant processing load to +establish sessions between clients, which is why you have to provision +more capacity in the access layer. SSL may not be required for private +deployments on trusted networks. + +Storage nodes +~~~~~~~~~~~~~ +In most configurations, each of the five zones should have an equal +amount of storage capacity. Storage nodes use a reasonable amount of +memory and CPU. Metadata needs to be readily available to return objects +quickly. The object stores run services not only to field incoming +requests from the access tier, but to also run replicators, auditors, +and reapers. You can provision object stores provisioned with single +gigabit or 10 gigabit network interface depending on the expected +workload and desired performance. + +**Object Storage (swift)** + +| + +.. image:: figures/objectstorage-nodes.png + +| + +Currently, a 2 TB or 3 TB SATA disk delivers good performance for the +price. You can use desktop-grade drives if you have responsive remote +hands in the datacenter and enterprise-grade drives if you don't. + +Factors to consider +------------------- +You should keep in mind the desired I/O performance for single-threaded +requests . This system does not use RAID, so a single disk handles each +request for an object. Disk performance impacts single-threaded response +rates. + +To achieve apparent higher throughput, the object storage system is +designed to handle concurrent uploads/downloads. The network I/O +capacity (1GbE, bonded 1GbE pair, or 10GbE) should match your desired +concurrent throughput needs for reads and writes. diff --git a/doc/admin-guide-cloud-rst/source/objectstorage_replication.rst b/doc/admin-guide-cloud-rst/source/objectstorage_replication.rst new file mode 100644 index 0000000000..ef8707cd0d --- /dev/null +++ b/doc/admin-guide-cloud-rst/source/objectstorage_replication.rst @@ -0,0 +1,96 @@ +=========== +Replication +=========== + +Because each replica in Object Storage functions independently and +clients generally require only a simple majority of nodes to respond to +consider an operation successful, transient failures like network +partitions can quickly cause replicas to diverge. These differences are +eventually reconciled by asynchronous, peer-to-peer replicator +processes. The replicator processes traverse their local file systems +and concurrently perform operations in a manner that balances load +across physical disks. + +Replication uses a push model, with records and files generally only +being copied from local to remote replicas. This is important because +data on the node might not belong there (as in the case of hand offs and +ring changes), and a replicator cannot know which data it should pull in +from elsewhere in the cluster. Any node that contains data must ensure +that data gets to where it belongs. The ring handles replica placement. + +To replicate deletions in addition to creations, every deleted record or +file in the system is marked by a tombstone. The replication process +cleans up tombstones after a time period known as the *consistency +window*. This window defines the duration of the replication and how +long transient failure can remove a node from the cluster. Tombstone +cleanup must be tied to replication to reach replica convergence. + +If a replicator detects that a remote drive has failed, the replicator +uses the ``get_more_nodes`` interface for the ring to choose an +alternate node with which to synchronize. The replicator can maintain +desired levels of replication during disk failures, though some replicas +might not be in an immediately usable location. + +.. note:: + + The replicator does not maintain desired levels of replication when + failures such as entire node failures occur; most failures are + transient. + +The main replication types are: + +- Database replication + Replicates containers and objects. + +- Object replication + Replicates object data. + +Database replication +~~~~~~~~~~~~~~~~~~~~ +Database replication completes a low-cost hash comparison to determine +whether two replicas already match. Normally, this check can quickly +verify that most databases in the system are already synchronized. If +the hashes differ, the replicator synchronizes the databases by sharing +records added since the last synchronization point. + +This synchronization point is a high water mark that notes the last +record at which two databases were known to be synchronized, and is +stored in each database as a tuple of the remote database ID and record +ID. Database IDs are unique across all replicas of the database, and +record IDs are monotonically increasing integers. After all new records +are pushed to the remote database, the entire synchronization table of +the local database is pushed, so the remote database can guarantee that +it is synchronized with everything with which the local database was +previously synchronized. + +If a replica is missing, the whole local database file is transmitted to +the peer by using rsync(1) and is assigned a new unique ID. + +In practice, database replication can process hundreds of databases per +concurrency setting per second (up to the number of available CPUs or +disks) and is bound by the number of database transactions that must be +performed. + +Object replication +~~~~~~~~~~~~~~~~~~ +The initial implementation of object replication performed an rsync to +push data from a local partition to all remote servers where it was +expected to reside. While this worked at small scale, replication times +skyrocketed once directory structures could no longer be held in RAM. +This scheme was modified to save a hash of the contents for each suffix +directory to a per-partition hashes file. The hash for a suffix +directory is no longer valid when the contents of that suffix directory +is modified. + +The object replication process reads in hash files and calculates any +invalidated hashes. Then, it transmits the hashes to each remote server +that should hold the partition, and only suffix directories with +differing hashes on the remote server are rsynced. After pushing files +to the remote server, the replication process notifies it to recalculate +hashes for the rsynced suffix directories. + +The number of uncached directories that object replication must +traverse, usually as a result of invalidated suffix directory hashes, +impedes performance. To provide acceptable replication speeds, object +replication is designed to invalidate around 2 percent of the hash space +on a normal node each day. diff --git a/doc/admin-guide-cloud-rst/source/objectstorage_ringbuilder.rst b/doc/admin-guide-cloud-rst/source/objectstorage_ringbuilder.rst new file mode 100644 index 0000000000..225dc438e9 --- /dev/null +++ b/doc/admin-guide-cloud-rst/source/objectstorage_ringbuilder.rst @@ -0,0 +1,181 @@ +============ +Ring-builder +============ + +Use the swift-ring-builder utility to build and manage rings. This +utility assigns partitions to devices and writes an optimized Python +structure to a gzipped, serialized file on disk for transmission to the +servers. The server processes occasionally check the modification time +of the file and reload in-memory copies of the ring structure as needed. +If you use a slightly older version of the ring, one of the three +replicas for a partition subset will be incorrect because of the way the +ring-builder manages changes to the ring. You can work around this +issue. + +The ring-builder also keeps its own builder file with the ring +information and additional data required to build future rings. It is +very important to keep multiple backup copies of these builder files. +One option is to copy the builder files out to every server while +copying the ring files themselves. Another is to upload the builder +files into the cluster itself. If you lose the builder file, you have to +create a new ring from scratch. Nearly all partitions would be assigned +to different devices and, therefore, nearly all of the stored data would +have to be replicated to new locations. So, recovery from a builder file +loss is possible, but data would be unreachable for an extended time. + +Ring data structure +~~~~~~~~~~~~~~~~~~~ +The ring data structure consists of three top level fields: a list of +devices in the cluster, a list of lists of device ids indicating +partition to device assignments, and an integer indicating the number of +bits to shift an MD5 hash to calculate the partition for the hash. + +Partition assignment list +~~~~~~~~~~~~~~~~~~~~~~~~~ +This is a list of ``array('H')`` of devices ids. The outermost list +contains an ``array('H')`` for each replica. Each ``array('H')`` has a +length equal to the partition count for the ring. Each integer in the +``array('H')`` is an index into the above list of devices. The partition +list is known internally to the Ring class as ``_replica2part2dev_id``. + +So, to create a list of device dictionaries assigned to a partition, the +Python code would look like:: + + devices = [self.devs[part2dev_id[partition]] for + part2dev_id in self._replica2part2dev_id] + +That code is a little simplistic because it does not account for the +removal of duplicate devices. If a ring has more replicas than devices, +a partition will have more than one replica on a device. + +``array('H')`` is used for memory conservation as there may be millions +of partitions. + +Replica counts +~~~~~~~~~~~~~~ +To support the gradual change in replica counts, a ring can have a real +number of replicas and is not restricted to an integer number of +replicas. + +A fractional replica count is for the whole ring and not for individual +partitions. It indicates the average number of replicas for each +partition. For example, a replica count of 3.2 means that 20 percent of +partitions have four replicas and 80 percent have three replicas. + +The replica count is adjustable. + +Example:: + + $ swift-ring-builder account.builder set_replicas 4 + $ swift-ring-builder account.builder rebalance + +You must rebalance the replica ring in globally distributed clusters. +Operators of these clusters generally want an equal number of replicas +and regions. Therefore, when an operator adds or removes a region, the +operator adds or removes a replica. Removing unneeded replicas saves on +the cost of disks. + +You can gradually increase the replica count at a rate that does not +adversely affect cluster performance. + +For example:: + + $ swift-ring-builder object.builder set_replicas 3.01 + $ swift-ring-builder object.builder rebalance + ... + + $ swift-ring-builder object.builder set_replicas 3.02 + $ swift-ring-builder object.builder rebalance + ... + +Changes take effect after the ring is rebalanced. Therefore, if you +intend to change from 3 replicas to 3.01 but you accidentally type +2.01, no data is lost. + +Additionally, the ``swift-ring-builder X.builder create`` command can now +take a decimal argument for the number of replicas. + +Partition shift value +~~~~~~~~~~~~~~~~~~~~~ +The partition shift value is known internally to the Ring class as +``_part_shift``. This value is used to shift an MD5 hash to calculate +the partition where the data for that hash should reside. Only the top +four bytes of the hash is used in this process. For example, to compute +the partition for the :file:`/account/container/object` path using Python:: + + partition = unpack_from('>I', + md5('/account/container/object').digest())[0] >> + self._part_shift + +For a ring generated with part\_power P, the partition shift value is +``32 - P``. + +Build the ring +~~~~~~~~~~~~~~ +The ring builder process includes these high-level steps: + +#. The utility calculates the number of partitions to assign to each + device based on the weight of the device. For example, for a + partition at the power of 20, the ring has 1,048,576 partitions. One + thousand devices of equal weight each want 1,048.576 partitions. The + devices are sorted by the number of partitions they desire and kept + in order throughout the initialization process. + + .. note:: + + Each device is also assigned a random tiebreaker value that is + used when two devices desire the same number of partitions. This + tiebreaker is not stored on disk anywhere, and so two different + rings created with the same parameters will have different + partition assignments. For repeatable partition assignments, + ``RingBuilder.rebalance()`` takes an optional seed value that + seeds the Python pseudo-random number generator. + +#. The ring builder assigns each partition replica to the device that + requires most partitions at that point while keeping it as far away + as possible from other replicas. The ring builder prefers to assign a + replica to a device in a region that does not already have a replica. + If no such region is available, the ring builder searches for a + device in a different zone, or on a different server. If it does not + find one, it looks for a device with no replicas. Finally, if all + options are exhausted, the ring builder assigns the replica to the + device that has the fewest replicas already assigned. + + .. note:: + + The ring builder assigns multiple replicas to one device only if + the ring has fewer devices than it has replicas. + +#. When building a new ring from an old ring, the ring builder + recalculates the desired number of partitions that each device wants. + +#. The ring builder unassigns partitions and gathers these partitions + for reassignment, as follows: + + - The ring builder unassigns any assigned partitions from any + removed devices and adds these partitions to the gathered list. + - The ring builder unassigns any partition replicas that can be + spread out for better durability and adds these partitions to the + gathered list. + - The ring builder unassigns random partitions from any devices that + have more partitions than they need and adds these partitions to + the gathered list. + +#. The ring builder reassigns the gathered partitions to devices by + using a similar method to the one described previously. + +#. When the ring builder reassigns a replica to a partition, the ring + builder records the time of the reassignment. The ring builder uses + this value when it gathers partitions for reassignment so that no + partition is moved twice in a configurable amount of time. The + RingBuilder class knows this configurable amount of time as + ``min_part_hours``. The ring builder ignores this restriction for + replicas of partitions on removed devices because removal of a device + happens on device failure only, and reassignment is the only choice. + +These steps do not always perfectly rebalance a ring due to the random +nature of gathering partitions for reassignment. To help reach a more +balanced ring, the rebalance process is repeated until near perfect +(less than 1 percent off) or when the balance does not improve by at +least 1 percent (indicating we probably cannot get perfect balance due +to wildly imbalanced zones or too many partitions recently moved). diff --git a/doc/admin-guide-cloud-rst/source/objectstorage_tenant_specific_image_storage.rst b/doc/admin-guide-cloud-rst/source/objectstorage_tenant_specific_image_storage.rst new file mode 100644 index 0000000000..caa55cf453 --- /dev/null +++ b/doc/admin-guide-cloud-rst/source/objectstorage_tenant_specific_image_storage.rst @@ -0,0 +1,31 @@ +============================================================= +Configure tenant-specific image locations with Object Storage +============================================================= + +For some deployers, it is not ideal to store all images in one place to +enable all tenants and users to access them. You can configure the Image +service to store image data in tenant-specific image locations. Then, +only the following tenants can use the Image service to access the +created image: + +- The tenant who owns the image +- Tenants that are defined in ``swift_store_admin_tenants`` and that + have admin-level accounts + +**To configure tenant-specific image locations** + +#. Configure swift as your ``default_store`` in the :file:`glance-api.conf` file. + +#. Set these configuration options in the :file:`glance-api.conf` file: + + - swift_store_multi_tenant + Set to ``True`` to enable tenant-specific storage locations. + Default is ``False``. + + - swift_store_admin_tenants + Specify a list of tenant IDs that can grant read and write access to all + Object Storage containers that are created by the Image service. + +With this configuration, images are stored in an Object Storage service +(swift) endpoint that is pulled from the service catalog for the +authenticated user.