diff --git a/doc/source/admin/figures/objectstorage-accountscontainers.png b/doc/source/admin/figures/objectstorage-accountscontainers.png new file mode 100644 index 00000000..4df7326a Binary files /dev/null and b/doc/source/admin/figures/objectstorage-accountscontainers.png differ diff --git a/doc/source/admin/figures/objectstorage-arch.png b/doc/source/admin/figures/objectstorage-arch.png new file mode 100644 index 00000000..3b7978b6 Binary files /dev/null and b/doc/source/admin/figures/objectstorage-arch.png differ diff --git a/doc/source/admin/figures/objectstorage-buildingblocks.png b/doc/source/admin/figures/objectstorage-buildingblocks.png new file mode 100644 index 00000000..8499ca1e Binary files /dev/null and b/doc/source/admin/figures/objectstorage-buildingblocks.png differ diff --git a/doc/source/admin/figures/objectstorage-nodes.png b/doc/source/admin/figures/objectstorage-nodes.png new file mode 100644 index 00000000..e7a0396f Binary files /dev/null and b/doc/source/admin/figures/objectstorage-nodes.png differ diff --git a/doc/source/admin/figures/objectstorage-partitions.png b/doc/source/admin/figures/objectstorage-partitions.png new file mode 100644 index 00000000..7e319ca0 Binary files /dev/null and b/doc/source/admin/figures/objectstorage-partitions.png differ diff --git a/doc/source/admin/figures/objectstorage-replication.png b/doc/source/admin/figures/objectstorage-replication.png new file mode 100644 index 00000000..8ce13091 Binary files /dev/null and b/doc/source/admin/figures/objectstorage-replication.png differ diff --git a/doc/source/admin/figures/objectstorage-ring.png b/doc/source/admin/figures/objectstorage-ring.png new file mode 100644 index 00000000..22ef3120 Binary files /dev/null and b/doc/source/admin/figures/objectstorage-ring.png differ diff --git a/doc/source/admin/figures/objectstorage-usecase.png b/doc/source/admin/figures/objectstorage-usecase.png new file mode 100644 index 00000000..5d7c8f42 Binary files /dev/null and b/doc/source/admin/figures/objectstorage-usecase.png differ diff --git a/doc/source/admin/figures/objectstorage-zones.png b/doc/source/admin/figures/objectstorage-zones.png new file mode 100644 index 00000000..ee5ffbf7 Binary files /dev/null and b/doc/source/admin/figures/objectstorage-zones.png differ diff --git a/doc/source/admin/figures/objectstorage.png b/doc/source/admin/figures/objectstorage.png new file mode 100644 index 00000000..9454065c Binary files /dev/null and b/doc/source/admin/figures/objectstorage.png differ diff --git a/doc/source/admin/index.rst b/doc/source/admin/index.rst new file mode 100644 index 00000000..c4f8f48f --- /dev/null +++ b/doc/source/admin/index.rst @@ -0,0 +1,22 @@ +=================================== +OpenStack Swift Administrator Guide +=================================== + +.. toctree:: + :maxdepth: 2 + + objectstorage-intro.rst + objectstorage-features.rst + objectstorage-characteristics.rst + objectstorage-components.rst + objectstorage-ringbuilder.rst + objectstorage-arch.rst + objectstorage-replication.rst + objectstorage-large-objects.rst + objectstorage-auditors.rst + objectstorage-EC.rst + objectstorage-account-reaper.rst + objectstorage-tenant-specific-image-storage.rst + objectstorage-monitoring.rst + objectstorage-admin.rst + objectstorage-troubleshoot.rst diff --git a/doc/source/admin/objectstorage-EC.rst b/doc/source/admin/objectstorage-EC.rst new file mode 100644 index 00000000..e01179ab --- /dev/null +++ b/doc/source/admin/objectstorage-EC.rst @@ -0,0 +1,31 @@ +============== +Erasure coding +============== + +Erasure coding is a set of algorithms that allows the reconstruction of +missing data from a set of original data. In theory, erasure coding uses +less capacity with similar durability characteristics as replicas. +From an application perspective, erasure coding support is transparent. +Object Storage (swift) implements erasure coding as a Storage Policy. +See `Storage Policies +`_ +for more details. + +There is no external API related to erasure coding. Create a container using a +Storage Policy; the interaction with the cluster is the same as any +other durability policy. Because support implements as a Storage Policy, +you can isolate all storage devices that associate with your cluster's +erasure coding capability. It is entirely possible to share devices between +storage policies, but for erasure coding it may make more sense to use +not only separate devices but possibly even entire nodes dedicated for erasure +coding. + +.. important:: + + The erasure code support in Object Storage is considered beta in Kilo. + Most major functionality is included, but it has not been tested or + validated at large scale. This feature relies on ``ssync`` for durability. + We recommend deployers do extensive testing and not deploy production + data using an erasure code storage policy. + If any bugs are found during testing, please report them to + https://bugs.launchpad.net/swift diff --git a/doc/source/admin/objectstorage-account-reaper.rst b/doc/source/admin/objectstorage-account-reaper.rst new file mode 100644 index 00000000..0acdc205 --- /dev/null +++ b/doc/source/admin/objectstorage-account-reaper.rst @@ -0,0 +1,51 @@ +============== +Account reaper +============== + +The purpose of the account reaper is to remove data from the deleted accounts. + +A reseller marks an account for deletion by issuing a ``DELETE`` request +on the account's storage URL. This action sets the ``status`` column of +the account_stat table in the account database and replicas to +``DELETED``, marking the account's data for deletion. + +Typically, a specific retention time or undelete are not provided. +However, you can set a ``delay_reaping`` value in the +``[account-reaper]`` section of the ``account-server.conf`` file to +delay the actual deletion of data. At this time, to undelete you have to update +the account database replicas directly, set the status column to an +empty string and update the put_timestamp to be greater than the +delete_timestamp. + +.. note:: + + It is on the development to-do list to write a utility that performs + this task, preferably through a REST call. + +The account reaper runs on each account server and scans the server +occasionally for account databases marked for deletion. It only fires up +on the accounts for which the server is the primary node, so that +multiple account servers aren't trying to do it simultaneously. Using +multiple servers to delete one account might improve the deletion speed +but requires coordination to avoid duplication. Speed really is not a +big concern with data deletion, and large accounts aren't deleted often. + +Deleting an account is simple. For each account container, all objects +are deleted and then the container is deleted. Deletion requests that +fail will not stop the overall process but will cause the overall +process to fail eventually (for example, if an object delete times out, +you will not be able to delete the container or the account). The +account reaper keeps trying to delete an account until it is empty, at +which point the database reclaim process within the db\_replicator will +remove the database files. + +A persistent error state may prevent the deletion of an object or +container. If this happens, you will see a message in the log, for example: + +.. code-block:: console + + Account has not been reaped since + +You can control when this is logged with the ``reap_warn_after`` value in the +``[account-reaper]`` section of the ``account-server.conf`` file. +The default value is 30 days. diff --git a/doc/source/admin/objectstorage-admin.rst b/doc/source/admin/objectstorage-admin.rst new file mode 100644 index 00000000..f0323e49 --- /dev/null +++ b/doc/source/admin/objectstorage-admin.rst @@ -0,0 +1,11 @@ +======================================== +System administration for Object Storage +======================================== + +By understanding Object Storage concepts, you can better monitor and +administer your storage solution. The majority of the administration +information is maintained in developer documentation at +`docs.openstack.org/developer/swift/ `__. + +See the `OpenStack Configuration Reference `__ +for a list of configuration options for Object Storage. diff --git a/doc/source/admin/objectstorage-arch.rst b/doc/source/admin/objectstorage-arch.rst new file mode 100644 index 00000000..4e2f5143 --- /dev/null +++ b/doc/source/admin/objectstorage-arch.rst @@ -0,0 +1,88 @@ +==================== +Cluster architecture +==================== + +Access tier +~~~~~~~~~~~ +Large-scale deployments segment off an access tier, which is considered +the Object Storage system's central hub. The access tier fields the +incoming API requests from clients and moves data in and out of the +system. This tier consists of front-end load balancers, ssl-terminators, +and authentication services. It runs the (distributed) brain of the +Object Storage system: the proxy server processes. + +.. note:: + + If you want to use OpenStack Identity API v3 for authentication, you + have the following options available in ``/etc/swift/dispersion.conf``: + ``auth_version``, ``user_domain_name``, ``project_domain_name``, + and ``project_name``. + +**Object Storage architecture** + + +.. figure:: figures/objectstorage-arch.png + + +Because access servers are collocated in their own tier, you can scale +out read/write access regardless of the storage capacity. For example, +if a cluster is on the public Internet, requires SSL termination, and +has a high demand for data access, you can provision many access +servers. However, if the cluster is on a private network and used +primarily for archival purposes, you need fewer access servers. + +Since this is an HTTP addressable storage service, you may incorporate a +load balancer into the access tier. + +Typically, the tier consists of a collection of 1U servers. These +machines use a moderate amount of RAM and are network I/O intensive. +Since these systems field each incoming API request, you should +provision them with two high-throughput (10GbE) interfaces - one for the +incoming ``front-end`` requests and the other for the ``back-end`` access to +the object storage nodes to put and fetch data. + +Factors to consider +------------------- + +For most publicly facing deployments as well as private deployments +available across a wide-reaching corporate network, you use SSL to +encrypt traffic to the client. SSL adds significant processing load to +establish sessions between clients, which is why you have to provision +more capacity in the access layer. SSL may not be required for private +deployments on trusted networks. + +Storage nodes +~~~~~~~~~~~~~ + +In most configurations, each of the five zones should have an equal +amount of storage capacity. Storage nodes use a reasonable amount of +memory and CPU. Metadata needs to be readily available to return objects +quickly. The object stores run services not only to field incoming +requests from the access tier, but to also run replicators, auditors, +and reapers. You can provision object stores provisioned with single +gigabit or 10 gigabit network interface depending on the expected +workload and desired performance. + +**Object Storage (swift)** + + +.. figure:: figures/objectstorage-nodes.png + + + +Currently, a 2 TB or 3 TB SATA disk delivers good performance for the +price. You can use desktop-grade drives if you have responsive remote +hands in the datacenter and enterprise-grade drives if you don't. + +Factors to consider +------------------- + +You should keep in mind the desired I/O performance for single-threaded +requests. This system does not use RAID, so a single disk handles each +request for an object. Disk performance impacts single-threaded response +rates. + +To achieve apparent higher throughput, the object storage system is +designed to handle concurrent uploads/downloads. The network I/O +capacity (1GbE, bonded 1GbE pair, or 10GbE) should match your desired +concurrent throughput needs for reads and writes. diff --git a/doc/source/admin/objectstorage-auditors.rst b/doc/source/admin/objectstorage-auditors.rst new file mode 100644 index 00000000..1a3a5783 --- /dev/null +++ b/doc/source/admin/objectstorage-auditors.rst @@ -0,0 +1,30 @@ +============== +Object Auditor +============== + +On system failures, the XFS file system can sometimes truncate files it is +trying to write and produce zero-byte files. The object-auditor will catch +these problems but in the case of a system crash it is advisable to run +an extra, less rate limited sweep, to check for these specific files. +You can run this command as follows: + +.. code-block:: console + + $ swift-object-auditor /path/to/object-server/config/file.conf once -z 1000 + +.. note:: + + "-z" means to only check for zero-byte files at 1000 files per second. + +It is useful to run the object auditor on a specific device or set of devices. +You can run the object-auditor once as follows: + +.. code-block:: console + + $ swift-object-auditor /path/to/object-server/config/file.conf once \ + --devices=sda,sdb + +.. note:: + + This will run the object auditor on only the ``sda`` and ``sdb`` devices. + This parameter accepts a comma-separated list of values. diff --git a/doc/source/admin/objectstorage-characteristics.rst b/doc/source/admin/objectstorage-characteristics.rst new file mode 100644 index 00000000..1b520708 --- /dev/null +++ b/doc/source/admin/objectstorage-characteristics.rst @@ -0,0 +1,43 @@ +============================== +Object Storage characteristics +============================== + +The key characteristics of Object Storage are that: + +- All objects stored in Object Storage have a URL. + +- All objects stored are replicated 3✕ in as-unique-as-possible zones, + which can be defined as a group of drives, a node, a rack, and so on. + +- All objects have their own metadata. + +- Developers interact with the object storage system through a RESTful + HTTP API. + +- Object data can be located anywhere in the cluster. + +- The cluster scales by adding additional nodes without sacrificing + performance, which allows a more cost-effective linear storage + expansion than fork-lift upgrades. + +- Data does not have to be migrated to an entirely new storage system. + +- New nodes can be added to the cluster without downtime. + +- Failed nodes and disks can be swapped out without downtime. + +- It runs on industry-standard hardware, such as Dell, HP, and + Supermicro. + +.. _objectstorage-figure: + +Object Storage (swift) + +.. figure:: figures/objectstorage.png + +Developers can either write directly to the Swift API or use one of the +many client libraries that exist for all of the popular programming +languages, such as Java, Python, Ruby, and C#. Amazon S3 and RackSpace +Cloud Files users should be very familiar with Object Storage. Users new +to object storage systems will have to adjust to a different approach +and mindset than those required for a traditional filesystem. diff --git a/doc/source/admin/objectstorage-components.rst b/doc/source/admin/objectstorage-components.rst new file mode 100644 index 00000000..11b8feac --- /dev/null +++ b/doc/source/admin/objectstorage-components.rst @@ -0,0 +1,258 @@ +========== +Components +========== + +Object Storage uses the following components to deliver high +availability, high durability, and high concurrency: + +- **Proxy servers** - Handle all of the incoming API requests. + +- **Rings** - Map logical names of data to locations on particular + disks. + +- **Zones** - Isolate data from other zones. A failure in one zone + does not impact the rest of the cluster as data replicates + across zones. + +- **Accounts and containers** - Each account and container are + individual databases that are distributed across the cluster. An + account database contains the list of containers in that account. A + container database contains the list of objects in that container. + +- **Objects** - The data itself. + +- **Partitions** - A partition stores objects, account databases, and + container databases and helps manage locations where data lives in + the cluster. + + +.. _objectstorage-building-blocks-figure: + +**Object Storage building blocks** + +.. figure:: figures/objectstorage-buildingblocks.png + + +Proxy servers +------------- + +Proxy servers are the public face of Object Storage and handle all of +the incoming API requests. Once a proxy server receives a request, it +determines the storage node based on the object's URL, for example: +https://swift.example.com/v1/account/container/object. Proxy servers +also coordinate responses, handle failures, and coordinate timestamps. + +Proxy servers use a shared-nothing architecture and can be scaled as +needed based on projected workloads. A minimum of two proxy servers +should be deployed for redundancy. If one proxy server fails, the others +take over. + +For more information concerning proxy server configuration, see +`Configuration Reference +`_. + +Rings +----- + +A ring represents a mapping between the names of entities stored on disks +and their physical locations. There are separate rings for accounts, +containers, and objects. When other components need to perform any +operation on an object, container, or account, they need to interact +with the appropriate ring to determine their location in the cluster. + +The ring maintains this mapping using zones, devices, partitions, and +replicas. Each partition in the ring is replicated, by default, three +times across the cluster, and partition locations are stored in the +mapping maintained by the ring. The ring is also responsible for +determining which devices are used for handoff in failure scenarios. + +Data can be isolated into zones in the ring. Each partition replica is +guaranteed to reside in a different zone. A zone could represent a +drive, a server, a cabinet, a switch, or even a data center. + +The partitions of the ring are equally divided among all of the devices +in the Object Storage installation. When partitions need to be moved +around (for example, if a device is added to the cluster), the ring +ensures that a minimum number of partitions are moved at a time, and +only one replica of a partition is moved at a time. + +You can use weights to balance the distribution of partitions on drives +across the cluster. This can be useful, for example, when differently +sized drives are used in a cluster. + +The ring is used by the proxy server and several background processes +(like replication). + + +.. _objectstorage-ring-figure: + +**The ring** + +.. figure:: figures/objectstorage-ring.png + +These rings are externally managed. The server processes themselves +do not modify the rings, they are instead given new rings modified by +other tools. + +The ring uses a configurable number of bits from an ``MD5`` hash for a path +as a partition index that designates a device. The number of bits kept +from the hash is known as the partition power, and 2 to the partition +power indicates the partition count. Partitioning the full ``MD5`` hash ring +allows other parts of the cluster to work in batches of items at once +which ends up either more efficient or at least less complex than +working with each item separately or the entire cluster all at once. + +Another configurable value is the replica count, which indicates how +many of the partition-device assignments make up a single ring. For a +given partition number, each replica's device will not be in the same +zone as any other replica's device. Zones can be used to group devices +based on physical locations, power separations, network separations, or +any other attribute that would improve the availability of multiple +replicas at the same time. + +Zones +----- + +Object Storage allows configuring zones in order to isolate failure +boundaries. If possible, each data replica resides in a separate zone. +At the smallest level, a zone could be a single drive or a grouping of a +few drives. If there were five object storage servers, then each server +would represent its own zone. Larger deployments would have an entire +rack (or multiple racks) of object servers, each representing a zone. +The goal of zones is to allow the cluster to tolerate significant +outages of storage servers without losing all replicas of the data. + + +.. _objectstorage-zones-figure: + +**Zones** + +.. figure:: figures/objectstorage-zones.png + + +Accounts and containers +----------------------- + +Each account and container is an individual SQLite database that is +distributed across the cluster. An account database contains the list of +containers in that account. A container database contains the list of +objects in that container. + + +.. _objectstorage-accountscontainers-figure: + +**Accounts and containers** + +.. figure:: figures/objectstorage-accountscontainers.png + + +To keep track of object data locations, each account in the system has a +database that references all of its containers, and each container +database references each object. + +Partitions +---------- + +A partition is a collection of stored data. This includes account databases, +container databases, and objects. Partitions are core to the replication +system. + +Think of a partition as a bin moving throughout a fulfillment center +warehouse. Individual orders get thrown into the bin. The system treats +that bin as a cohesive entity as it moves throughout the system. A bin +is easier to deal with than many little things. It makes for fewer +moving parts throughout the system. + +System replicators and object uploads/downloads operate on partitions. +As the system scales up, its behavior continues to be predictable +because the number of partitions is a fixed number. + +Implementing a partition is conceptually simple, a partition is just a +directory sitting on a disk with a corresponding hash table of what it +contains. + + +.. _objectstorage-partitions-figure: + +**Partitions** + +.. figure:: figures/objectstorage-partitions.png + + +Replicators +----------- + +In order to ensure that there are three copies of the data everywhere, +replicators continuously examine each partition. For each local +partition, the replicator compares it against the replicated copies in +the other zones to see if there are any differences. + +The replicator knows if replication needs to take place by examining +hashes. A hash file is created for each partition, which contains hashes +of each directory in the partition. Each of the three hash files is +compared. For a given partition, the hash files for each of the +partition's copies are compared. If the hashes are different, then it is +time to replicate, and the directory that needs to be replicated is +copied over. + +This is where partitions come in handy. With fewer things in the system, +larger chunks of data are transferred around (rather than lots of little +TCP connections, which is inefficient) and there is a consistent number +of hashes to compare. + +The cluster eventually has a consistent behavior where the newest data +has a priority. + + +.. _objectstorage-replication-figure: + +**Replication** + +.. figure:: figures/objectstorage-replication.png + + +If a zone goes down, one of the nodes containing a replica notices and +proactively copies data to a handoff location. + +Use cases +--------- + +The following sections show use cases for object uploads and downloads +and introduce the components. + + +Upload +~~~~~~ + +A client uses the REST API to make a HTTP request to PUT an object into +an existing container. The cluster receives the request. First, the +system must figure out where the data is going to go. To do this, the +account name, container name, and object name are all used to determine +the partition where this object should live. + +Then a lookup in the ring figures out which storage nodes contain the +partitions in question. + +The data is then sent to each storage node where it is placed in the +appropriate partition. At least two of the three writes must be +successful before the client is notified that the upload was successful. + +Next, the container database is updated asynchronously to reflect that +there is a new object in it. + + +.. _objectstorage-usecase-figure: + +**Object Storage in use** + +.. figure:: figures/objectstorage-usecase.png + + +Download +~~~~~~~~ + +A request comes in for an account/container/object. Using the same +consistent hashing, the partition name is generated. A lookup in the +ring reveals which storage nodes contain that partition. A request is +made to one of the storage nodes to fetch the object and, if that fails, +requests are made to the other nodes. diff --git a/doc/source/admin/objectstorage-features.rst b/doc/source/admin/objectstorage-features.rst new file mode 100644 index 00000000..6548ceb9 --- /dev/null +++ b/doc/source/admin/objectstorage-features.rst @@ -0,0 +1,63 @@ +===================== +Features and benefits +===================== + +.. list-table:: + :header-rows: 1 + :widths: 10 40 + + * - Features + - Benefits + * - Leverages commodity hardware + - No lock-in, lower price/GB. + * - HDD/node failure agnostic + - Self-healing, reliable, data redundancy protects from failures. + * - Unlimited storage + - Large and flat namespace, highly scalable read/write access, + able to serve content directly from storage system. + * - Multi-dimensional scalability + - Scale-out architecture: Scale vertically and + horizontally-distributed storage. Backs up and archives large + amounts of data with linear performance. + * - Account/container/object structure + - No nesting, not a traditional file system: Optimized for scale, + it scales to multiple petabytes and billions of objects. + * - Built-in replication 3✕ + data redundancy (compared with 2✕ on + RAID) + - A configurable number of accounts, containers and object copies + for high availability. + * - Easily add capacity (unlike RAID resize) + - Elastic data scaling with ease. + * - No central database + - Higher performance, no bottlenecks. + * - RAID not required + - Handle many small, random reads and writes efficiently. + * - Built-in management utilities + - Account management: Create, add, verify, and delete users; + Container management: Upload, download, and verify; Monitoring: + Capacity, host, network, log trawling, and cluster health. + * - Drive auditing + - Detect drive failures preempting data corruption. + * - Expiring objects + - Users can set an expiration time or a TTL on an object to + control access. + * - Direct object access + - Enable direct browser access to content, such as for a control + panel. + * - Realtime visibility into client requests + - Know what users are requesting. + * - Supports S3 API + - Utilize tools that were designed for the popular S3 API. + * - Restrict containers per account + - Limit access to control usage by user. + * - Support for NetApp, Nexenta, Solidfire + - Unified support for block volumes using a variety of storage + systems. + * - Snapshot and backup API for block volumes. + - Data protection and recovery for VM data. + * - Standalone volume API available + - Separate endpoint and API for integration with other compute + systems. + * - Integration with Compute + - Fully integrated with Compute for attaching block volumes and + reporting on usage. diff --git a/doc/source/admin/objectstorage-intro.rst b/doc/source/admin/objectstorage-intro.rst new file mode 100644 index 00000000..c5061e8a --- /dev/null +++ b/doc/source/admin/objectstorage-intro.rst @@ -0,0 +1,23 @@ +============================== +Introduction to Object Storage +============================== + +OpenStack Object Storage (swift) is used for redundant, scalable data +storage using clusters of standardized servers to store petabytes of +accessible data. It is a long-term storage system for large amounts of +static data which can be retrieved and updated. Object Storage uses a +distributed architecture +with no central point of control, providing greater scalability, +redundancy, and permanence. Objects are written to multiple hardware +devices, with the OpenStack software responsible for ensuring data +replication and integrity across the cluster. Storage clusters scale +horizontally by adding new nodes. Should a node fail, OpenStack works to +replicate its content from other active nodes. Because OpenStack uses +software logic to ensure data replication and distribution across +different devices, inexpensive commodity hard drives and servers can be +used in lieu of more expensive equipment. + +Object Storage is ideal for cost effective, scale-out storage. It +provides a fully distributed, API-accessible storage platform that can +be integrated directly into applications or used for backup, archiving, +and data retention. diff --git a/doc/source/admin/objectstorage-large-objects.rst b/doc/source/admin/objectstorage-large-objects.rst new file mode 100644 index 00000000..33d803bc --- /dev/null +++ b/doc/source/admin/objectstorage-large-objects.rst @@ -0,0 +1,35 @@ +==================== +Large object support +==================== + +Object Storage (swift) uses segmentation to support the upload of large +objects. By default, Object Storage limits the download size of a single +object to 5GB. Using segmentation, uploading a single object is virtually +unlimited. The segmentation process works by fragmenting the object, +and automatically creating a file that sends the segments together as +a single object. This option offers greater upload speed with the possibility +of parallel uploads. + +Large objects +~~~~~~~~~~~~~ +The large object is comprised of two types of objects: + +- **Segment objects** store the object content. You can divide your + content into segments, and upload each segment into its own segment + object. Segment objects do not have any special features. You create, + update, download, and delete segment objects just as you would normal + objects. + +- A **manifest object** links the segment objects into one logical + large object. When you download a manifest object, Object Storage + concatenates and returns the contents of the segment objects in the + response body of the request. The manifest object types are: + + - **Static large objects** + - **Dynamic large objects** + +To find out more information on large object support, see `Large objects +`_ +in the OpenStack End User Guide, or `Large Object Support +`_ +in the developer documentation. diff --git a/doc/source/admin/objectstorage-monitoring.rst b/doc/source/admin/objectstorage-monitoring.rst new file mode 100644 index 00000000..dac37f79 --- /dev/null +++ b/doc/source/admin/objectstorage-monitoring.rst @@ -0,0 +1,228 @@ +========================= +Object Storage monitoring +========================= + +.. note:: + + This section was excerpted from a blog post by `Darrell + Bishop `_ and + has since been edited. + +An OpenStack Object Storage cluster is a collection of many daemons that +work together across many nodes. With so many different components, you +must be able to tell what is going on inside the cluster. Tracking +server-level meters like CPU utilization, load, memory consumption, disk +usage and utilization, and so on is necessary, but not sufficient. + +Swift Recon +~~~~~~~~~~~ + +The Swift Recon middleware (see +`Defining Storage Policies `_) +provides general machine statistics, such as load average, socket +statistics, ``/proc/meminfo`` contents, as well as Swift-specific meters: + +- The ``MD5`` sum of each ring file. + +- The most recent object replication time. + +- Count of each type of quarantined file: Account, container, or + object. + +- Count of "async_pendings" (deferred container updates) on disk. + +Swift Recon is middleware that is installed in the object servers +pipeline and takes one required option: A local cache directory. To +track ``async_pendings``, you must set up an additional cron job for +each object server. You access data by either sending HTTP requests +directly to the object server or using the ``swift-recon`` command-line +client. + +There are Object Storage cluster statistics but the typical +server meters overlap with existing server monitoring systems. To get +the Swift-specific meters into a monitoring system, they must be polled. +Swift Recon acts as a middleware meters collector. The +process that feeds meters to your statistics system, such as +``collectd`` and ``gmond``, should already run on the storage node. +You can choose to either talk to Swift Recon or collect the meters +directly. + +Swift-Informant +~~~~~~~~~~~~~~~ + +Swift-Informant middleware (see +`swift-informant `_) has +real-time visibility into Object Storage client requests. It sits in the +pipeline for the proxy server, and after each request to the proxy server it +sends three meters to a ``StatsD`` server: + +- A counter increment for a meter like ``obj.GET.200`` or + ``cont.PUT.404``. + +- Timing data for a meter like ``acct.GET.200`` or ``obj.GET.200``. + [The README says the meters look like ``duration.acct.GET.200``, but + I do not see the ``duration`` in the code. I am not sure what the + Etsy server does but our StatsD server turns timing meters into five + derivative meters with new segments appended, so it probably works as + coded. The first meter turns into ``acct.GET.200.lower``, + ``acct.GET.200.upper``, ``acct.GET.200.mean``, + ``acct.GET.200.upper_90``, and ``acct.GET.200.count``]. + +- A counter increase by the bytes transferred for a meter like + ``tfer.obj.PUT.201``. + +This is used for receiving information on the quality of service clients +experience with the timing meters, as well as sensing the volume of the +various modifications of a request server type, command, and response +code. Swift-Informant requires no change to core Object +Storage code because it is implemented as middleware. However, it gives +no insight into the workings of the cluster past the proxy server. +If the responsiveness of one storage node degrades, you can only see +that some of the requests are bad, either as high latency or error +status codes. + +Statsdlog +~~~~~~~~~ + +The `Statsdlog `_ +project increments StatsD counters based on logged events. Like +Swift-Informant, it is also non-intrusive, however statsdlog can track +events from all Object Storage daemons, not just proxy-server. The +daemon listens to a UDP stream of syslog messages, and StatsD counters +are incremented when a log line matches a regular expression. Meter +names are mapped to regex match patterns in a JSON file, allowing +flexible configuration of what meters are extracted from the log stream. + +Currently, only the first matching regex triggers a StatsD counter +increment, and the counter is always incremented by one. There is no way +to increment a counter by more than one or send timing data to StatsD +based on the log line content. The tool could be extended to handle more +meters for each line and data extraction, including timing data. But a +coupling would still exist between the log textual format and the log +parsing regexes, which would themselves be more complex to support +multiple matches for each line and data extraction. Also, log processing +introduces a delay between the triggering event and sending the data to +StatsD. It would be preferable to increment error counters where they +occur and send timing data as soon as it is known to avoid coupling +between a log string and a parsing regex and prevent a time delay +between events and sending data to StatsD. + +The next section describes another method for gathering Object Storage +operational meters. + +Swift StatsD logging +~~~~~~~~~~~~~~~~~~~~ + +StatsD (see `Measure Anything, Measure Everything +`_) +was designed for application code to be deeply instrumented. Meters are +sent in real-time by the code that just noticed or did something. The +overhead of sending a meter is extremely low: a ``sendto`` of one UDP +packet. If that overhead is still too high, the StatsD client library +can send only a random portion of samples and StatsD approximates the +actual number when flushing meters upstream. + +To avoid the problems inherent with middleware-based monitoring and +after-the-fact log processing, the sending of StatsD meters is +integrated into Object Storage itself. The submitted change set (see +``_) currently reports 124 meters +across 15 Object Storage daemons and the tempauth middleware. Details of +the meters tracked are in the `Administrator's +Guide `_. + +The sending of meters is integrated with the logging framework. To +enable, configure ``log_statsd_host`` in the relevant config file. You +can also specify the port and a default sample rate. The specified +default sample rate is used unless a specific call to a statsd logging +method (see the list below) overrides it. Currently, no logging calls +override the sample rate, but it is conceivable that some meters may +require accuracy (``sample_rate=1``) while others may not. + +.. code-block:: ini + + [DEFAULT] + # ... + log_statsd_host = 127.0.0.1 + log_statsd_port = 8125 + log_statsd_default_sample_rate = 1 + +Then the LogAdapter object returned by ``get_logger()``, usually stored +in ``self.logger``, has these new methods: + +- ``set_statsd_prefix(self, prefix)`` Sets the client library stat + prefix value which gets prefixed to every meter. The default prefix + is the ``name`` of the logger such as ``object-server``, + ``container-auditor``, and so on. This is currently used to turn + ``proxy-server`` into one of ``proxy-server.Account``, + ``proxy-server.Container``, or ``proxy-server.Object`` as soon as the + Controller object is determined and instantiated for the request. + +- ``update_stats(self, metric, amount, sample_rate=1)`` Increments + the supplied meter by the given amount. This is used when you need + to add or subtract more that one from a counter, like incrementing + ``suffix.hashes`` by the number of computed hashes in the object + replicator. + +- ``increment(self, metric, sample_rate=1)`` Increments the given counter + meter by one. + +- ``decrement(self, metric, sample_rate=1)`` Lowers the given counter + meter by one. + +- ``timing(self, metric, timing_ms, sample_rate=1)`` Record that the + given meter took the supplied number of milliseconds. + +- ``timing_since(self, metric, orig_time, sample_rate=1)`` + Convenience method to record a timing meter whose value is "now" + minus an existing timestamp. + +.. note:: + + These logging methods may safely be called anywhere you have a + logger object. If StatsD logging has not been configured, the methods + are no-ops. This avoids messy conditional logic each place a meter is + recorded. These example usages show the new logging methods: + + .. code-block:: python + + # swift/obj/replicator.py + def update(self, job): + # ... + begin = time.time() + try: + hashed, local_hash = tpool.execute(tpooled_get_hashes, job['path'], + do_listdir=(self.replication_count % 10) == 0, + reclaim_age=self.reclaim_age) + # See tpooled_get_hashes "Hack". + if isinstance(hashed, BaseException): + raise hashed + self.suffix_hash += hashed + self.logger.update_stats('suffix.hashes', hashed) + # ... + finally: + self.partition_times.append(time.time() - begin) + self.logger.timing_since('partition.update.timing', begin) + + .. code-block:: python + + # swift/container/updater.py + def process_container(self, dbfile): + # ... + start_time = time.time() + # ... + for event in events: + if 200 <= event.wait() < 300: + successes += 1 + else: + failures += 1 + if successes > failures: + self.logger.increment('successes') + # ... + else: + self.logger.increment('failures') + # ... + # Only track timing data for attempted updates: + self.logger.timing_since('timing', start_time) + else: + self.logger.increment('no_changes') + self.no_changes += 1 diff --git a/doc/source/admin/objectstorage-replication.rst b/doc/source/admin/objectstorage-replication.rst new file mode 100644 index 00000000..32cd33ad --- /dev/null +++ b/doc/source/admin/objectstorage-replication.rst @@ -0,0 +1,98 @@ +=========== +Replication +=========== + +Because each replica in Object Storage functions independently and +clients generally require only a simple majority of nodes to respond to +consider an operation successful, transient failures like network +partitions can quickly cause replicas to diverge. These differences are +eventually reconciled by asynchronous, peer-to-peer replicator +processes. The replicator processes traverse their local file systems +and concurrently perform operations in a manner that balances load +across physical disks. + +Replication uses a push model, with records and files generally only +being copied from local to remote replicas. This is important because +data on the node might not belong there (as in the case of hand offs and +ring changes), and a replicator cannot know which data it should pull in +from elsewhere in the cluster. Any node that contains data must ensure +that data gets to where it belongs. The ring handles replica placement. + +To replicate deletions in addition to creations, every deleted record or +file in the system is marked by a tombstone. The replication process +cleans up tombstones after a time period known as the ``consistency +window``. This window defines the duration of the replication and how +long transient failure can remove a node from the cluster. Tombstone +cleanup must be tied to replication to reach replica convergence. + +If a replicator detects that a remote drive has failed, the replicator +uses the ``get_more_nodes`` interface for the ring to choose an +alternate node with which to synchronize. The replicator can maintain +desired levels of replication during disk failures, though some replicas +might not be in an immediately usable location. + +.. note:: + + The replicator does not maintain desired levels of replication when + failures such as entire node failures occur; most failures are + transient. + +The main replication types are: + +- Database replication + Replicates containers and objects. + +- Object replication + Replicates object data. + +Database replication +~~~~~~~~~~~~~~~~~~~~ + +Database replication completes a low-cost hash comparison to determine +whether two replicas already match. Normally, this check can quickly +verify that most databases in the system are already synchronized. If +the hashes differ, the replicator synchronizes the databases by sharing +records added since the last synchronization point. + +This synchronization point is a high water mark that notes the last +record at which two databases were known to be synchronized, and is +stored in each database as a tuple of the remote database ID and record +ID. Database IDs are unique across all replicas of the database, and +record IDs are monotonically increasing integers. After all new records +are pushed to the remote database, the entire synchronization table of +the local database is pushed, so the remote database can guarantee that +it is synchronized with everything with which the local database was +previously synchronized. + +If a replica is missing, the whole local database file is transmitted to +the peer by using rsync(1) and is assigned a new unique ID. + +In practice, database replication can process hundreds of databases per +concurrency setting per second (up to the number of available CPUs or +disks) and is bound by the number of database transactions that must be +performed. + +Object replication +~~~~~~~~~~~~~~~~~~ + +The initial implementation of object replication performed an rsync to +push data from a local partition to all remote servers where it was +expected to reside. While this worked at small scale, replication times +skyrocketed once directory structures could no longer be held in RAM. +This scheme was modified to save a hash of the contents for each suffix +directory to a per-partition hashes file. The hash for a suffix +directory is no longer valid when the contents of that suffix directory +is modified. + +The object replication process reads in hash files and calculates any +invalidated hashes. Then, it transmits the hashes to each remote server +that should hold the partition, and only suffix directories with +differing hashes on the remote server are rsynced. After pushing files +to the remote server, the replication process notifies it to recalculate +hashes for the rsynced suffix directories. + +The number of uncached directories that object replication must +traverse, usually as a result of invalidated suffix directory hashes, +impedes performance. To provide acceptable replication speeds, object +replication is designed to invalidate around 2 percent of the hash space +on a normal node each day. diff --git a/doc/source/admin/objectstorage-ringbuilder.rst b/doc/source/admin/objectstorage-ringbuilder.rst new file mode 100644 index 00000000..ddd6f606 --- /dev/null +++ b/doc/source/admin/objectstorage-ringbuilder.rst @@ -0,0 +1,228 @@ +============ +Ring-builder +============ + +Use the swift-ring-builder utility to build and manage rings. This +utility assigns partitions to devices and writes an optimized Python +structure to a gzipped, serialized file on disk for transmission to the +servers. The server processes occasionally check the modification time +of the file and reload in-memory copies of the ring structure as needed. +If you use a slightly older version of the ring, one of the three +replicas for a partition subset will be incorrect because of the way the +ring-builder manages changes to the ring. You can work around this +issue. + +The ring-builder also keeps its own builder file with the ring +information and additional data required to build future rings. It is +very important to keep multiple backup copies of these builder files. +One option is to copy the builder files out to every server while +copying the ring files themselves. Another is to upload the builder +files into the cluster itself. If you lose the builder file, you have to +create a new ring from scratch. Nearly all partitions would be assigned +to different devices and, therefore, nearly all of the stored data would +have to be replicated to new locations. So, recovery from a builder file +loss is possible, but data would be unreachable for an extended time. + +Ring data structure +~~~~~~~~~~~~~~~~~~~ + +The ring data structure consists of three top level fields: a list of +devices in the cluster, a list of lists of device ids indicating +partition to device assignments, and an integer indicating the number of +bits to shift an MD5 hash to calculate the partition for the hash. + +Partition assignment list +~~~~~~~~~~~~~~~~~~~~~~~~~ + +This is a list of ``array('H')`` of devices ids. The outermost list +contains an ``array('H')`` for each replica. Each ``array('H')`` has a +length equal to the partition count for the ring. Each integer in the +``array('H')`` is an index into the above list of devices. The partition +list is known internally to the Ring class as ``_replica2part2dev_id``. + +So, to create a list of device dictionaries assigned to a partition, the +Python code would look like: + +.. code-block:: python + + devices = [self.devs[part2dev_id[partition]] for + part2dev_id in self._replica2part2dev_id] + +That code is a little simplistic because it does not account for the +removal of duplicate devices. If a ring has more replicas than devices, +a partition will have more than one replica on a device. + +``array('H')`` is used for memory conservation as there may be millions +of partitions. + +Overload +~~~~~~~~ + +The ring builder tries to keep replicas as far apart as possible while +still respecting device weights. When it can not do both, the overload +factor determines what happens. Each device takes an extra +fraction of its desired partitions to allow for replica dispersion; +after that extra fraction is exhausted, replicas are placed closer +together than optimal. + +The overload factor lets the operator trade off replica +dispersion (durability) against data dispersion (uniform disk usage). + +The default overload factor is 0, so device weights are strictly +followed. + +With an overload factor of 0.1, each device accepts 10% more +partitions than it otherwise would, but only if it needs to maintain +partition dispersion. + +For example, consider a 3-node cluster of machines with equal-size disks; +node A has 12 disks, node B has 12 disks, and node C has +11 disks. The ring has an overload factor of 0.1 (10%). + +Without the overload, some partitions would end up with replicas only +on nodes A and B. However, with the overload, every device can accept +up to 10% more partitions for the sake of dispersion. The +missing disk in C means there is one disk's worth of partitions +to spread across the remaining 11 disks, which gives each +disk in C an extra 9.09% load. Since this is less than the 10% +overload, there is one replica of each partition on each node. + +However, this does mean that the disks in node C have more data +than the disks in nodes A and B. If 80% full is the warning +threshold for the cluster, node C's disks reach 80% full while A +and B's disks are only 72.7% full. + + +Replica counts +~~~~~~~~~~~~~~ + +To support the gradual change in replica counts, a ring can have a real +number of replicas and is not restricted to an integer number of +replicas. + +A fractional replica count is for the whole ring and not for individual +partitions. It indicates the average number of replicas for each +partition. For example, a replica count of 3.2 means that 20 percent of +partitions have four replicas and 80 percent have three replicas. + +The replica count is adjustable. For example: + +.. code-block:: console + + $ swift-ring-builder account.builder set_replicas 4 + $ swift-ring-builder account.builder rebalance + +You must rebalance the replica ring in globally distributed clusters. +Operators of these clusters generally want an equal number of replicas +and regions. Therefore, when an operator adds or removes a region, the +operator adds or removes a replica. Removing unneeded replicas saves on +the cost of disks. + +You can gradually increase the replica count at a rate that does not +adversely affect cluster performance. For example: + +.. code-block:: console + + $ swift-ring-builder object.builder set_replicas 3.01 + $ swift-ring-builder object.builder rebalance + ... + + $ swift-ring-builder object.builder set_replicas 3.02 + $ swift-ring-builder object.builder rebalance + ... + +Changes take effect after the ring is rebalanced. Therefore, if you +intend to change from 3 replicas to 3.01 but you accidentally type +2.01, no data is lost. + +Additionally, the :command:`swift-ring-builder X.builder create` command can +now take a decimal argument for the number of replicas. + +Partition shift value +~~~~~~~~~~~~~~~~~~~~~ + +The partition shift value is known internally to the Ring class as +``_part_shift``. This value is used to shift an MD5 hash to calculate +the partition where the data for that hash should reside. Only the top +four bytes of the hash is used in this process. For example, to compute +the partition for the ``/account/container/object`` path using Python: + +.. code-block:: python + + partition = unpack_from('>I', + md5('/account/container/object').digest())[0] >> + self._part_shift + +For a ring generated with part\_power P, the partition shift value is +``32 - P``. + +Build the ring +~~~~~~~~~~~~~~ + +The ring builder process includes these high-level steps: + +#. The utility calculates the number of partitions to assign to each + device based on the weight of the device. For example, for a + partition at the power of 20, the ring has 1,048,576 partitions. One + thousand devices of equal weight each want 1,048.576 partitions. The + devices are sorted by the number of partitions they desire and kept + in order throughout the initialization process. + + .. note:: + + Each device is also assigned a random tiebreaker value that is + used when two devices desire the same number of partitions. This + tiebreaker is not stored on disk anywhere, and so two different + rings created with the same parameters will have different + partition assignments. For repeatable partition assignments, + ``RingBuilder.rebalance()`` takes an optional seed value that + seeds the Python pseudo-random number generator. + +#. The ring builder assigns each partition replica to the device that + requires most partitions at that point while keeping it as far away + as possible from other replicas. The ring builder prefers to assign a + replica to a device in a region that does not already have a replica. + If no such region is available, the ring builder searches for a + device in a different zone, or on a different server. If it does not + find one, it looks for a device with no replicas. Finally, if all + options are exhausted, the ring builder assigns the replica to the + device that has the fewest replicas already assigned. + + .. note:: + + The ring builder assigns multiple replicas to one device only if + the ring has fewer devices than it has replicas. + +#. When building a new ring from an old ring, the ring builder + recalculates the desired number of partitions that each device wants. + +#. The ring builder unassigns partitions and gathers these partitions + for reassignment, as follows: + + - The ring builder unassigns any assigned partitions from any + removed devices and adds these partitions to the gathered list. + - The ring builder unassigns any partition replicas that can be + spread out for better durability and adds these partitions to the + gathered list. + - The ring builder unassigns random partitions from any devices that + have more partitions than they need and adds these partitions to + the gathered list. + +#. The ring builder reassigns the gathered partitions to devices by + using a similar method to the one described previously. + +#. When the ring builder reassigns a replica to a partition, the ring + builder records the time of the reassignment. The ring builder uses + this value when it gathers partitions for reassignment so that no + partition is moved twice in a configurable amount of time. The + RingBuilder class knows this configurable amount of time as + ``min_part_hours``. The ring builder ignores this restriction for + replicas of partitions on removed devices because removal of a device + happens on device failure only, and reassignment is the only choice. + +These steps do not always perfectly rebalance a ring due to the random +nature of gathering partitions for reassignment. To help reach a more +balanced ring, the rebalance process is repeated until near perfect +(less than 1 percent off) or when the balance does not improve by at +least 1 percent (indicating we probably cannot get perfect balance due +to wildly imbalanced zones or too many partitions recently moved). diff --git a/doc/source/admin/objectstorage-tenant-specific-image-storage.rst b/doc/source/admin/objectstorage-tenant-specific-image-storage.rst new file mode 100644 index 00000000..69855d8e --- /dev/null +++ b/doc/source/admin/objectstorage-tenant-specific-image-storage.rst @@ -0,0 +1,32 @@ +============================================================== +Configure project-specific image locations with Object Storage +============================================================== + +For some deployers, it is not ideal to store all images in one place to +enable all projects and users to access them. You can configure the Image +service to store image data in project-specific image locations. Then, +only the following projects can use the Image service to access the +created image: + +- The project who owns the image +- Projects that are defined in ``swift_store_admin_tenants`` and that + have admin-level accounts + +**To configure project-specific image locations** + +#. Configure swift as your ``default_store`` in the + ``glance-api.conf`` file. + +#. Set these configuration options in the ``glance-api.conf`` file: + + - swift_store_multi_tenant + Set to ``True`` to enable tenant-specific storage locations. + Default is ``False``. + + - swift_store_admin_tenants + Specify a list of tenant IDs that can grant read and write access to all + Object Storage containers that are created by the Image service. + +With this configuration, images are stored in an Object Storage service +(swift) endpoint that is pulled from the service catalog for the +authenticated user. diff --git a/doc/source/admin/objectstorage-troubleshoot.rst b/doc/source/admin/objectstorage-troubleshoot.rst new file mode 100644 index 00000000..2df93080 --- /dev/null +++ b/doc/source/admin/objectstorage-troubleshoot.rst @@ -0,0 +1,208 @@ +=========================== +Troubleshoot Object Storage +=========================== + +For Object Storage, everything is logged in ``/var/log/syslog`` (or +``messages`` on some distros). Several settings enable further +customization of logging, such as ``log_name``, ``log_facility``, and +``log_level``, within the object server configuration files. + +Drive failure +~~~~~~~~~~~~~ + +Problem +------- + +Drive failure can prevent Object Storage performing replication. + +Solution +-------- + +In the event that a drive has failed, the first step is to make sure the +drive is unmounted. This will make it easier for Object Storage to work +around the failure until it has been resolved. If the drive is going to +be replaced immediately, then it is just best to replace the drive, +format it, remount it, and let replication fill it up. + +If you cannot replace the drive immediately, then it is best to leave it +unmounted, and remove the drive from the ring. This will allow all the +replicas that were on that drive to be replicated elsewhere until the +drive is replaced. Once the drive is replaced, it can be re-added to the +ring. + +You can look at error messages in the ``/var/log/kern.log`` file for +hints of drive failure. + +Server failure +~~~~~~~~~~~~~~ + +Problem +------- + +The server is potentially offline, and may have failed, or require a +reboot. + +Solution +-------- + +If a server is having hardware issues, it is a good idea to make sure +the Object Storage services are not running. This will allow Object +Storage to work around the failure while you troubleshoot. + +If the server just needs a reboot, or a small amount of work that should +only last a couple of hours, then it is probably best to let Object +Storage work around the failure and get the machine fixed and back +online. When the machine comes back online, replication will make sure +that anything that is missing during the downtime will get updated. + +If the server has more serious issues, then it is probably best to +remove all of the server's devices from the ring. Once the server has +been repaired and is back online, the server's devices can be added back +into the ring. It is important that the devices are reformatted before +putting them back into the ring as it is likely to be responsible for a +different set of partitions than before. + +Detect failed drives +~~~~~~~~~~~~~~~~~~~~ + +Problem +------- + +When drives fail, it can be difficult to detect that a drive has failed, +and the details of the failure. + +Solution +-------- + +It has been our experience that when a drive is about to fail, error +messages appear in the ``/var/log/kern.log`` file. There is a script called +``swift-drive-audit`` that can be run via cron to watch for bad drives. If +errors are detected, it will unmount the bad drive, so that Object +Storage can work around it. The script takes a configuration file with +the following settings: + +.. list-table:: **Description of configuration options for [drive-audit] in drive-audit.conf** + :header-rows: 1 + + * - Configuration option = Default value + - Description + * - ``device_dir = /srv/node`` + - Directory devices are mounted under + * - ``error_limit = 1`` + - Number of errors to find before a device is unmounted + * - ``log_address = /dev/log`` + - Location where syslog sends the logs to + * - ``log_facility = LOG_LOCAL0`` + - Syslog log facility + * - ``log_file_pattern = /var/log/kern.*[!.][!g][!z]`` + - Location of the log file with globbing pattern to check against device + errors locate device blocks with errors in the log file + * - ``log_level = INFO`` + - Logging level + * - ``log_max_line_length = 0`` + - Caps the length of log lines to the value given; no limit if set to 0, + the default. + * - ``log_to_console = False`` + - No help text available for this option. + * - ``minutes = 60`` + - Number of minutes to look back in ``/var/log/kern.log`` + * - ``recon_cache_path = /var/cache/swift`` + - Directory where stats for a few items will be stored + * - ``regex_pattern_1 = \berror\b.*\b(dm-[0-9]{1,2}\d?)\b`` + - No help text available for this option. + * - ``unmount_failed_device = True`` + - No help text available for this option. + +.. warning:: + + This script has only been tested on Ubuntu 10.04; use with caution on + other operating systems in production. + +Emergency recovery of ring builder files +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Problem +------- + +An emergency might prevent a successful backup from restoring the +cluster to operational status. + +Solution +-------- + +You should always keep a backup of swift ring builder files. However, if +an emergency occurs, this procedure may assist in returning your cluster +to an operational state. + +Using existing swift tools, there is no way to recover a builder file +from a ``ring.gz`` file. However, if you have a knowledge of Python, it +is possible to construct a builder file that is pretty close to the one +you have lost. + +.. warning:: + + This procedure is a last-resort for emergency circumstances. It + requires knowledge of the swift python code and may not succeed. + +#. Load the ring and a new ringbuilder object in a Python REPL: + + .. code-block:: python + + >>> from swift.common.ring import RingData, RingBuilder + >>> ring = RingData.load('/path/to/account.ring.gz') + +#. Start copying the data we have in the ring into the builder: + + .. code-block:: python + + >>> import math + >>> partitions = len(ring._replica2part2dev_id[0]) + >>> replicas = len(ring._replica2part2dev_id) + + >>> builder = RingBuilder(int(math.log(partitions, 2)), replicas, 1) + >>> builder.devs = ring.devs + >>> builder._replica2part2dev = ring._replica2part2dev_id + >>> builder._last_part_moves_epoch = 0 + >>> from array import array + >>> builder._last_part_moves = array('B', (0 for _ in xrange(partitions))) + >>> builder._set_parts_wanted() + >>> for d in builder._iter_devs(): + d['parts'] = 0 + >>> for p2d in builder._replica2part2dev: + for dev_id in p2d: + builder.devs[dev_id]['parts'] += 1 + + This is the extent of the recoverable fields. + +#. For ``min_part_hours`` you either have to remember what the value you + used was, or just make up a new one: + + .. code-block:: python + + >>> builder.change_min_part_hours(24) # or whatever you want it to be + +#. Validate the builder. If this raises an exception, check your + previous code: + + .. code-block:: python + + >>> builder.validate() + +#. After it validates, save the builder and create a new ``account.builder``: + + .. code-block:: python + + >>> import pickle + >>> pickle.dump(builder.to_dict(), open('account.builder', 'wb'), protocol=2) + >>> exit () + +#. You should now have a file called ``account.builder`` in the current + working directory. Run + :command:`swift-ring-builder account.builder write_ring` and compare the new + ``account.ring.gz`` to the ``account.ring.gz`` that you started + from. They probably are not byte-for-byte identical, but if you load them + in a REPL and their ``_replica2part2dev_id`` and ``devs`` attributes are + the same (or nearly so), then you are in good shape. + +#. Repeat the procedure for ``container.ring.gz`` and + ``object.ring.gz``, and you might get usable builder files. diff --git a/doc/source/index.rst b/doc/source/index.rst index 0b206f5b..ecc79c25 100644 --- a/doc/source/index.rst +++ b/doc/source/index.rst @@ -93,6 +93,7 @@ Administrator Documentation replication_network logs ops_runbook/index + admin/index Object Storage v1 REST API Documentation ========================================