[arch-guide] Adding arch guidance to storage design

Update information about object storage replication.

Change-Id: Ic1bdaa50569711b9e7c8fb0b8fe8c3a2160c027e
Implements: blueprint arch-design-pike
Partial-Bug: #1676238
This commit is contained in:
Ben Silverman 2017-05-06 11:03:06 -04:00 committed by Alexandra Settle
parent d93f860955
commit 17cec8d51e

View File

@ -182,8 +182,82 @@ Cost
incompatibility and interoperability issues.
Performance
The latency of storage I/O requests indicates performance. Performance
requirements affect which solution you choose.
Performance of block based storage is typically measured in the maximum read
and write operations to non-contiguous storage locations per second. This
measurement typically applies to SAN, hard drives, and solid state drives.
While IOPS can be broadly measured and is not an official benchmark, many
vectors like to be used by vendors to communicate performance levels. Since
there are no real standards for measuring IOPS, vendor test results may vary,
sometimes wildly. However, along with transfer rate which measures the speed
that data can be transferred to contiguous storage locations, IOPS can be
used in a performance evaluation. Typically, transfer rate is represented by
a bytes per second calculation but IOPS is measured by an integer.
To calculate IOPS for a single drive you could use:
IOPS = 1 / (AverageLatency + AverageSeekTime)
For example:
Average Latency for Single Disk = 2.99ms or .00299 seconds
Average Seek Time for Single Disk = 4.7ms or .0047 seconds
IOPS = 1/(.00299 + .0047)
IOPS = 130
To calculate maximum IOPS for a disk array:
Maximum Read IOPS:
In order to accurately calculate maximum read IOPS for a disk array,
multiply the IOPS for each disk by the maximum read or write IOPS per disk.
maxReadIOPS = nDisks * diskMaxIOPS
For example, 15 10K Spinning Disks would be measured the following way:
maxReadIOPS = 15 * 130 maxReadIOPS = 1950
Maximum write IOPS per array:
Determining the maximum *write* IOPS is a little different because most
administrators configure disk replication using RAID and since the RAID
controller requires IOPS itself, there is a write penalty. The severity of
the write penalty is determined by the type of RAID used.
=========== ==========
Raid Type Penalty
----------- ----------
1 2
5 4
10 2
=========== ==========
.. note::
Raid 5 has the worst penalty (has the most cross disk writes.)
Therefore, when using the above examples, a 15 disk array using RAID 5 is
capable of 1950 read IOPS however, we need to add the penalty when
determining the *write* IOPS:
.. code-block:: none
maxWriteIOPS = 1950 / 4
maxWriteIOPS = 487.5
A RAID 5 array only has 25% of the write IOPS of the read IOPS while a RAID
1 array in this case would produce a maximum of 975 IOPS.
What about SSD? DRAM SSD?
In an HDD, data transfer is sequential. The actual read/write head "seeks" a
point in the hard drive to execute the operation. Seek time is significant.
Transfer rate can also be influenced by file system fragmentation and the
layout. Finally, the mechanical nature of hard disks also has certain
performance limitations.
In an SSD, data transfer is *not* sequential; it is random so it is faster.
There is consistent read performance because the physical location of data is
irrelevant because SSDs have no read/write heads and thus no delays due to
head motion (seeking).
.. note::
Some basic benchmarks for small read/writes:
- **HDDs**: Small reads 175 IOPs, Small writes 280 IOPs
- **Flash SSDs**: Small reads 1075 IOPs (6x), Small writes 21 IOPs (0.1x)
- **DRAM SSDs**: Small reads 4091 IOPs (23x), Small writes 4184 IOPs
(14x)
Scalability
Scalability, along with expandability, is a major consideration in a
@ -207,9 +281,9 @@ Configure Block Storage resource nodes with advanced RAID controllers
and high-performance disks to provide fault tolerance at the hardware
level.
Deploy high performing storage solutions such as SSD drives or
flash storage systems for applications requiring additional performance out
of Block Storage devices.
We recommend deploying high performing storage solutions such as SSD
drives or flash storage systems for applications requiring additional
performance out of Block Storage devices.
In environments that place substantial demands on Block Storage, we
recommend using multiple storage pools. In this case, each pool of
@ -418,9 +492,51 @@ nodes and proxy servers should make use of a design which is scalable.
Redundancy
----------
.. TODO
Replication
-----------
.. TODO
Replicas in Object Storage function independently, and clients only
require a majority of nodes to respond to a request in order for an
operation to be considered successful. Thus, transient failures like
network partitions can quickly cause replicas to diverge.
Fix These differences are eventually reconciled by
asynchronous, peer-to-peer replicator processes. The replicator processes
traverse their local filesystems, concurrently performing operations in a
manner that balances load across physical disks.
Replication uses a push model, with records and files generally only being
copied from local to remote replicas. This is important because data on the
node may not belong there (as in the case of handoffs and ring changes), and a
replicator can not know what data exists elsewhere in the cluster that it
should pull in. It is the duty of any node that contains data to ensure that
data gets to where it belongs. Replica placement is handled by the ring.
Every deleted record or file in the system is marked by a tombstone, so that
deletions can be replicated alongside creations. The replication process cleans
up tombstones after a time period known as the consistency window. The
consistency window encompasses replication duration and the length of time a
transient failure can remove a node from the cluster. Tombstone cleanup must be
tied to replication to reach replica convergence.
If a replicator detects that a remote drive has failed, the replicator uses the
``get_more_nodes`` interface for the ring to choose an alternative node with
which to synchronize. The replicator can maintain desired levels of replication
in the face of disk failures, though some replicas may not be in an immediately
usable location.
.. note::
The replicator does not maintain desired levels of replication when other
failures occur, such as entire node failures, because most failures are
transient.
Replication is an area of active development, andimplementation details
are likely to change over time.
There are two major classes of replicator: the db replicator, which replicates
accounts and containers, and the object replicator, which replicates object
data.
For more information, please see the `Swift replication page <https://docs.openstack.org/developer/swift/overview_replication.html>`_.