
There's a bunch of moving pieces here: - Add a new RingWriter class. Stick it in a new swift.common.ring.io module. You *can* use it like the old gzip file, but you can also define named sections which can be referenced later on read. Section names may be arbitrary strings, but the "swift/" prefix is reserved for upstream use. Sections must contain a single length-value encoded BLOB. If sections are used, an additional BLOB is written at the end containing a JSON section-index, followed by an uncompressed offset for the index. Move RingReader to ring/io.py, too. - Clean up some ring metadata handling: - Drop MD5 tracking in RingReader. It was brittle at best anyway, and nothing uses it. YAGNI - Fix size/raw_size attributes when loading only metadata. - Add the ability to seek within RingReaders, though you need to know what you're doing and only seek to flush points. - Let RingBuilder objects change how wide their replica2part2dev_id arrays are. Add a dev_id_bytes key to serialized ring metadata. dev_id_bytes may be either 2 or 4, but 4 requires v2 rings. We considered allowing dev_id_bytes of 1, but dropped it as unnecessary complexity for a niche use case. - swift-ring-builder version subcommand added, which takes a ring. This lets operators see the serialization format of a ring on disk: $ swift-ring-builder object.ring.gz version object.ring.gz: Serialization version: 2 (2-byte IDs), build version: 54 Signed-off-by: Tim Burke <tim.burke@gmail.com> Change-Id: Ia0ac4ea2006d8965d7fdb6659d355c77386adb70
254 lines
9.8 KiB
ReStructuredText
254 lines
9.8 KiB
ReStructuredText
=================
|
|
Ring File Formats
|
|
=================
|
|
|
|
The ring is the most important data structure in Swift. How this data structure
|
|
been serialized to disk has changed over the years.
|
|
|
|
Initially ring files contain three key pieces of information:
|
|
|
|
* the part_power value (often stored as ``part_shift := 32 - part_power``)
|
|
|
|
* which determines how many partitions are in the ring,
|
|
|
|
* the device list
|
|
|
|
* which includes all the disks participating in the ring, and
|
|
|
|
* the replica-to-part-to-device table
|
|
|
|
* which has all ``replica_count * (2 ** part_power)`` partition assignments.
|
|
|
|
But the ability to extend the serialization format to add more data structures
|
|
to the ring serialization format has meant a new ring v2 format has been created.
|
|
|
|
Ring files have always been gzipped when serialized, though the inner,
|
|
raw format has evolved over the years.
|
|
|
|
Ring v0
|
|
-------
|
|
|
|
Initially, rings were simply pickle dumps of the RingData object. `With
|
|
Swift 1.3.0 <https://opendev.org/openstack/swift/commit/fc6391ea>`__, this
|
|
changed to pickling a pure-stdlib data structure, but the core concept
|
|
was the same.
|
|
|
|
.. note:
|
|
|
|
Swift 2.36.0 dropped support for v0 rings.
|
|
|
|
Ring v1
|
|
-------
|
|
|
|
Pickle presented some problems, however. While `there are security
|
|
concerns <https://docs.python.org/3/library/pickle.html>`__ around unpickling
|
|
untrusted data, security boundaries are generally drawn such that rings are
|
|
assumed to be trusted. Ultimately, what pushed us to a new format were
|
|
`performance considerations <https://bugs.launchpad.net/swift/+bug/1031954>`__.
|
|
|
|
Starting in `Swift 1.7.0 <https://opendev.org/openstack/swift/commit/f8ce43a2>`__,
|
|
Swift began using a new format (while still being willing to read the old one).
|
|
The new format starts with some magic so we may identify it as such::
|
|
|
|
+---------------+-------+
|
|
|'R' '1' 'N' 'G'| <vrs> |
|
|
+---------------+-------+
|
|
|
|
where ``<vrs>`` is a network-order two-byte version number (which is always 1).
|
|
After that, a JSON object is serialized as::
|
|
|
|
+---------------+-------...---+
|
|
| <data-length> | <data ... > |
|
|
+---------------+-------...---+
|
|
|
|
where ``<data-length>`` is the network-order four-byte length (in bytes) of
|
|
``<data>``, which is the ASCII-encoded JSON-serialized object. This object
|
|
has at minimum three keys:
|
|
|
|
* ``devs`` for the device list
|
|
* ``part_shift`` (i.e., ``32 - part_power``)
|
|
* ``replica_count`` for the integer number of part-to-device rows to read
|
|
|
|
The replica-to-part-to-device table then follows::
|
|
|
|
+-------+-------+...+-------+-------+
|
|
| <dev> | <dev> |...| <dev> | <dev> |
|
|
+-------+-------+...+-------+-------+
|
|
| <dev> | <dev> |...| <dev> | <dev> |
|
|
+-------+-------+...+-------+-------+
|
|
| ... |
|
|
+-------+-------+...+-------+-------+
|
|
| <dev> | <dev> |...|
|
|
+-------+-------+...+
|
|
|
|
Each ``<dev>`` is a host-order two-byte index into the ``devs`` list. Every row
|
|
except the last has exactly ``2 ** part_power`` entries; the last row may
|
|
have the same or fewer.
|
|
|
|
The metadata object has proven quite versatile: new keys have been added
|
|
to provide additional information while remaining backwards-compatible.
|
|
In order, the following new fields have been added:
|
|
|
|
* ``byteorder`` specifies whether the host-order for the
|
|
replica-to-part-to-device table is "big" or "little" endian. Added in
|
|
`Swift 2.12.0 <https://opendev.org/openstack/swift/commit/1ec6e2bb>`__,
|
|
this allows rings written on big-endian machines to be read on
|
|
little-endian machines and vice-versa.
|
|
* ``next_part_power`` indicates whether a partition-power increase is in
|
|
progress. Added in `Swift 2.15.0 <https://opendev.org/openstack/swift/commit/e1140666>`__,
|
|
this will have one of two values, if present: the ring's current
|
|
``part_power``, indicating that there may be hardlinks to clean up,
|
|
or ``part_power + 1`` indicating that hardlinks may need to be created.
|
|
See :ref:`the documentation<modify_part_power>`
|
|
for more information.
|
|
* ``version`` specifies the version number of the ring-builder that was used
|
|
to write this ring. Added in `Swift 2.24.0 <https://opendev.org/openstack/swift/commit/6853616a>`__,
|
|
this allows the comparing of rings from different machines to determine
|
|
which is newer.
|
|
|
|
Ring v2
|
|
-------
|
|
|
|
The way that v1 rings dealt with fractional replicas made it impossible
|
|
to reliably serialize additional large data structures after the
|
|
replica-to-part-to-device table. The v2 format has been designed to be
|
|
extensable.
|
|
|
|
The new format starts with magic similar to v1::
|
|
|
|
+---------------+-------+
|
|
|'R' '1' 'N' 'G'| <vrs> |
|
|
+---------------+-------+
|
|
|
|
where <vrs> is again a network-order two-byte version number (which is now 2).
|
|
By bumping the version number, we ensure that old versions of Swift refuse to
|
|
read the ring, rather than misinterpret the content.
|
|
|
|
After that, a series of BLOBs are serialized, each as::
|
|
|
|
+-------------------------------+-------...---+
|
|
| <data-length> | <data ... > |
|
|
+-------------------------------+-------...---+
|
|
|
|
where ``<data-length>`` is the network-order eight-byte length (in bytes) of
|
|
``<data>``. Each BLOB is preceded by a ``Z_FULL_FLUSH`` to allow it to be
|
|
decompressed without reading the whole file.
|
|
|
|
The order of the BLOBs isn't important, although they do tend to be written
|
|
in the order Swift will read them while loading. This reduces the disk seeks
|
|
necessary to load.
|
|
|
|
The final BLOB is an index: a JSON object mapping named sections to an array
|
|
of offsets within the file, like
|
|
|
|
.. code::
|
|
|
|
{
|
|
section: [
|
|
compressed start,
|
|
uncompressed start,
|
|
compressed end,
|
|
uncompressed end,
|
|
checksum method,
|
|
checksum value
|
|
],
|
|
...
|
|
}
|
|
|
|
Section names may be arbitrary strings, but the "swift/" prefix is reserved
|
|
for upstream use. The start/end values mark the beginning and ending of the
|
|
section's BLOB. Note that some end values may be ``null`` if they were not
|
|
known when the index was written -- in particular, this *will* be true for
|
|
the index itself. The checksum method should be one of ``"md5"``, ``"sha1"``,
|
|
``"sha256"``, or ``"sha512"``; other values will be ignored in anticipation
|
|
of a need to support further algorithms. The checksum value will be the
|
|
hex-encoded digest of the uncompressed section's bytes. Like end values,
|
|
checksum data may be ``null`` if not known when the index is written.
|
|
|
|
Finally, a "tail" is written:
|
|
|
|
* the gzip stream is flushed with another ``Z_FULL_FLUSH``,
|
|
* the stream is switched to uncompressed,
|
|
* the eight-byte offset of the uncompressed start of the index is written,
|
|
* the gzip stream is flushed with another ``Z_FULL_FLUSH``,
|
|
* the eight-byte offset of the compressed start of the index is written,
|
|
* the gzip stream is flushed with another ``Z_FULL_FLUSH``, and
|
|
* the gzip stream is closed; this involves:
|
|
|
|
* flushing the underlying deflate stream with ``Z_FINISH``
|
|
* writing ``CRC32`` (of the full uncompressed data)
|
|
* writing ``ISIZE`` (the length of the full uncompressed data ``mod 2 ** 32``)
|
|
|
|
By switching to uncompressed, we can know exactly how many bytes will be
|
|
written in the tail, so that when reading we can quickly seek to and read the
|
|
index offset, seek to the index start, and read the index. From there we
|
|
can do similar things for any other section.
|
|
|
|
|
|
* Seek to the end of the file
|
|
* Go back 31 bytes in the underlying file; this should leave us at the start of
|
|
the deflate block containing the offset for the compressed start
|
|
* Decompress 8 bytes from the deflate stream to get the location of the
|
|
compressed start of the index BLOB
|
|
* Seek to that location
|
|
* Read/decompress the size of the index BLOB
|
|
* Read/decompress the json serialized index.
|
|
|
|
.. note:: This 31 bytes is the deflate block containing the 8 byte location,
|
|
a ``Z_FULL_FLUSH`` block, the ``Z_FINISH`` block, and the ``CRC32`` and
|
|
``ISIZE``. For more information, see `RFC 1951`_ (for the deflate stream)
|
|
and `RFC 1952`_ (for the gzip format).
|
|
|
|
The currently defined section and section names upstream are as follows:
|
|
|
|
* ``swift/index`` - The swift index
|
|
* ``swift/ring/metadata`` - Ring metadata serialized as json
|
|
* ``swift/ring/devices`` - Devices json serialized data structure.
|
|
|
|
* This has been seperated from the ring metadata structure in v1 as it
|
|
gets large
|
|
|
|
* ``swift/ring/assignments`` - The ring replica2part2dev_id data structure
|
|
|
|
.. note::
|
|
Third-parties may find it useful to add their own sections; however,
|
|
the ``swift/`` prefix is reserved for future upstream enhancements.
|
|
|
|
swift/ring/metadata
|
|
~~~~~~~~~~~~~~~~~~~
|
|
This BLOB is an ASCII-encoded JSON object full of metadata, similar
|
|
to v1 rings. It has the following required keys:
|
|
|
|
* ``part_shift``
|
|
* ``dev_id_bytes`` specifies the number of bytes used for each ``<dev>`` in the
|
|
replica-to-part-to-device table; will be one of 2, 4, or 8
|
|
|
|
Additionally, there are several optional keys which may be present:
|
|
|
|
* ``next_part_power``
|
|
* ``version``
|
|
|
|
Notice that two keys are no longer present: ``replica_count`` is no longer
|
|
needed as the size of the replica-to-part-to-device table is explicit, and
|
|
``byteorder`` is not needed as all data in v2 rings should be written using
|
|
network-order.
|
|
|
|
swift/ring/devices
|
|
~~~~~~~~~~~~~~~~~~
|
|
This BLOB contains a list of swift device dictionarys. And was seperated out
|
|
from the metadata BLOB as this can become a large structure in it's own right.
|
|
|
|
swift/ring/assignments
|
|
~~~~~~~~~~~~~~~~~~~~~~
|
|
This BLOB is the replica-to-part-to-device table. It's length will be
|
|
``replicas * (2 ** part_power) * dev_id_bytes``, where ``replicas`` is the exact
|
|
(potentially fractional) replica count for the ring. Unlike in v1, each
|
|
``<dev>`` is written using network-order.
|
|
|
|
Note that this is why we increased the size of ``<data-length>`` as compared to
|
|
the v1 format -- otherwise, we may not be able to represent rings with both
|
|
high ``replica_count`` and high ``part_power``.
|
|
|
|
.. _RFC 1952: https://rfc-editor.org/rfc/rfc1952
|
|
.. _RFC 1951: https://rfc-editor.org/rfc/rfc1951
|