Files

Tim Burke ae062f8b09 ring: Introduce a v2 ring format

There's a bunch of moving pieces here:

- Add a new RingWriter class.

  Stick it in a new swift.common.ring.io module. You *can* use it like
  the old gzip file, but you can also define named sections which can
  be referenced later on read. Section names may be arbitrary strings,
  but the "swift/" prefix is reserved for upstream use. Sections must
  contain a single length-value encoded BLOB. If sections are used, an
  additional BLOB is written at the end containing a JSON section-index,
  followed by an uncompressed offset for the index.

  Move RingReader to ring/io.py, too.

- Clean up some ring metadata handling:

  - Drop MD5 tracking in RingReader. It was brittle at best anyway, and
    nothing uses it. YAGNI

  - Fix size/raw_size attributes when loading only metadata.

- Add the ability to seek within RingReaders, though you need to know
  what you're doing and only seek to flush points.

- Let RingBuilder objects change how wide their replica2part2dev_id
  arrays are. Add a dev_id_bytes key to serialized ring metadata.

  dev_id_bytes may be either 2 or 4, but 4 requires v2 rings. We
  considered allowing dev_id_bytes of 1, but dropped it as unnecessary
  complexity for a niche use case.

- swift-ring-builder version subcommand added, which takes a ring. This
  lets operators see the serialization format of a ring on disk:

  $ swift-ring-builder object.ring.gz version
  object.ring.gz: Serialization version: 2 (2-byte IDs), build version: 54

Signed-off-by: Tim Burke <tim.burke@gmail.com>
Change-Id: Ia0ac4ea2006d8965d7fdb6659d355c77386adb70

2025-07-21 11:37:15 -07:00

9.8 KiB

Raw Permalink Blame History

Ring File Formats

The ring is the most important data structure in Swift. How this data structure been serialized to disk has changed over the years.

Initially ring files contain three key pieces of information:

the part_power value (often stored as part_shift := 32 - part_power)
- which determines how many partitions are in the ring,
the device list
- which includes all the disks participating in the ring, and
the replica-to-part-to-device table
- which has all replica_count * (2 ** part_power) partition assignments.

But the ability to extend the serialization format to add more data structures to the ring serialization format has meant a new ring v2 format has been created.

Ring files have always been gzipped when serialized, though the inner, raw format has evolved over the years.

Ring v0

Initially, rings were simply pickle dumps of the RingData object. With Swift 1.3.0, this changed to pickling a pure-stdlib data structure, but the core concept was the same.

Swift 2.36.0 dropped support for v0 rings.

Ring v1

Pickle presented some problems, however. While there are security concerns around unpickling untrusted data, security boundaries are generally drawn such that rings are assumed to be trusted. Ultimately, what pushed us to a new format were performance considerations.

Starting in Swift 1.7.0, Swift began using a new format (while still being willing to read the old one). The new format starts with some magic so we may identify it as such:

+---------------+-------+
|'R' '1' 'N' 'G'| <vrs> |
+---------------+-------+

where <vrs> is a network-order two-byte version number (which is always 1). After that, a JSON object is serialized as:

+---------------+-------...---+
| <data-length> | <data ... > |
+---------------+-------...---+

where <data-length> is the network-order four-byte length (in bytes) of <data>, which is the ASCII-encoded JSON-serialized object. This object has at minimum three keys:

devs for the device list
part_shift (i.e., 32 - part_power)
replica_count for the integer number of part-to-device rows to read

The replica-to-part-to-device table then follows:

+-------+-------+...+-------+-------+
| <dev> | <dev> |...| <dev> | <dev> |
+-------+-------+...+-------+-------+
| <dev> | <dev> |...| <dev> | <dev> |
+-------+-------+...+-------+-------+
|                ...                |
+-------+-------+...+-------+-------+
| <dev> | <dev> |...|
+-------+-------+...+

Each <dev> is a host-order two-byte index into the devs list. Every row except the last has exactly 2 ** part_power entries; the last row may have the same or fewer.

The metadata object has proven quite versatile: new keys have been added to provide additional information while remaining backwards-compatible. In order, the following new fields have been added:

byteorder specifies whether the host-order for the replica-to-part-to-device table is "big" or "little" endian. Added in Swift 2.12.0, this allows rings written on big-endian machines to be read on little-endian machines and vice-versa.
next_part_power indicates whether a partition-power increase is in progress. Added in Swift 2.15.0, this will have one of two values, if present: the ring's current part_power, indicating that there may be hardlinks to clean up, or part_power + 1 indicating that hardlinks may need to be created. See the documentation<modify_part_power> for more information.
version specifies the version number of the ring-builder that was used to write this ring. Added in Swift 2.24.0, this allows the comparing of rings from different machines to determine which is newer.

Ring v2

The way that v1 rings dealt with fractional replicas made it impossible to reliably serialize additional large data structures after the replica-to-part-to-device table. The v2 format has been designed to be extensable.

The new format starts with magic similar to v1:

+---------------+-------+
|'R' '1' 'N' 'G'| <vrs> |
+---------------+-------+

where <vrs> is again a network-order two-byte version number (which is now 2). By bumping the version number, we ensure that old versions of Swift refuse to read the ring, rather than misinterpret the content.

After that, a series of BLOBs are serialized, each as:

+-------------------------------+-------...---+
| <data-length>                 | <data ... > |
+-------------------------------+-------...---+

where <data-length> is the network-order eight-byte length (in bytes) of <data>. Each BLOB is preceded by a Z_FULL_FLUSH to allow it to be decompressed without reading the whole file.

The order of the BLOBs isn't important, although they do tend to be written in the order Swift will read them while loading. This reduces the disk seeks necessary to load.

The final BLOB is an index: a JSON object mapping named sections to an array of offsets within the file, like

{
    section: [
        compressed start,
        uncompressed start,
        compressed end,
        uncompressed end,
        checksum method,
        checksum value
    ],
    ...
}

Section names may be arbitrary strings, but the "swift/" prefix is reserved for upstream use. The start/end values mark the beginning and ending of the section's BLOB. Note that some end values may be null if they were not known when the index was written -- in particular, this will be true for the index itself. The checksum method should be one of "md5", "sha1", "sha256", or "sha512"; other values will be ignored in anticipation of a need to support further algorithms. The checksum value will be the hex-encoded digest of the uncompressed section's bytes. Like end values, checksum data may be null if not known when the index is written.

Finally, a "tail" is written:

the gzip stream is flushed with another Z_FULL_FLUSH,
the stream is switched to uncompressed,
the eight-byte offset of the uncompressed start of the index is written,
the gzip stream is flushed with another Z_FULL_FLUSH,
the eight-byte offset of the compressed start of the index is written,
the gzip stream is flushed with another Z_FULL_FLUSH, and
the gzip stream is closed; this involves:
- flushing the underlying deflate stream with Z_FINISH
- writing CRC32 (of the full uncompressed data)
- writing ISIZE (the length of the full uncompressed data mod 2 ** 32)

By switching to uncompressed, we can know exactly how many bytes will be written in the tail, so that when reading we can quickly seek to and read the index offset, seek to the index start, and read the index. From there we can do similar things for any other section.

Seek to the end of the file
Go back 31 bytes in the underlying file; this should leave us at the start of the deflate block containing the offset for the compressed start
Decompress 8 bytes from the deflate stream to get the location of the compressed start of the index BLOB
Seek to that location
Read/decompress the size of the index BLOB
Read/decompress the json serialized index.

Note

This 31 bytes is the deflate block containing the 8 byte location, a Z_FULL_FLUSH block, the Z_FINISH block, and the CRC32 and ISIZE. For more information, see RFC 1951 (for the deflate stream) and RFC 1952 (for the gzip format).

The currently defined section and section names upstream are as follows:

swift/index - The swift index
swift/ring/metadata - Ring metadata serialized as json
swift/ring/devices - Devices json serialized data structure.
- This has been seperated from the ring metadata structure in v1 as it gets large
swift/ring/assignments - The ring replica2part2dev_id data structure

Note

Third-parties may find it useful to add their own sections; however, the swift/ prefix is reserved for future upstream enhancements.

swift/ring/metadata

This BLOB is an ASCII-encoded JSON object full of metadata, similar to v1 rings. It has the following required keys:

part_shift
dev_id_bytes specifies the number of bytes used for each <dev> in the replica-to-part-to-device table; will be one of 2, 4, or 8

Additionally, there are several optional keys which may be present:

next_part_power
version

Notice that two keys are no longer present: replica_count is no longer needed as the size of the replica-to-part-to-device table is explicit, and byteorder is not needed as all data in v2 rings should be written using network-order.

swift/ring/devices

This BLOB contains a list of swift device dictionarys. And was seperated out from the metadata BLOB as this can become a large structure in it's own right.

swift/ring/assignments

This BLOB is the replica-to-part-to-device table. It's length will be replicas * (2 ** part_power) * dev_id_bytes, where replicas is the exact (potentially fractional) replica count for the ring. Unlike in v1, each <dev> is written using network-order.

Note that this is why we increased the size of <data-length> as compared to the v1 format -- otherwise, we may not be able to represent rings with both high replica_count and high part_power.

9.8 KiB Raw Permalink Blame History