
There's a bunch of moving pieces here: - Add a new RingWriter class. Stick it in a new swift.common.ring.io module. You *can* use it like the old gzip file, but you can also define named sections which can be referenced later on read. Section names may be arbitrary strings, but the "swift/" prefix is reserved for upstream use. Sections must contain a single length-value encoded BLOB. If sections are used, an additional BLOB is written at the end containing a JSON section-index, followed by an uncompressed offset for the index. Move RingReader to ring/io.py, too. - Clean up some ring metadata handling: - Drop MD5 tracking in RingReader. It was brittle at best anyway, and nothing uses it. YAGNI - Fix size/raw_size attributes when loading only metadata. - Add the ability to seek within RingReaders, though you need to know what you're doing and only seek to flush points. - Let RingBuilder objects change how wide their replica2part2dev_id arrays are. Add a dev_id_bytes key to serialized ring metadata. dev_id_bytes may be either 2 or 4, but 4 requires v2 rings. We considered allowing dev_id_bytes of 1, but dropped it as unnecessary complexity for a niche use case. - swift-ring-builder version subcommand added, which takes a ring. This lets operators see the serialization format of a ring on disk: $ swift-ring-builder object.ring.gz version object.ring.gz: Serialization version: 2 (2-byte IDs), build version: 54 Signed-off-by: Tim Burke <tim.burke@gmail.com> Change-Id: Ia0ac4ea2006d8965d7fdb6659d355c77386adb70
9.8 KiB
Ring File Formats
The ring is the most important data structure in Swift. How this data structure been serialized to disk has changed over the years.
Initially ring files contain three key pieces of information:
- the part_power value (often stored as
part_shift := 32 - part_power
)- which determines how many partitions are in the ring,
- the device list
- which includes all the disks participating in the ring, and
- the replica-to-part-to-device table
- which has all
replica_count * (2 ** part_power)
partition assignments.
- which has all
But the ability to extend the serialization format to add more data structures to the ring serialization format has meant a new ring v2 format has been created.
Ring files have always been gzipped when serialized, though the inner, raw format has evolved over the years.
Ring v0
Initially, rings were simply pickle dumps of the RingData object. With Swift 1.3.0, this changed to pickling a pure-stdlib data structure, but the core concept was the same.
Swift 2.36.0 dropped support for v0 rings.
Ring v1
Pickle presented some problems, however. While there are security concerns around unpickling untrusted data, security boundaries are generally drawn such that rings are assumed to be trusted. Ultimately, what pushed us to a new format were performance considerations.
Starting in Swift 1.7.0, Swift began using a new format (while still being willing to read the old one). The new format starts with some magic so we may identify it as such:
+---------------+-------+
|'R' '1' 'N' 'G'| <vrs> |
+---------------+-------+
where <vrs>
is a network-order two-byte version
number (which is always 1). After that, a JSON object is serialized
as:
+---------------+-------...---+
| <data-length> | <data ... > |
+---------------+-------...---+
where <data-length>
is the network-order four-byte
length (in bytes) of <data>
, which is the
ASCII-encoded JSON-serialized object. This object has at minimum three
keys:
devs
for the device listpart_shift
(i.e.,32 - part_power
)replica_count
for the integer number of part-to-device rows to read
The replica-to-part-to-device table then follows:
+-------+-------+...+-------+-------+
| <dev> | <dev> |...| <dev> | <dev> |
+-------+-------+...+-------+-------+
| <dev> | <dev> |...| <dev> | <dev> |
+-------+-------+...+-------+-------+
| ... |
+-------+-------+...+-------+-------+
| <dev> | <dev> |...|
+-------+-------+...+
Each <dev>
is a host-order two-byte index into the
devs
list. Every row except the last has exactly
2 ** part_power
entries; the last row may have the same or
fewer.
The metadata object has proven quite versatile: new keys have been added to provide additional information while remaining backwards-compatible. In order, the following new fields have been added:
byteorder
specifies whether the host-order for the replica-to-part-to-device table is "big" or "little" endian. Added in Swift 2.12.0, this allows rings written on big-endian machines to be read on little-endian machines and vice-versa.next_part_power
indicates whether a partition-power increase is in progress. Added in Swift 2.15.0, this will have one of two values, if present: the ring's currentpart_power
, indicating that there may be hardlinks to clean up, orpart_power + 1
indicating that hardlinks may need to be created. Seethe documentation<modify_part_power>
for more information.version
specifies the version number of the ring-builder that was used to write this ring. Added in Swift 2.24.0, this allows the comparing of rings from different machines to determine which is newer.
Ring v2
The way that v1 rings dealt with fractional replicas made it impossible to reliably serialize additional large data structures after the replica-to-part-to-device table. The v2 format has been designed to be extensable.
The new format starts with magic similar to v1:
+---------------+-------+
|'R' '1' 'N' 'G'| <vrs> |
+---------------+-------+
where <vrs> is again a network-order two-byte version number (which is now 2). By bumping the version number, we ensure that old versions of Swift refuse to read the ring, rather than misinterpret the content.
After that, a series of BLOBs are serialized, each as:
+-------------------------------+-------...---+
| <data-length> | <data ... > |
+-------------------------------+-------...---+
where <data-length>
is the network-order
eight-byte length (in bytes) of <data>
. Each BLOB is
preceded by a Z_FULL_FLUSH
to allow it to be decompressed
without reading the whole file.
The order of the BLOBs isn't important, although they do tend to be written in the order Swift will read them while loading. This reduces the disk seeks necessary to load.
The final BLOB is an index: a JSON object mapping named sections to an array of offsets within the file, like
{
section: [
compressed start,
uncompressed start,
compressed end,
uncompressed end,
checksum method,
checksum value
],
...
}
Section names may be arbitrary strings, but the "swift/" prefix is
reserved for upstream use. The start/end values mark the beginning and
ending of the section's BLOB. Note that some end values may be
null
if they were not known when the index was written --
in particular, this will be true for the index itself. The
checksum method should be one of "md5"
,
"sha1"
, "sha256"
, or "sha512"
;
other values will be ignored in anticipation of a need to support
further algorithms. The checksum value will be the hex-encoded digest of
the uncompressed section's bytes. Like end values, checksum data may be
null
if not known when the index is written.
Finally, a "tail" is written:
- the gzip stream is flushed with another
Z_FULL_FLUSH
, - the stream is switched to uncompressed,
- the eight-byte offset of the uncompressed start of the index is written,
- the gzip stream is flushed with another
Z_FULL_FLUSH
, - the eight-byte offset of the compressed start of the index is written,
- the gzip stream is flushed with another
Z_FULL_FLUSH
, and - the gzip stream is closed; this involves:
- flushing the underlying deflate stream with
Z_FINISH
- writing
CRC32
(of the full uncompressed data) - writing
ISIZE
(the length of the full uncompressed datamod 2 ** 32
)
- flushing the underlying deflate stream with
By switching to uncompressed, we can know exactly how many bytes will be written in the tail, so that when reading we can quickly seek to and read the index offset, seek to the index start, and read the index. From there we can do similar things for any other section.
- Seek to the end of the file
- Go back 31 bytes in the underlying file; this should leave us at the start of the deflate block containing the offset for the compressed start
- Decompress 8 bytes from the deflate stream to get the location of the compressed start of the index BLOB
- Seek to that location
- Read/decompress the size of the index BLOB
- Read/decompress the json serialized index.
Note
This 31 bytes is the deflate block containing the 8 byte location, a
Z_FULL_FLUSH
block, the Z_FINISH
block, and
the CRC32
and ISIZE
. For more information, see
RFC 1951 (for the
deflate stream) and RFC
1952 (for the gzip format).
The currently defined section and section names upstream are as follows:
swift/index
- The swift indexswift/ring/metadata
- Ring metadata serialized as jsonswift/ring/devices
- Devices json serialized data structure.- This has been seperated from the ring metadata structure in v1 as it gets large
swift/ring/assignments
- The ring replica2part2dev_id data structure
Note
Third-parties may find it useful to add their own sections; however,
the swift/
prefix is reserved for future upstream
enhancements.
swift/ring/metadata
This BLOB is an ASCII-encoded JSON object full of metadata, similar to v1 rings. It has the following required keys:
part_shift
dev_id_bytes
specifies the number of bytes used for each<dev>
in the replica-to-part-to-device table; will be one of 2, 4, or 8
Additionally, there are several optional keys which may be present:
next_part_power
version
Notice that two keys are no longer present:
replica_count
is no longer needed as the size of the
replica-to-part-to-device table is explicit, and byteorder
is not needed as all data in v2 rings should be written using
network-order.
swift/ring/devices
This BLOB contains a list of swift device dictionarys. And was seperated out from the metadata BLOB as this can become a large structure in it's own right.
swift/ring/assignments
This BLOB is the replica-to-part-to-device table. It's length will be
replicas * (2 ** part_power) * dev_id_bytes
, where
replicas
is the exact (potentially fractional) replica
count for the ring. Unlike in v1, each <dev>
is
written using network-order.
Note that this is why we increased the size of
<data-length>
as compared to the v1 format --
otherwise, we may not be able to represent rings with both high
replica_count
and high part_power
.