From cc2f0f4ed6f12554b7d8e8cb61e14f2b103445a0 Mon Sep 17 00:00:00 2001 From: Samuel Merritt Date: Thu, 4 Dec 2014 18:37:24 -0800 Subject: [PATCH] Speed up reading and writing xattrs for object metadata Object metadata is stored as a pickled hash: first the data is pickled, then split into strings of length <= 254, then stored in a series of extended attributes named "user.swift.metadata", "user.swift.metadata1", "user.swift.metadata2", and so forth. The choice of length 254 is odd, undocumented, and dates back to the initial commit of Swift. From talking to people, I believe this was an attempt to fit the first xattr in the inode, thus avoiding a seek. However, it doesn't work. XFS _either_ stores all the xattrs together in the inode (local), _or_ it spills them all to blocks located outside the inode (extents or btree). Using short xattrs actually hurts us here; by splitting into more pieces, we end up with more names to store, thus reducing the metadata size that'll fit in the inode. [Source: http://xfs.org/docs/xfsdocs-xml-dev/XFS_Filesystem_Structure//tmp/en-US/html/Extended_Attributes.html] I did some benchmarking of read_metadata with various xattr sizes against an XFS filesystem on a spinning disk, no VMs involved. Summary: name | rank | runs | mean | sd | timesBaseline ------|------|------|-----------|-----------|-------------- 32768 | 1 | 2500 | 0.0001195 | 3.75e-05 | 1.0 16384 | 2 | 2500 | 0.0001348 | 1.869e-05 | 1.12809122912 8192 | 3 | 2500 | 0.0001604 | 2.708e-05 | 1.34210998858 4096 | 4 | 2500 | 0.0002326 | 0.0004816 | 1.94623473988 2048 | 5 | 2500 | 0.0003414 | 0.0001409 | 2.85674781189 1024 | 6 | 2500 | 0.0005457 | 0.0001741 | 4.56648611635 254 | 7 | 2500 | 0.001848 | 0.001663 | 15.4616067887 Here, "name" is the chunk size for the pickled metadata. A total metadata size of around 31.5 KiB was used, so the "32768" runs represent storing everything in one single xattr, while the "254" runs represent things as they are without this change. Since bigger xattr chunks make things go faster, the new chunk size is 64 KiB. That's the biggest xattr that XFS allows. Reading of metadata from existing files is unaffected; the read_metadata() function already handles xattrs of any size. On non-XFS filesystems, this is no worse than what came before: ext4 has a limit of one block (typically 4 KiB) for all xattrs (names and values) taken together [1], so this change slightly increases the amount of Swift metadata that can be stored on ext4. ZFS let me store an xattr with an 8 MiB value, so that's plenty. It'll probably go further, but I stopped there. [1] https://ext4.wiki.kernel.org/index.php/Ext4_Disk_Layout#Extended_Attributes Change-Id: Ie22db08ac0050eda693de4c30d4bc0d620e7f7d4 --- swift/obj/diskfile.py | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/swift/obj/diskfile.py b/swift/obj/diskfile.py index f828a16d02..809b5ecb4d 100644 --- a/swift/obj/diskfile.py +++ b/swift/obj/diskfile.py @@ -125,7 +125,7 @@ def read_metadata(fd): return pickle.loads(metadata) -def write_metadata(fd, metadata): +def write_metadata(fd, metadata, xattr_size=65536): """ Helper function to write pickled metadata for an object file. @@ -137,8 +137,8 @@ def write_metadata(fd, metadata): while metastr: try: xattr.setxattr(fd, '%s%s' % (METADATA_KEY, key or ''), - metastr[:254]) - metastr = metastr[254:] + metastr[:xattr_size]) + metastr = metastr[xattr_size:] key += 1 except IOError as e: for err in 'ENOTSUP', 'EOPNOTSUPP':